## Amazon Textract

**Amazon Textract** es un servicio de análisis de documentos que detecta y extrae texto impreso, escritura manuscrita, datos estructurados (como campos de interés y sus valores) y tablas a partir de imágenes y escaneos de documentos. Los modelos de machine learning de Amazon Textract han sido entrenados en millones de documentos para que prácticamente cualquier tipo de documento que se cargue sea reconocido y procesado automáticamente para la extracción de texto. 

Cuando se extrae información a partir de los documentos, el servicio arroja una puntuación de confianza para cada elemento que identifica, de modo que sea posible tomar decisiones fundamentadas sobre el modo de utilizar los resultados. 

Por ejemplo, al extraer información de documentos fiscales, se pueden establecer reglas personalizadas para que cualquier información extraída con una puntuación de confianza inferior al 95 % sea marcada. Además, todos los datos extraídos se devuelven con las coordenadas del cuadro delimitador, que es un marco rectangular que abarca completamente cada dato identificado, de modo que sea posible identificar rápidamente dónde aparece una palabra o un número en un documento.

#### [Link GitHub](https://github.com/aws-samples/amazon-textract-code-samples/tree/master/python)

In [2]:
import boto3
from pprint import pprint
import time

In [4]:
# Document
documentName = "simple-document-image.jpg"

# Read document content
with open(documentName, 'rb') as document:
    imageBytes = bytearray(document.read())

In [8]:
# Amazon Textract client
textract = boto3.client('textract')

# Call Amazon Textract
response = textract.detect_document_text(Document={'Bytes': imageBytes})

pprint(response)

{'Blocks': [{'BlockType': 'PAGE',
             'Geometry': {'BoundingBox': {'Height': 1.0,
                                          'Left': 0.0,
                                          'Top': 0.0,
                                          'Width': 1.0},
                          'Polygon': [{'X': 0.0, 'Y': 0.0},
                                      {'X': 1.0, 'Y': 0.0},
                                      {'X': 1.0, 'Y': 1.0},
                                      {'X': 0.0, 'Y': 1.0}]},
             'Id': '85abcc54-b09b-4109-a79d-6a94ac48d94b',
             'Relationships': [{'Ids': ['6e8949e3-8b07-4885-b771-ccbe48854b24',
                                        'd093c43f-e192-4f03-b403-337ede610f95',
                                        '3edb55c3-7522-429a-82d5-db95b6aa777b',
                                        '825fcb78-e888-4d8d-9612-33d32dfa6149'],
                                'Type': 'CHILD'}]},
            {'BlockType': 'LINE',
             'Confidence': 99.52398

In [20]:
# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

[94mAmazon.com, Inc. is located in Seattle, WA[0m
[94mIt was founded July 5th, 1994 by Jeff Bezos[0m
[94mAmazon.com allows customers to buy everything from books to blenders[0m
[94mSeattle is north of Portland and south of Vancouver, BC.[0m


### From S3 Bucket

In [21]:
# Document
s3BucketName = "ki-textract-demo-docs"
documentName = "simple-document-image.jpg"

In [22]:
# Call Amazon Textract
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })


In [23]:
# Print detected text
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')

[94mAmazon.com, Inc. is located in Seattle, WA[0m
[94mIt was founded July 5th, 1994 by Jeff Bezos[0m
[94mAmazon.com allows customers to buy everything from books to blenders[0m
[94mSeattle is north of Portland and south of Vancouver, BC.[0m


In [24]:
documentName = "two-column-image.jpg"
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

In [25]:
# Detect columns and print lines
columns = []
lines = []
for item in response["Blocks"]:
      if item["BlockType"] == "LINE":
        column_found=False
        for index, column in enumerate(columns):
            bbox_left = item["Geometry"]["BoundingBox"]["Left"]
            bbox_right = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]
            bbox_centre = item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]/2
            column_centre = column['left'] + column['right']/2

            if (bbox_centre > column['left'] and bbox_centre < column['right']) or (column_centre > bbox_left and column_centre < bbox_right):
                #Bbox appears inside the column
                lines.append([index, item["Text"]])
                column_found=True
                break
        if not column_found:
            columns.append({'left':item["Geometry"]["BoundingBox"]["Left"], 'right':item["Geometry"]["BoundingBox"]["Left"] + item["Geometry"]["BoundingBox"]["Width"]})
            lines.append([len(columns)-1, item["Text"]])

lines.sort(key=lambda x: x[0])
for line in lines:
    print (line[1])

Extract data quickly &
accurately
Textract makes it easy to quickly and
accurately extract data from
documents and forms. Textract
automatically detects a document's
layout and the key elements on the
page, understands the data
relationships in any embedded forms
or tables, and extracts everything
with its context intact. This means
you can instantly use the extracted
data in an application or store it in a
database without a lot of
complicated code in between
No code or templates to
maintain
Textract's pre-trained machine
learning models eliminate the need
to write code for data extraction,
because they have already been
trained on tens of millions of
documents from virtually every
industry, including invoices, receipts,
contracts, tax documents, sales
orders, enrollment forms, benefit
applications, insurance claims, policy
documents and many more. You no
longer need to maintain code for
every document or form you might
receive or worry about how page
layouts change over time.


### NLP-Comprehend

In [26]:
documentName = "simple-document-image.jpg"
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

In [27]:
# Print text
print("\nText\n========")
text = ""
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')
        text = text + " " + item["Text"]


Text
[94mAmazon.com, Inc. is located in Seattle, WA[0m
[94mIt was founded July 5th, 1994 by Jeff Bezos[0m
[94mAmazon.com allows customers to buy everything from books to blenders[0m
[94mSeattle is north of Portland and south of Vancouver, BC.[0m


In [28]:
# Amazon Comprehend client
comprehend = boto3.client('comprehend')

# Detect sentiment
sentiment =  comprehend.detect_sentiment(LanguageCode="en", Text=text)
print ("\nSentiment\n========\n{}".format(sentiment.get('Sentiment')))

# Detect entities
entities =  comprehend.detect_entities(LanguageCode="en", Text=text)
print("\nEntities\n========")
for entity in entities["Entities"]:
    print ("{}\t=>\t{}".format(entity["Type"], entity["Text"]))


Sentiment
NEUTRAL

Entities
ORGANIZATION	=>	Amazon.com, Inc.
LOCATION	=>	Seattle, WA
DATE	=>	July 5th, 1994
PERSON	=>	Jeff Bezos
ORGANIZATION	=>	Amazon.com
LOCATION	=>	Seattle
LOCATION	=>	Portland
LOCATION	=>	Vancouver, BC


### NLP-Medical

In [29]:
documentName = "medical-notes.png"
response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

In [30]:
# Print text
print("\nText\n========")
text = ""
for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        print ('\033[94m' +  item["Text"] + '\033[0m')
        text = text + " " + item["Text"]


Text
[94mPatient visit notes[0m
[94mPt is 40yo mother, high school teacher[0m
[94mHPI : Sleeping trouble on present dosage of Clonidine.[0m
[94mSevere Rash on face and leg, slightly itchy[0m
[94mMeds : Vyvanse 50 mgs po at breakfast daily,[0m
[94mClonidine 0.2 mgs -- 1 and 1/2 tabs po qhs[0m
[94mHEENT : Boggy inferior turbinates, No oropharyngeal lesion[0m
[94mLungs : clear[0m
[94mHeart : Regular rhythm[0m
[94mSkin : Mild erythematous eruption to hairline[0m
[94mFollow-up as scheduled[0m


In [33]:
# Amazon Comprehend client
comprehend = boto3.client('comprehendmedical')

# Detect medical entities
entities =  comprehend.detect_entities(Text=text)
print("\nMedical Entities\n========")
for entity in entities["Entities"]:
    print("- {}".format(entity["Text"]))
    print ("   Type: {}".format(entity["Type"]))
    print ("   Category: {}".format(entity["Category"]))
    if(entity["Traits"]):
        print("   Traits:")
        for trait in entity["Traits"]:
            print ("    - {}".format(trait["Name"]))
    print("\n")


Medical Entities
- 40yo
   Type: AGE
   Category: PROTECTED_HEALTH_INFORMATION


- Sleeping trouble
   Type: DX_NAME
   Category: MEDICAL_CONDITION
   Traits:
    - SYMPTOM


- Clonidine
   Type: GENERIC_NAME
   Category: MEDICATION


- Rash
   Type: DX_NAME
   Category: MEDICAL_CONDITION
   Traits:
    - SYMPTOM


- face
   Type: SYSTEM_ORGAN_SITE
   Category: ANATOMY


- leg
   Type: SYSTEM_ORGAN_SITE
   Category: ANATOMY


- itchy
   Type: DX_NAME
   Category: MEDICAL_CONDITION


- Meds
   Type: TREATMENT_NAME
   Category: TEST_TREATMENT_PROCEDURE


- Vyvanse
   Type: BRAND_NAME
   Category: MEDICATION


- Clonidine
   Type: GENERIC_NAME
   Category: MEDICATION


- HEENT
   Type: SYSTEM_ORGAN_SITE
   Category: ANATOMY


- Boggy
   Type: DX_NAME
   Category: MEDICAL_CONDITION


- inferior turbinates
   Type: SYSTEM_ORGAN_SITE
   Category: ANATOMY


- oropharyngeal
   Type: SYSTEM_ORGAN_SITE
   Category: ANATOMY


- lesion
   Type: DX_NAME
   Category: MEDICAL_CONDITION
   Traits:
  

### Read PDF

In [25]:
client = boto3.client('textract')
object_name = "Amazon-Textract-Pdf.pdf"
s3_bucket_name = "ki-textract-demo-docs"

In [26]:
response = client.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': s3_bucket_name,
                'Name': object_name
            }})

In [27]:
job_id=response["JobId"]
print("JobId:", job_id)

JobId: 6b204a38ca435e8402dcf9669457279ca9cc0ad6cc981099fa5765fbb5b0bf9d


In [28]:
response = client.get_document_text_detection(JobId=job_id)
status = response["JobStatus"]
print("Job status: {}".format(status))

while(status == "IN_PROGRESS"):
    time.sleep(3)
    response = client.get_document_text_detection(JobId=job_id)
    status = response["JobStatus"]
print("Job status: {}".format(status))

Job status: IN_PROGRESS
Job status: SUCCEEDED


In [29]:
response = client.get_document_text_detection(JobId=job_id)

In [30]:
pages = []
pages.append(response)
print("Resultset page received: {}".format(response['DocumentMetadata']['Pages']))
next_token = None
if 'NextToken' in response:
    next_token = response['NextToken']
    
while next_token:
    time.sleep(1)
    response = client.\
        get_document_text_detection(JobId=job_id, NextToken=next_token)
    pages.append(response)
    print("Resultset page received: {}".format(len(pages)))
    next_token = None
    if 'NextToken' in response:
        next_token = response['NextToken']

Resultset page received: 2


In [31]:
# Print detected text
for result_page in pages:
    for item in result_page["Blocks"]:
        if item["BlockType"] == "LINE":
            print('\033[94m' + item["Text"] + '\033[0m')

[94mAmazon Textract[0m
[94mAmazon Textract is a service that automatically extracts text and data from scanned[0m
[94mdocuments. Amazon Textract goes beyond simple optical character recognition (OCR) to[0m
[94malso identify the contents of fields in forms and information stored in tables.[0m
[94mMany companies today extract data from documents and forms through manual data[0m
[94mentry that's slow and expensive or through simple optical character recognition (OCR)[0m
[94msoftware that is difficult to customize. Rules and workflows for each document and form[0m
[94moften need to be hard-coded and updated with each change to the form or when dealing[0m
[94mwith multiple forms. If the form deviates from the rules, the output is often scrambled[0m
[94mand unusable.[0m
[94mAmazon Textract overcomes these challenges by using machine learning to instantly[0m
[94m"read" virtually any type of document to accurately extract text and data without the[0m
[94mneed for any m