# Detecting headers and footers using Amazon Textract

A common customer use case when extracting text from documents is to identify, and in many cases ignore, headers and footers. Textract provides sufficient metadata to make this possible. Note this sample just demonstrates a simple version of the use case, and is not meant to provide a general purpose out of the box solution.

Before running this notebook:

1. Upload the sample PDF document into an S3 bucket
2. Define an SNS topic that will receive a notification of completion of the async text detection job
3. Define an SNS role that Textract can use to access the SNS topic

In [1]:
import boto3

# To call the start_document_text_detection api, need to have an SNS topic and role defined
BUCKET    = '<YOUR BUCKET NAME HERE>'

# If you placed the PDF file in a different prefix in your S3 bucket, update PREFIX
PREFIX    = 'textract/test_input/'

# If you used a different PDF filename, update FILENAME
FILENAME  = PREFIX + 'sample_with_headers.pdf'

# Capture your SNS topic and role arn's here
SNS_TOPIC = '<YOUR SNS TOPIC ARN HERE>'
SNS_ROLE  = '<YOUR SNS ROLE ARN HERE>'

client = boto3.client('textract')

In [13]:
doc_loc       = {'S3Object': {'Bucket': BUCKET, 'Name': FILENAME}} 
notif_channel = {'SNSTopicArn': SNS_TOPIC, 'RoleArn': SNS_ROLE}

response = client.start_document_text_detection(
    DocumentLocation=doc_loc,
    JobTag='1',
    NotificationChannel=notif_channel
)
response

{'JobId': 'a79704d403a47d45c1b0d2f0f33beba20c6920e169c3f132b05ba0b9e41a2b88',
 'ResponseMetadata': {'RequestId': 'cf39c7c4-ff01-4564-b475-2568388afe87',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Fri, 26 Apr 2019 18:07:46 GMT',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '76',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'cf39c7c4-ff01-4564-b475-2568388afe87'},
  'RetryAttempts': 0}}

In [14]:
job_id = response['JobId']

In [15]:
job_id

'a79704d403a47d45c1b0d2f0f33beba20c6920e169c3f132b05ba0b9e41a2b88'

In [23]:
response = client.get_document_text_detection(
    JobId=job_id,
    MaxResults=500
)

job_status = response['JobStatus']
if (job_status == 'SUCCEEDED'):
    next_token = response['NextToken']
    blocks = response['Blocks']
    print('Job SUCCEEDED, returned {} blocks'.format(len(blocks)))
else:
    print('Job status: ' + job_status)
    
# retry until job is completed (succeeded) vs in progress

Job SUCCEEDED, returned 500 blocks


In [24]:
response

{'DocumentMetadata': {'Pages': 5},
 'JobStatus': 'SUCCEEDED',
 'NextToken': 'bWhL0lel3REElcoozHWS/WSUKLth6Mx/wSCYsDvppbHUW82AbTknjN4fYXqKZOmw5BEMZrIlBjH7qbYIeqTTWNGZW/pa/wdJBcSJ6/PT65sO9DQd',
 'Blocks': [{'BlockType': 'PAGE',
   'Geometry': {'BoundingBox': {'Width': 1.0,
     'Height': 1.0,
     'Left': 0.0,
     'Top': 0.0},
    'Polygon': [{'X': 0.0, 'Y': 0.0},
     {'X': 1.0, 'Y': 0.0},
     {'X': 1.0, 'Y': 1.0},
     {'X': 0.0, 'Y': 1.0}]},
   'Id': '691b5dd0-3d66-4617-808d-3c7521ae137f',
   'Relationships': [{'Type': 'CHILD',
     'Ids': ['e452d817-aae0-4fb6-9db6-cc3fa998c69c',
      '2a233469-36b7-43f2-a6c7-b00840da9af6',
      '4b896554-d31d-4f3d-b808-a577c549711f',
      '41688818-cf48-46dc-a2f7-a92a8b3fc673',
      '76d7daf9-280e-409c-9ede-dda1e93a4f7f',
      '48c4368e-2b88-43d2-9eaf-841acb4d3890',
      '44329fa8-d14c-4c8a-a55b-9967c05af330',
      '378fb05b-589a-4681-ba58-346e55c0302b',
      '70216932-1259-4a81-8bfe-7f72221deb59',
      '6ea343ea-624c-4f65-8f4c-c7a931466c0

In [25]:
discarded_header = []
discarded_footer = []
saved_text = []

for block in blocks:
    if (block['BlockType'] == 'PAGE'):
        print('Page: {}'.format(block['Page']))
    elif (block['BlockType'] == 'LINE'):
        text = block['Text']
        top = block['Geometry']['BoundingBox']['Top']
        polygon_bottom = block['Geometry']['Polygon'][2]['Y']
        polygon_top    = block['Geometry']['Polygon'][0]['Y']
        if (polygon_bottom < 0.035):
            discarded_header.append(text)
        elif (polygon_top > 0.95):
            discarded_footer.append(text)
        else:
            saved_text.append(text)

print('\nDiscarded headers:')
for h in discarded_header:
    print(h)
print('\nDiscarded footers:')
for f in discarded_footer:
    print(f)
    
print('\nSaved text:')
for s in saved_text:
    print(s)


Page: 1
Page: 2

Discarded headers:
5/9/2016
Add headers, footers, and Bates numbering to PDFs, Adobe Acrobat
5/9/2016
Add dheaders, footers, and Bates numbering to PDFs, Adobe Acrobat

Discarded footers:
htpshelpxadobe.comlacrobatusingadd-headers-iooters-pdishim|
2/6
tpshelpxadobe.comlacrobatlusingadctheaders-iooters-pdishtm|
3/6

Saved text:
Headers, footers, and Bates numbering
Add headers and footers, with an open document
Add headers and footers, with no document open (Windows only)
Update the headers and footers
Add another header and footer
Replace all headers and footers
Remove all headers and footers
Show all
To the top
Headers, footers, and Bates numbering
Acrobat DC lets you add a header and footer throughout a PDF. Headers and footers can include a date,
automatic page numbering, Bates numbers for legal documents, or the title and author. You can add
headers and footers to one or more PDFS.
You can vary the headers and footers within a PDF. For example, you can add a header