# Identifying and Extracting Tables from PDFs

Locating and extracting tables from PDFS is a common challenge, and we've got a solution!  This notebook will demonstrate how to tag and extract a simple table found in a PDF.  

We'll be using Kodexa's Pattern-based Table Tagger (action) to locate the table in the PDF, tag the table rows and columns, and extract the table values.  Documentation on the Pattern-based Table Tagger can be found [here: Pattern-based Table Tagger](https://developer.kodexa.com/#/kodexa-platform/actions/kodexa-pattern-table-tagger)

In these examples, we'll explore different parameters and their effects on the tagging and extraction process.

## Setup our imports

In [1]:
from kodexa import Document, Pipeline, RemoteAction, KodexaPlatform

CLOUD_URL = 'https://platform.kodexa.com' 

## Set Platform Environment and Access Token Credential

In the next cell, you'll be prompted to enter your access token that you've created in the environment specified by the CLOUD_URL.
If you haven't created a token already, follow the steps in our [Getting Started](https://developer.kodexa.com/kodexa-cloud/accessing-kodexa-cloud) guide.

* Note:  The text you enter in the prompt field will be masked.  Once you're done entering the access token value, hit enter to complete the action in the cell.

In [2]:
import getpass

# Only request a login if we aren't logged in

if KodexaPlatform.get_access_token() is None:
    
    ACCESS_TOKEN = getpass.getpass("Enter access token:")

    KodexaPlatform.set_url(CLOUD_URL)
    KodexaPlatform.set_access_token(ACCESS_TOKEN)

Enter access token: ································


In [3]:
import os

# Setting up location of data file
DATA_FOLDER = '_data'
PDF_FOLDER = 'pdfs'
DATA_FILE = 'Ice_Cream_Specials.pdf'

FULL_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, PDF_FOLDER, DATA_FILE)

print(f'\nThis is where the PDF document is located: {FULL_PATH}\n')



This is where the PDF document is located: /home/skep/Projects/Kodexa/get-started-with-python/2_Examples_by_Source_Type/../_data/pdfs/Ice_Cream_Specials.pdf



## First, parse the PDF

We'll start by constructing a pipeline that parses the PDF.  We'll then use the resulting Kodexa document for our other tagging explorations.

In [4]:

# Let's set up a pipeline that parses the PDF.  We're doing this as a separate piece of work
# so we can spend time really digging into the table tagging parameters later

pipeline = Pipeline.from_file(FULL_PATH)
pipeline.add_step(RemoteAction(slug='kodexa/pdf-parser', attach_source=True))
pipeline.run()

kodexa_doc = pipeline.context.output_document


## Next, tag and extract the table data

When tagging a table in a PDF document, we focus on the areas of the document above and below the table.  Tables may or may not have headers, may span multiple pages, and may have rows of varying heights.  We have parameters that can control for all of those variations in presentation.  Let's start with the basics!

We'll be using the Pattern-based table tagger (slug 'kodexa/pattern-table-tagger') to identify tables in our PDF documents.  That tagger identifies and tags a table using text patterns and then leverages spatial awareness to find columns and rows.

The pattern-table-tagger has three required parameters:
* **tag_to_apply**:  The tag we'll apply to the table once it's identified.
* **page_start_re**: A regular expression that identifies the page the table will be found on.  This could be a page number or any bit of text that appears on the page before the table contents begins.  If the table is the only data on the page, you can enter some text you expect to be in the table or an empty string.
* **table_start_re**: A regular expression that identifies a line of text that starts the table, such as the the column header line or other identifier.

We're also going to set the 'extract' parameter to True so we can access the tagged table as a TableDataStore.  That store can be directly converted to a pandas dataframe, which is a familiar data structure for most Python developers.

* **extract**: Boolean value indicating that the tagged table should be added to the pipeline context's stores
* extract_options - **store_name**: The name of the extracted table's TableDataStore

In [5]:
# Setting up the regex values for this table

# Because this is a 'tagger', we must provide a name for the tag
tag_name = "Ice Cream Specials"

# Since this table is the only text on the page, we'll set the page start regular expression to match any character
page_start_re = "."  

# Again, since this table is the only text on the page, we can use a regular expression that matches one or more characters (any type)
table_start_re = ".*"

pipeline = Pipeline(kodexa_doc)
pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"tag_to_apply": tag_name, 
                                              "page_start_re": page_start_re, 
                                              "table_start_re":table_start_re, 
                                              "extract":True, 
                                                  "extract_options": {'store_name': tag_name}
                                              }))


context = pipeline.run()
kodexa_doc = pipeline.context.output_document

## Let's see what tags were applied to the document

This action is first and foremost a tagger, so the table we're trying to identify should be tagged with the name we specified.  Let's look and see if we do, indeed, have the expected tag.

In [6]:
kodexa_doc.get_root().get_all_tags()

['Ice Cream Specials', 'col1', 'col0', 'col2']

## Wow!  More tags than expected!

Our 'Ice Cream Specials' table was tagged, and three additional column tags (col1, col0, and col2) were included.  Let's take a look at that data.

In [7]:

## Select all the content_nodes that have been tagged with 'Ice Cream Specials'
ice_cream_special_nodes = kodexa_doc.select("//*[hasTag('" + tag_name + "')]")

for n in ice_cream_special_nodes:
    print(f'UUID: {n.uuid} :: node_type: {n.node_type} :: all content: {n.get_all_content()}')

UUID: 3e59be32-d531-4dea-b2c2-e68190fd1677 :: node_type: line :: all content: Date Flavor of the Day Description
UUID: 7ba4b3fb-c02c-4cdd-afed-c0624e0b7bbf :: node_type: line :: all content: July 1, 2020 Rocky Road Chocolate ice cream with bits
UUID: 530ea145-9eca-4e7e-8522-7953248d6c60 :: node_type: line :: all content: of marshmallow and nuts
UUID: 054a3080-0607-4f4a-a52d-c42c4b946d19 :: node_type: line :: all content: folded in.
UUID: ccb9b62a-e7be-46c0-b4f4-7a39789ce41d :: node_type: line :: all content: July 8, 2020 Bananas Foster A banana and rum flavored
UUID: 1abaea20-81d5-497c-9d55-f6b7c59f89ea :: node_type: line :: all content: ice cream with bits of real
UUID: f748a0c4-d041-496f-958b-605d2fd7e036 :: node_type: line :: all content: banana and brown sugar
UUID: f989cbe3-4a0e-4a36-a782-e2ba8e59ac3c :: node_type: line :: all content: swirled in.
UUID: 0fe4926e-43e0-467d-b934-f61d79b88163 :: node_type: line :: all content: July 7, 2020 Peachy Dream Ice cream made with fresh
UUID:

## Each line in the PDF is tagged as a row

If you look at the PDF, you see a table with three columns, one header row, and three rows of data.  When the PDF is parsed and transformed into a Kodexa Document, the resulting Kodexa document contains a series of lines.  The tagger has identified the table and tagged each line with the table's tag.  

Let's take a look a the column tags to see what nodes have been tagged as columns.


In [8]:

## Select all the content_nodes that have been tagged with 'col0'
col0_nodes = kodexa_doc.select("//*[hasTag('col0')]")

print('Nodes tagged with col0:')
for n in col0_nodes:
    print(f'\tUUID: {n.uuid} :: node_type: {n.node_type} :: all content: {n.get_all_content()}')
    
## Select all the content_nodes that have been tagged with 'col0'
col1_nodes = kodexa_doc.select("//*[hasTag('col1')]")

print('\nNodes tagged with col1:')
for n in col1_nodes:
    print(f'\tUUID: {n.uuid} :: node_type: {n.node_type} :: all content: {n.get_all_content()}')
    
    
## Select all the content_nodes that have been tagged with 'col0'
col2_nodes = kodexa_doc.select("//*[hasTag('col2')]")

print('\nNodes tagged with col2:')
for n in col2_nodes:
    print(f'\tUUID: {n.uuid} :: node_type: {n.node_type} :: all content: {n.get_all_content()}')

Nodes tagged with col0:
	UUID: 71f12a88-6b03-4442-b75b-f43e659fa65f :: node_type: column :: all content: Date
	UUID: b1a36c7e-22c9-45bd-a495-3d7746b56504 :: node_type: column :: all content: July 1, 2020
	UUID: bd2a11e5-f1e6-4cad-a843-592308d322d1 :: node_type: column :: all content: 
	UUID: 038915e6-bff3-4da0-a9e3-9a143ce37958 :: node_type: column :: all content: 
	UUID: 7e06ab1c-18fe-4b29-a3c4-7ebeb6d2ea23 :: node_type: column :: all content: July 8, 2020
	UUID: ef4e38c1-1bef-4266-b5aa-bd662e658234 :: node_type: column :: all content: 
	UUID: 63d3a71f-858c-45a5-a36c-3115dc86ab20 :: node_type: column :: all content: 
	UUID: 228a4e7d-9d9d-48fb-b52e-6a525b95d869 :: node_type: column :: all content: 
	UUID: 2ba7067d-b54a-47f5-978c-df4cbc762e85 :: node_type: column :: all content: July 7, 2020
	UUID: d3780a34-71fa-4eab-bb5f-7c7ccb5a2836 :: node_type: column :: all content: 
	UUID: 46b62cff-80ea-4a35-a080-35c0b11397e4 :: node_type: column :: all content: 

Nodes tagged with col1:
	UUID: 1b

## Our column data is as expected!

That's cool - the data we expected to appear in each column has, indeed, been tagged with the correct column groups.  

Now let's take a look at the data as a dataframe:

In [9]:
context.get_store(tag_name).to_df()

Unnamed: 0,Date,Flavor of the Day,Description
0,"July 1, 2020",Rocky Road,Chocolate ice cream with bits
1,,,of marshmallow and nuts
2,,,folded in.
3,"July 8, 2020",Bananas Foster,A banana and rum flavored
4,,,ice cream with bits of real
5,,,banana and brown sugar
6,,,swirled in.
7,"July 7, 2020",Peachy Dream,Ice cream made with fresh
8,,,"Georgia peaches, accented"
9,,,with bits of toasted almonds.


## It's easy to convert the data to a dataframe

Just as expected, we have three columns of data (yea!)...however, we have nine rows of data and we really only want to have 3.  We can change that!

To merge the unwanted rows of data into a single row, we need to tell the tagger where we expect data to exist.  In this case, we really expect data to begin in the first column (index 0), so we'll set a new parameter for the extract options specifying that index

* extract_options - **col_index_with_text**: The index of the column where data we expect to see data.  We expect to see data in the 'Date' column, so we'll set the value as 0.



In [10]:

# First, remove all existing tags from the document so they don't interfere with our new round of tagging
for t in kodexa_doc.get_root().get_all_tags():
    [n.remove_tag(t) for n in kodexa_doc.select("//*[hasTag('" + t + "')]")]
    
    
# Use the same options values as before, and add the 'extract_options' parameter 'col_index_with_text' with a value of 0.
pipeline = Pipeline(kodexa_doc)
pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"tag_to_apply": tag_name, 
                                              "page_start_re": page_start_re, 
                                              "table_start_re":table_start_re, 
                                              "extract":True, 
                                                  "extract_options": {'store_name': tag_name, 
                                                                      'col_index_with_text': 0}
                                              }))


context = pipeline.run()
kodexa_doc = pipeline.context.output_document

In [11]:
context.get_store(tag_name).to_df()

Unnamed: 0,Date,Flavor of the Day,Description
0,"July 1, 2020",Rocky Road,Chocolate ice cream with bits of marshmallow a...
1,"July 8, 2020",Bananas Foster,A banana and rum flavored ice cream with bits ...
2,"July 7, 2020",Peachy Dream,"Ice cream made with fresh Georgia peaches, acc..."


##  Bingo!  The data is extracted correctly!

### Let's look closer at the header row

When we set the 'extract' parameter to True to extract the table in a store, a number of additional 'extract_options' are available to us.  We've already used two of them - 'store_name' (required) and 'col_index_with_text'.

If you look at the data displayed in the cell above, you'll see the Date, Flavor of the Day, and Description values are all bold, indicating they're the column headers.  

We can see that clearly when we look at the columns on the dataframe:

In [12]:

# Print the name of the columns in the dataframe
context.get_store(tag_name).to_df().columns

Index(['Date', 'Flavor of the Day', 'Description'], dtype='object')

## How do column names get set?

The 'extract_options' includes a parameter named 'header_lines_count' and it's defaulted to 1.  That means the extract logic expects the first row of data to be the header for the table.  

Let's see what happens if we change that if we set that value to 0:


In [13]:
# Removing all existing tags
for t in kodexa_doc.get_root().get_all_tags():
    [n.remove_tag(t) for n in kodexa_doc.select("//*[hasTag('" + t + "')]")]
    
    
# Use the same options values as before, and add the 'extract_options' parameter 'header_lines_count' with a value of 0.
pipeline = Pipeline(kodexa_doc)
pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"tag_to_apply": tag_name, 
                                              "page_start_re": page_start_re, 
                                              "table_start_re":table_start_re, 
                                              "extract":True, 
                                                  "extract_options": {'store_name': tag_name, 
                                                                      'col_index_with_text': 0,
                                                                      'header_lines_count': 0
                                                                     }
                                              }))


context = pipeline.run()
kodexa_doc = pipeline.context.output_document

In [14]:

print(f'Column headers are now:\n\t{context.get_store(tag_name).to_df().columns}')
print('\nDataframe values:')
context.get_store(tag_name).to_df()

Column headers are now:
	Index(['', '', ''], dtype='object')

Dataframe values:


Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,Date,Flavor of the Day,Description
1,"July 1, 2020",Rocky Road,Chocolate ice cream with bits of marshmallow a...
2,"July 8, 2020",Bananas Foster,A banana and rum flavored ice cream with bits ...
3,"July 7, 2020",Peachy Dream,"Ice cream made with fresh Georgia peaches, acc..."


## Our column headers are gone!

By setting the header_lines_count value to 0, we've instructed the extract logic that there are no header lines and everything should be included as data.

When would this be helpful?

Sometimes we want to ignore the headers on a table and just select the data...or maybe there are no headers on the table at all.  In this case, we can write our table_start_re selector to select the first data cell, and set the header_lines_count extract option to 0 (zero).



In [15]:
# Removing all existing tags
for t in kodexa_doc.get_root().get_all_tags():
    [n.remove_tag(t) for n in kodexa_doc.select("//*[hasTag('" + t + "')]")]
    
    
# Setting our table_start regex to look for data in the first row.  Since we know this data begins with the month of July, we'll use that for our match.
table_start_re = ".*July.*"
    
# Use the same options values as before, and add the 'extract_options' parameter 'header_lines_count' with a value of 0.
pipeline = Pipeline(kodexa_doc)
pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"tag_to_apply": tag_name, 
                                              "page_start_re": page_start_re, 
                                              "table_start_re": table_start_re, 
                                              "extract":True, 
                                                  "extract_options": {'store_name': tag_name, 
                                                                      'col_index_with_text': 0,
                                                                      'header_lines_count': 0
                                                                     }
                                              }))


context = pipeline.run()
kodexa_doc = pipeline.context.output_document

In [16]:

print(f'Column headers are now:\n\t{context.get_store(tag_name).to_df().columns}')
print('\nDataframe values:')
context.get_store(tag_name).to_df()

Column headers are now:
	Index(['', '', ''], dtype='object')

Dataframe values:


Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,"July 1, 2020",Rocky Road,Chocolate ice cream with bits of marshmallow a...
1,"July 8, 2020",Bananas Foster,A banana and rum flavored ice cream with bits ...
2,"July 7, 2020",Peachy Dream,"Ice cream made with fresh Georgia peaches, acc..."


## Look at that! We were able to extract only the data from the table!

What else can we do with this table?

## Let's transpose the data

The 'extract_options' includes a parameter named 'transpose' which is defaulted to False.  If we set that value to True, the data in the extracted TableDataStore will be transposed and we'll have a column for every date and one row with the ice cream specials.  Let's try that out.

In [17]:
# Removing all existing tags
for t in kodexa_doc.get_root().get_all_tags():
    [n.remove_tag(t) for n in kodexa_doc.select("//*[hasTag('" + t + "')]")]
    
    
# Setting our table_start regex to look for data in the first row.  Since we know this data begins with the month of July, we'll use that for our match.
table_start_re = ".*July.*"
    
# Use the same options values as before, and add the 'extract_options' parameter 'header_lines_count' with a value of 0.
pipeline = Pipeline(kodexa_doc)
pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"tag_to_apply": tag_name, 
                                              "page_start_re": page_start_re, 
                                              "table_start_re": table_start_re, 
                                              "extract":True, 
                                                  "extract_options": {'store_name': tag_name, 
                                                                      'col_index_with_text': 0,
                                                                      'header_lines_count': 0,
                                                                      'transpose': True
                                                                     }
                                              }))


context = pipeline.run()
kodexa_doc = pipeline.context.output_document

In [18]:
print('\nDataframe values:')
context.get_store(tag_name).to_df()


Dataframe values:


Unnamed: 0,"July 1, 2020","July 8, 2020","July 7, 2020"
0,Rocky Road,Bananas Foster,Peachy Dream
1,Chocolate ice cream with bits of marshmallow a...,A banana and rum flavored ice cream with bits ...,"Ice cream made with fresh Georgia peaches, acc..."
