# Working with Bank Statements - Bank of America Example

Example problem:  Bank statements come in every month and must be processed.  These statements may share a similar format but vary depending on customer, content, account types, etc.

This notebook demonstrates how to extract several different categories of account data from a single statement.  

We provide statements for two different months in order to demonstrate the flexibility provided in our tagging/extraction process.  The data differs slightly from month-to-month, some data is present in one month and not the next and some tables span more than one page on one month and not the next.  Our PatterTableTagger, used when extracting data from spatially analyzed documents such as PDFs, is flexible enough to handle these types of differences between documents. 

## Setup our imports

In [1]:
from kodexa import Document, Pipeline, RemoteAction, KodexaPlatform

CLOUD_URL = 'https://platform.kodexa.com' 

## Set Platform Environment and Access Token Credential

In the next cell, you'll be prompted to enter your access token that you've created in the environment specified by the CLOUD_URL.
If you haven't created a token already, follow the steps in our [Getting Started](https://developer.kodexa.com/kodexa-cloud/accessing-kodexa-cloud) guide.

* Note:  The text you enter in the prompt field will be masked.  Once you're done entering the access token value, hit enter to complete the action in the cell.

In [2]:
import getpass

ACCESS_TOKEN = getpass.getpass("Enter access token:")

KodexaPlatform.set_url(CLOUD_URL)
KodexaPlatform.set_access_token(ACCESS_TOKEN)

Enter access token: ································


## First, parse the PDF

We'll start by constructing a pipeline that parses the PDF.  We'll used this parsed Kodexa document for the rest of our processing.

There are two sample documents provided in the _data folder - try parsing each of these documents and processing them to see how the data is extracted for each (parse & process one at a time).

In [3]:

# Set up a pipeline that parses the PDF.  We're doing this as a separate piece of work
# so we can spend time really digging into the table tagging parameters later


# May statement data
pipeline = Pipeline.from_file('_data/BofA_2020_05_07.pdf')

# June statement data
#pipeline = Pipeline.from_file('_data/BofA_2020_06_09.pdf')
pipeline.add_step(RemoteAction(slug='kodexa/pdf-parser', attach_source=True))
pipeline.run()

kodexa_doc = pipeline.context.output_document

## Next, we'll tag and extract the checking Account Summary table

When tagging a table in a PDF document, we focus on the areas of the document above and below the table.  Tables may or may not have headers, may span multiple pages, and may have rows of varying heights.  We have parameters that can control for all of those variations in presentation.  Let's start with the basics!

We'll be using the Pattern-based table tagger (slug 'kodexa/pattern-table-tagger') to identify tables in our PDF documents.  That tagger identifies and tags a table using text patterns and then leverages spatial awareness to find columns and rows.

The pattern-table-tagger has three required parameters:
* **tag_to_apply**:  The tag we'll apply to the table once it's identified.
* **page_start_re**: A regular expression that identifies the page the table will be found on.  This could be a page number or any bit of text that appears on the page before the table contents begins.  If the table is the only data on the page, you can enter some text you expect to be in the table or an empty string.
* **table_start_re**: A regular expression that identifies a line of text that starts the table, such as the the column header line or other identifier.

In addition identifying the page_start and table_start, we're going to supply a few other parameters in order to extract specific table data:
* **page_end_re**:  A regular expression that identifies the page the table will end on.  This could be the format of a page number or any bit of text that appears on the page after the table contents ends.
* **table_end_re**:  A regular expression used to identify a line that is at the end of the table - important when the table spans multiple pages.
* **include_end_line**:  Boolean indication we want to include the line identified by the table_end_re will in the tagged table data so it' available for extraction.


We're also going to set the 'extract' parameter to True so we can access the tagged table as a TableDataStore.  That store can be directly converted to a pandas dataframe, which is a familiar data structure for most Python developers.

* **extract**: Boolean value indicating that the tagged table should be added to the pipeline context's stores
* extract_options - **store_name**: The name of the extracted table's TableDataStore
* extract_options - **header_lines_count**: The number of lines we expect to be header data (default is 1).  Since our summary tables don't have named headers above the table data, we're setting that value to zero.


In [15]:
# Remove all existing tags from the document (if this cell has already been processed) so they don't interfere with our new round of tagging
for t in kodexa_doc.get_root().get_all_tags():
    [n.remove_tag(t) for n in kodexa_doc.select("//*[hasTag('" + t + "')]")]
    

# Setting up the regex values for this table
checking_account_summary_table_tag_name = "Checking Account Summary"
checking_page_start_re = "^Your.*Ch.*"
summary_table_start_re = "^Beginning balance.*"
summary_table_end_re = "^Ending balance.*$"
page_number_re = ".*Page \d+ of \d+$"

pipeline = Pipeline(kodexa_doc)
pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"tag_to_apply": checking_account_summary_table_tag_name, 
                                              "page_start_re": checking_page_start_re, 
                                              "page_end_re": page_number_re,
                                              "table_start_re":summary_table_start_re, 
                                              "table_end_re": summary_table_end_re,
                                              "include_end_line" : True,  #This is set so the summary Ending Balance line is also extracted
                                              "extract":True, 
                                              "extract_options": {'store_name': checking_account_summary_table_tag_name, 
                                                                 'header_lines_count': 0}
                                              }))


context = pipeline.run()


In [16]:
if context.get_store(checking_account_summary_table_tag_name):
    checking_summary_df = context.get_store(checking_account_summary_table_tag_name).to_df()
    print(f'There are {len(checking_summary_df.columns)} columns which have header values of: {checking_summary_df.columns}')
    display(checking_summary_df)
else:
    print("No checking summary extracted")
    

There are 2 columns which have header values of: Index(['', ''], dtype='object')


Unnamed: 0,Unnamed: 1,Unnamed: 2
0,"Beginning balance on April 10, 2020","$4,181.74"
1,Deposits and other additions,14118.32
2,Withdrawals and other subtractions,-13959.83
3,Checks,-0.00
4,Service fees,-0.00
5,"Ending balance on May 7, 2020","$4,340.23"


## Examine the extracted data

We were able to identify and extract the data as expected (yea!).  Take a look at the column names - you'll see there aren't any.  We have two columns but both are identified with an empty string.  This table doesn't have a proper table header to reference, so there are no descriptive column names set for the dataframe.


## Get the Savings summary

We'll use a few of the same regex values that we defined for the checking summary extraction and add a few new ones that identify savings summary data.  If you look at the original PDF, you'll be able to see that the savings summary table is titled "Account Summary", same as the checking summary.  The savings summary table begins on a page with a heading of "Your Regular Savings".


In [17]:
# Remove all existing tags from the document (if this cell has already been processed) so they don't interfere with our new round of tagging
for t in kodexa_doc.get_root().get_all_tags():
    [n.remove_tag(t) for n in kodexa_doc.select("//*[hasTag('" + t + "')]")]

    
# Using the already processed PDF, so we can skip parsing
pipeline = Pipeline(kodexa_doc)

# Setting up the regex values for this table
savings_account_summary_table_tag_name = "Savings Account Summary"
savings_page_start_re = "^Your.*Sa.*"

pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"tag_to_apply": savings_account_summary_table_tag_name, 
                                              "page_start_re": savings_page_start_re, 
                                              "page_end_re": page_number_re,
                                              "table_start_re":summary_table_start_re, 
                                              "table_end_re": summary_table_end_re,
                                              "include_end_line" : True,  #This is set so the summary Ending Balance line is also extracted
                                              "extract":True, 
                                              "extract_options": {'store_name': savings_account_summary_table_tag_name, 
                                                                  'header_lines_count': 0}
                                              }))


context = pipeline.run()

In [18]:
if context.get_store(savings_account_summary_table_tag_name):
    savings_summary_df = context.get_store(savings_account_summary_table_tag_name).to_df()
    print(f'There are {len(savings_summary_df.columns)} columns which have header values of: {savings_summary_df.columns}')
    display(savings_summary_df)
else:
    print("No savings summary extracted")

There are 2 columns which have header values of: Index(['', ''], dtype='object')


Unnamed: 0,Unnamed: 1,Unnamed: 2
0,"Beginning balance on April 10, 2020","$1,601.51"
1,Deposits and other additions,275.01
2,Withdrawals and other subtractions,-0.00
3,Service fees,-0.00
4,"Ending balance on May 7, 2020","$1,876.52"


## Examine the savings account summary output

Again, we can see that our data was tagged and extracted as expected, and we can see that the columns in the dataframe have not been named.  This summary table doesn't have a proper table header either, so no descriptive names are set.

## Get the detailed deposit and withdrawal information.

Starting with checking, we'll get details for deposits and withdrawals.

### New parameters for detailed transaction tables

Take a look at the source PDF and you'll see that the detailed transaction sections (checking/savings deposits & withdrawals) all have a table header of "Date", "Description", and "Amount".  Our summary tables did not have header information, so we explicitly set the "header_lines_count" value to 0.  Since our detailed information has headers, we'll change that value to 1.

Refer to the source PDF once again and you'll see that the data in the detailed transaction sections sometimes spans more than one line.  We can collapse that multi-line data into a single row by setting the "col_index_with_text" parameter to zero.

* extract_options - **header_lines_count**: The number of lines we expect to be header data (default is 1).  We're explicitly setting the value to 1 for this example, but since 1 is the default, you could choose to omit this parameter all together.
* extract_options - **col_index_with_text**: The index of the column where data we expect to see data.  We expect to see data in the 'Date' column, so we'll set the value as 0.



In [19]:
# Remove all existing tags from the document (if this cell has already been processed) so they don't interfere with our new round of tagging
for t in kodexa_doc.get_root().get_all_tags():
    [n.remove_tag(t) for n in kodexa_doc.select("//*[hasTag('" + t + "')]")]

    
# Again, skip parsing by using the same doc we've already parsed
pipeline = Pipeline(kodexa_doc)

# Setting up the regex values for this table
checking_deposits_table_tag_name = "Checking Account Deposits"
detailed_transactions_header_re = '^Date\s+Description\s+Amount$'
detailed_deposits_table_end = '^Total deposits.*'
continued_re = '^continued.*$'

deposit_re = '^Deposits.*additions$'

pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"tag_to_apply": checking_deposits_table_tag_name, 
                                              "page_start_re": deposit_re,  #since these are the checking deposits, we'll look for them in the checking section
                                              "page_end_re": detailed_deposits_table_end, #this is text that appears at the end of the table - after it has spanned all pages
                                              "table_start_re":detailed_transactions_header_re, 
                                              "table_end_re": continued_re,
                                              "extract":True, 
                                              "extract_options": {'store_name': checking_deposits_table_tag_name, 
                                                                  'header_lines_count': 1,
                                                                  'col_index_with_text': 0} #need the col_index_with_text to get all the entries on one line
                                              }))


context = pipeline.run()


In [20]:
if context.get_store(checking_deposits_table_tag_name):
    checking_deposit_df = context.get_store(checking_deposits_table_tag_name).to_df()
    print(f'There are {len(checking_deposit_df.columns)} columns which have header values of: {checking_deposit_df.columns}')
    display(checking_deposit_df)
else:
    print("No checking deposit detail information extracted")

There are 4 columns which have header values of: Index(['Date', 'Description', '', 'Amount'], dtype='object')


Unnamed: 0,Date,Description,Unnamed: 3,Amount
0,04/10/20,PLACE FOR JOBS DES:Payroll ID:CER000XX8XX2 IND...,CO,4515.02
1,04/15/20,"ABC ENTERPRISES, L DES:PAYROLL ID:01X2000-XXX-...",CO,1036.14
2,04/24/20,PLACE FOR JOBS DES:Payroll ID:CER000XX8XX2 IND...,CO,4515.03
3,04/28/20,Transfer JANE SMITH,,3015.91
4,04/30/20,"ABC ENTERPRISES, L DES:PAYROLL ID:01X2000-XXX-...",CO,1036.22


##  Examine the checking deposit summary output

It looks like our table was tagged and our data was extracted, but we've got 4 columns instead of the expected 3.  The alignment of the text "CO", spaced far off the rest of the description text, makes it look like that data could be in its own column.  We know that the data really belongs to the "Description" column, so we're going to provide a new parameter so the data is merged correctly.

* extract_options - **col_marker_re**:  A regular expression that identifies a row in the table that can indicate specific positions of the columns.  In this case, we have a header row that indicates column positions.  If we didn't have a header row, we could use some other row in the table to serve as the 'master' position indicators.  Note - this option is not always necessary, but is useful when data is randomly spaced within a column.


## Set the new col_marker_re option and execute the extraction again

Once again, extracting detailed checking deposit information.

In [21]:
# Remove all existing tags from the document (if this cell has already been processed) so they don't interfere with our new round of tagging
for t in kodexa_doc.get_root().get_all_tags():
    [n.remove_tag(t) for n in kodexa_doc.select("//*[hasTag('" + t + "')]")]

    
# Using the same doc we've already parsed
pipeline = Pipeline(kodexa_doc)
pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"tag_to_apply": checking_deposits_table_tag_name, 
                                              "page_start_re": deposit_re,  #since these are the checking deposits, we'll look for them in the checking section
                                              "page_end_re": detailed_deposits_table_end, #this is text that appears at the end of the table - after it has spanned all pages
                                              "table_start_re":detailed_transactions_header_re, 
                                              "table_end_re": continued_re,
                                              "col_marker_re" : detailed_transactions_header_re,
                                              "extract":True, 
                                              "extract_options": {'store_name': checking_deposits_table_tag_name, 
                                                                  'header_lines_count': 1,
                                                                  'col_index_with_text': 0} #need the col_index_with_text to get all the entries on one line
                                              }))


context = pipeline.run()


In [22]:
if context.get_store(checking_deposits_table_tag_name):
    checking_deposit_df = context.get_store(checking_deposits_table_tag_name).to_df()
    print(f'There are {len(checking_deposit_df.columns)} columns which have header values of: {checking_deposit_df.columns}')
    display(checking_deposit_df)
else:
    print("No checking deposit detail information extracted")

There are 3 columns which have header values of: Index(['Date', 'Description', 'Amount'], dtype='object')


Unnamed: 0,Date,Description,Amount
0,04/10/20,PLACE FOR JOBS DES:Payroll ID:CER000XX8XX2 IND...,4515.02
1,04/15/20,"ABC ENTERPRISES, L DES:PAYROLL ID:01X2000-XXX-...",1036.14
2,04/24/20,PLACE FOR JOBS DES:Payroll ID:CER000XX8XX2 IND...,4515.03
3,04/28/20,Transfer JANE SMITH,3015.91
4,04/30/20,"ABC ENTERPRISES, L DES:PAYROLL ID:01X2000-XXX-...",1036.22


## Fantastic!  Now we have the expected number of columns!

Since the format of the checking withdrawal table is the same as that used for the checking deposit table, we'll use that "col_marker_re" parameter for the next table data extraction as well.

### Checking withdrawals

Again, we'll use the same approach (and several of the same regex values) for this table as we did for the deposits table.

In [23]:
# Remove all existing tags from the document (if this cell has already been processed) so they don't interfere with our new round of tagging
for t in kodexa_doc.get_root().get_all_tags():
    [n.remove_tag(t) for n in kodexa_doc.select("//*[hasTag('" + t + "')]")]


# Using the previously parsed kodexa_doc
pipeline = Pipeline(kodexa_doc)

# Setting up the regex values for this table
checking_withdrawals_page_start_re = "^W.*als.*subtractions$"  #
checking_withdrawals_table_tag_name = "Checking Account Withdrawals"
detailed_withdrawal_table_end = '^Total withd.*'

pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"tag_to_apply": checking_withdrawals_table_tag_name, 
                                              "page_start_re": checking_withdrawals_page_start_re,  #since these are the checking deposits, we'll look for them in the checking section
                                              "page_end_re": detailed_withdrawal_table_end, #this is text that appears at the end of the table - after it has spanned all pages
                                              "table_start_re":detailed_transactions_header_re, 
                                              "table_end_re": continued_re,
                                              "col_marker_re" : detailed_transactions_header_re,
                                              "extract":True, 
                                              "extract_options": {'store_name': checking_withdrawals_table_tag_name, 
                                                                  'header_lines_count': 1,
                                                                  'col_index_with_text': 0} #need the col_index_with_text to get all the entries on one line
                                              }))


context = pipeline.run()


In [24]:
if context.get_store(checking_withdrawals_table_tag_name):
    checking_withdrawals_df = context.get_store(checking_withdrawals_table_tag_name).to_df()
    print(f'There are {len(checking_withdrawals_df.columns)} columns which have header values of: {checking_withdrawals_df.columns}')
    display(checking_withdrawals_df)
else:
    print("No checking withdrawal detail information extracted")
    

There are 3 columns which have header values of: Index(['Date', 'Description', 'Amount'], dtype='object')


Unnamed: 0,Date,Description,Amount
0,04/10/20,CAPITAL ONE DES:ONLINE PMT ID:0101399XXXX656 I...,-50.0
1,04/13/20,Low e s C C DES:LWS EPAY ID:233XXXX2 INDN: 79X...,-74.89
2,04/14/20,BANK DES:$TRANSFER ID:2XXXXXXX1 INDN:JANE SMIT...,-300.0
3,04/14/20,LIFE DES:INSUR PREM ID:P 2AXXXXX756 INDN:SMITH...,-61.63
4,04/15/20,AMERICAN EXPRESS DES:ACH PMT ID:WXXX2 INDN:JAN...,-194.94
5,04/16/20,BKOFAMERICA ATM 04/16 #000XXXXX2 WITHDRWL CHAR...,-300.0
6,04/16/20,BANK DES:$TRANSFER ID:2XXXXXXX1 INDN:JANE SMIT...,-75.0
7,04/17/20,WATER DES:UTIL-PMNT ID:22XXXX1 INDN:SMITH JOE ...,-37.82
8,04/20/20,BKOFAMERICA ATM 04/18 #000XXXXX2 WITHDRWL CHAR...,-40.0
9,04/20/20,CHECKCARD 0419 PAYPAL *EBAYINCSHIP 402-935-773...,-8.76


## And now detailed Savings information

### Savings Deposits

In [25]:
# Remove all existing tags from the document (if this cell has already been processed) so they don't interfere with our new round of tagging
for t in kodexa_doc.get_root().get_all_tags():
    [n.remove_tag(t) for n in kodexa_doc.select("//*[hasTag('" + t + "')]")]


pipeline = Pipeline(kodexa_doc)

# Setting up the regex values for this table
savings_deposit_page_start_re = "^Your.*Sav.*"
savings_deposits_table_tag_name = "Savings Account Deposits"

pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"tag_to_apply": savings_deposits_table_tag_name, 
                                              "page_start_re": savings_deposit_page_start_re,  #since these are the checking deposits, we'll look for them in the checking section
                                              "page_end_re": detailed_deposits_table_end, #this is text that appears at the end of the table - after it has spanned all page
                                              "table_start_re":detailed_transactions_header_re, 
                                              "table_end_re": continued_re,
                                              "extract":True, 
                                              "extract_options": {'store_name': savings_deposits_table_tag_name, 
                                                                  'header_lines_count': 1,
                                                                  'col_index_with_text': 0} #need the col_index_with_text to get all the entries on one line
                                              }))


context = pipeline.run()


In [26]:
if context.get_store(savings_deposits_table_tag_name):
    savings_deposits_df = context.get_store(savings_deposits_table_tag_name).to_df()
    print(f'There are {len(savings_deposits_df.columns)} columns which have header values of: {savings_deposits_df.columns}')
    display(savings_deposits_df)
else:
    print('No savings deposit detail information extracted')

There are 3 columns which have header values of: Index(['Date', 'Description', 'Amount'], dtype='object')


Unnamed: 0,Date,Description,Amount
0,05/01/20,Automatic Transfer from CHK 0000 Confirmation#...,275.0
1,05/07/20,Interest Earned,0.01


### Savings Withdrawals

In [27]:
# Remove all existing tags from the document (if this cell has already been processed) so they don't interfere with our new round of tagging
for t in kodexa_doc.get_root().get_all_tags():
    [n.remove_tag(t) for n in kodexa_doc.select("//*[hasTag('" + t + "')]")]


pipeline = Pipeline(kodexa_doc)

# Setting up the regex values for this table
savings_withdrawals_page_start_re = "^Your.*Sav.*$"
savings_withdrawals_table_tag_name = "Savings Account Withdrawals"

pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"tag_to_apply": savings_withdrawals_table_tag_name, 
                                              "page_start_re": savings_withdrawals_page_start_re,  #since these are the checking deposits, we'll look for them in the checking section
                                              "page_end_re": detailed_withdrawal_table_end, #this is text that appears at the end of the table - after it has spanned all page
                                              "table_start_re":'^Withdrawals and other subtractions$', 
                                              "table_end_re": continued_re,
                                              "extract":True, 
                                              "extract_options": {'store_name': savings_withdrawals_table_tag_name, 
                                                                  'header_lines_count': 1,
                                                                  'col_index_with_text': 0} #need the col_index_with_text to get all the entries on one line
                                              }))


context = pipeline.run()


In [28]:
if context.get_store(savings_withdrawals_table_tag_name):
    savings_withdrawals_df = context.get_store(savings_withdrawals_table_tag_name).to_df()
    print(f'There are {len(savings_withdrawals_df.columns)} columns which have header values of: {savings_withdrawals_df.columns}')
    display(savings_withdrawals_df)
    
else:
    print('No savings withdrawal detail information extracted')

No savings withdrawal detail information extracted
