# Extracting Data from Bank Statements

In this notebook, we'll walk through a couple of simple starting points for working with bank statements.

**Note**: this is just an initial setup. A complete solution would extend past this approach, but this illustrates how we can quickly get started.

All of our processing will occur in Kodexa's cloud environment.  In order to access the platform, you'll need to register for an account and generate an access token.  If you haven't done that already, follow the steps in our [Getting Started](https://developer.kodexa.com/kodexa-cloud/accessing-kodexa-cloud) guide.


## Setup our imports

1. Our actions occur in the could, so we'll need to import the KodexaPlatform and RemoteAction modules
2. We'll be build pipelines to process our document, so we'll import Kodexa's Pipeline module
4. All files that have been processed/parsed in Kodexa become Kodexa Documents, so we'll import that module as well.

We're also setting the CLOUD_URL value to the platform environment on which we want to perform our processing.


In [10]:
from kodexa import Document, Pipeline, RemoteAction, KodexaPlatform

#CLOUD_URL = 'https://platform.kodexa.com' 
CLOUD_URL = 'https://quantum.kodexa.com' 

## Set Platform Environment and Access Token Credential

In the next cell, you'll be prompted to enter your access token that you've created in the environment specified by the CLOUD_URL.
If you haven't created a token already, follow the steps in our [Getting Started](https://developer.kodexa.com/org-management/manage-access-token) guide.

* Note:  The text you enter in the prompt field will be masked.  Once you're done entering the access token value, hit enter to complete the action in the cell.  **You will then need to manulally set control at the next cell.**

In [12]:
import getpass

ACCESS_TOKEN = getpass.getpass("Enter access token:")

KodexaPlatform.set_url(CLOUD_URL)
KodexaPlatform.set_access_token(ACCESS_TOKEN)

In [13]:
import os

# Setting up location of data file
DATA_FOLDER = '_data'
PDF_FOLDER = 'pdfs'
DATA_FILE = 'USBankSample.pdf'

FULL_PATH = os.path.join(os.getcwd(), '..', DATA_FOLDER, PDF_FOLDER, DATA_FILE)

print(f'\nThis is where the bank statement is located: {FULL_PATH}\n')


This is where the bank statement is located: /home/skep/Projects/Kodexa/get-started-with-python/3_Advanced_Examples/../_data/pdfs/USBankSample.pdf



## Define a basic pipeline

This is a basic pipeline to show the concept using a sample statement (found on the internet).

In [14]:
# Create the pipeline

pipeline = Pipeline.from_file(FULL_PATH)
pipeline.add_step(RemoteAction(slug='kodexa/pdf-parser', 
                               options={"layout_analysis_options":{"rollup":"word","space_multiplier":1},
                                      "analyze_layout":True},
                               attach_source=True))

col_space_multiplier = 3.0
page_number_re = ".*Page \d+ of \d+$"

transactions_header_re = '^Date\s+Description.*\s+Amount$'
continued_re = '^.*\(continued\)$'

# Extract Other Deposits
other_deposits_table_tag_name = "Other Deposits"
pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"col_space_multiplier": col_space_multiplier,
                                              "tag_to_apply": other_deposits_table_tag_name,
                                              "page_start_re": '^Other Deposits$',
                                              "page_end_re": '^Total Other Deposits.*\d{2}$',
                                              "table_start_re": transactions_header_re,
                                              "table_end_re": '^BALANCE YOUR ACCOUNT$',
                                              "col_marker_re": transactions_header_re,
                                              "insert_col_index": 2,
                                              "extract": True,
                                              "extract_options": {'store_name': other_deposits_table_tag_name,
                                                                  'header_lines_count': 1,
                                                                  'col_index_with_text': 0}
                                             }))


# Extract Card Withdrawals
card_withdrawals_table_tag_name = "Card Withdrawals"
pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"col_space_multiplier": col_space_multiplier,
                                              "tag_to_apply": card_withdrawals_table_tag_name,
                                              "page_start_re": '^Card Withdrawals$',
                                              "page_end_re": '^Card.*Subtotal.*',
                                              "table_start_re": transactions_header_re,
                                              "table_end_re": '^Card.*Subtotal.*',
                                              "col_marker_re": transactions_header_re,
                                              "insert_col_index": 2,
                                              "extract": True,
                                              "extract_options": {'store_name': card_withdrawals_table_tag_name,
                                                                  'header_lines_count': 1,
                                                                  'col_index_with_text': 0}
                                             }))


# Extract Other Withdrawals
other_withdrawals_table_tag_name = "Other Withdrawals"
pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={"col_space_multiplier": col_space_multiplier,
                                              "tag_to_apply": other_withdrawals_table_tag_name,
                                              "page_start_re": '^Other Withdrawals$',
                                              "page_end_re": '^Total Other Withdrawals',
                                              "table_start_re": transactions_header_re,
                                              "table_end_re": '^Total Other Withdrawals',
                                              "col_marker_re": transactions_header_re,
                                              "insert_col_index": 2,
                                              "extract": True,
                                              "extract_options": {'store_name': other_withdrawals_table_tag_name,
                                                                  'header_lines_count': 1,
                                                                  'col_index_with_text': 0}
                                             }))

# Extract Checks
checks_table_tag_name = "Checks"
check_transactions_re = '^Check Date .* Ref Number Amount$'

pipeline.add_step(RemoteAction(slug='kodexa/pattern-table-tagger', 
                                     options={ "col_space_multiplier": col_space_multiplier,
                                              "tag_to_apply": checks_table_tag_name,
                                              "page_start_re": '^Checks Presented Conventionally$',
                                              "page_end_re":  '.*Conventional Checks Paid.*\d{2}.$',
                                              "table_start_re": check_transactions_re,
                                              "table_end_re": '',
                                              "col_marker_re": check_transactions_re,
                                              "extract": True,
                                              "extract_options": {'store_name': checks_table_tag_name,
                                                                  'header_lines_count': 1,
                                                                  'col_index_with_text': 0,
                                                                  'tables_in_page_count': 2}
                                             }))


context = pipeline.run()
kodexa_doc = pipeline.context.output_document

In [15]:
context.get_store('Checks').to_df()

Unnamed: 0,Check,Date,Ref Number,Amount
0,4667,Oct16,8054899793,3777.34
1,4670*,Oct 2,8058398645,1146.08
2,4671,Oct10,8355786671,1146.08
3,4672,Oct 2,8059110408,70.0
4,4675*,Oct11,8656387853,200.0
5,4681,Oct27,9254685896,425.71
6,4682,Oct 6,9250487356,1804.6
7,4683,Oct 4,8655296985,3000.0
8,4685*,Oct10,8450678071,379.26
9,4687*,Oct20,9254033749,287.66


In [16]:
context.get_store("Other Deposits").to_df()

Unnamed: 0,Date,Description of Transaction,Unnamed: 3,Ref Number,Amount
0,Oct 2,Electronic Deposit REF=172750102657000N00,From 36 TREAS 310 9101036151 MISC PAY431833386...,$,7265.0
1,Oct 3,Electronic Deposit REF=172760027363140N00,From 36 TREAS 310 9101036151 MISC PAY431833386...,,6400.0
2,Oct 4,Electronic Deposit REF=172720058223180Y00,From CGS ADMINISTATOR 6202552297HCCLAIMPMT267605,,11911.98
3,Oct11,Electronic Deposit REF=172780042278970Y00,From CGS ADMINISTATOR 6202552297HCCLAIMPMT267605,,4972.84
4,Oct12,Electronic Deposit REF=172850024413210N00,From 36 TREAS 310 9101036151 MISC PAY431833386...,,4510.0
5,Oct12,Electronic Deposit REF=172790122999750Y00,From CGS ADMINISTATOR 6202552297HCCLAIMPMT267605,,5597.43
6,Oct16,Electronic Deposit REF=172890066691320N00,From 36 TREAS 310 9101036151 MISC PAY431833386...,,2641.08
7,Oct17,Electronic Deposit REF=172850047281610Y00,From CGS ADMINISTATOR 6202552297HCCLAIMPMT267605,,3036.78
8,Oct20,Electronic Deposit REF=172920112272070N00,From 36 TREAS 310 9101036151 MISC PAY431833386...,,760.0
9,Oct20,Electronic Deposit REF=172900111651190Y00,From CGS ADMINISTATOR 6202552297HCCLAIMPMT267605,,11414.48


In [17]:
context.get_store('Card Withdrawals').to_df()

Unnamed: 0,Date,Description of Transaction,Unnamed: 3,Ref Number,Amount
0,Oct 2,Debit Purchase - VISA PANERA BREAD #20 *******...,On 092917 MISSION KS REF # 2442733727272004013...,2720040130 $,7.00-
1,Oct 2,Debit Purchase - VISA CHICK-FIL-A #029 *******...,On 092917 MISSION KS REF # 2442733727371002372...,3710023729,7.02-
2,Oct 4,Debit Purchase - VISA MO SEC OF STATE ********...,On 100317 WWW.SOS.MO.G MO REF # 24540457277235...,7235570575,51.25-
3,Oct 6,Debit Purchase - VISA FREDPRYOR CAREER *******...,On 100517 800-5563012 KS REF # 249064172780452...,8045224965,149.00-
4,Oct10,Debit Purchase - VISA THE PLUMBING PRO *******...,On 100617 816-7638200 MO REF # 241831072799000...,9900014900,372.00-


In [18]:
context.get_store('Other Withdrawals').to_df()

Unnamed: 0,Date,Description of Transaction,Unnamed: 3,Ref Number,Amount
0,Oct 6,Electronic Withdrawal REF=172780089093620N00,To BlueKc Com Stlmt 4431257251WEB PYMNT 38589255,$,195.34-
1,Oct 6,Customer Withdrawal,,9255094845,500.00-
2,Oct12,Branch Account Transfer,To Account 145574108240,,"8,000.00-"
3,Oct12,Branch Account Transfer,To Account 145570459670,,"12,000.00-"
4,Oct16,Electronic Withdrawal REF=172890065018750N00,From PHILA INS CO 2316092819INS IN 80092172,,5.00-
5,Oct16,Analysis Service Charge,,1600000000,24.95-
6,Oct16,Electronic Withdrawal REF=172890065018740N00,From PHILA INS CO 2316092819INS IN 80092172,,"7,514.68-"
7,Oct25,Electronic Withdrawal REF=172970047495440N00,From ATT 9864031006Payment 401469002EPAYQ,,308.48-
8,Oct25,Branch Account Transfer,To Account 145574108240,,"7,700.00-"
9,Oct25,Branch Account Transfer,To Account 145570459670,,"10,776.00-"
