## Personal Finance Tracker through PDF Statement Data Extraction + Post-Analysis

This notebook interprets bank statements from Scotiabank. It extracts data from bank statement PDFs, bins them into categories based on extracted metadata, and then produces insights and visualizations on financial health, spending habits and so on. It is intended to be used as a means of insights into spending, long-term growth and ways to improve financial health overall.

In [1]:
import sys
import os
sys.path.append(os.path.abspath('../..'))

In [2]:
from src.modules.pdf_interpreter import PDFReader
from src.modules.helper_fns import CategoryUpdater

### Step 1: Extract raw data from PDF bank statement

In [3]:
StatementReader = PDFReader()

Reading PDFs from 'Ultimate Package' bucket:   0%|          | 0/40 [00:00<?, ?it/s]

Reading PDFs from 'Passport Infinite' bucket:   0%|          | 0/23 [00:00<?, ?it/s]

Reading PDFs from 'Visa' bucket:   0%|          | 0/24 [00:00<?, ?it/s]

Reading PDFs from 'MoneyMaster' bucket:   0%|          | 0/47 [00:00<?, ?it/s]

### Step 2A: Process raw data based on metadata + extracted insights

In [4]:
StatementReader.process_raw_df()

Pre-processing bank statements.
Recalibrating amounts in bank statements.
Savings MoneyMaster
Credit Visa
Credit Passport Infinite
Chequing Ultimate Package
Post-processing bank statements.


### Step 2B: Assess unlabelled data for proper representation (otherwise left as 'uncharacterized')

In [5]:
classifier = CategoryUpdater(StatementReader.filtered_df)

HBox(children=(VBox(children=(Dropdown(description='Filter:', options=('All', 'Groceries', 'Activities', 'Onli…

In [None]:
# StatementReader.process_raw_df()

### Step 3: Visualize data

#### Balance variation + insights

We visualize, apply and extract the following from the below graphs:
- Balance variation across statement transaction dates
- (Only relevant for data-processing adjustments) Various filters to assess fit to data
- Rates of changes to assess spending habits and growth of capitol

In [7]:
StatementReader.plot_attribute_against_datetime(
    StatementReader.filtered_df,
    segment_settings={
        'penalty': 100,
        'min_size': 50,
        'jump': 50,
        'model': "l2"
    },
    view_segments=True)

#### Categorical Spending Habits per Month

We visualize the following below:
- We group transactions by TRANSACTION month and then by CATEGORY
- Categories are assigned in prior code blocks; done by running through a word encoder OR by prompts for keyword associations

In [8]:
StatementReader.plot_stacked(StatementReader.filtered_df)

### Step 4: (Optional) View specifics of data to gain deeper insights

In [9]:
# filtered_rows = filtered_df_4[filtered_df_4['Processed Details'].apply(lambda x: any(s.lower() == 'payroll' for s in x))]

filtered_rows = filtered_df_6[filtered_df_6['Matched Keyword'].apply(lambda x: x == 'costco')]
# filtered_rows = filtered_df_6[filtered_df_6['Classification'].apply(lambda x: x == 'uncharacterized')]
# filtered_rows[filtered_rows['DateTime'] >= '2023-05-01'].head(20)

# filtered_df_6[filtered_df_6['DateTime'] >= '2023-12-20'].head(20)

# filtered_rows = filtered_df_5[filtered_df_5['Details'].apply(
#     lambda x: any('airbnb' in word for word in x.lower().split())
# )]
filtered_rows.head(20)

NameError: name 'filtered_df_6' is not defined