## Overview
This Jupyter notebook is designed for processing account transactions. It reads transaction data, classifies them using a pre-trained machine learning model, and outputs categorized transaction records. The notebook includes sections for data loading, preprocessing, model inference, and results display.

## Limitations 
1. Hard-coded Paths: Use configuration files or environment variables for paths.
2. Error Handling: Implement robust error handling and logging.
3. Batch Size Management: Dynamically adjust batch size based on available resources.
4. Consistent Naming Conventions: Ensure clear and consistent naming throughout the notebook.
5. May not work with other bank's data. Some minor adjustments may be needed. 

## Run the scripts to process data extracted from scotia bank website

The scripts are located in a different directory. The scripts are designed to validate the data, and ensure consistancy. 

-The two scripts are credit data processing, and current data processing. 

-Running these scripts does the following:
1. Checks a specific directory (Dashboarding/Data Extraction/1. Credit 2.Current) if there are any files to be processed. 
2. If files are found these scripts process all the files. Ensuring consistant data format, column order, and by enriching the data with supporting columns
3. After the csv files are processed as mentioned in step two. It transfers the files in 'Archive' folder. This step ensures no repition is done for file processing. 

Documentation for these files could be found at E:\Data analytics projects\Financial Tracker\Dashboarding\Data Extraction\Documentation.txt


In [1]:
#Run both the files in Credit and Current repositories

import subprocess

# Define the paths to the Python scripts in other directories
credit_script_path = r"E:\Data analytics projects\Financial Tracker\Dashboarding\Data Extraction\Credit\Credit Data processing.py"
current_script_path = r"E:\Data analytics projects\Financial Tracker\Dashboarding\Data Extraction\Current\Current Data processing.py"

# Run the 'credit' script
print("Running Credit Data processing.py:")
credit_output = subprocess.run(['python', credit_script_path], capture_output=True, text=True)
print(credit_output.stdout)

# Run the 'current' script
print("\nRunning Current Data processing.py:")
current_output = subprocess.run(['python', current_script_path], capture_output=True, text=True)
print(current_output.stdout)

Running Credit Data processing.py:
No files to process


Running Current Data processing.py:
No files to process in the directory



## Conduct data validation checks before merging into one dataframe

In [2]:
import pandas as pd
import numpy as np

current_master_directory = r"E:\Data analytics projects\Financial Tracker\Dashboarding\Data Extraction\Current\Current processed master.csv"
credit_master_directory = r"E:\Data analytics projects\Financial Tracker\Dashboarding\Data Extraction\Credit\Credit processed master.csv"


In [3]:

def check_null_values(df):
    """
    Check for null values in a pandas DataFrame.
    
    Parameters:
    df (pd.DataFrame): Input DataFrame
    
    Returns:
    None
    """
    # Check for null values in the entire DataFrame
    null_values = df.isnull()
    
    # Check if any null values exist in the DataFrame
    if null_values.any().any():
        # Report if any partial nulls exist
        partial_nulls = null_values.any(axis=1).sum()
        if partial_nulls > 0:
            print("Partial nulls exist in the DataFrame.")
        
        # Drop rows with all null values
        df.dropna(how='all', inplace=True)
        
        print("Rows with all null values have been dropped.")
    else:
        print("No null values exist in the DataFrame.")

def check_duplicate_values(df):
    """
    Check for duplicate rows in a pandas DataFrame.
    
    Parameters:
    df (pd.DataFrame): Input DataFrame
    
    Returns:
    None
    """
    # Check for duplicate rows in the DataFrame
    duplicate_rows = df.duplicated()
    
    # Check if any duplicate rows exist in the DataFrame
    if duplicate_rows.any():
        # Count the number of duplicate rows
        num_duplicate_rows = duplicate_rows.sum()
        print(f"{num_duplicate_rows} duplicate rows exist in the DataFrame.")
        
        # Drop duplicate rows
        df.drop_duplicates(inplace=True)
        
        print("Duplicate rows have been dropped.")
    else:
        print("No duplicate rows exist in the DataFrame.")


In [4]:
current_df = pd.read_csv(current_master_directory)
current_df['Date'] = pd.to_datetime(current_df['Date'])
check_null_values(current_df)
check_duplicate_values(current_df)
current_df.head()

No null values exist in the DataFrame.
No duplicate rows exist in the DataFrame.


Unnamed: 0,Date,Transaction_description,Transaction_amount,Account
0,2024-04-30,POS Purchase OPOS GOOGLE *Wallet g.co/,-7.33,Current A/c
1,2024-04-30,POS Purchase FPOS THE BEER STORE #2362ETOBI,-16.2,Current A/c
2,2024-04-30,POS Purchase FPOS SHELL C45051,-13.75,Current A/c
3,2024-04-30,POS Purchase GPOS TIM HORTONS #17 ETOBI,-13.81,Current A/c
4,2024-04-29,DEPOSIT FREE INTERAC E-TRANSFER,10.0,Current A/c


In [5]:
credit_df = pd.read_csv(credit_master_directory)
credit_df['Date'] = pd.to_datetime(credit_df['Date'])
check_null_values(credit_df)
check_duplicate_values(credit_df)
credit_df.head()

No null values exist in the DataFrame.
No duplicate rows exist in the DataFrame.


Unnamed: 0,Date,Transaction_description,Transaction_amount,Account
0,2024-04-04,DIVINE FLOWER TORONTO ON (GOO...,-5.08,Credit Card
1,2024-04-04,CIRCLE K #69015 TORONTO ON (GOO...,-14.24,Credit Card
2,2024-04-05,TIM HORTONS #2000 TORONTO ON (GOO...,-14.52,Credit Card
3,2024-04-05,UBER CANADA/UBERTRIP TORONTO ON (GOO...,-38.18,Credit Card
4,2024-04-06,FROM - *****11*6870,259.01,Credit Card


### Merge both files and conduct EDA 

In [6]:
master_df = pd.concat([current_df, credit_df], axis=0, ignore_index=True)

In [7]:
master_df['Account'].value_counts()

Account
Current A/c    158
Credit Card     87
Name: count, dtype: int64

In [8]:
master_df['Type'] = master_df['Transaction_amount'].apply(lambda x: 'Income' if x > 0 else 'Expense')
master_df.head()

Unnamed: 0,Date,Transaction_description,Transaction_amount,Account,Type
0,2024-04-30,POS Purchase OPOS GOOGLE *Wallet g.co/,-7.33,Current A/c,Expense
1,2024-04-30,POS Purchase FPOS THE BEER STORE #2362ETOBI,-16.2,Current A/c,Expense
2,2024-04-30,POS Purchase FPOS SHELL C45051,-13.75,Current A/c,Expense
3,2024-04-30,POS Purchase GPOS TIM HORTONS #17 ETOBI,-13.81,Current A/c,Expense
4,2024-04-29,DEPOSIT FREE INTERAC E-TRANSFER,10.0,Current A/c,Income


In [9]:
master_df.Type.value_counts()

Type
Expense    206
Income      39
Name: count, dtype: int64

creating the primary key column: 


In [10]:

def create_key(row):
  account_prefix = 'DEB' if row['Account'] == 'Current A/c' else 'CRE'
  date_part = row['Date'].strftime('%y%m%d')  # Use strftime to format date
  amount_part = f"{row['Transaction_amount']:.2f}".split('.')[1]
  return f"{account_prefix}{date_part}{amount_part:02}"


# Apply the function to create a new 'key' column
master_df['Primary_key'] = master_df.apply(create_key, axis=1)

# Set the 'key' column as the index
#master_df = master_df.set_index('key')

# Print the DataFrame
master_df = master_df[['Primary_key', 'Date', 'Transaction_description', 'Transaction_amount', 'Account', 'Type']]
master_df.head()

Unnamed: 0,Primary_key,Date,Transaction_description,Transaction_amount,Account,Type
0,DEB24043033,2024-04-30,POS Purchase OPOS GOOGLE *Wallet g.co/,-7.33,Current A/c,Expense
1,DEB24043020,2024-04-30,POS Purchase FPOS THE BEER STORE #2362ETOBI,-16.2,Current A/c,Expense
2,DEB24043075,2024-04-30,POS Purchase FPOS SHELL C45051,-13.75,Current A/c,Expense
3,DEB24043081,2024-04-30,POS Purchase GPOS TIM HORTONS #17 ETOBI,-13.81,Current A/c,Expense
4,DEB24042900,2024-04-29,DEPOSIT FREE INTERAC E-TRANSFER,10.0,Current A/c,Income


In [11]:
master_df.columns

Index(['Primary_key', 'Date', 'Transaction_description', 'Transaction_amount',
       'Account', 'Type'],
      dtype='object')

## Configuring the data features before running it through the saved model 

In [12]:
model_df = master_df
#Created a new column. The column indicates whether the Transaction_amount is higher than 100$. 
#The 100$ Transaction_amount may need be change in case significant rows are higher than 100$
model_df['Big_amount'] = model_df['Transaction_amount'].abs() > 100
#Created a new category combining, Transaction_amount value and transaction note. Transaction Transaction_amount may feed the model additional context

model_df['Classification_text'] = model_df['Transaction_description'] + ': $' + model_df['Transaction_amount'].astype(str)


In [13]:
model_df = model_df[['Primary_key', 'Classification_text', 'Big_amount']]
model_df.loc[:, 'Classification_text'] = model_df['Classification_text'].str.lower()
model_df.head()

Unnamed: 0,Primary_key,Classification_text,Big_amount
0,DEB24043033,pos purchase opos google *wallet g.co/: $...,False
1,DEB24043020,pos purchase fpos the beer store #2362etobi: $...,False
2,DEB24043075,pos purchase fpos shell c45051: $-13.75,False
3,DEB24043081,pos purchase gpos tim hortons #17 etobi: $...,False
4,DEB24042900,deposit free interac e-transfer: $10.0,False


In [14]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification # Replace with appropriate task type

model_2 = AutoModelForSequenceClassification.from_pretrained("E:\\Data analytics projects\\Financial Tracker\\Model training\\Finance_categorization_fine_tuned")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Get the text data from your DataFrame column
text_data_pd = model_df["Classification_text"]
text_data = []
for x in text_data_pd:
    text_data.append(x)

# Preprocess the data (tokenization etc.)
encoded_data = tokenizer(text_data, padding="max_length", truncation=True, return_tensors="pt")

# Split the data into batches for efficient processing
batch_size = 4  # Adjust based on your hardware and memory limitations
predictions = []
for i in range(0, len(text_data), batch_size):
  batch_data = {key: value[i:i+batch_size] for key, value in encoded_data.items()}
  # Run model inference on the batch
  outputs = model_2(**batch_data)
  predictions.extend(outputs.logits.argmax(dim=-1).tolist())  # Extract predictions (adapt based on task)


  from .autonotebook import tqdm as notebook_tqdm


In [17]:
# importing the module 
import json 
with open('Data Extraction/classification_labels.txt') as f: 
    data = f.read()

classification_labels = json.loads(data)

# Create a reverse dictionary for efficient lookup
reverse_labels = {v: k for k, v in classification_labels.items()}

# Convert numbers to their respective keys using list comprehension
keys = [reverse_labels[number] for number in predictions]


In [18]:
model_df.loc[: ,'Predictions'] = predictions
model_df.loc[:, 'Category'] = keys

model_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_df.loc[: ,'Predictions'] = predictions
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  model_df.loc[:, 'Category'] = keys


Unnamed: 0,Primary_key,Classification_text,Big_amount,Predictions,Category
0,DEB24043033,pos purchase opos google *wallet g.co/: $...,False,1,"Bill payments, Subscriptions"
1,DEB24043020,pos purchase fpos the beer store #2362etobi: $...,False,0,Alcohol
2,DEB24043075,pos purchase fpos shell c45051: $-13.75,False,4,Ciggerates
3,DEB24043081,pos purchase gpos tim hortons #17 etobi: $...,False,13,"Restaurant, fast-food"
4,DEB24042900,deposit free interac e-transfer: $10.0,False,9,Interact


In [19]:
# Saving the processed file into other directory with the file name as Minimum and Maximum Date.
min_date = str(master_df.Date.min()).split(" ")[0]
max_date = str(master_df.Date.max()).split(" ")[0]

file_name = min_date + " to " + max_date + ".csv"
model_df.to_csv(f'E:/Data analytics projects/Financial Tracker/Dashboarding/Categorized by model files/{file_name}')

## Staging the data for dashboarding 
-Merging the original data with the data processed by the model. 

-Appending the data to a master dataset file for dashboarding.

In [20]:
print("size of the datasets being mergerd: ","\n", master_df.shape,"\n",model_df.shape)

size of the datasets being mergerd:  
 (245, 8) 
 (245, 5)


In [21]:
merged_df = pd.merge(master_df, model_df, how='inner' ,on = master_df['Primary_key'])
merged_df

Unnamed: 0,key_0,Primary_key_x,Date,Transaction_description,Transaction_amount,Account,Type,Big_amount_x,Classification_text_x,Primary_key_y,Classification_text_y,Big_amount_y,Predictions,Category
0,DEB24043033,DEB24043033,2024-04-30,POS Purchase OPOS GOOGLE *Wallet g.co/,-7.33,Current A/c,Expense,False,POS Purchase OPOS GOOGLE *Wallet g.co/: $...,DEB24043033,pos purchase opos google *wallet g.co/: $...,False,1,"Bill payments, Subscriptions"
1,DEB24043020,DEB24043020,2024-04-30,POS Purchase FPOS THE BEER STORE #2362ETOBI,-16.20,Current A/c,Expense,False,POS Purchase FPOS THE BEER STORE #2362ETOBI: $...,DEB24043020,pos purchase fpos the beer store #2362etobi: $...,False,0,Alcohol
2,DEB24043075,DEB24043075,2024-04-30,POS Purchase FPOS SHELL C45051,-13.75,Current A/c,Expense,False,POS Purchase FPOS SHELL C45051: $-13.75,DEB24043075,pos purchase fpos shell c45051: $-13.75,False,4,Ciggerates
3,DEB24043081,DEB24043081,2024-04-30,POS Purchase GPOS TIM HORTONS #17 ETOBI,-13.81,Current A/c,Expense,False,POS Purchase GPOS TIM HORTONS #17 ETOBI: $...,DEB24043081,pos purchase gpos tim hortons #17 etobi: $...,False,13,"Restaurant, fast-food"
4,DEB24042900,DEB24042900,2024-04-29,DEPOSIT FREE INTERAC E-TRANSFER,10.00,Current A/c,Income,False,DEPOSIT FREE INTERAC E-TRANSFER: $10.0,DEB24042900,deposit free interac e-transfer: $10.0,False,9,Interact
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336,CRE24051817,CRE24051817,2024-05-18,TIM HORTONS #2000 416-506-1972 ON (GOO...,-2.17,Credit Card,Expense,False,TIM HORTONS #2000 416-506-1972 ON (GOO...,CRE24051817,tim hortons #2000 416-506-1972 on (goo...,False,13,"Restaurant, fast-food"
337,CRE24051991,CRE24051991,2024-05-19,SHELL FLYING J #80500 ETOBICOKE ON,-28.91,Credit Card,Expense,False,SHELL FLYING J #80500 ETOBICOKE ON : $-2...,CRE24051991,shell flying j #80500 etobicoke on : $-2...,False,4,Ciggerates
338,CRE24052079,CRE24052079,2024-05-20,AMZN Mktp CA*JV14A5VD3 WWW.AMAZON.CAON,-15.79,Credit Card,Expense,False,AMZN Mktp CA*JV14A5VD3 WWW.AMAZON.CAON : $-1...,CRE24052079,amzn mktp ca*jv14a5vd3 www.amazon.caon : $-1...,False,14,Shopping
339,CRE24052054,CRE24052054,2024-05-20,AMZN Mktp CA*Q25IU47B3 WWW.AMAZON.CAON,-152.54,Credit Card,Expense,True,AMZN Mktp CA*Q25IU47B3 WWW.AMAZON.CAON : $-1...,CRE24052054,amzn mktp ca*q25iu47b3 www.amazon.caon : $-1...,True,14,Shopping


In [33]:
merged_df['Transfer'] = merged_df['Category'].apply(lambda x: True if x > 'TRANSFER' else False)
merged_df = merged_df[['Primary_key_x','Date','Transaction_description','Transaction_amount','Type','Transfer','Account','Category']]
merged_df.columns = ['Primary_key', 'Date', 'Transaction_description','Transaction_amount', 'Type', 'Transfer', 'Account', 'Category']
merged_df.head()

Unnamed: 0,Primary_key,Date,Transaction_description,Transaction_amount,Type,Transfer,Account,Category
0,DEB24043033,2024-04-30,POS Purchase OPOS GOOGLE *Wallet g.co/,-7.33,Expense,False,Current A/c,"Bill payments, Subscriptions"
1,DEB24043020,2024-04-30,POS Purchase FPOS THE BEER STORE #2362ETOBI,-16.2,Expense,False,Current A/c,Alcohol
2,DEB24043075,2024-04-30,POS Purchase FPOS SHELL C45051,-13.75,Expense,False,Current A/c,Ciggerates
3,DEB24043081,2024-04-30,POS Purchase GPOS TIM HORTONS #17 ETOBI,-13.81,Expense,False,Current A/c,"Restaurant, fast-food"
4,DEB24042900,2024-04-29,DEPOSIT FREE INTERAC E-TRANSFER,10.0,Income,False,Current A/c,Interact


In [35]:
# df = pd.read_csv(r"E:\Data analytics projects\Financial Tracker\Dashboarding\Master Dataset.csv")
# df.Date = pd.to_datetime(df['Date'])
# df['Primary_key'] = df.apply(create_key, axis=1)
# df = df[['Primary_key', 'Date', 'Transaction_description', 'Transaction_amount', 'type', 'transfer', 'Account', 'Category']]
# df.columns = ['Primary_key', 'Date', 'Transaction_description','Transaction_amount', 'Type', 'Transfer', 'Account', 'Category']
# df.head()
# df.to_csv("Master_Dataset_final.csv", index=False)


Unnamed: 0,Primary_key,Date,Transaction_description,Transaction_amount,Type,Transfer,Account,Category
0,DEB23101764,2023-10-17,Customer Transfer Cr. INVESTMENT PROCEEDS,678.64,Income,False,Current A/c,Child Support
1,DEB23111764,2023-11-17,Customer Transfer Cr. INVESTMENT PROCEEDS,678.64,Income,False,Current A/c,Child Support
2,DEB23121864,2023-12-18,Customer Transfer Cr. INVESTMENT PROCEEDS,678.64,Income,False,Current A/c,Child Support
3,DEB24011764,2024-01-17,Customer Transfer Cr. INVESTMENT PROCEEDS,678.64,Income,False,Current A/c,Child Support
4,DEB24021764,2024-02-17,Customer Transfer Cr. INVESTMENT PROCEEDS,678.64,Income,False,Current A/c,Child Support


In [39]:
merged_df.to_csv("Master_Dataset_final.csv", mode="a", index=False)