## PERSONAL FINANCE TRACKER

### Business Understanding
The Kenyan financial landscape is dominated by mobile money, with M-Pesa serving over 34 million users and processing more than 254 million transactions monthly. Despite this dominance, there is a significant "financial invisibility" gap. While traditional banking apps offer automated insights, mobile money users are forced to manually track their finances through unstructured SMS messages or PDF statements. This project, the Personal Finance Tracker, applies FinTech analytics and behavioral economics to the Kenyan mobile money ecosystem. It aims to bridge the gap between high transaction volumes and low financial visibility by transforming raw M-Pesa data into actionable intelligence.

### Business Problem
The Kenyan financial ecosystem is currently hindered by a significant "financial invisibility" gap that affects over 34 million M-Pesa users. Although mobile money has revolutionized digital payments, it lacks the sophisticated analytical infrastructure found in traditional banking apps, leaving users without a clear understanding of their economic behavior. This transparency gap is particularly damaging because the average user loses between 11% and 15% of their monthly budget to cumulative transaction fees and untracked micro-expenses. Without an automated way to parse password-protected PDF statements or unstructured SMS logs, users struggle to identify "budget leaks," separate personal spending from small business cash flows, or establish a verifiable financial history. This lack of data-driven insight prevents millions from optimizing their spending, meeting savings goals, or accessing formal credit opportunities despite having significant transaction volumes.
The Personal Business Tracker addresses this crisis by introducing an automated, privacy-centric financial intelligence layer designed specifically for the M-Pesa ecosystem.The project transforms raw numbers into actionable intelligence, highlighting high-cost patterns and providing personalized recommendations.

### Project Objectives
##### Specific Objective
To develop an automated, privacy-first financial intelligence system that categorizes M-Pesa transactions and provides optimized spending and budgeting recommendations to enhance the financial well-being of Kenyan users.

##### General Objectives
Develop a Robust Extraction Pipeline: Build a 5-stage data pipeline to accurately extract and clean data from password-protected M-Pesa PDF statements.

Automate Categorization: Implement a multi-class classification model to assign transactions into one of the categories (e.g., Food, Transport, Utilities) with at least 92% accuracy.

Optimize Transaction Costs: Identify patterns in transaction fees to provide recommendations that could reduce M-Pesa fees by up to 30% through consolidation.

Provide Actionable Insights: Create an interactive web dashboard using Streamlit to visualize monthly trends, budget leaks, and progress toward savings goals.

Ensure Data Privacy: Implement a "local-first" architecture where sensitive financial data is processed on the user's device rather than stored in the cloud.

### Success Criteria
Categorization Accuracy: > 92% via ML enhancement.

User Effort: < 5% manual labeling required.

Economic Value: Identification of potential savings of up to KSh 30,000+ per month for high-volume users.

Privacy: Full compliance with a "zero-storage" policy for sensitive PII Personally Identifiable Information.

### Data Understanding
##### Data Source
The primary data source for this project consists of personal M-Pesa transaction statements. These are typically exported as password-protected PDF documents directly from the Safaricom M-Pesa ecosystem. For the purpose of analysis and model training, this data is extracted and structured into a CSV format containing historical transaction records.

##### Data Components
The dataset is composed of several key categories of information extracted from each transaction:
Transaction Identifiers: Unique receipt numbers (e.g., SBH9E59WPT) and completion timestamps.

Financial Details: The original transaction details, descriptions, and the specific amounts either "paid in" or "withdrawn".

Entity Information: Extracted fields containing agent till numbers, merchant names, paybill numbers, and recipient details (names and masked phone numbers).

System Metadata: Current account balances after each transaction and the transaction status (e.g., "Completed").

Enriched Analytics: Derived features such as the day of the week, time of day (e.g., "Morning", "Evening"), and specific categorization levels (e.g., "Finance & Fees", "Social & Leisure").


In [1]:
# Import and Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
import warnings
import ast
import re
from pathlib import Path


print("  SECTION 1 · LOAD & INITIAL INSPECTION")

df = pd.read_csv("final_analysis.csv")

print(f"\n✓ Loaded dataset successfully")
print(f"  Shape           : {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"  Memory usage    : {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

# Preview
print("\n── First 3 rows ")
print(df.head(3).to_string())

print("\n── Last 3 rows ")
print(df.tail(3).to_string())



  SECTION 1 · LOAD & INITIAL INSPECTION

✓ Loaded dataset successfully
  Shape           : 2,715 rows × 35 columns
  Memory usage    : 3976.7 KB

── First 3 rows 
   receipt_no      completion_time                                                                              details_original                                                                                 description     status   paid_in withdrawn  balance          type                                                                              extracted_fields      category merchant_subcategory merchant_id final_category      category_level1 category_level2             datetime        date  year  month month_name  day   weekday  weekday_num  hour  is_weekend time_of_day  amount_spent  amount_received  net_flow  cumulative_spent  cumulative_received  balance_change  is_essential  is_discretionary
0  SBH9E59WPT  2024-02-17 18:31:52                                                                              Airtime Purcha

In [2]:
# 2. Schema & Data Types
print(" SECTION 1 . SCHEMA & DATA TYPES")

print("\n── Column data types")
dtype_df = pd.DataFrame({
    "column":   df.columns,
    "dtype":    df.dtypes.values,
    "non_null": df.notnull().sum().values,
    "null":     df.isnull().sum().values,
    "unique":   df.nunique().values,
    "sample":   [str(df[c].dropna().iloc[0])[:50] if df[c].notnull().any() else "—"
                 for c in df.columns],
})
print(dtype_df.to_string(index=False))

# Type groups
numeric_cols      = df.select_dtypes(include="number").columns.tolist()
categorical_cols  = df.select_dtypes(include="object").columns.tolist()
print(f"\n  Numeric columns    : {len(numeric_cols)}  → {numeric_cols}")
print(f"  Categorical columns: {len(categorical_cols)} → {categorical_cols}")


 SECTION 1 . SCHEMA & DATA TYPES

── Column data types
              column   dtype  non_null  null  unique              sample
          receipt_no  object      2715     0    1936          SBH9E59WPT
     completion_time  object      2715     0    1936 2024-02-17 18:31:52
    details_original  object      2714     1     658    Airtime Purchase
         description  object      2714     1     655    Airtime Purchase
              status  object      2715     0       1           Completed
             paid_in  object       586  2129     125            3,000.00
           withdrawn  object      2129   586     294               -30.0
             balance float64      2715     0    2364               95.15
                type  object      2715     0      13             Airtime
    extracted_fields  object      2715     0     477                  {}
            category  object      2715     0      16             Airtime
merchant_subcategory  object       807  1908      12            Shopp

In [3]:
# 3. Missing Values Analysis
print("\n SECTION 3 . MISSING VALUES ANALYSIS")

missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    "missing_count": missing,
    "missing_%": missing_pct,
    
}).query("missing_count > 0")

print(f"\n  Total columns with missing data : {len(missing_df)}")
print(f"  Total missing cells             : {missing.sum():,}")
print(f"  Fully complete columns          : {(missing == 0).sum()}/35\n")
print(missing_df.to_string())


 SECTION 3 . MISSING VALUES ANALYSIS

  Total columns with missing data : 7
  Total missing cells             : 6,546
  Fully complete columns          : 28/35

                      missing_count  missing_%
details_original                  1       0.04
description                       1       0.04
paid_in                        2129      78.42
withdrawn                       586      21.58
merchant_subcategory           1908      70.28
merchant_id                    1920      70.72
balance_change                    1       0.04


In [4]:
# 4. Duplicate Analysis
print("\n SECTION 4 . DUPLICATE ANALYSIS")

total_rows        = len(df)
unique_receipts   = df["receipt_no"].nunique()
dup_receipt_rows  = df.duplicated(subset=["receipt_no"]).sum()
fully_dup_rows    = df.duplicated().sum()

print(f"\n  Total rows                 : {total_rows:,}")
print(f"  Unique receipt numbers     : {unique_receipts:,}")
print(f"  Duplicate receipt_no rows  : {dup_receipt_rows:,}  ← M-Pesa fee pairs")
print(f"  Fully duplicate rows   : {fully_dup_rows:,}  ← Exact duplicates")

fee_rows     = df[df["type"] == "M-Pesa Fee"]
dup_receipts = df[df.duplicated(subset=["receipt_no"], keep=False)]["receipt_no"].nunique()
print(f"\n  M-Pesa Fee rows  : {len(fee_rows):,}")
print(f"  Duplicate receipt_no groups : {dup_receipts:,}")
print(f"  → Each duplicate pair = 1 primary txn + 1 M-Pesa Fee row (same receipt_no)")

example_receipt = df[df.duplicated(subset=["receipt_no"], keep=False)].iloc[0]["receipt_no"]
print(f"\n  Example duplicate pair (receipt: {example_receipt}):")
print(df[df["receipt_no"] == example_receipt][
    ["receipt_no", "type", "description", "amount_spent", "amount_received"]
].to_string(index=False))




 SECTION 4 . DUPLICATE ANALYSIS

  Total rows                 : 2,715
  Unique receipt numbers     : 1,936
  Duplicate receipt_no rows  : 779  ← M-Pesa fee pairs
  Fully duplicate rows   : 0  ← Exact duplicates

  M-Pesa Fee rows  : 779
  Duplicate receipt_no groups : 779
  → Each duplicate pair = 1 primary txn + 1 M-Pesa Fee row (same receipt_no)

  Example duplicate pair (receipt: SBJ4ILP3K2):
receipt_no       type                                     description  amount_spent  amount_received
SBJ4ILP3K2 M-Pesa Fee               Customer Transfer of Funds Charge          13.0              0.0
SBJ4ILP3K2 Send Money Customer Transfer to - 2547******795 KIOKO PAUL         550.0              0.0


In [5]:
# 5.Descriptive Statistics
print("\n SECTION 5 . DESCRIPTIVE STATISTICS")

key_numeric = ["amount_spent", "amount_received", "net_flow", "balance", "balance_change"]

print("\n── Full describe() ")
print(df[key_numeric].describe().round(2).to_string())

print("\n── Extended percentiles")
percs = [1, 5, 10, 25, 50, 75, 90, 95, 99]
ext = df[key_numeric].quantile([p / 100 for p in percs])
ext.index = [f"p{p}" for p in percs]
print(ext.round(2).to_string())

print("\n── Skewness & Kurtosis ")
sk_df = pd.DataFrame({
    "skewness": df[key_numeric].skew().round(3),
    "kurtosis": df[key_numeric].kurt().round(3),
    "interpretation": [
        "Heavy right-skew (many small, few huge txns)",
        "Heavy right-skew (income mostly zero; large spikes)",
        "Approximately symmetric around 0",
        "Right-skewed (low median vs high mean)",
        "Mixed (deposits & withdrawals)"
    ]
})
print(sk_df.to_string())


 SECTION 5 . DESCRIPTIVE STATISTICS

── Full describe() 
       amount_spent  amount_received  net_flow   balance  balance_change
count       2715.00          2715.00   2715.00   2715.00         2714.00
mean        1182.54          1182.78      0.24   5378.22            0.25
std         3783.89          4142.76   5854.80   7822.34         5879.11
min            0.00             0.00 -70000.00      1.15       -70000.00
25%            7.00             0.00   -500.00    709.30         -500.00
50%           50.00             0.00    -50.00   2179.45          -50.00
75%          500.00             0.00     -7.00   6381.95           33.00
max        70000.00         70000.00  70000.00  72820.45        69060.00

── Extended percentiles
     amount_spent  amount_received  net_flow   balance  balance_change
p1            0.0              0.0  -17779.0     44.45        -17780.5
p5            0.0              0.0   -8000.0    135.85         -8000.0
p10           0.0              0.0   -2750.0   

##### Key Data Properties
Temporal Nature: The data is a time-series record, allowing for the analysis of spending trends over days, months, and years.

Categorical Diversity: Transactions cover a wide range of types, including Airtime Purchases, PayBill payments, Pochi la Biashara, Cash Deposits, and Send Money transfers.

High Granularity: The data provides granular insights into specific merchants (like KPLC or Nairobi Water) and individual peer-to-peer recipients.

Relational Structure: There is a clear relationship between primary transactions and their associated M-Pesa transaction fees, which are recorded as separate entries but linked by time and description.

##### Data Limitation
Unstructured Descriptions: Original transaction details are often messy or contain concatenated strings (e.g., "Deposit of Funds at Agent Till692803..."), requiring significant cleaning and parsing.

Privacy Masking: Phone numbers are partially masked (e.g., 2547******795), which protects privacy but limits the ability to uniquely identify repeat individual recipients across different datasets.

Data Invisibility: Raw data is trapped in password-protected PDFs, making it inaccessible for automated tools without a dedicated decryption and extraction pipeline.

Categorization Complexity: Some transactions, especially "Send Money" to individuals, may be for various purposes (e.g., a "gift" vs. a "payment for service"), making automated categorization challenging without additional user input


### Data Preparation
The data preparation phase for the Personal Finance Tracker involves transforming raw, semi-structured M-Pesa records into a clean, feature-rich dataset suitable for Classification




In [6]:
# 1. Dropping Redundant Columns
# Rationale: 'details_original' is unstructured raw text; 'category_level2' and 'datetime' 
# are repetitive as we already have extracted features like 'final_category' and 'date'.
redundant_columns = [
    'details_original', 
    'category_level1', 
    'category_level2', 
    'completion_time', 
    
]
df_cleaned = df.drop(columns=redundant_columns)



# Final check of the prepared data
print(f"Original shape: {df.shape}")
print(f"Cleaned shape: {df_cleaned.shape}")


df_cleaned.head()



Original shape: (2715, 35)
Cleaned shape: (2715, 31)


Unnamed: 0,receipt_no,description,status,paid_in,withdrawn,balance,type,extracted_fields,category,merchant_subcategory,...,is_weekend,time_of_day,amount_spent,amount_received,net_flow,cumulative_spent,cumulative_received,balance_change,is_essential,is_discretionary
0,SBH9E59WPT,Airtime Purchase,Completed,,-30.0,95.15,Airtime,{},Airtime,,...,1,Evening,30.0,0.0,-30.0,30.0,0.0,,1,0
1,SBI2H3UA02,Deposit of Funds at Agent Till 305622 - Daljos...,Completed,3000.0,,3095.15,Cash Deposit,"{'agent_till': '305622', 'agent_name': 'Daljos...",Cash Deposit,,...,1,Afternoon,0.0,3000.0,3000.0,30.0,3000.0,3000.0,0,0
2,SBJ4ILP3K2,Customer Transfer of Funds Charge,Completed,,-13.0,2532.15,M-Pesa Fee,{},M-Pesa Fees,,...,0,Morning,13.0,0.0,-13.0,43.0,3000.0,-563.0,0,0
3,SBJ4ILP3K2,Customer Transfer to - 2547******795 KIOKO PAUL,Completed,,-550.0,2545.15,Send Money,"{'recipient_number': '2547******795', 'recipie...",Friends & Family,,...,0,Morning,550.0,0.0,-550.0,593.0,3000.0,13.0,0,0
4,SBJ4JNM6VS,Deposit of Funds at Agent Till 692803 - Kilwa ...,Completed,1000.0,,3532.15,Cash Deposit,"{'agent_till': '692803', 'agent_name': 'Kilwa ...",Cash Deposit,,...,0,Afternoon,0.0,1000.0,1000.0,593.0,4000.0,987.0,0,0


Why this is necessary for your project:

Model Performance: Dropping details_original after extraction prevents the model from "overfitting" on raw, messy strings.

Financial Accuracy: Replacing NaN with 0.0 in paid_in and withdrawn ensures your "Total Spent" and "Net Flow" calculations are mathematically sound.

Redundancy: By keeping only final_category and removing category_level2, you reduce the dimensionality of your data, making the  classifier faster and more efficient