#  M-PESA DATA EXTRACTION PIPELINE
## From PDF Upload to Analytics-Ready Data

---

###  Complete Workflow:

```
PDF UPLOAD ‚Üí AUTOMATED PROCESSING ‚Üí MERCHANT LEARNING ‚Üí ANALYTICS READY
```

###  What This Notebook Does:

1. **Stage 1-5:** Automated processing (PDF ‚Üí Categorized CSV)
2. **Stage 6A:** [OPTIONAL] Import manual labels
3. **Stage 6B:** Hybrid merchant learning (interactive)
4. **Stage 7:** Create unified category system
5. **Final:** Analysis-ready CSV with all features

###  Final Output:

- **ONE clean CSV** with unified categories
- **Temporal features** (weekday, hour, payday indicators)
- **Financial features** (spending velocity, balance trends)
- **Behavioral features** (recurring merchants, patterns)
- **100% analysis-ready** for EDA, dashboards, ML

---

##  SETUP & CONFIGURATION

In [116]:
# Core imports
import pandas as pd
import numpy as np
import json
import os
import re
import tabula
from collections import Counter
from datetime import datetime
from typing import Dict
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries loaded")


‚úÖ Libraries loaded


###  Project Configuration

**Update these settings for your project:**

In [117]:
# ============================================================================
# PROJECT SETTINGS
# ============================================================================

# User ID (for personal merchant database)
USER_ID = "john"  # Change per user

# Input: M-Pesa PDF statement
PDF_PATH = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\mpesa_statement_john.pdf"
PDF_PASSWORD = "335419"

# Optional: Manual labels (if you have pre-labeled data)
MANUAL_LABELS_CSV = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\Statement3.csv"  # Set path or None
MANUAL_LABELS_COLUMN = "Merchant_Subcategory"  # Column with labels

# Output directory
OUTPUT_DIR = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\processed"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Database directory (for merchant learning)
DATABASE_DIR = os.path.join(OUTPUT_DIR, "merchant_databases")
os.makedirs(DATABASE_DIR, exist_ok=True)

# ============================================================================
# STAGE OUTPUTS
# ============================================================================

STAGE1 = os.path.join(OUTPUT_DIR, "stage1_raw.csv")
STAGE2 = os.path.join(OUTPUT_DIR, "stage2_types.csv")
STAGE3 = os.path.join(OUTPUT_DIR, "stage3_categories.csv")
STAGE4 = os.path.join(OUTPUT_DIR, "stage4_smart_rules.csv")
STAGE5 = os.path.join(OUTPUT_DIR, "stage5_clean.csv")
STAGE6 = os.path.join(OUTPUT_DIR, "stage6_merchants_learned.csv")
FINAL = os.path.join(OUTPUT_DIR, "FINAL_ANALYSIS_READY.csv")

print("‚úÖ Configuration complete")
print(f"   User: {USER_ID}")
print(f"   Output: {OUTPUT_DIR}")
print(f"   Database: {DATABASE_DIR}")

‚úÖ Configuration complete
   User: john
   Output: C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\processed
   Database: C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\processed\merchant_databases


---
#  AUTOMATED PROCESSING PIPELINE
## Stages 1-5 (No User Input Required)
---

## STAGE 1: PDF ‚Üí CSV

Extract M-Pesa statement from password-protected PDF.

In [118]:
"""
STAGE 1: PDF TO CSV CONVERSION
"""

import tabula

print("=" * 80)
print("STAGE 1: PDF ‚Üí CSV")
print("=" * 80)
print()

print(f" Reading: {PDF_PATH}")
print(f" Password: {'*' * len(PDF_PASSWORD)}")
print()

# Extract all tables from all pages
tables = tabula.read_pdf(
    PDF_PATH,
    password=PDF_PASSWORD,
    encoding='latin-1',
    pages='all',
    multiple_tables=True
)

print(f"‚úì Extracted {len(tables)} tables")

# Combine tables
df = pd.concat(tables, ignore_index=True)

print(f"‚úì Combined: {len(df):,} rows, {len(df.columns)} columns")

# Save
df.to_csv(STAGE1, index=False, encoding='utf-8-sig')

print(f" Saved: {STAGE1}")
print()
print(" STAGE 1 COMPLETE")
print()

STAGE 1: PDF ‚Üí CSV

 Reading: C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\mpesa_statement_john.pdf
 Password: ******



Got stderr: Feb 20, 2026 2:17:17 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>



‚úì Extracted 148 tables
‚úì Combined: 2,869 rows, 14 columns
 Saved: C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\processed\stage1_raw.csv

 STAGE 1 COMPLETE



## STAGE 2: Transaction Type Identification

Identify: Send Money, Till Payment, PayBill, Pochi, Airtime, etc.

In [119]:
"""
STAGE 2: TRANSACTION TYPE IDENTIFICATION (COMPLETE FIX)
‚úÖ Separates Fuliza LOAN (OverDraft of Credit Party) from Fuliza PAYMENTS
‚úÖ Loan Repayment as separate category
‚úÖ All other fixes included
"""
class TransactionTypeIdentifier:
    """Enhanced transaction type identification with all fixes"""
    
    def __init__(self):
        """Define patterns in strict priority order"""
        # Format: (type_name, [patterns], priority)
        self.type_patterns = [
            # PRIORITY 1: Fees (check first - often confused with other types)
            ('M-Pesa Fee', [
                r'transfer\s+of\s+funds\s+charge',
                r'pay\s+bill\s+charge',
                r'pay\s+merchant\s+charge',
                r'withdraw(al)?\s+charge',
                r'\bcharge\b$',
            ], 1),
            
            # PRIORITY 2: Fuliza/Overdraft LOAN (the credit itself - OverDraft of Credit Party)
            ('Fuliza', [
                r'overdraft\s+of\s+credit\s+party',  # This is the LOAN
            ], 2),
            
            # PRIORITY 3: Loan Repayment (paying back loans, including Fuliza payments)
            ('Loan Repayment', [
                r'od\s+loan\s+repayment',
                r'loan\s+repayment',
                r'fuliza\s+repayment',
                r'overdraw',
            ], 3),
            
            # PRIORITY 4: LOOP Payment (Income from LOOP)
            ('LOOP Payment', [
                r'promotion\s+payment\s+from.*loop\s+b2c',
                r'loop\s+b2c',
            ], 4),
            
            # PRIORITY 5: Received Money (Income)
            ('Income', [
                r'funds\s+received\s+from',
                r'business\s+payment\s+from',
                r'received\s+from',
                r'salary\s+payment\s+from',
            ], 5),
            
            # PRIORITY 6: Cash Deposit (at agent)
            ('Cash Deposit', [
                r'deposit\s+of\s+funds\s+at\s+agent',
            ], 6),
            
            # PRIORITY 7: Cash Withdrawal (at agent)
            ('Cash Withdrawal', [
                r'customer\s+withdrawal\s+at\s+agent',
                r'withdrawal\s+at\s+agent',
            ], 7),
            
            # PRIORITY 8: Data Bundles (separated from Airtime)
            # Including Fuliza-powered data bundles
            ('Data Bundles', [
                r'safaricom\s+data',
                r'safaricom\s+data\s+bundles',
                r'customer\s+bundle\s+purchase\s+with\s+fuliza.*4093441',
                r'(?i)buy\s+bundle',
                r'(?i)customer\s+bundle\s+purchase',
                r'customer\s+bundle\s+purchase\s+with\s+fuliza',
            
            ], 8),
            
            # PRIORITY 9: Airtime (separated from Data, includes Direct Pay)
            # Including Fuliza-powered airtime
            ('Airtime', [
                r'(?i)safaricom\s+offers',  # Safaricom Offers = Airtime
                r'airtime\s+purchase',
                r'pay\s+bill.*direct\s+pay.*atl\d+',  # Direct Pay airtime
                r'4187661.*direct\s+pay',  # Direct Pay paybill
                r'4093275.*direct\s+pay',  # Another Direct Pay paybill
                r'recharge\s+for\s+customer',
                r'pay\s+bill.*220220.*pesapal.*airt\d+',
                r'(?i).\bpesapal\b.',
                r'(?i)merchant\s+payment.to\s+\d+\s-\s*TINGG',# Catch TINGG via Merchant Payment
                r'(?i)pay\s+bill.to\s+\d+\s-\s*TINGG',   # Catch TINGG via Pay Bill (The one you just found)
                r'TINGG',  # Catch any remaining TINGG transactions as Airtime
            ], 9),
            
            # PRIORITY 10: Send Money (including Fuliza-powered transfers)
            ('Send Money', [
                r'(?i)customer\s+transfer\s+to\s+-\s+(2547|07|01)[\d\*]+',
                r'customer\s+transfer\s+to\s+-\s+',
                r'(?i)customer\stransfer',
                r'customer\s+send\s+money.*fuliza.*to\s+-\s+(2547|07|01)[\d\*]+',
                r'(?i)customer\s+transfer\s+fuliza\s+mpesa\s*to\s+-\s+(2547|07|01)[\d\*]+',
            ], 10),
            
            # PRIORITY 11: Pochi la Biashara
            ('Pochi la Biashara', [
                r'customer\s+payment\s+to\s+small\s+business',
            ], 11),
            
            # PRIORITY 12: Till Payment (including Fuliza-powered)
            ('Till Payment', [
                r'merchant\s+payment\s+(online\s+)?to\s+\d+',
                r'merchant\s+payment\s+fuliza\s+m-?pesa\s*to\s+\d+',
                r'till\s+\d+',
            ], 12),
            
            # PRIORITY 13: PayBill (including Fuliza-powered)
            ('PayBill', [
                r'pay\s+bill\s+(online\s+)?to\s+\d+',
                r'pay\s+bill\s+fuliza\s+m-?pesa\s+to\s+\d+',
                r'pay\s+bill\s+online\s+fuliza\s+m-pesa\s+to\s+(\d+)\s+-\s+([\w\s]+?)\s+acc\.?\s+([\w\s]+)',
            ], 13),
            
            # PRIORITY 14: M-Shwari
            ('M-Shwari', [
                r'm-?\s*shwari',
            ], 14),
            
            # PRIORITY 15: Unit Trust
            ('Unit Trust', [
                r'unit\s+trust',
                r'ziidi',
            ], 15),
            
            # PRIORITY 16: Reversal
            ('Reversal', [
                r'reversal',
            ], 16),
        ]
    
    def identify_type(self, description: str) -> str:
        """Identify transaction type"""
        if pd.isna(description) or description == '':
            return 'Other'
        
        desc_lower = str(description).lower().strip()
        
        # Check in priority order
        for trans_type, patterns, _ in self.type_patterns:
            for pattern in patterns:
                if re.search(pattern, desc_lower, re.IGNORECASE):
                    return trans_type
        
        return 'Other'
    
    def extract_fields(self, description: str, txn_type: str) -> Dict:
        """Extract key fields from description"""
        if pd.isna(description):
            return {}
        
        fields = {}
        desc = str(description)
        
        if txn_type == "Send Money":
            # Regular transfer
            match = re.search(
                r'(?i)customer\s+transfer\s+(?:fuliza\s+mpesa\s*)?to\s+-\s+((2547|07|01)[\d\*]+)\s+(.*)',
                desc
            )
            if match:
                fields["recipient_number"] = match.group(1)
                fields["recipient_name"] = match.group(3).strip()
        
        elif txn_type == "Pochi la Biashara":
            match = re.search(
                r'(?i)small\s+business\s+to\s+-\s+((2547|07|01)[\d\*]+)\s+(.*)',
                desc
            )
            if match:
                fields["recipient_number"] = match.group(1)
                fields["recipient_name"] = match.group(3).strip()
        
        elif txn_type == "Till Payment":
            # Regular or Fuliza merchant payment
            match = re.search(
                r'(?i)merchant\s+payment\s+(?:fuliza\s+m-?pesa\s*)?(?:online\s+)?to\s+(\d+)\s+-\s+(.*)',
                desc
            )
            if match:
                fields["till_number"] = match.group(1)
                raw_merchant = match.group(2).strip()
                raw_merchant = re.sub(
                    r'(?i)\s+via\s+(coop|equity|kcb|ncba|family)\s+bank\.?$',
                    '', raw_merchant
                ).strip()
                fields["merchant_name"] = raw_merchant
        
        elif txn_type == "PayBill":
            # Regular or Fuliza paybill
            match = re.search(
                r'(?i)pay\s+bill\s+(?:fuliza\s+m-?pesa\s*)?(?:online\s+)?to\s+(\d+)\s+[-‚Äì]\s+([\w\s]+?)\s+[Aa]cc\.?\s+([\w#]+)',
                desc
            )
            if match:
                fields["paybill_number"] = match.group(1)
                fields["merchant_name"] = match.group(2).strip()
                fields["account_number"] = match.group(3).strip()
            else:
                match2 = re.search(
                    r'(?i)pay\s+bill\s+(?:fuliza\s+m-?pesa\s*)?(?:online\s+)?to\s+(\d+)\s+[-‚Äì]?\s+(.*)',
                    desc
                )
                if match2:
                    fields["paybill_number"] = match2.group(1)
                    fields["merchant_name"] = match2.group(2).strip()
        
        elif txn_type in ["Cash Withdrawal", "Cash Deposit"]:
            match = re.search(
                r'(?i)agent\s+till\s+(\d+)\s+[-‚Äì]\s+(.*)',
                desc
            )
            if match:
                fields["agent_till"] = match.group(1)
                fields["agent_name"] = match.group(2).strip()
        
        elif txn_type in ["Received Money", "LOOP Payment"]:
            match = re.search(
                r'(?i)(?:funds\s+received|payment)\s+from\s+[-‚Äì]?\s+(\d+)\s+[-‚Äì]\s+(.*)',
                desc
            )
            if match:
                fields["sender_number"] = match.group(1)
                fields["sender_name"] = match.group(2).strip()
        
        return fields
    
    def process_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
        """Add transaction_type and extracted_fields"""
        print(" Identifying transaction types...")
        
        # Clean description
        df['description_clean'] = df['Details'].apply(self._clean_text)
        
        # Identify types
        df['transaction_type'] = df['description_clean'].apply(self.identify_type)
        
        # Extract fields
        print(" Extracting details...")
        df['extracted_fields'] = df.apply(
            lambda row: self.extract_fields(row['description_clean'], row['transaction_type']),
            axis=1
        )
        
        print(f"‚úì Identified {df['transaction_type'].nunique()} transaction types")
        
        return df
    
    def _clean_text(self, text: str) -> str:
        """Clean multiline PDF text"""
        if pd.isna(text):
            return ''
        text = str(text).replace('\\r', ' ').replace('\\n', ' ').replace('\r', ' ').replace('\n', ' ')
        text = re.sub(r'\s+', ' ', text)
        return text.strip()


def run_stage2(input_csv: str, output_csv: str):
    """Run Stage 2"""
    

    # Load
    print(f" Loading: {input_csv}")
    df = pd.read_csv(input_csv, low_memory=False)
    
    # Filter to transactions
    mask = df['Receipt No.'].notna() & (df['Receipt No.'] != '')
    df = df[mask].copy()
    
    print(f"‚úì Loaded {len(df)} transactions")
    print()
    
    # Process
    identifier = TransactionTypeIdentifier()
    df = identifier.process_dataframe(df)
    
    print()
    print("=" * 80)
    print("TRANSACTION TYPE BREAKDOWN")
    print("=" * 80)
    for trans_type, count in df['transaction_type'].value_counts().items():
        pct = (count / len(df)) * 100
        print(f"{trans_type:25s}: {count:5d} ({pct:5.1f}%)")
    
    print()
    print("=" * 80)
    print("VERIFICATION - KEY TYPES")
    print("=" * 80)
    
    # Verify critical fixes
    key_types = ['Fuliza', 'Loan Repayment', 'LOOP Payment', 'Data Bundles', 
                 'Airtime', 'Cash Deposit', 'Cash Withdrawal']
    
    for trans_type in key_types:
        type_df = df[df['transaction_type'] == trans_type]
        if len(type_df) > 0:
            print(f"\n{trans_type} ({len(type_df)} transactions):")
            for _, row in type_df.head(2).iterrows():
                print(f"  {row['description_clean'][:70]}")
    
    print()
    
    # Show Fuliza-powered transactions
    fuliza_powered = df[df['description_clean'].str.contains('fuliza', case=False, na=False)]
    print("=" * 80)
    print(f"FULIZA-POWERED TRANSACTIONS: {len(fuliza_powered)} total")
    print("=" * 80)
    fuliza_breakdown = fuliza_powered['transaction_type'].value_counts()
    for txn_type, count in fuliza_breakdown.items():
        print(f"  {txn_type:25s}: {count:5d}")
    print()
    print("‚úì Fuliza LOAN (OverDraft) vs Fuliza PAYMENTS properly separated!")
    print()
    
    # Save
    df['extracted_fields_str'] = df['extracted_fields'].apply(str)
    df.to_csv(output_csv, index=False)
    print(f"‚úÖ Saved: {output_csv}")
    print()
    
    return df


if __name__ == "__main__":
    INPUT = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\stage1_mpesa_raw.csv"
    OUTPUT = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\stage2_with_types.csv"
    
    df = run_stage2(INPUT, OUTPUT)
    print("‚úÖ Ready for Stage 3")

 Loading: C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\stage1_mpesa_raw.csv
‚úì Loaded 2715 transactions

 Identifying transaction types...
 Extracting details...
‚úì Identified 13 transaction types

TRANSACTION TYPE BREAKDOWN
M-Pesa Fee               :   779 ( 28.7%)
Send Money               :   625 ( 23.0%)
PayBill                  :   381 ( 14.0%)
Cash Deposit             :   297 ( 10.9%)
Income                   :   245 (  9.0%)
Pochi la Biashara        :   139 (  5.1%)
M-Shwari                 :    85 (  3.1%)
Till Payment             :    79 (  2.9%)
Airtime                  :    58 (  2.1%)
Cash Withdrawal          :    21 (  0.8%)
Reversal                 :     3 (  0.1%)
Data Bundles             :     2 (  0.1%)
Other                    :     1 (  0.0%)

VERIFICATION - KEY TYPES

Data Bundles (2 transactions):
  Customer Bundle Purchase to 4093441SAFARICOM DATA BUNDLES by - 2547***
  Customer Bundle Purchase to 4093441SAFARICOM DATA BUNDLES by - 2547***

Airtime (58 trans

## STAGE 3: Keyword Categorization

Apply keyword rules: Income, Bills, Shopping, Airtime, etc.

In [120]:

class KeywordCategorizer: 
    def __init__(self):
        self.category_keywords = {
            # HIGH PRIORITY
            'Health Care': {
                'keywords': [
                    'hospital', 'clinic', 'pharmacy', 'medical',
                    'chemist', 'doctor', 'laboratory', 'lab', 'diagnostic',
                    'aga khan', 'nairobi hospital', 'mater', 'kenyatta hospital',
                    'mp shah', 'gertrudes', 'lancet', 'dental', 'optical',
                ],
                'priority': 1,
            },
            
            'Government Bills': {
                'keywords': [
                    'government', 'tax', 'revenue', 'nssf', 'nhif', 'kra', 'SHIF', 'SHA',
                    'pension', 'social security', 'national insurance', 'E-CITIZEN', 'E-Citizen','E- CITIZEN',
                ],
                'priority': 1,
            },
            
            'Betting': {
                'keywords': [
                    'sportpesa', 'sportybet', 'betika', '1xbet', 'stake', 
                    'bangbet', '22bet', 'mozzart bet', 'betway', 'odibets',
                    'kareco holdings', 'melbet', 'betin', 'betpawa', 'shabiki',
                    'bet', 'betting', 'lotto', 'lottery', 'casino', 'CHEZA WIN', 
                    'SHINDA SASA', 'DUMUKENYA', 'BETWIN', 'DUMU KENYA',
                ],
                'priority': 1,
            },
            
            # Loans - REMOVED "fuliza" (now handled via transaction type)
            'Loans': {
                'keywords': [
                    'm-shwari loan', 'kcb m-pesa loan', 'hustler fund',
                    'okash', 'zenka', 'timiza', 'Overdraft',
                ],
                'priority': 1,
            },

            'Loan Repayment': {
                'keywords': ['repayment', 'overdraw'],
                'priority': 1,
            },
            
            # Online Shopping - REMOVED "online" and "online purchase"
            'Online Shopping': {
                'keywords': [
                    'jumia', 'kilimall', 'masoko', 'glovo', 'jiji',
                    'aliexpress', 'amazon', 'alibaba', 'uber eats', 'bolt food',
                    'sky garden', 'food delivery', 'home delivery',
                ],
                'priority': 1,
            },
            
            'Bills': {
                'keywords': [
                    'kplc', 'water', 'rent', 'insurance', 'gas refill',
                    'internet', 'wifi', 'land rates', 'security',
                    'parking', 'electricity', 'prepaid', 'postpaid',
                ],
                'priority': 1,
            },
            
            'Subscriptions': {
                'keywords': [
                    'netflix', 'spotify', 'youtube', 'prime', 'hbo',
                    'GOTV', 'dstv', 'showmax', 'apple music', 'startimes',
                    'zuku', 'subscription', 'microsoft 365', 'office 365', 'GO TV'
                ],
                'priority': 1,
            },
            
            'Education': {
                'keywords': [
                    'university', 'school', 'college', 'helb', 'kuccps',
                    'knec', 'tvet', 'kmtc', 'fees', 'tuition', 'catering',
                    'kabarak', 'student', 'academy', 'exam fee', 'hostel',
                ],
                'priority': 1,
            },
            
            'Savings': {
                'keywords': [
                    'mshwari deposit', 'unit trust', 'mmf', 'fixed deposit',
                    'investment', 'koala', 'ndovu', 'etica', 'chama',
                    'ziidi', 'savings', 'sacco deposit', 'Sacco', 'co-op', 
                    'equity', 'kcb', 'ncba', 'family bank', 'absa', 'stanbic', 
                    'co-operative', 'self help group', 'mfb', 'mf', 'savings group', 
                    'savings account', 'savings deposit', 'retirement', 
                    'retirement fund', 'pension fund',
                ],
                'priority': 1,
            },
            
            # MEDIUM PRIORITY
            'Shopping': {
                'keywords': [
                    'supermarket', 'naivas', 'quickmart', 'quick mart',
                    'carrefour', 'chandarana', 'foodplus', 'cleanshelf',
                    'eastmatt', 'tuskys', 'kabsmart', 'nakumatt', 'Store', 
                    'majid al futtah', 'market',
                ],
                'priority': 2,
            },
            
            'Fast Foods': {
                'keywords': [
                    'kfc', 'chicken inn', 'java house', 'artcaffe',
                    'pizza', 'burger king', 'dominos', 'debonairs',
                    'pizza hut', 'pizza inn', 'subway', 'steers', 'inn', 
                    'cafe', 'chips', 'snack', 'chips and snacks',
                ],
                'priority': 2,
            },
            
            'Food & Dining': {
                'keywords': [
                    'restaurant', 'hotel', 'cafe', 'eatery', 'food court',
                    'dining', 'meat', 'vegetables', 'fruits', 'milk', 'food',
                ],
                'priority': 2,
            },
            
            'Personal Care': {
                'keywords': [
                    'beauty', 'cosmetics', 'skincare', 'makeup', 'barber',
                    'salon', 'spa', 'kinyozi', 'grooming', 'hair', 'nails',
                ],
                'priority': 2,
            },
            
            'Transport': {
                'keywords': [
                    'uber', 'bolt', 'taxi', 'little cab', 'transport',
                    'fuel', 'petrol', 'diesel', 'shell', 'total', 'parking', 'rubis',
                ],
                'priority': 2,
            },
            
            'Entertainment': {
                'keywords': [
                    'liquor', 'bar', 'wine', 'beer', 'club', 'lounge',
                    'pub', 'cinema', 'bowling', 'arcade', 'entertainment',
                ],
                'priority': 2,
            },
            
            # LOW PRIORITY
            'Bank Transfer': {
                'keywords': [
                    'equity', 'kcb', 'family bank', 'co-op', 'ncba',
                    'stanbic', 'absa', 'bank transfer',
                ],
                'priority': 3,
            },
        }
    
    def categorize(self, description: str, transaction_type: str, extracted_fields: Dict = None) -> str:
        """Categorize transaction"""
        if pd.isna(description):
            return 'Uncategorized'
        
        desc_lower = str(description).lower()
        
        # Build search text
        search_text = desc_lower
        if extracted_fields:
            for key in ['merchant_name', 'recipient_name', 'sender_name', 'agent_name']:
                if key in extracted_fields:
                    search_text += ' ' + str(extracted_fields[key]).lower()
        
        # PRIORITY 1: Transaction type based (FIXED)
        
        # Income
        if transaction_type in ['Received Money', 'LOOP Payment', 'Income']:
            return 'Income'
        
        # Cash operations
        if transaction_type == 'Cash Deposit':
            return 'Cash Deposit'
        
        if transaction_type == 'Cash Withdrawal':
            return 'Cash Withdrawal'
        
        # Loans (including Fuliza/OverDraft from Stage 2)
        if transaction_type == 'Overdraft':
            return 'Loans'
        
        # Data vs Airtime (now separated in Stage 2)
        if transaction_type == 'Data Bundles':
            return 'Data Bundles'
        
        if transaction_type == 'Airtime':
            return 'Airtime'
        
        # Fees
        if transaction_type == 'M-Pesa Fee':
            return 'M-Pesa Fees'
        
        # M-Shwari
        if transaction_type == 'M-Shwari':
            if 'withdraw' in desc_lower:
                return 'Cash Withdrawal'
            else:
                return 'Savings'
        
        # Unit Trust
        if transaction_type == 'Unit Trust':
            return 'Savings'
        
        # Reversal
        if transaction_type == 'Reversal':
            return 'Reversal'
        
        # PRIORITY 2: Send Money - LEAVE UNCATEGORIZED for Stage 4
        if transaction_type == 'Send Money':
            return 'Uncategorized'
        
        # PRIORITY 3: Till/PayBill/Pochi - Try keywords, fallback to Merchant
        if transaction_type in ['Till Payment', 'PayBill', 'Pochi la Biashara']:
            matched = self._match_keywords(search_text)
            return matched if matched else 'Merchant'
        
        # PRIORITY 4: Other - Try keywords
        matched = self._match_keywords(search_text)
        return matched if matched else 'Other'
    
    def _match_keywords(self, search_text: str) -> str:
        """Match keywords"""
        sorted_categories = sorted(
            self.category_keywords.items(),
            key=lambda x: x[1].get('priority', 99)
        )
        
        for category, rules in sorted_categories:
            for keyword in rules.get('keywords', []):
                pattern = r'\b' + re.escape(keyword) + r'\b'
                if re.search(pattern, search_text, re.IGNORECASE):
                    return category
        
        return ''
    
    def process_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
        """Add category column"""
        print("  Categorizing...")
        
        # Parse extracted_fields
        if 'extracted_fields' in df.columns:
            try:
                import ast
                df['extracted_fields_dict'] = df['extracted_fields'].apply(
                    lambda x: ast.literal_eval(x) if isinstance(x, str) and x.strip() else {}
                )
            except:
                df['extracted_fields_dict'] = df['extracted_fields']
        else:
            df['extracted_fields_dict'] = [{}] * len(df)
        
        # Categorize
        df['category'] = df.apply(
            lambda row: self.categorize(
                row['description_clean'], 
                row['transaction_type'],
                row.get('extracted_fields_dict', {})
            ),
            axis=1
        )
        
        print(f"‚úì Categorized into {df['category'].nunique()} categories")
        
        return df


def run_stage3(input_csv: str, output_csv: str):
    """Run Stage 3"""
    
    
    # Load
    print(f" Loading: {input_csv}")
    df = pd.read_csv(input_csv, low_memory=False)
    print(f"‚úì Loaded {len(df)} transactions")
    print()
    
    # Categorize
    categorizer = KeywordCategorizer()
    df = categorizer.process_dataframe(df)
    
    print()
    print("=" * 80)
    print("CATEGORY BREAKDOWN")
    print("=" * 80)
    for category, count in df['category'].value_counts().items():
        pct = (count / len(df)) * 100
        print(f"{category:30s}: {count:5d} ({pct:5.1f}%)")
    
    print()
    
    # Verify key categories
    print("=" * 80)
    print("VERIFICATION - KEY CATEGORIES")
    print("=" * 80)
    
    key_cats = ['Loans', 'Data Bundles', 'Airtime', 'Income', 
                'Cash Deposit', 'Cash Withdrawal', 'Loan Repayment']
    
    for cat in key_cats:
        cat_df = df[df['category'] == cat]
        if len(cat_df) > 0:
            print(f"\n{cat} ({len(cat_df)} transactions):")
            for _, row in cat_df.head(2).iterrows():
                print(f"  {row['description_clean'][:70]}")
    
    print()
    
    # Save
    df.to_csv(output_csv, index=False)
    print(f"‚úÖ Saved: {output_csv}")
    print()
    
    return df


if __name__ == "__main__":
    INPUT = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\stage2_with_types.csv"
    OUTPUT = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\stage3_with_categories.csv"
    
    df = run_stage3(INPUT, OUTPUT)
    print("‚úÖ Ready for Stage 4")

 Loading: C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\stage2_with_types.csv
‚úì Loaded 2715 transactions

  Categorizing...
‚úì Categorized into 19 categories

CATEGORY BREAKDOWN
M-Pesa Fees                   :   779 ( 28.7%)
Uncategorized                 :   626 ( 23.1%)
Cash Deposit                  :   297 ( 10.9%)
Income                        :   245 (  9.0%)
Savings                       :   228 (  8.4%)
Merchant                      :   220 (  8.1%)
Bills                         :   104 (  3.8%)
Cash Withdrawal               :    62 (  2.3%)
Airtime                       :    58 (  2.1%)
Government Bills              :    32 (  1.2%)
Subscriptions                 :    26 (  1.0%)
Betting                       :    20 (  0.7%)
Shopping                      :     6 (  0.2%)
Personal Care                 :     4 (  0.1%)
Reversal                      :     3 (  0.1%)
Data Bundles                  :     2 (  0.1%)
Fast Foods                    :     1 (  0.0%)
Transport       

## STAGE 4: Send Money Smart Rules

Apply customizable rules to categorize Send Money transactions.

In [121]:
"""
STAGE 4: SEND MONEY CATEGORIZATION (CUSTOMIZABLE RULES)
Now with customizable thresholds that can be set from UI or manually

CUSTOMIZABLE RULES:
1. Send Money + Recurring (‚â•X times) + Amount > Y ‚Üí Category A
2. Send Money + Recurring (‚â•X times) + Amount ‚â§ Y ‚Üí Category B  
3. Send Money + Non-recurring (any amount) ‚Üí Category C

Where X, Y, A, B, C are user-defined
"""

class CustomizableSendMoneyCategorizer:
    """
    Categorize Send Money transactions with customizable rules
    """
    
    def __init__(self, rules_config: dict = None):
        """
        Initialize with custom rules configuration
        
        Args:
            rules_config: Dictionary with rule parameters
                {
                    'rule1': {
                        'min_occurrences': 2,
                        'amount_threshold': 500,
                        'category': 'Friends & Family'
                    },
                    'rule2': {
                        'min_occurrences': 2,
                        'amount_threshold': 500,
                        'category': 'Merchant'
                    },
                    'rule3': {
                        'category': 'Merchant'
                    }
                }
        """
        
        # Default rules
        self.default_rules = {
            'rule1': {
                'name': 'Recurring + High Amount',
                'min_occurrences': 2,
                'amount_threshold': 500.0,
                'amount_comparison': 'greater',  # 'greater' or 'less_equal'
                'category': 'Friends & Family',
                'description': 'Recurring recipient + Amount > threshold'
            },
            'rule2': {
                'name': 'Recurring + Low Amount',
                'min_occurrences': 2,
                'amount_threshold': 500.0,
                'amount_comparison': 'less_equal',
                'category': 'Merchant',
                'description': 'Recurring recipient + Amount ‚â§ threshold'
            },
            'rule3': {
                'name': 'One-time Transactions',
                'category': 'Merchant',
                'description': 'Non-recurring transactions'
            }
        }
        
        # Use provided config or default
        if rules_config:
            self.rules = {**self.default_rules, **rules_config}
        else:
            self.rules = self.default_rules
    
    def save_rules(self, filepath: str):
        """Save rules configuration to JSON file"""
        with open(filepath, 'w') as f:
            json.dump(self.rules, f, indent=2)
        print(f"‚úì Rules saved to: {filepath}")
    
    @classmethod
    def load_rules(cls, filepath: str):
        """Load rules configuration from JSON file"""
        with open(filepath, 'r') as f:
            rules = json.load(f)
        return cls(rules)
    
    def extract_recipient_id(self, extracted_fields_str: str) -> str:
        """Extract recipient identifier from extracted_fields"""
        if pd.isna(extracted_fields_str) or extracted_fields_str == '':
            return None
        
        try:
            import ast
            fields = ast.literal_eval(extracted_fields_str)
            
            # Use phone number as primary identifier
            if 'recipient_number' in fields:
                return fields['recipient_number']
            elif 'recipient_name' in fields:
                return fields['recipient_name']
        except:
            pass
        
        return None
    
    def detect_recurring_recipients(self, df: pd.DataFrame) -> dict:
        """Detect recurring recipients in Send Money transactions"""
        
        # Filter to uncategorized Send Money
        send_money_df = df[
            (df['transaction_type'] == 'Send Money') & 
            (df['category'] == 'Uncategorized')
        ].copy()
        
        if len(send_money_df) == 0:
            return {}
        
        # Extract recipient IDs
        send_money_df['recipient_id'] = send_money_df['extracted_fields_str'].apply(
            self.extract_recipient_id
        )
        
        # Count occurrences
        recipient_counts = send_money_df['recipient_id'].value_counts().to_dict()
        
        # Remove None
        recipient_counts = {k: v for k, v in recipient_counts.items() if k is not None}
        
        return recipient_counts
    
    def categorize_send_money(self, row: pd.Series, recipient_counts: dict) -> str:
        """Categorize a single Send Money transaction using custom rules"""
        
        # Extract recipient ID
        recipient_id = self.extract_recipient_id(row['extracted_fields_str'])
        
        # Get recipient count
        count = recipient_counts.get(recipient_id, 1) if recipient_id else 1
        
        # Get amount
        amount = 0
        if pd.notna(row.get('Withdrawn')):
            try:
                amount = abs(float(row['Withdrawn']))
            except:
                amount = 0
        elif pd.notna(row.get('withdrawn')):
            try:
                amount = abs(float(row['withdrawn']))
            except:
                amount = 0
        
        # Check Rule 1: Recurring + High Amount
        rule1 = self.rules.get('rule1', {})
        if count >= rule1.get('min_occurrences', 2):
            threshold = rule1.get('amount_threshold', 500.0)
            comparison = rule1.get('amount_comparison', 'greater')
            
            if comparison == 'greater' and amount > threshold:
                return rule1.get('category', 'Friends & Family')
            
            # Check Rule 2: Recurring + Low Amount
            rule2 = self.rules.get('rule2', {})
            if count >= rule2.get('min_occurrences', 2):
                threshold2 = rule2.get('amount_threshold', 500.0)
                comparison2 = rule2.get('amount_comparison', 'less_equal')
                
                if comparison2 == 'less_equal' and amount <= threshold2:
                    return rule2.get('category', 'Merchant')
        
        # Rule 3: Non-recurring (default fallback)
        rule3 = self.rules.get('rule3', {})
        return rule3.get('category', 'Merchant')
    
    def process_dataframe(self, df: pd.DataFrame) -> pd.DataFrame:
        """Process entire dataframe with custom rules"""
        
        print("ü§ñ Categorizing Send Money transactions with custom rules...")
        print()
        
        # Show current rules
        print("=" * 80)
        print("ACTIVE RULES")
        print("=" * 80)
        for rule_id, rule in self.rules.items():
            print(f"\n{rule_id.upper()}: {rule.get('name', 'Unnamed')}")
            if 'min_occurrences' in rule:
                print(f"  Min occurrences: {rule['min_occurrences']}")
            if 'amount_threshold' in rule:
                print(f"  Amount threshold: KES {rule['amount_threshold']:,.0f}")
            if 'amount_comparison' in rule:
                comp = '>' if rule['amount_comparison'] == 'greater' else '‚â§'
                print(f"  Amount condition: {comp} threshold")
            print(f"  ‚Üí Categorize as: {rule['category']}")
            print(f"  ({rule.get('description', '')})")
        
        print()
        
        # Detect recurring recipients
        recipient_counts = self.detect_recurring_recipients(df)
        
        if recipient_counts:
            print(f"üîç Found {len(recipient_counts)} unique recipients")
            recurring = {k: v for k, v in recipient_counts.items() if v >= self.rules['rule1'].get('min_occurrences', 2)}
            print(f"‚úì {len(recurring)} recurring recipients (‚â•{self.rules['rule1'].get('min_occurrences', 2)} transactions)")
            
            if recurring:
                print()
                print("Top 15 recurring recipients:")
                sorted_recipients = sorted(recurring.items(), key=lambda x: x[1], reverse=True)[:15]
                for recipient, count in sorted_recipients:
                    print(f"  {recipient[:45]:45s}: {count:3d} times")
        
        print()
        
        # Count before
        before_uncat = len(df[
            (df['transaction_type'] == 'Send Money') & 
            (df['category'] == 'Uncategorized')
        ])
        
        # Apply categorization
        mask = (df['transaction_type'] == 'Send Money') & (df['category'] == 'Uncategorized')
        
        df.loc[mask, 'category'] = df[mask].apply(
            lambda row: self.categorize_send_money(row, recipient_counts),
            axis=1
        )
        
        # Show results
        print("=" * 80)
        print("SEND MONEY CATEGORIZATION RESULTS")
        print("=" * 80)
        
        # Count by resulting category
        category_breakdown = df[
            (df['transaction_type'] == 'Send Money')
        ]['category'].value_counts()
        
        print(f"Processed: {before_uncat:,} Send Money transactions")
        print()
        print("Results:")
        for category, count in category_breakdown.items():
            pct = count / before_uncat * 100 if before_uncat > 0 else 0
            print(f"  {category:30s}: {count:5,} ({pct:5.1f}%)")
        
        print()
        
        return df


def run_stage4_customizable(input_csv: str, 
                            output_csv: str,
                            rules_config: dict = None,
                            rules_file: str = None):
    """
    Run Stage 4 with customizable rules
    
    Args:
        input_csv: Path to Stage 3 output CSV
        output_csv: Path to save final categorized CSV
        rules_config: Dictionary with custom rules (optional)
        rules_file: Path to JSON file with rules (optional)
    """
    
    print("=" * 80)
    print("STAGE 4: SEND MONEY CATEGORIZATION (CUSTOMIZABLE)")
    print("=" * 80)
    print()
    
    # Load rules
    if rules_file:
        print(f"üìÇ Loading rules from: {rules_file}")
        categorizer = CustomizableSendMoneyCategorizer.load_rules(rules_file)
    elif rules_config:
        print("‚úì Using provided rules configuration")
        categorizer = CustomizableSendMoneyCategorizer(rules_config)
    else:
        print("‚úì Using default rules")
        categorizer = CustomizableSendMoneyCategorizer()
    
    print()
    
    # Load data
    print(f"üìÇ Loading: {input_csv}")
    df = pd.read_csv(input_csv, low_memory=False)
    print(f"‚úì Loaded {len(df):,} transactions")
    print()
    
    # Process
    df = categorizer.process_dataframe(df)
    
    # Final summary
    print("=" * 80)
    print("FINAL CATEGORY BREAKDOWN")
    print("=" * 80)
    
    category_counts = df['category'].value_counts().sort_values(ascending=False)
    for category, count in category_counts.items():
        pct = (count / len(df)) * 100
        print(f"{category:30s}: {count:6,} ({pct:5.1f}%)")
    
    print()
    
    # Save
    df.to_csv(output_csv, index=False)
    print(f"üíæ Saved: {output_csv}")
    print()
    
    # Optionally save rules
    rules_output = output_csv.replace('.csv', '_rules.json')
    categorizer.save_rules(rules_output)
    print()
    
    print("=" * 80)
    print("STAGE 4 COMPLETE! üéâ")
    print("=" * 80)
    print()
    
    return df


if __name__ == "__main__":
    # EXAMPLE 1: Use default rules
    INPUT_CSV = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\stage3_with_categories.csv"
    OUTPUT_CSV = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\stage4_final_categorized.csv"
    
    df = run_stage4_customizable(INPUT_CSV, OUTPUT_CSV)
    
    # EXAMPLE 2: Use custom rules (e.g., from UI)
    """
    CUSTOM_RULES = {
        'rule1': {
            'name': 'Recurring + High Amount',
            'min_occurrences': 3,  # More strict: need 3+ occurrences
            'amount_threshold': 1000.0,  # Higher threshold: 1000 KES
            'amount_comparison': 'greater',
            'category': 'Family'  # Different category name
        },
        'rule2': {
            'name': 'Recurring + Low Amount',
            'min_occurrences': 3,
            'amount_threshold': 1000.0,
            'amount_comparison': 'less_equal',
            'category': 'Food Vendor'  # More specific
        },
        'rule3': {
            'name': 'One-time Transactions',
            'category': 'Uncategorized'  # Flag for manual review
        }
    }
    
    df = run_stage4_customizable(INPUT_CSV, OUTPUT_CSV, rules_config=CUSTOM_RULES)
    """
    
    # EXAMPLE 3: Load rules from UI-generated JSON
    """
    df = run_stage4_customizable(
        INPUT_CSV, 
        OUTPUT_CSV, 
        rules_file='ui_generated_rules.json'
    )
    """

STAGE 4: SEND MONEY CATEGORIZATION (CUSTOMIZABLE)

‚úì Using default rules

üìÇ Loading: C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\stage3_with_categories.csv
‚úì Loaded 2,715 transactions

ü§ñ Categorizing Send Money transactions with custom rules...

ACTIVE RULES

RULE1: Recurring + High Amount
  Min occurrences: 2
  Amount threshold: KES 500
  Amount condition: > threshold
  ‚Üí Categorize as: Friends & Family
  (Recurring recipient + Amount > threshold)

RULE2: Recurring + Low Amount
  Min occurrences: 2
  Amount threshold: KES 500
  Amount condition: ‚â§ threshold
  ‚Üí Categorize as: Merchant
  (Recurring recipient + Amount ‚â§ threshold)

RULE3: One-time Transactions
  ‚Üí Categorize as: Merchant
  (Non-recurring transactions)

üîç Found 198 unique recipients
‚úì 72 recurring recipients (‚â•2 transactions)

Top 15 recurring recipients:
  2547******463                                : 196 times
  2547******402                                :  20 times
  07******162    

## STAGE 5: CSV Cleanup

Remove empty columns, standardize names.

In [122]:
"""
STAGE 5: CSV CLEANUP
Removes empty columns, duplicate columns, and standardizes column names
"""

def clean_csv(input_csv: str, output_csv: str):
    """
    Clean the categorized CSV by removing empty/duplicate columns
    and standardizing column names
    
    Args:
        input_csv: Path to Stage 4 output CSV
        output_csv: Path to save cleaned CSV
    """
    
    print("=" * 80)
    print("STAGE 5: CSV CLEANUP")
    print("=" * 80)
    print()
    
    # Load data
    print(f" Loading: {input_csv}")
    df = pd.read_csv(input_csv, low_memory=False)
    print(f"‚úì Loaded {len(df):,} rows, {len(df.columns)} columns")
    print()
    
    # Show original columns
    print("Original columns:")
    for i, col in enumerate(df.columns, 1):
        null_count = df[col].isna().sum()
        null_pct = (null_count / len(df)) * 100
        print(f"  {i:2d}. {col:40s} - {null_count:5,} nulls ({null_pct:5.1f}%)")
    print()
    
    # Remove columns with 100% null values
    print("üßπ Removing empty columns...")
    before_cols = len(df.columns)
    
    empty_cols = [col for col in df.columns if df[col].isna().all()]
    if empty_cols:
        print(f"  Removing {len(empty_cols)} completely empty columns:")
        for col in empty_cols:
            print(f"    ‚Ä¢ {col}")
        df = df.drop(columns=empty_cols)
    else:
        print("  No completely empty columns found")
    
    print()
    
    # Remove duplicate columns
    print(" Checking for duplicate columns...")
    
    # Check for columns with same content
    duplicate_cols = []
    checked = set()
    
    for col1 in df.columns:
        if col1 in checked:
            continue
        for col2 in df.columns:
            if col1 != col2 and col2 not in checked:
                if df[col1].equals(df[col2]):
                    duplicate_cols.append(col2)
                    checked.add(col2)
    
    if duplicate_cols:
        print(f"  Removing {len(duplicate_cols)} duplicate columns:")
        for col in duplicate_cols:
            print(f"    ‚Ä¢ {col}")
        df = df.drop(columns=duplicate_cols)
    else:
        print("  No duplicate columns found")
    
    print()
    
    # Standardize essential column names
    print(" Standardizing column names...")
    
    rename_map = {
        'Receipt No.': 'receipt_no',
        'Completion Time': 'completion_time',
        'Details': 'details_original',
        'Transaction Status': 'status',
        'Paid In': 'paid_in',
        'Withdrawn': 'withdrawn',
        'Balance': 'balance',
        'description_clean': 'description',
        'transaction_type': 'type',
        'extracted_fields_str': 'extracted_fields',
        'category': 'category'
    }
    
    # Only rename columns that exist
    actual_rename = {old: new for old, new in rename_map.items() if old in df.columns}
    df = df.rename(columns=actual_rename)
    
    for old, new in actual_rename.items():
        print(f"  {old} ‚Üí {new}")
    
    print()
    
    # Keep only essential columns in specific order
    essential_columns = [
        'receipt_no',
        'completion_time',
        'description',
        'status',
        'paid_in',
        'withdrawn',
        'balance',
        'type',
        'extracted_fields',
        'category'
    ]
    
    # Add details_original if it exists
    if 'details_original' in df.columns:
        essential_columns.insert(2, 'details_original')
    
    # Filter to existing columns
    available_columns = [col for col in essential_columns if col in df.columns]
    df_clean = df[available_columns].copy()
    
    print("‚úÖ Final columns:")
    for i, col in enumerate(df_clean.columns, 1):
        null_count = df_clean[col].isna().sum()
        null_pct = (null_count / len(df_clean)) * 100
        print(f"  {i:2d}. {col:25s} - {null_count:5,} nulls ({null_pct:4.1f}%)")
    
    print()
    
    # Summary
    print("=" * 80)
    print("CLEANUP SUMMARY")
    print("=" * 80)
    print(f"Original: {before_cols} columns")
    print(f"Removed:  {before_cols - len(df_clean.columns)} columns")
    print(f"Final:    {len(df_clean.columns)} columns")
    print(f"Rows:     {len(df_clean):,} (unchanged)")
    print()
    
    # Save
    df_clean.to_csv(output_csv, index=False)
    print(f" Saved: {output_csv}")
    print()
    
    print("=" * 80)
    print("STAGE 5 COMPLETE! ‚ú®")
    print("=" * 80)
    print()
    
    return df_clean


if __name__ == "__main__":
    INPUT_CSV = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\stage4_final_categorized.csv"
    OUTPUT_CSV = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\final_categorized_clean.csv"
    
    df = clean_csv(INPUT_CSV, OUTPUT_CSV)
    
    print("Ready for manual merchant labeling!")
    print("Next: Open CSV and add 'merchant_subcategory' column")

STAGE 5: CSV CLEANUP

 Loading: C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\stage4_final_categorized.csv
‚úì Loaded 2,715 rows, 20 columns

Original columns:
   1. Unnamed: 0                               - 2,715 nulls (100.0%)
   2. Unnamed: 1                               - 2,715 nulls (100.0%)
   3. TRANSACTION TYPE                         - 2,715 nulls (100.0%)
   4. PAID IN                                  - 2,715 nulls (100.0%)
   5. PAID OUT                                 - 2,715 nulls (100.0%)
   6. Receipt No.                              -     0 nulls (  0.0%)
   7. Completion Time                          -     0 nulls (  0.0%)
   8. Details                                  -     1 nulls (  0.0%)
   9. Transaction Status                       -     0 nulls (  0.0%)
  10. Paid In                                  - 2,129 nulls ( 78.4%)
  11. Withdrawn                                -   586 nulls ( 21.6%)
  12. Balance                                  -     0 nulls (  0.

---
# üéì MERCHANT LEARNING SYSTEM
## Stage 6: Hybrid Personal Learning
---


## STAGE 6A: [OPTIONAL] Import Manual Labels

**Run this ONLY if you have manually labeled data.**

This pre-populates your merchant database, so you skip labeling those merchants.

In [123]:
"""
OPTIONAL: IMPORT MANUAL LABELS
Skip this cell if you don't have manual labels
"""

# Only run if MANUAL_LABELS_CSV is set and file exists
if MANUAL_LABELS_CSV and os.path.exists(MANUAL_LABELS_CSV):
    print(" Importing manual labels...")
    print()
    
    # Import function (embedded)
    class ManualLabelImporter:
        def __init__(self, user_id, database_dir):
            self.user_id = user_id
            self.db_path = os.path.join(database_dir, f"user_{user_id}_merchants.json")
            self.merchant_db = {}
            if os.path.exists(self.db_path):
                with open(self.db_path, 'r') as f:
                    self.merchant_db = json.load(f)
        
        def _extract_merchant_id(self, row):
            # Try extracted_fields
            extracted = row.get('extracted_fields_str', '') or row.get('extracted_fields', '')
            if pd.notna(extracted) and str(extracted) not in ['', '{}', 'nan']:
                try:
                    import ast
                    fields = ast.literal_eval(str(extracted))
                    if fields.get('till_number'):
                        return f"TILL_{fields['till_number']}"
                    elif fields.get('paybill_number'):
                        return f"PAYBILL_{fields['paybill_number']}"
                    elif fields.get('recipient_number'):
                        return f"PHONE_{fields['recipient_number']}"
                    elif fields.get('merchant_name'):
                        return f"NAME_{fields['merchant_name'][:50].upper()}"
                except:
                    pass
            return None
        
        def import_csv(self, csv_path, category_col):
            df = pd.read_csv(csv_path, low_memory=False)
            
            if 'category' in df.columns:
                df = df[df['category'] == 'Merchant']
            
            labeled = df[df[category_col].notna()].copy()
            labeled['merchant_id'] = labeled.apply(self._extract_merchant_id, axis=1)
            labeled = labeled[labeled['merchant_id'].notna()]
            
            imported = 0
            for merchant_id, group in labeled.groupby('merchant_id'):
                category = group[category_col].value_counts().index[0]
                sample = str(group.iloc[0].get('description', ''))[:70]
                
                self.merchant_db[merchant_id] = {
                    'category': category,
                    'sample_description': sample,
                    'first_seen': datetime.now().isoformat(),
                    'transaction_count': len(group),
                    'imported_from': csv_path
                }
                imported += 1
            
            # Save
            with open(self.db_path, 'w') as f:
                json.dump(self.merchant_db, f, indent=2)
            
            print(f"‚úÖ Imported {imported} merchants from manual labels")
            print(f"‚úÖ Database saved: {self.db_path}")
            return imported
    
    # Run import
    importer = ManualLabelImporter(USER_ID, DATABASE_DIR)
    count = importer.import_csv(MANUAL_LABELS_CSV, MANUAL_LABELS_COLUMN)
    
    print()
    print(f"üéâ {count} merchants pre-loaded!")
    print(f"   These will be auto-categorized in Stage 6B")
    print()
else:
    print("‚Ñπ  No manual labels to import - will learn interactively in Stage 6B")
    print()

 Importing manual labels...

‚úÖ Imported 285 merchants from manual labels
‚úÖ Database saved: C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\processed\merchant_databases\user_john_merchants.json

üéâ 285 merchants pre-loaded!
   These will be auto-categorized in Stage 6B



## STAGE 6B: Hybrid Merchant Learning (Interactive)

**This is where the UI integration happens!**

### How It Works:
1. Checks database first (from manual import or previous months)
2. Auto-labels known merchants
3. Shows interactive prompts for unknown merchants
4. Saves learnings to personal database

### In Production UI:
- Replace `input()` with React/HTML form
- Show progress bar
- Display merchant info visually
- Save to backend API

In [124]:
"""
STAGE 6B: HYBRID MERCHANT LEARNING
"""

print("=" * 80)
print("STAGE 6B: MERCHANT LEARNING")
print("=" * 80)
print()

# Load Stage 5 output
STAGE5 = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\final_categorized_clean.csv"
print(f" Loading: {STAGE5}")
df = pd.read_csv(STAGE5, low_memory=False)
print(f"‚úì {len(df):,} transactions")
print()

# Simple merchant learner
class MerchantLearner:
    def __init__(self, user_id, db_dir):
        self.db_path = os.path.join(db_dir, f"user_{user_id}_merchants.json")
        if os.path.exists(self.db_path):
            with open(self.db_path, 'r') as f:
                self.db = json.load(f)
        else:
            self.db = {}
        print(f"‚úì Database loaded: {len(self.db)} known merchants")
    
    def extract_id(self, row):
        # Try to get merchant ID from extracted_fields
        extracted = str(row.get('extracted_fields_str', '') or row.get('extracted_fields', ''))
        if extracted and extracted not in ['', '{}', 'nan']:
            try:
                import ast
                fields = ast.literal_eval(extracted)
                if fields.get('till_number'):
                    return f"TILL_{fields['till_number']}"
                elif fields.get('recipient_number'):
                    return f"PHONE_{fields['recipient_number']}"
                elif fields.get('merchant_name'):
                    return f"NAME_{fields['merchant_name'][:30]}"
            except:
                pass
        return None
    
    def categorize(self, df, interactive=True):
        df = df.copy()
        df['merchant_subcategory'] = None
        df['merchant_id'] = None
        
        # Extract IDs
        merchant_mask = df['category'] == 'Merchant'
        for idx in df[merchant_mask].index:
            df.at[idx, 'merchant_id'] = self.extract_id(df.loc[idx])
        
        # Auto-label from database
        from_db = 0
        for idx in df[merchant_mask].index:
            mid = df.at[idx, 'merchant_id']
            if mid and mid in self.db:
                df.at[idx, 'merchant_subcategory'] = self.db[mid]['category']
                from_db += 1
        
        need_input = merchant_mask.sum() - from_db
        
        print(f"‚úÖ Auto-labeled: {from_db:,}")
        print(f"‚ùì Need input: {need_input:,}")
        print()
        
        # Interactive learning
        if interactive and need_input > 0:
            unknown = df[(merchant_mask) & (df['merchant_subcategory'].isna())]
            unique = unknown.groupby('merchant_id').first().reset_index()
            
            print("üéì Learning new merchants...")
            print()
            
            learned = 0
            for _, row in unique.iterrows():
                mid = row['merchant_id']
                if not mid or mid == 'None':
                    continue
                
                desc = str(row.get('description', ''))[:65]
                count = len(df[df['merchant_id'] == mid])
                
                print("-" * 80)
                print(f"Merchant: {desc}")
                print(f"Appears: {count} times")
                print()
                print("Category?")
                print("1=Transport 2=Groceries 3=Shopping 4=Food&Dining")
                print("5=Construction 6=Family 7=Investment 8=Other S=Skip")
                
                choice = input("‚Üí ").strip().upper()
                
                cats = {'1':'Transport','2':'Groceries','3':'Shopping',
                       '4':'Food & Dining','5':'Construction','6':'Friends & Family', '7': 'Investment', 
                        '8':'Personal Care', '9':'Business','10': 'Contribution', '11': 'Health Care','12': 'Other', '13':'Skip'}
                
                if choice == 'S':
                    continue
                elif choice in cats:
                    cat = cats[choice]
                elif choice == '7':
                    cat = input("Enter category: ").strip()
                else:
                    continue
                
                df.loc[df['merchant_id']==mid, 'merchant_subcategory'] = cat
                self.db[mid] = {'category': cat, 'learned': datetime.now().isoformat()}
                learned += 1
                print(f"‚úì {cat}")
                print()
            
            # Save database
            with open(self.db_path, 'w') as f:
                json.dump(self.db, f, indent=2)
            
            print("=" * 80)
            print(f"üéâ Learned {learned} merchants!")
            print(f"‚úì Total known: {len(self.db)}")
            print("=" * 80)
        
        return df

# Run learning
learner = MerchantLearner(USER_ID, DATABASE_DIR)
df = learner.categorize(df, interactive=True)

# Save
STAGE6 = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\final_categorized_with_merchants.csv"
df.to_csv(STAGE6, index=False)
print(f" Saved: {STAGE6}")
print()
print("‚úÖ STAGE 6 COMPLETE")
print()

STAGE 6B: MERCHANT LEARNING

 Loading: C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\final_categorized_clean.csv
‚úì 2,715 transactions

‚úì Database loaded: 333 known merchants
‚úÖ Auto-labeled: 795
‚ùì Need input: 0

 Saved: C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\final_categorized_with_merchants.csv

‚úÖ STAGE 6 COMPLETE



---
# ‚ú® FINAL PREPARATION
## Stage 7: Unified Categories + Analysis Features
---

## STAGE 7: Create Unified Category System

**What This Does:**
1. Merges keyword-caught merchants (Shopping, Transport, etc.) into Merchant category
2. Creates `final_category` - ONE category for all analysis
3. Creates category hierarchy (high-level ‚Üî detailed)
4. Adds temporal, financial, and behavioral features
5. Standardizes all labels (Transport not transport)

**Output:** 100% analysis-ready CSV

In [125]:
"""
STAGE 7: UNIFIED CATEGORY SYSTEM + FEATURES
"""

print("=" * 80)
print("STAGE 7: CREATING ANALYSIS-READY DATA")
print("=" * 80)
print()

# Load Stage 6
df = pd.read_csv(STAGE6, low_memory=False)

# Remove empty duplicate if exists
if 'Merchant_Subcategory' in df.columns:
    df = df.drop(columns=['Merchant_Subcategory'])

# Standardize merchant subcategory labels
label_map = {
    'transport': 'Transport', 'Transport': 'Transport',
    'groceries': 'Groceries', 'Groceries': 'Groceries',
    'shopping': 'Shopping', 'Shopping': 'Shopping',
    'construction': 'Construction',
    'business': 'Business',
    'contribution': 'Contribution',
    'food & dining': 'Food & Dining',
    'Friends & Family': 'Friends & Family',
    'Personal Care': 'Personal Care',
}

if 'merchant_subcategory' in df.columns:
    df['merchant_subcategory'] = df['merchant_subcategory'].map(
        lambda x: label_map.get(x, x) if pd.notna(x) else x
    )

# Merge keyword-caught merchants
keyword_merchants = ['Shopping', 'Groceries', 'Transport', 'Food & Dining', 
                    'Personal Care', 'Health Care', 'Entertainment']

merge_mask = (
    df['category'].isin(keyword_merchants) &
    df['type'].isin(['Till Payment', 'PayBill', 'Pochi la Biashara'])
)

df.loc[merge_mask & df['merchant_subcategory'].isna(), 'merchant_subcategory'] = df.loc[merge_mask & df['merchant_subcategory'].isna(), 'category']
df.loc[merge_mask, 'category'] = 'Merchant'

print(f"‚úì Merged {merge_mask.sum()} keyword merchants")

# Create final_category
df['final_category'] = df.apply(
    lambda r: r['merchant_subcategory'] if r['category']=='Merchant' and pd.notna(r['merchant_subcategory']) else r['category'],
    axis=1
)

print("‚úì Created final_category")

# Create hierarchy
def get_level1(cat):
    if pd.isna(cat): return 'Other'
    if cat in ['Income']: return 'Income'
    if cat in ['Transport','Groceries','Shopping','Food & Dining','Construction','Personal Care','Entertainment','Clothing','Business','Labor']:
        return 'Spending'
    if cat in ['Bills','Government Bills','Subscriptions','Health Care','Education']:
        return 'Bills & Obligations'
    if cat in ['Friends & Family','Contribution','Betting']:
        return 'Social & Leisure'
    if cat in ['M-Pesa Fees','Loan Repayment','Loans','Fuliza']:
        return 'Finance & Fees'
    if cat in ['Cash Deposit','Cash Withdrawal','Savings','Bank Transfer']:
        return 'Transfers & Savings'
    return 'Other'

df['category_level1'] = df['final_category'].apply(get_level1)
df['category_level2'] = df['final_category']

print("‚úì Created category hierarchy")

# 1. Parse datetime
date_col = 'completion_time' if 'completion_time' in df.columns else 'Completion Time'
df['datetime'] = pd.to_datetime(df[date_col], errors='coerce')

# 2. Create temporal features
print("üìÖ Creating temporal features...")
df['date'] = df['datetime'].dt.date
df['year'] = df['datetime'].dt.year
df['month'] = df['datetime'].dt.month
df['month_name'] = df['datetime'].dt.strftime('%B')
df['day'] = df['datetime'].dt.day
df['weekday'] = df['datetime'].dt.day_name()
df['weekday_num'] = df['datetime'].dt.dayofweek
df['hour'] = df['datetime'].dt.hour
df['is_weekend'] = df['weekday_num'].isin([5,6]).astype(int)

def time_of_day(hour):
    if pd.isna(hour): return 'Unknown'
    if 5 <= hour < 12: return 'Morning'
    elif 12 <= hour < 17: return 'Afternoon'
    elif 17 <= hour < 21: return 'Evening'
    else: return 'Night'

df['time_of_day'] = df['hour'].apply(time_of_day)


# Add features
# Clean amounts
if 'withdrawn' in df.columns:
    df['amount_spent'] = pd.to_numeric(df['withdrawn'].astype(str).str.replace(',',''), errors='coerce').abs().fillna(0)
elif 'Withdrawn' in df.columns:
    df['amount_spent'] = pd.to_numeric(df['Withdrawn'].astype(str).str.replace(',',''), errors='coerce').abs().fillna(0)
else:
    df['amount_spent'] = 0

if 'paid_in' in df.columns:
    df['amount_received'] = pd.to_numeric(df['paid_in'].astype(str).str.replace(',',''), errors='coerce').abs().fillna(0)
elif 'Paid In' in df.columns:
    df['amount_received'] = pd.to_numeric(df['Paid In'].astype(str).str.replace(',',''), errors='coerce').abs().fillna(0)
else:
    df['amount_received'] = 0

if 'balance' in df.columns:
    df['balance'] = pd.to_numeric(df['balance'].astype(str).str.replace(',',''), errors='coerce')
elif 'Balance' in df.columns:
    df['balance'] = pd.to_numeric(df['Balance'].astype(str).str.replace(',',''), errors='coerce')

# Sort by time
df = df.sort_values('datetime').reset_index(drop=True)

# Net flow
df['net_flow'] = df['amount_received'] - df['amount_spent']

# Running totals
df['cumulative_spent'] = df['amount_spent'].cumsum()
df['cumulative_received'] = df['amount_received'].cumsum()

# Balance change
if 'balance' in df.columns:
    df['balance_change'] = df['balance'].diff()

# 4. Categorical features
print("üè∑Ô∏è  Creating categorical features...")

print("‚úì Added analysis features")
print()

# Save
FINAL = r"C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\final_analysis_ready.csv"
df.to_csv(FINAL, index=False)

print("=" * 80)
print("‚úÖ ANALYSIS-READY DATA CREATED!")
print("=" * 80)
print()
print(f" Saved: {FINAL}")
print()
print(f"Rows: {len(df):,}")
print(f"Columns: {len(df.columns)}")
print(f"Date range: {df['datetime'].min()} to {df['datetime'].max()}")
print()

print("High-Level Categories:")
for cat, cnt in df['category_level1'].value_counts().items():
    pct = cnt/len(df)*100
    print(f"  {cat:30s}: {cnt:6,} ({pct:5.1f}%)")

print()
print("Top 15 Detailed Categories:")
for cat, cnt in df['final_category'].value_counts().head(15).items():
    pct = cnt/len(df)*100
    print(f"  {cat:30s}: {cnt:6,} ({pct:5.1f}%)")

print()
print("‚úÖ STAGE 7 COMPLETE")

STAGE 7: CREATING ANALYSIS-READY DATA

‚úì Merged 12 keyword merchants
‚úì Created final_category
‚úì Created category hierarchy
üìÖ Creating temporal features...
üè∑Ô∏è  Creating categorical features...
‚úì Added analysis features

‚úÖ ANALYSIS-READY DATA CREATED!

 Saved: C:\Users\setla\Documents\Flatiron\PHASE5\Capstone\final_analysis_ready.csv

Rows: 2,715
Columns: 33
Date range: 2024-02-17 18:31:52 to 2026-02-10 10:30:15

High-Level Categories:
  Finance & Fees                :    779 ( 28.7%)
  Spending                      :    685 ( 25.2%)
  Transfers & Savings           :    587 ( 21.6%)
  Income                        :    245 (  9.0%)
  Bills & Obligations           :    163 (  6.0%)
  Social & Leisure              :    157 (  5.8%)
  Other                         :     99 (  3.6%)

Top 15 Detailed Categories:
  M-Pesa Fees                   :    779 ( 28.7%)
  Cash Deposit                  :    297 ( 10.9%)
  Income                        :    245 (  9.0%)
  Savings      

## ESSENTIAL AND DISCRETIONARY CLASSIFICATION



In [127]:
"""

ESSENTIAL (True needs - can't avoid for basic living):
‚úì Transport - Getting to work/school
‚úì Groceries - Food for home
‚úì Bills - Utilities (KPLC, Water)
‚úì Government Bills - NHIF, NSSF, taxes
‚úì Health Care - Medical expenses
‚úì Education - School fees
‚úì Airtime - Communication (basic need in modern life)

DISCRETIONARY (Wants - can reduce/avoid):
‚úì Betting - Gambling
‚úì Fast Foods - Eating out (non-essential)
‚úì Food & Dining - Restaurants (non-essential)
‚úì Entertainment - Leisure activities
‚úì Shopping - Non-grocery shopping
‚úì Personal Care - Salons, spas (beyond basic)
‚úì Contribution - Donations, offerings (generous but optional)

NEITHER (Not spending or social obligations):
‚Ä¢ Income
‚Ä¢ Cash Deposit/Withdrawal  
‚Ä¢ Savings
‚Ä¢ M-Pesa Fees
‚Ä¢ Friends & Family - Social obligations (different from discretionary)
‚Ä¢ Construction - Business/investment
‚Ä¢ Business - Business expenses
‚Ä¢ All other transfers
"""
def fix_essential_discretionary_classification(input_csv: str, output_csv: str):
    """
    Fix essential vs discretionary classification with correct logic
    """
    
    
    # ESSENTIAL - True basic needs
    essential_categories = [
        'Transport',        # Getting to work/school
        'Groceries',        # Food for home
        'Bills',           # Utilities
        'Government Bills', # Taxes, NHIF, NSSF
        'Health Care',     # Medical
        'Education',       # School fees
        'Airtime',         # Basic communication
    ]
    
    # DISCRETIONARY - Wants, can reduce/avoid
    discretionary_categories = [
        'Betting',         # Gambling
        'Fast Foods',      # Eating out
        'Food & Dining',   # Restaurants (NOT groceries)
        'Entertainment',   # Leisure
        'Shopping',        # Non-grocery shopping
        'Personal Care',   # Salons, spas
        'Contribution',    # Donations, offerings
        'Subscriptions',   # Entertainment subscriptions (Netflix, GOTV, etc.)
    ]
    
    # Apply classification
    df['is_essential'] = df['final_category'].isin(essential_categories).astype(int)
    df['is_discretionary'] = df['final_category'].isin(discretionary_categories).astype(int)
    
    print("‚úì Classification applied")
    print()
    
    # =========================================================================
    # VALIDATION & SUMMARY
    # =========================================================================
    
    print("=" * 80)
    print("NEW CLASSIFICATION SUMMARY")
    print("=" * 80)
    print()
    
    essential = df[df['is_essential'] == 1]
    discretionary = df[df['is_discretionary'] == 1]
    neither = df[(df['is_essential'] == 0) & (df['is_discretionary'] == 0)]
    
    print(f"ESSENTIAL:      {len(essential):6,} ({len(essential)/len(df)*100:5.1f}%)")
    print(f"DISCRETIONARY:  {len(discretionary):6,} ({len(discretionary)/len(df)*100:5.1f}%)")
    print(f"NEITHER:        {len(neither):6,} ({len(neither)/len(df)*100:5.1f}%)")
    print()
    
    # Essential breakdown
    print("ESSENTIAL Categories:")
    for cat in essential_categories:
        count = len(df[(df['final_category'] == cat) & (df['is_essential'] == 1)])
        if count > 0:
            amount = df[(df['final_category'] == cat) & (df['is_essential'] == 1)]['amount_spent'].sum()
            print(f"  {cat:25s}: {count:5,} txns | KES {amount:12,.2f}")
    
    print()
    
    # Discretionary breakdown
    print("DISCRETIONARY Categories:")
    for cat in discretionary_categories:
        count = len(df[(df['final_category'] == cat) & (df['is_discretionary'] == 1)])
        if count > 0:
            amount = df[(df['final_category'] == cat) & (df['is_discretionary'] == 1)]['amount_spent'].sum()
            print(f"  {cat:25s}: {count:5,} txns | KES {amount:12,.2f}")
    
    print()
    
    # Financial summary
    essential_spending = essential['amount_spent'].sum()
    discretionary_spending = discretionary['amount_spent'].sum()
    total_spending = df['amount_spent'].sum()
    
    print("=" * 80)
    print("SPENDING ANALYSIS")
    print("=" * 80)
    print()
    print(f"Essential Spending:     KES {essential_spending:14,.2f} ({essential_spending/total_spending*100:5.1f}%)")
    print(f"Discretionary Spending: KES {discretionary_spending:14,.2f} ({discretionary_spending/total_spending*100:5.1f}%)")
    print(f"Other (Neither):        KES {total_spending - essential_spending - discretionary_spending:14,.2f}")
    print(f"Total Spending:         KES {total_spending:14,.2f}")
    print()
    
    # Savings potential
    if discretionary_spending > 0:
        print("üí° INSIGHTS:")
        print(f"   ‚Ä¢ If you reduce discretionary spending by 25%: Save KES {discretionary_spending * 0.25:,.2f}/month")
        print(f"   ‚Ä¢ If you reduce discretionary spending by 50%: Save KES {discretionary_spending * 0.50:,.2f}/month")
        print()
    
    # Save
    df.to_csv(output_csv, index=False)
    print(f"üíæ Saved: {output_csv}")
    print()
    
    return df


if __name__ == "__main__":
    INPUT = "final_analysis_ready.csv"
    OUTPUT = "final_analysis.csv"
    
    df = fix_essential_discretionary_classification(INPUT, OUTPUT)
    
    print("‚úÖ Ready for analysis with correct classification!")

‚úì Classification applied

NEW CLASSIFICATION SUMMARY

ESSENTIAL:         475 ( 17.5%)
DISCRETIONARY:     398 ( 14.7%)
NEITHER:         1,842 ( 67.8%)

ESSENTIAL Categories:
  Transport                :   190 txns | KES   111,570.00
  Groceries                :    90 txns | KES   103,264.00
  Bills                    :   104 txns | KES   115,656.00
  Government Bills         :    32 txns | KES    32,536.00
  Health Care              :     1 txns | KES        90.00
  Airtime                  :    58 txns | KES    44,400.00

DISCRETIONARY Categories:
  Betting                  :    20 txns | KES     1,005.00
  Fast Foods               :     1 txns | KES       310.00
  Food & Dining            :   221 txns | KES    31,575.00
  Shopping                 :    56 txns | KES    45,444.00
  Personal Care            :    28 txns | KES     2,240.00
  Contribution             :    46 txns | KES    80,117.00
  Subscriptions            :    26 txns | KES    45,231.00

SPENDING ANALYSIS

Essential S

---
# ‚úÖ WORKFLOW COMPLETE!
---

## üéâ Success!

Your M-Pesa data is now **100% analysis-ready**!

### üìÅ Output Files:

1. **`FINAL_ANALYSIS_READY.csv`** ‚Üê Use this for analytics!
2. **`user_{id}_merchants.json`** ‚Üê Personal database (persistent)
3. Intermediate files (stage1-6) for debugging

### ‚ú® What's in the Final CSV:

- ‚úÖ **Unified categories** (no duplicates)
- ‚úÖ **Category hierarchy** (high-level ‚Üî detailed)
- ‚úÖ **Temporal features** (weekday, hour, payday indicators)
- ‚úÖ **Financial features** (amounts, balance, trends)
- ‚úÖ **Behavioral features** (recurring, essential/discretionary)
- ‚úÖ **Standardized labels** (Transport not transport)

### üöÄ Next Steps:

1. **Exploratory Data Analysis (EDA)**
   - Spending patterns by time
   - Category breakdowns
   - Trend analysis

2. **Recommendations Engine**
   - Budget optimization
   - Savings opportunities
   - Spending predictions

3. **Interactive Dashboard**
   - Real-time visualizations
   - Drill-down capabilities
   - Comparative analysis

### üí° For Future Statements:

Run this notebook again with a new PDF:
- Stages 1-5: Automated (no input)
- Stage 6: Fewer merchants to label (database grows!)
- After 3-6 months: 95%+ auto-categorized üéØ

---