# Money Mop

## Simulating Data

In order to analyze spend data, we can start by simulating spend data until we can find a way to gain real data.

This notebook will set up the basic structure of creating data.

### Imports/Set Up

In [1]:
import pandas as pd
import numpy as np
from faker import Faker
import random
from datetime import datetime, timedelta
import seaborn as sns
import matplotlib.pyplot as plt
import os

In [2]:
fake = Faker()

np.random.seed(88)

### Create Fake Transactions 

In [3]:
num_records = 50000
companies = ['Spendora', 'Expensivus', 'ClarityLedger', 'TrueSpend', 'Fintrix', 'Procuro', 'LedgrIQ', 'Zentro']
departments = ['Engineering', 'Marketing', 'Sales', 'Finance', 'HR', 'Operations']
categories = ['Travel', 'Meals', 'Supplies', 'Entertainment', 'Misc']
merchants = {
    'Travel': ['Delta', 'Uber', 'Lyft', 'Marriott', 'Hilton'],
    'Meals': ['Starbucks', 'Chipotle', 'Panera', 'Olive Garden'],
    'Supplies': ['Staples', 'Office Depot', 'Amazon'],
    'Entertainment': ['AMC', 'TopGolf', 'Dave & Buster\'s'],
    'Misc': ['Etsy', 'Other', 'Unknown']
}
category_amounts = {
    'Travel': (500, 150),         # flights/hotels
    'Meals': (40, 10),            # meals, coffee
    'Software': (200, 100),       # licenses, tools
    'Supplies': (50, 20),         # office stuff
    'Entertainment': (100, 50),   # events, activities
    'Misc': (75, 75)              # uncertain, high variance
}

In [4]:
data = []
for _ in range(num_records):
    company = random.choice(companies)
    department = random.choice(departments)
    category = random.choice(categories)
    merchant = random.choice(merchants[category])
    
    mean, std = category_amounts[category]
    amount = round(np.random.normal(loc=mean, scale=std), 2)
    amount = max(1, amount)
    ## min_amt, max_amt = category_amount_ranges[category]
    ## amount = round(random.uniform(min_amt, max_amt), 2)

    employee = fake.name()
    date = fake.date_between(start_date='-180d', end_date='today')
    
    data.append([employee, company, department, category, merchant, amount, date, 'transaction'])

df_transactions = pd.DataFrame(data, columns=[
    'employee', 'company', 'department', 'category', 'merchant', 'amount', 'date', 'type'
])

### Create Fake Subscriptions

In [5]:
software_subscriptions = {
    'Slack': 120,
    'Zoom': 80,
    'Adobe': 150,
    'AWS': 4000,
    'Snowflake': 10000,
    'Notion': 100
}

In [6]:
software_assignments = {
    'Engineering': ['Slack', 'Zoom', 'AWS', 'Snowflake'],
    'Marketing': ['Slack', 'Notion', 'Adobe'],
    'Sales': ['Zoom', 'Slack'],
    'Finance': ['Slack', 'Notion'],
    'HR': ['Slack', 'Notion']
}

In [7]:
def generate_monthly_dates(start_date, end_date):
    return pd.date_range(start=start_date, end=end_date, freq='MS')

In [8]:
start_date = pd.to_datetime('2025-01-01')
end_date = pd.to_datetime('2025-06-30')
dates = generate_monthly_dates(start_date, end_date)

In [9]:
recurring_rows = []

for company in companies:
    for dept, tools in software_assignments.items():
        for tool in tools:
            amount = software_subscriptions[tool]
            employee = fake.name()

            for bill_date in dates:
                recurring_rows.append([
                    employee,
                    company,
                    dept,
                    'Software',
                    tool,
                    amount,
                    bill_date.date(),
                    'subsription'
                ])

In [10]:
df_recurring = pd.DataFrame(recurring_rows, columns=[
   'employee', 'company', 'department', 'category', 'merchant', 'amount', 'date', 'type'
])

In [11]:
df = pd.concat([df_transactions, df_recurring], ignore_index=True)

In [12]:
df[df['type'] == 'subsription'].head()

Unnamed: 0,employee,company,department,category,merchant,amount,date,type
50000,Troy Jones,Spendora,Engineering,Software,Slack,120.0,2025-01-01,subsription
50001,Troy Jones,Spendora,Engineering,Software,Slack,120.0,2025-02-01,subsription
50002,Troy Jones,Spendora,Engineering,Software,Slack,120.0,2025-03-01,subsription
50003,Troy Jones,Spendora,Engineering,Software,Slack,120.0,2025-04-01,subsription
50004,Troy Jones,Spendora,Engineering,Software,Slack,120.0,2025-05-01,subsription


### Simulate Overspends

In [13]:
outliers = df.sample(frac=0.03).copy()
outliers['amount'] *= 3  # Big overspend
outliers['category'] = 'Misc'  # Unclear category
df.update(outliers)

In [14]:
df.to_csv('simulated_expense_transactions.csv', index=False)

In [15]:
df.head()

Unnamed: 0,employee,company,department,category,merchant,amount,date,type
0,Marcia Bauer,Expensivus,Operations,Misc,Other,83.02,2025-02-05,transaction
1,Robert Terrell,LedgrIQ,Engineering,Meals,Panera,62.06,2025-05-24,transaction
2,Thomas Vega,ClarityLedger,Engineering,Misc,Unknown,146.74,2025-02-17,transaction
3,Victoria Roman,Fintrix,HR,Travel,Uber,510.26,2025-05-18,transaction
4,Jessica Holland,Spendora,Marketing,Supplies,Amazon,71.37,2025-03-06,transaction


## Enrichment

While the current notebook performs data enrichment tasks like calculating Z-scores and flagging policy violations in-line, in a production-grade architecture this logic would typically be handled downstream in the data pipeline.

For example:

- Daily ingestion jobs (e.g., simulated swipes or one-time expenses) might run via scheduled scripts or serverless jobs.

- Monthly batch jobs could simulate recurring expenses like software subscriptions.

- Z-score calculations, merchant averages, and anomaly detection are examples of enrichment that might be performed after ingestion—using tools like Amazon Glue, dbt, or a Spark job in a data lake environment.

For the purposes of this prototype, all steps are consolidated into this notebook. As the project evolves, the plan is to modularize these stages into a pipeline that simulates real-world ETL behavior more closely.

In [16]:
df['day_of_week'] = pd.to_datetime(df['date']).dt.day_name()
df['is_weekend'] = df['day_of_week'].isin(['Saturday', 'Sunday'])
df['month'] = pd.to_datetime(df['date']).dt.month

In [17]:
df['merchant_mean'] = df.groupby('merchant')['amount'].transform('mean')
df['merchant_std'] = df.groupby('merchant')['amount'].transform('std')

df['amount_z_score'] = (df['amount'] - df['merchant_mean']) / df['merchant_std']

df['is_policy_violation'] = (df['amount_z_score'] > 3)

### Estimated Savings

In [18]:
df['potential_savings'] = 0.0
df.loc[df['amount_z_score'] > 3, 'potential_savings'] = df['amount'] - (df['merchant_mean'] + 3 * df['merchant_std'])


In [23]:
output_path = '/Users/griffinbrown/Documents/money_mop/data'
filename = 'simulated_data.csv'
os.makedirs(output_path, exist_ok=True)
df.to_csv(os.path.join(output_path, filename), index=False)

In [21]:
df.head(20)

Unnamed: 0,employee,company,department,category,merchant,amount,date,type,day_of_week,is_weekend,month,merchant_mean,merchant_std,amount_z_score,is_policy_violation,potential_savings
0,Marcia Bauer,Expensivus,Operations,Misc,Other,83.02,2025-02-05,transaction,Wednesday,False,2,85.029928,73.593481,-0.027311,False,0.0
1,Robert Terrell,LedgrIQ,Engineering,Meals,Panera,62.06,2025-05-24,transaction,Saturday,True,5,42.07121,16.724194,1.195202,False,0.0
2,Thomas Vega,ClarityLedger,Engineering,Misc,Unknown,146.74,2025-02-17,transaction,Monday,False,2,86.498069,76.934703,0.783027,False,0.0
3,Victoria Roman,Fintrix,HR,Travel,Uber,510.26,2025-05-18,transaction,Sunday,True,5,531.188385,239.136145,-0.087517,False,0.0
4,Jessica Holland,Spendora,Marketing,Supplies,Amazon,71.37,2025-03-06,transaction,Thursday,False,3,54.341907,28.890219,0.589407,False,0.0
5,Danielle Burke,Zentro,Operations,Meals,Olive Garden,49.97,2025-06-12,transaction,Thursday,False,6,42.026389,16.652833,0.477013,False,0.0
6,Karen Gilbert,LedgrIQ,Engineering,Travel,Hilton,360.27,2025-03-15,transaction,Saturday,True,3,535.848537,245.056369,-0.716482,False,0.0
7,Benjamin Gonzalez,Zentro,Operations,Meals,Starbucks,47.3,2025-03-15,transaction,Saturday,True,3,42.537389,17.843407,0.266912,False,0.0
8,Neil Lamb,ClarityLedger,HR,Entertainment,Dave & Buster's,91.44,2025-02-12,transaction,Wednesday,False,2,108.911119,67.085387,-0.260431,False,0.0
9,Kathy Barnes,Spendora,HR,Supplies,Staples,24.23,2025-04-21,transaction,Monday,False,4,53.323356,28.039722,-1.037576,False,0.0
