# Zoo Financial Exploratory Data Analysis
What is the motivation for conducting this analysis?

Exploring the CheckChicks' cashflow can reveal that...

Users who enjoy tracking their expenses and income might be,
depending on their approach and enthusiasm for financial management.
described as:

- **Budget-conscious** – Someone who carefully plans and monitors their spending.
- **Financially meticulous** – A person who pays close attention to financial details.
- **Expense tracker** – Someone who actively records their spending habits.
- **Frugal planner** – A person who enjoys optimizing their finances for savings.
- **Money-savvy** – Someone who is knowledgeable and strategic about finances.
- **Personal finance enthusiast** – A broader term for those who enjoy managing their money.
- **Data-driven spender** – Someone who makes financial decisions based on recorded data.
- **Financial optimizer** – A person who seeks to maximize efficiency in their financial habits.

On the other hand, users who don’t track their expenses and income regularly might be,
depending on their habits and attitudes toward financial management, described as such:

- **Spontaneous spender** – Someone who makes purchases without much planning.
- **Financially carefree** – A person who doesn’t stress about tracking money closely.
- **Unstructured budgeter** – Someone who manages finances loosely without detailed records.
- **Impulse buyer** – A person who tends to make purchases on a whim.
- **Money-agnostic** – Someone who doesn’t prioritize financial tracking.
- **Casual earner** – A person who earns and spends without strict oversight.
- **Non-budgeter** – Someone who avoids formal budgeting altogether.
- **Financially intuitive** – A person who relies on instinct rather than detailed tracking.

# Data Source

- acc_user
- acc_cashflow

In [None]:
import os
from sqlalchemy import create_engine
import pandas as pd

user = os.getenv("MYSQL_USER")
password = os.getenv("MYSQL_PASSWORD")
host = "localhost"
database = "zoo"

engine = create_engine(f"mysql+pymysql://{user}:{password}@{host}/{database}")

## User Registration Date

In [None]:
## user registration , i.e. account creation, date

query = """
SELECT user_id, min(DATE(CREDTM)) registered
FROM zoo_checkchick2.ACC_USER
GROUP BY user_id
"""

users = pd.read_sql(query, con=engine,
                    dtype=({'registered':'datetime64[ns]'}))
users.describe()

In [None]:
# user cohorts by registered year

query = """
SELECT
  year(CREDTM) registered_yr,
  count(*) n_new_user
FROM (
  SELECT
    USER_ID,
    min(CREDTM) AS CREDTM
    FROM zoo_checkchick2.ACC_USER
	GROUP BY USER_ID
) AS foo
  GROUP BY registered_yr
  ORDER BY registered_yr
;
"""

pd.read_sql(query, con=engine)

## Cashflow

In [None]:
# check number of records (expense and income entries) each month

query = """
SELECT DATE_FORMAT(date, '%%Y-%%m') yyyy_mm,
  COUNT(*) total,
  SUM(is_expense) nbr_expense_entry,
  SUM(is_group) nbr_group_entry
FROM zoo.acc_cashflow
GROUP BY yyyy_mm
"""

df = pd.read_sql(query, con=engine)
df.describe()

_**Observation:**_ Half of the periods in the data set have very low number of records each month. Erroneous date?

In [None]:
import seaborn as sns

# Reshape data for Seaborn
df_melted = df.melt(id_vars=['yyyy_mm'], var_name='Metric', value_name='Value')
df_melted['yyyy_mm'] = df_melted['yyyy_mm'].astype('string')

_ = sns.lineplot(data=df_melted, x='yyyy_mm', y='Value', color='skyblue', hue='Metric')

# Rotate X-axis labels
# Reduce tick labels by selecting every 4th label
_.set_xticklabels([label if i % 4 == 0 else '' for i, label in enumerate(df_melted['yyyy_mm'])], rotation=90)

_.figure.set_size_inches(12, 6)
_.set_xlabel('Period')
_.set_ylabel('Count')

# _.figure.show()  # Ensure proper display without Matplotlib calls

__*Observation:*__
- errorenous timestamps
- personal entries makes up significally larger portions of records; I had expected more entries. 
- expenses makes up larger portion of the records; this aligns with the norm for personal finance dataset.

In [None]:
# acc_cashflow dataset
# left and right date range censored 
# between '2018-06-01' AND '2025-01-31'
# to exclude erroneous rows

cashflow = pd.read_feather('../data/cashflow.feather'
                          ).query("'2018-06-01' <= date <= '2025-01-31'")
cashflow.info()

_**n.b.,**_ __amt__ is float64. _I expect this to be whole number._

In [None]:
cashflow.isna().sum()

In [None]:
print(cashflow.group_id.count() / cashflow.shape[0])

_**Observation:**_ group entries make up about 15% of the records.

_**Question:**_ Should factional amounts be rounded up?

In [None]:
# should amt be whole number?
bad = cashflow.query("amt % 1 != 0").copy()
bad.describe()

In [None]:
# create quartile bins
bad['quartile'] = pd.qcut(bad['amt'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bad.groupby('quartile', observed=True).agg({'amt':['count', 'sum', 'mean']})

In [None]:
bad.groupby(['quartile', 'is_expense'], observed=True).agg({'amt':'sum'}).unstack(level=1)

In [None]:
bad['category'].unique()

In [None]:
bad['note'].unique()

In [None]:
# impute fractional amounts by rounding up if > 0 and down if < 0

cashflow['amt'] = cashflow['amt'].round().astype('Int64')

In [None]:
cashflow.info()

In [None]:
# Should rows where amount is 0 be excluded?

cashflow[cashflow.amt == 0].count()

In [None]:
# check number of active users in each month, i.e. period

cashflow['yyyy_mm'] = cashflow['date'].dt.to_period('M')
cashflow['yyyy'] = cashflow['date'].dt.year
cashflow['mm'] = cashflow['date'].dt.month

#cashflow.groupby(['yyyy', 'mm']).agg({'user_id':'nunique'}).unstack(level=1)

In [None]:
_ = cashflow.groupby('yyyy_mm').agg({'user_id':'nunique'}) \
    .plot(y='user_id', kind='line', figsize=(12, 6), title="Number of Active Users Each Month")

__*Observation:*__ number of active users gradually declined but not obvious

In [None]:
cashflow.describe()

# Tidy Dataset

Let the final tidy dataset be _td_. Proceed the data preparation as follow:

## Tenure

In [None]:
# user tenure, income and expense entry stats,
# including group entries
# excluding amount = 0
# n.b. tenure begin when user submit first entry, and does not consider
#      when user first registered with the app

tenure = cashflow[cashflow.amt != 0] \
    .groupby('user_id') \
    .agg(user_tenure = ('date', lambda x: x.max() - x.min()),
         first_entry = ('date', 'min'),
         last_entry = ('date', 'max'),
         nbr_entry = ('user_id', 'count'),
         total_exp = ('amt', lambda x: x[x < 0].sum()),
         nbr_exp = ('amt', lambda x: x[x < 0].count()),
         total_inc = ('amt', lambda x: x[x > 0].sum()),
         nbr_inc = ('amt', lambda x: x[x > 0].count())
        )
tenure.info()

In [None]:
tenure.describe()

_**Initial observation:**_ Of the 399,125 users, 25% of which had logged entries in the last 48~50 days 

In [None]:
cashflow.groupby(['user_id', 'is_group', 'is_expense'])['category_id'].nunique().unstack(level=[1,2], fill_value=0).describe()

_**Observation:**_ Is it true that entries are not well categorized by the bottom 75% of users?

## Group Bookkeeping

In [None]:
# user group expense and income entry stat per user

td_grp = cashflow[(cashflow.amt != 0) & (cashflow.is_group == True)] \
    .groupby('user_id') \
    .agg(n_grp = ('group_id', 'nunique'),
         first_grp_entry = ('date', 'min'),
         last_grp_entry = ('date', 'max'),
         grp_exp = ('amt', lambda x: x[x < 0].sum()),
         nbr_grp_exp = ('amt', lambda x: x[x < 0].count()),
         grp_inc = ('amt', lambda x: x[x > 0].sum()),
         nbr_grp_inc = ('amt', lambda x: x[x > 0].count())
        )
td_grp.info()

In [None]:
td_grp.describe()

## Co-bookkeepers

self-joined group_id:user_id from _cashflow_ to compute
number of distinct users interacted within groups

In [None]:
# unique group_id:user_id linkage 
grp = cashflow.loc[(cashflow.amt != 0) & (cashflow.group_id.notnull()),
                 ['group_id', 'user_id']].drop_duplicates()
grp.info()

In [None]:
# user's groups and their assoicated users (members),
# i.e., user's connections with other users thru cooperative bookkeeping 

mbr = grp[['user_id', 'group_id']].merge(grp[['user_id', 'group_id']], on='group_id', how='left')
mbr.columns = ['user_id', 'group_id', 'member_id']
mbr.describe()

In [None]:
# count the participants at each group level for every user

x = mbr.groupby(['user_id', 'group_id']) \
    .agg({'member_id':'nunique'}) \
    .reset_index(1) \
    .rename(columns={'member_id':'nbr_member'})
x.describe()

In [None]:
x.plot.hist(bins=30, alpha=0.7)

_**Observation:**_ Groups with only one participant should be excluded in the stat summary.

In [None]:
# for every user, count the unique users across _all_ associated groups

cnx = mbr.groupby('user_id').agg({'group_id': 'nunique', 'member_id':'nunique'})
cnx.columns = ['n_grp', 'nbr_connection']
cnx['nbr_connection'] = cnx['nbr_connection'] - 1 # remove user itself from count
cnx.hist(bins=40, grid=False, alpha=.7)

## Merge and Impute

In [None]:
td = users.merge(tenure, on='user_id', how='right') \
    .merge(cnx['nbr_connection'], on='user_id', how='left') \
    .merge(td_grp, on='user_id', how='left') \
    .convert_dtypes()
td.info()

In [None]:
values = {'nbr_connection':0, 'n_grp':0,
          'grp_exp':0, 'nbr_grp_exp':0,
          'grp_inc':0, 'nbr_grp_inc':0}
td.fillna(value=values, inplace=True)

In [None]:
td.describe()

In [None]:
td.to_csv('../reports/tidy.csv')

# EDA

In [None]:
# select users having made at least 1 entry in the past 50 days

t1 = max(tenure.last_entry)
t0 = t1 - pd.Timedelta(days=50)

td.loc[td.last_entry >= t0,]['nbr_entry'].quantile([0.25, 0.5, 0.75])

In [None]:
# number of users in the database
td.loc[td.last_entry >= t0,].shape[0]

In [None]:
# percent of active users in the past 50 days 
td[td.last_entry >= t0].shape[0] / td.shape[0]

In [None]:
x = td[td.last_entry >= t0].copy()
x['quantile'] = pd.qcut(x['user_tenure'], q=10)
(x['user_tenure'].dt.days / 365.2425).mean()

In [None]:
x.groupby('quantile', observed=True).agg({'user_tenure':'median', 'nbr_entry':'mean', 'last_entry':'median'})

In [None]:
#number of members in each group
nbr_mbr_grp = grp.groupby('group_id')['user_id'].nunique()

nbr_mbr_grp.agg(['min', 'max', 'mean', 'median'])

In [None]:
nbr_mbr_grp.quantile([.25, .5, .75, .8, .9, .95, .99])

In [None]:
# is this interesting?

cashflow.groupby(['user_id'])['amt'].sum().quantile([.1, .25, .5, .75, .9])

## Expense by Category



In [None]:
# private entries
cashflow.query("is_group == False & is_expense == True").groupby('category') \
    .agg(n_user = ('user_id','nunique'),
         nbr_expense = ('category', 'count'),
         nbr_expense_per_user = ('user_id', lambda x: round(x.count() / x.nunique(), 2)),
         avg_amt = ('amt', lambda x: -(round(x.mean())))
        ) \
    .sort_values(by='nbr_expense',  ascending=False)

In [None]:
# group entries
cashflow.query("is_group == True & is_expense == True").groupby('category') \
    .agg(n_user = ('user_id','nunique'),
         nbr_expense = ('category', 'count'),
         nbr_expense_per_user = ('user_id', lambda x: round(x.count() / x.nunique(), 2)),
         avg_amt = ('amt', lambda x: -(round(x.mean())))
        ) \
    .sort_values(by='nbr_expense',  ascending=False)

# Supplemental

```sql
-- consolidate expense and income entry to simplify analysis

USE zoo;

DROP TABLE IF EXISTS acc_cashflow;

CREATE TABLE acc_cashflow (
  user_id VARCHAR(100) NOT NULL,
  date DATE NOT NULL,
  amt DOUBLE NOT NULL,
  is_expense BOOLEAN,
	is_group BOOLEAN,
  group_id VARCHAR(100),
  category_id INT,
  category VARCHAR(100),
  note VARCHAR(160)
) COMMENT = 'consolidated expense and income entries'
;

-- insert personal expense
INSERT INTO acc_cashflow
SELECT
  USER_ID user_id,
  DATE(CREDTM) date,
  -AMOUNT amt,
  TRUE is_expense,
	FALSE is_group,
  NULL group_id,
  A.CATEGORY category_id,
  B.CATEGORY category,
  NOTE note
FROM zoo.ACC_USER_DETAIL A
LEFT JOIN zoo_checkchick3.ACC_CATEGORY B
ON A.CATEGORY = B.ID
;

-- insert group expense
INSERT INTO acc_cashflow
SELECT
  USER_ID user_id,
  DATE(CREDTM) date,
  -AMOUNT amt,
  TRUE is_expense,
	TRUE is_group,
  GROUP_ID group_id,
  A.CATEGORY category_id,
  B.CATEGORY category,
  NOTE note
FROM zoo_checkchick.ACC_GROUP_DETAIL A
LEFT JOIN zoo_checkchick3.ACC_CATEGORY B
ON A.CATEGORY = B.ID
;

-- insert personal income
INSERT INTO acc_cashflow
SELECT
  USER_ID user_id,
  DATE(CREDTM) date,
  AMOUNT amt,
  FALSE is_expense,
	FALSE is_group,
  NULL group_id,
  A.CATEGORY category_id,
  B.CATEGORY category,
  NOTE note
FROM zoo_checkchick2.ACC_USER_DETAIL_INCOME A
LEFT JOIN zoo_checkchick3.ACC_CATEGORY_INCOME B
ON A.CATEGORY = B.ID
;

-- insert group income
INSERT INTO acc_cashflow
SELECT
  USER_ID user_id,
  DATE(CREDTM) date,
  AMOUNT amt,
  FALSE is_expense,
	TRUE is_group,
  GROUP_ID group_id,
  A.CATEGORY category_id,
  B.CATEGORY category,
  NOTE note
FROM zoo_checkchick2.ACC_GROUP_DETAIL_INCOME A
LEFT JOIN zoo_checkchick3.ACC_CATEGORY_INCOME B
ON A.CATEGORY = B.ID
;
```