# Analysis of Personal Finance Bookkeeping Activity
Per Offical Account (OA) dashboard
1/31/2025  
628,181 Friends added  
312,619 Target reach  
288,289 Blocked count

Transaction Database  
651,793 users (acc_user)  
317,726 blocked (acc_user)  
399,125 unique users (acc_cashflow)

What is the motivation for conducting this analysis?



# Data Source

- acc_user
- acc_cashflow

In [None]:
import os
from sqlalchemy import create_engine
import pandas as pd

user = os.getenv("MYSQL_USER")
password = os.getenv("MYSQL_PASSWORD")
host = "localhost"
database = "zoo"

engine = create_engine(f"mysql+pymysql://{user}:{password}@{host}/{database}")

## Users
`user_id`  
`is_bot`  
`is_agree`: is communication from LINE OA blocked by the user
`ts`: timestamp when the record was created but unclear when it is later updated

In [None]:
query = """
SELECT
  user_id,
  isBot is_bot,
  isAgree is_agree,
  min(CREDTM) ts
FROM zoo_checkchick2.ACC_USER
GROUP BY user_id, is_bot, is_agree
"""

users = pd.read_sql(query, con=engine,
                    dtype=({'is_bot':'bool', 'is_agree':'bool', 'ts':'datetime64[ns]'}))
users.info()

In [None]:
# check for duplicates
x = users.groupby('user_id').size()
dups=x[x > 1].index
len(dups)

In [None]:
users[users.user_id.isin(dups)].shape

In [None]:
# what is dup ratio?
89192/44596

In [None]:
users[users.user_id.isin(dups)].sort_values(['user_id', 'ts']).tail(10)

Remove duplicates by keeping the last row (most recent) in each group.


In [None]:
users.sort_values(['user_id','ts'], inplace=True)

In [None]:
x = users.groupby('user_id', as_index=False).last()
x[x.user_id.isin(dups)]

In [None]:
x.shape

In [None]:
users = x

In [None]:
# how many Bots?
print(users.is_bot.sum())

In [None]:
users.query("is_bot")

Impute `user_id`  
Real user_id starts with captial 'U'

In [None]:
i = users.user_id.str.lower().str.startswith('u')
users[i].shape
users.loc[i, 'user_id'] = users[i]['user_id'].str.capitalize()

Cohorts by user timestamp
_n.b._, this is not always the time user followed or added the OA

In [None]:
cohorts = users.groupby(users.ts.dt.year).size()
cohorts.name = 'n_new_user'
_ = cohorts.plot.bar(rot=0, xlabel='cohort year', ylabel='users')

In [None]:
# user count & block count
(users.user_id.str.startswith('U').sum(), (users.user_id.str.startswith('U') & ~users.is_agree).sum())

## Cashflow

In [None]:
# acc_cashflow dataset
# check number of records (expense and income entries) each month

cashflow = pd.read_feather('../data/cashflow.feather')

df = cashflow \
    .groupby(cashflow.ts.dt.to_period('M')) \
    .agg(total = ('ts', 'size'),
         nbr_expense_entry = ('is_expense', 'sum'),
         nbr_group_entry = ('is_group', 'sum')
        )
_ = df.plot.line(y=['total', 'nbr_expense_entry', 'nbr_group_entry'], xlabel='')
print(df.describe())

In [None]:
print(df[df.total > df.total.quantile(.51)])

__*Observation:*__
- errorenous timestamps
- contrary to my expecation, personal entries makes up significally larger portions of records
- expenses makes up larger portion of the records; this aligns with the norm for personal finance dataset

Let `tsl` be the observation time period

In [None]:
# select date range
# between '2018-06-01' AND '2025-02-01'
# to exclude erroneous rows

tsl = pd.to_datetime(['2018-06-01', '2025-02-01'])
cashflow = cashflow.query("@tsl[0] <= ts < @tsl[1]").copy()
cashflow.info()

_**n.b.,**_ __amt__ is float64. _I expect this to be whole number._

In [None]:
cashflow.isna().sum()

In [None]:
print(cashflow.group_id.count() / cashflow.shape[0])

_**Observation:**_ group entries make up about 15% of the records.

_**Question:**_ Should factional amounts be rounded up?

In [None]:
# should amt be whole number?
bad = cashflow.query("amt % 1 != 0")
bad.groupby('is_expense')['amt'].describe()

In [None]:
bad['category'].unique()

In [None]:
bad['note'].unique()

In [None]:
# impute fractional amounts by rounding up if amt > 0 and down if amt < 0

cashflow['amt'] = cashflow['amt'].round().astype('Int64')

In [None]:
cashflow.info()

In [None]:
# Should amount equals 0 be excluded?

cashflow[cashflow.amt == 0].count()

In [None]:
# check number of active users in each month, i.e. period

cashflow['yyyy_mm'] = cashflow['ts'].dt.to_period('M')
cashflow['yyyy'] = cashflow['ts'].dt.year
cashflow['mm'] = cashflow['ts'].dt.month
cashflow['wk'] = cashflow['ts'].dt.isocalendar().week

#cashflow.groupby(['yyyy', 'mm']).agg({'user_id':'nunique'}).unstack(level=1)

In [None]:
_ = cashflow.groupby('yyyy_mm').agg({'user_id':'nunique'}) \
    .plot(y='user_id', kind='line', figsize=(12, 6), title="Probable User Churns Each Month")

__*Observation:*__ number of active users gradually declined but not obvious

In [None]:
cashflow.describe()

_**Oberservation:**_ expense records out weight income records 3:1

Impute `group_id` is blank but not `None`

In [None]:
bad = cashflow.group_id.str.strip().str.len() == 0  # group_id missing
bad.value_counts()

In [None]:
cashflow.loc[bad, 'user_id'].nunique()

In [None]:
# impute
cashflow.loc[bad, ['group_id', 'is_group']] = [None, False]

Erroneous `user_id`
Valid `user_id` must start with captial U

In [None]:
cashflow['isBad'] = ~cashflow.user_id.str.lower().str.startswith('u')

In [None]:
cashflow.groupby('isBad').size()

In [None]:
cashflow[bad].describe()

Impute valid `user_id`  
Make sure they start with capital U

In [None]:
cashflow.loc[~cashflow.isBad, 'user_id'] = cashflow[~cashflow.isBad]['user_id'].str.capitalize()

In [None]:
cashflow.user_id.nunique()

# Tidy Dataset

Let the final tidy dataset be `td`. Proceed with the data preparation as follow:

## Tenure

Based on acc_cashflow,
tenure is here is defined as to begin when user submits first entry,
and does not consider when a user first follows or adds LINE OA
(offical account).

Transaction amount 0 (zero) are dismissed.

In [None]:
# user tenure, income and expense entry stats,
# including group entries
# excluding amount = 0

tenure = cashflow[(cashflow.amt != 0) & ~cashflow.isBad] \
    .groupby('user_id') \
    .agg(user_tenure = ('ts', lambda x: x.max() - x.min()),
         first_entry = ('ts', 'min'),
         last_entry = ('ts', 'max'),
         nbr_entry = ('user_id', 'count'),
         total_exp = ('amt', lambda x: x[x < 0].sum()),
         nbr_exp = ('amt', lambda x: x[x < 0].count()),
         total_inc = ('amt', lambda x: x[x > 0].sum()),
         nbr_inc = ('amt', lambda x: x[x > 0].count())
        )
# tenure.info()

In [None]:
tenure.describe(percentiles=[.25, .5, .75, .9, .95, .97, .99])

_**Initial observation:**_ Of the 397,208 users, 97% of which had **not** logged entries in the last 55 days 

### WIP Transaction Category 

In [None]:
# WIP category count
td_cat = cashflow[~cashflow.isBad].groupby(['user_id', 'is_expense'])['category_id'].nunique().unstack(level=1, fill_value=0)
td_cat.describe()

_**Initial observation:**_ Is it true that entries are not well categorized by the bottom 75% of users?

## Group Bookkeeping

_to-do:_ add count of categories

In [None]:
# user group expense and income entry stat per user

td_grp = cashflow[(cashflow.amt != 0) & cashflow.is_group & ~cashflow.isBad] \
    .groupby('user_id') \
    .agg(n_grp = ('group_id', 'nunique'),
         first_grp_entry = ('ts', 'min'),
         last_grp_entry = ('ts', 'max'),
         grp_exp = ('amt', lambda x: x[x < 0].sum()),
         nbr_grp_exp = ('amt', lambda x: x[x < 0].count()),
         grp_inc = ('amt', lambda x: x[x > 0].sum()),
         nbr_grp_inc = ('amt', lambda x: x[x > 0].count())
        )
# td_grp.info()

In [None]:
td_grp.describe()

## Co-bookkeepers

self-joined group_id:user_id from _cashflow_ to compute
number of distinct users interacted within groups

In [None]:
# unique group_id:user_id linkage 
grp = cashflow.loc[(cashflow.amt != 0) & (cashflow.group_id.notnull()) & ~cashflow.isBad,
                 ['group_id', 'user_id']].drop_duplicates()
# grp.info()

In [None]:
# user's groups and their assoicated users (members),
# i.e., user's connections with other users thru cooperative bookkeeping 

mbr = grp[['user_id', 'group_id']].merge(grp[['user_id', 'group_id']], on='group_id', how='left')
mbr.columns = ['user_id', 'group_id', 'member_id']
mbr.describe()

In [None]:
# count the participants at each group level for every user

x = mbr.groupby(['user_id', 'group_id']) \
    .agg({'member_id':'nunique'}) \
    .reset_index(1) \
    .rename(columns={'member_id':'nbr_member'})
x.describe()

In [None]:
x.plot.hist(bins=30, alpha=0.7)

_**Observation:**_ Groups with only one participant should be excluded in the stat summary.

In [None]:
# for every user, count the unique users across _all_ associated groups

cnx = mbr.groupby('user_id').agg({'group_id': 'nunique', 'member_id':'nunique'})
cnx.columns = ['n_grp', 'nbr_connection']
cnx['nbr_connection'] = cnx['nbr_connection'] - 1 # remove user itself from count
cnx.hist(bins=40, grid=False, alpha=.7)

## Frequency

What is the typical frequency of logging financial transactions?  
- number of times per week
- interval (or elapsed time) between events (logging transactions)

In [None]:
# set datetime index in order to resample frequency of event
# cashflow.reset_index(inplace=True)
cashflow = cashflow.set_index('ts').sort_index()

In [None]:
weekly_counts = cashflow[~cashflow.isBad].groupby(['user_id', pd.Grouper(freq='W')])['user_id'].size()

In [None]:
fq = weekly_counts.groupby('user_id').agg({'count', 'median', 'mean'})
fq.columns = ['nbr_wks', 'fq_median', 'fq_mean']
fq.info()

In [None]:
fq.describe(percentiles=[.25, .5, .6, .7, .75, .8, .9, .95, .99])

calcuate interval

In [None]:
# if we just want to know how frequent users records their personal finance,
# it is not important to separate income from expense entries

cashflow.reset_index(inplace=True)
y_sorted = cashflow[~cashflow.isBad].sort_values(['user_id', 'ts'])
y_sorted['time_elapsed'] = y_sorted.groupby('user_id')['ts'].diff()

In [None]:
# y_sorted[['user_id', 'ts', 'time_elapsed']].tail(30)

In [None]:
# intervals = y_sorted.groupby('user_id').agg({'days_elapsed':['median', 'mean', 'max']})
# n.b. a known bug with median, use quantile(0.5) workaround
#      workaround also includes renaming the columns <lambda_0>
intervals = y_sorted.groupby('user_id').agg({'time_elapsed':[lambda x: x.quantile(0.5), 'mean', 'max']})
intervals.columns = intervals.columns.set_levels(['median', 'mean', 'max'], level=1)
intervals.describe()

In [None]:
# flatten multilevel column index
# intervals.columns = intervals.columns.to_flat_index()
intervals.columns = ['_'.join(col) for col in intervals.columns]

In [None]:
# intervals.head(30)

In [None]:
fq.info()

In [None]:
intervals.info()

In [None]:
freq = fq.merge(intervals, on='user_id').convert_dtypes()

In [None]:
freq.info()

### WIP

In [None]:
# 
x_sorted = cashflow.sort_values(['user_id', 'is_expense', 'ts'])
x_sorted['days_elapsed'] = x_sorted.groupby(['user_id', 'is_expense'])['ts'].diff()

In [None]:
x_sorted[['user_id', 'is_expense', 'ts', 'days_elapsed']].tail(30)

In [None]:
fq_ = x_sorted.groupby(['user_id', 'is_expense']).agg({'days_elapsed':[lambda x: x.quantile(0.5), 'mean', 'max']})

In [None]:
fq_.query("is_expense").describe()

In [None]:
#fq =
fq_.groupby(['is_expense']).agg(['min', 'median', 'mean', 'max'])
# fq

In [None]:
fq_.xs('Uffffed94576a41cb306b899c40719ed9', level='user_id')

In [None]:
fq_.xs(False, level='is_expense').agg(['min','mean','median'])
# fq.index

## Merge and Impute

In [None]:
td = users[['user_id', 'is_agree', 'ts']].rename(columns={'ts':'user_ts'}) \
    .merge(tenure, on='user_id', how='right') \
    .merge(freq, on='user_id', how='left') \
    .merge(cnx['nbr_connection'], on='user_id', how='left') \
    .merge(td_grp, on='user_id', how='left') \
    .convert_dtypes()
# td.info()

In [None]:
values = {'is_agree':False,
          'nbr_connection':0, 'n_grp':0,
          'grp_exp':0, 'nbr_grp_exp':0,
          'grp_inc':0, 'nbr_grp_inc':0}
td.fillna(value=values, inplace=True)

In [None]:
td.describe()

In [None]:
td.info()

In [None]:
# raise exception if file already exists but allow the run to proceed
# to-do: convert user_tenure from timedelta days to int before .to_csv
td.to_csv('../reports/tidy.csv', mode='x')

In [None]:
td[td.user_id=='Uffffed94576a41cb306b899c40719ed9']

# EDA

## WIP: Group of One

What are the patterns of this cluster of users
who tracks income or expenses using one or more groups
with no other members in the group?

In [None]:
x = mbr.groupby('group_id').agg(n_member=('member_id', 'nunique'))
y = x.query('n_member == 1')
grp_lst = [i for i in y.index]
grp_one = cashflow[(cashflow.amt != 0) & (cashflow.group_id.isin(grp_lst)) & ~cashflow.isBad] \
    .groupby('group_id') \
    .agg(first_entry=('ts', 'min'), last_entry=('ts', 'max'),
         nbr_entry=('amt', 'count'),
         grp_exp = ('amt', lambda x: x[x < 0].sum()),
         nbr_grp_exp = ('amt', lambda x: x[x < 0].count()),
         grp_inc = ('amt', lambda x: x[x > 0].sum()),
         nbr_grp_inc = ('amt', lambda x: x[x > 0].count())
        )

grp_one.describe()

In [None]:
len(grp_lst) / x.shape[0]
# x.shape[0]

## Survival Analysis
churn event: user blocked/unfollowed

`churned`: `is_agree` is `False` 

right censored: same as`churned` is `False`

if churned and `registerd` (a.k.a. user_ts) > `last_entry`, make end time = `registered`

if not churned, make end time = `tsl[1]` end of observation period

? `registerd` < `last_entry` but `churned` is `True`


In [None]:
# data set highly censored -> bias observations
td.groupby('is_agree').size()

In [None]:
# time elapsed since the last transaction entry until the
# observation cut off period tsl[1]
td['days_since'] = tsl[1] - td.last_entry

In [None]:
# data issue?
td.query("user_ts.isna()").shape

calculate survival time

In [None]:
# Is left censoring necessary? No.
(td.user_ts < tsl[0]).sum()

In [None]:
# calculate start time
td['t0'] = td[['user_ts', 'first_entry']].min(axis=1)

calculate end time and survival_time

`tsl[1]` is the observation end time  

if is_agree is False, set the user end time to larger of 
`last_entry` or user record timestamp `ts` from acc_user table

In [None]:
td['t1'] = tsl[1]
td.loc[~td.is_agree, 't1'] = td.loc[~td.is_agree, ['user_ts', 'last_entry']].max(axis=1, skipna=True)
td['survival_time'] = td.t1 - td.t0

In [None]:
# how many have churned beyound the oberservation period
td.loc[~td.is_agree & (td.t1 >= tsl[1])]

In [None]:
# what portion of users have churned?
(~td.is_agree).sum() / td.shape[0]

In [None]:
# how many have churned before the oberservation period
td[~td.is_agree & (td.t1 < tsl[1])].shape

In [None]:
# ... and what is the percentage?
td[~td.is_agree & (td.t1 < tsl[1])].shape[0] / td.shape[0]

In [None]:
# how many have suppoesely blocked but still made entries
# vs. true churned(?) user_ts > last_entry
(td[(~td.is_agree) & (td.user_ts < td.last_entry)].shape[0],
td[(~td.is_agree) & (td.user_ts > td.last_entry)].shape[0])

Calculating Churn

In [None]:
# churned vs.
# semi-churned, when users blocked but still makes entries past the user_ts flag
td['churned'] = ~td.is_agree
td['churned_'] = ~td.is_agree & (td.user_ts < td.last_entry)

# is_agree not set and churned flag not set
td.loc[~td.is_agree & td.churned_,
       ['user_ts', 'last_entry', 'user_tenure',
        'days_since', 'survival_time', 't1', 't0',
        'churned_', 'user_id']]

In [None]:
td.churned.sum() / td.shape[0], (~td.is_agree).sum() / td.shape[0]

In [None]:
td[['last_entry', 'user_tenure', 'days_since', 'survival_time']].describe()

_**Oberservations:**_
- >> incorrect... doubl check  > 50% of users churned after 7 days, 75% churned after 139 days (or ~4.6 months)
- 75% of users had not made any entries for >1146 days (`days_since`)
- >> incorrect... double check > top quartile (best or most active users) had made at least one entries since 2021-12-12 (or 48~50 days)

In [None]:
# what do you observe from the top quartile...

t0 = td.days_since.quantile(.25) # 1146 days since last entry
print('days_since =', t0.days)
(td.loc[td.days_since < t0, ['days_since', 'user_tenure', 'nbr_entry', 'last_entry']].describe())

In [None]:
td[td.user_id == 'U000046b3786c997220a07872c5191c37']

In [None]:
fq.xs('U000046b3786c997220a07872c5191c37')

## Segmentation by `days_since`, `survival_time` and `user_tenure`
- `days_since` is the number of days since the user has made the last entry
- `user_tenure` is the number of days between the users first and last date of expense or income entry
- `survivial time` is the number of days between the system initially recongizes user's activity and the last day of observation '2025-01-31' or when the users had churned



In [None]:
# [(x.left.round('D').days, x.right.days) for x in pd.qcut(td.survival_time, q=10).unique().sort_values()]

In [None]:
# [(x.left.round('D').days, x.right.days) for x in pd.qcut(td.days_since, q=10).unique().sort_values()]

In [None]:
# [(round(x.left), round(x.right)) for x in pd.qcut(td.user_tenure.dt.days, q=10, duplicates='drop').unique().sort_values()]
# td.info()

In [None]:
td_ = td #.merge(freq, on='user_id', how='left')
td_.isna().sum()

In [None]:
# count of null time_elasped...
td_[td_.time_elapsed_median.isna() & td_.nbr_entry == 1].shape

In [None]:
# [round(i.right) for i in pd.qcut(df.fq_mean, q=12, duplicates='drop').cat.categories]

In [None]:
df = td_[~td_.churned]
df = df.loc[(df.days_since < df.days_since.quantile(.10)) #& (td.user_tenure > td.user_tenure.quantile(.5))
       , ['user_id', 'days_since', 'user_tenure', 'survival_time', 'fq_mean', 'fq_median']].copy()
df.info()
df.describe(percentiles=[.1, .2, .3, .4, .5, .6, .7, .8, .9])

In [None]:
# cut and tag quantiles
df['days_since_decile'] = pd.qcut(df.days_since.dt.days, q=10)
df['days_since_decile'] = df['days_since_decile'].apply(lambda x: (x.right).astype('int') )  
df['user_tenure_decile'] = pd.qcut(df.user_tenure.dt.days, q=10, duplicates='drop')
df['user_tenure_decile'] = df['user_tenure_decile'].apply(lambda x: round(x.right)) 
df['survival_time_decile'] = pd.qcut(df.survival_time.dt.days, q=10)
df['survival_time_decile'] = df['survival_time_decile'].apply(lambda x: round(x.right))
df['mean_wk_fq'] = pd.qcut(df.fq_mean, q=12, duplicates='drop')
df['mean_wk_fq'] = df['mean_wk_fq'].apply(lambda x: round(x.right))
df['median_wk_fq'] = pd.qcut(df.fq_median, q=12, duplicates='drop')
df['median_wk_fq'] = df['median_wk_fq'].apply(lambda x: round(x.right))

In [None]:
import seaborn as sns

# df_plot = df.groupby(['days_since_decile', 'user_tenure_decile'], observed=True).agg({'user_id':'count'}).reset_index()
# df_plot = df_plot.pivot(index='user_tenure_decile', columns='days_since_decile', values='user_id')
df_plot = df.groupby(['days_since_decile', 'survival_time_decile'], observed=True).agg({'user_id':'count'}).reset_index()
df_plot = df_plot.pivot(index='survival_time_decile', columns='days_since_decile', values='user_id')

# Set figure size globally
sns.set_theme(rc={'figure.figsize': (12, 6)})

_ = sns.heatmap(df_plot, annot=False, cmap='Greens')
df_plot

In [None]:
import seaborn as sns

# df_plot = df.groupby(['days_since_decile', 'user_tenure_decile'], observed=True).agg({'user_id':'count'}).reset_index()
# df_plot = df_plot.pivot(index='user_tenure_decile', columns='days_since_decile', values='user_id')
df_plot = df.groupby(['days_since_decile', 'mean_wk_fq'], observed=True).agg({'user_id':'count'}).reset_index()
df_plot = df_plot.pivot(index='mean_wk_fq', columns='days_since_decile', values='user_id')

# Set figure size globally
sns.set_theme(rc={'figure.figsize': (12, 6)})

_ = sns.heatmap(df_plot, annot=False, cmap='Purples')
df_plot

In [None]:
df_plot = df.groupby(['days_since_decile', 'mean_wk_fq'], observed=True).agg({'user_id':'count'}).reset_index()
df_plot = df_plot.pivot(index='mean_wk_fq', columns='days_since_decile', values='user_id')

# Set figure size globally
sns.set_theme(rc={'figure.figsize': (12, 6)})

_ = sns.heatmap(df_plot, annot=False, cmap='Blues')
df_plot

In [None]:
# extreme cases
td_.loc[(td_.fq_mean > 70) & ~td.churned,
['fq_mean', 'user_tenure', 'days_since', 'user_id', 'nbr_entry', 'n_grp', 'nbr_connection']]

In [None]:
cashflow.query("user_id == 'U8835e86e095f591d93b8d36454174525'").groupby('yyyy_mm').size().plot()

In [None]:
cashflow.query("user_id == 'Uff7dc69b55ff36a6cf8fa0bd1e0356c8' & ts > '2025-01-25'")

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Compute correlation matrix
corr_matrix = df.drop(columns='user_id').corr()

# Create heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix Heatmap")
plt.show()

In [None]:
# Create pairplot
sns.pairplot(df.drop(columns='user_id'), kind="scatter", corner=True)
plt.suptitle("Pairwise Scatterplots of Correlations", y=1.02)

## Explore Group 

In [None]:
#number of members in each group
nbr_mbr_grp = grp.groupby('group_id')['user_id'].nunique()

nbr_mbr_grp.agg(['min', 'max', 'mean', 'median'])

In [None]:
nbr_mbr_grp.quantile([.25, .5, .75, .8, .9, .95, .99])

In [None]:
# is this interesting?

cashflow.groupby(['user_id'])['amt'].sum().quantile([.1, .25, .5, .75, .9])

## Expense by Category



In [None]:
# private entries
cashflow.query("is_group == False & is_expense == True").groupby('category') \
    .agg(n_user = ('user_id','nunique'),
         nbr_expense = ('category', 'count'),
         nbr_expense_per_user = ('user_id', lambda x: round(x.count() / x.nunique(), 2)),
         avg_amt = ('amt', lambda x: -(round(x.mean())))
        ) \
    .sort_values(by='nbr_expense',  ascending=False)

In [None]:
# group entries
cashflow.query("is_group == True & is_expense == True").groupby('category') \
    .agg(n_user = ('user_id','nunique'),
         nbr_expense = ('category', 'count'),
         nbr_expense_per_user = ('user_id', lambda x: round(x.count() / x.nunique(), 2)),
         avg_amt = ('amt', lambda x: -(round(x.mean())))
        ) \
    .sort_values(by='nbr_expense',  ascending=False)

# Supplemental

```sql
-- consolidate expense and income entry to simplify analysis

USE zoo;

DROP TABLE IF EXISTS acc_cashflow;

CREATE TABLE acc_cashflow (
  user_id VARCHAR(100) NOT NULL,
  date DATE NOT NULL,
  amt DOUBLE NOT NULL,
  is_expense BOOLEAN,
	is_group BOOLEAN,
  group_id VARCHAR(100),
  category_id INT,
  category VARCHAR(100),
  note VARCHAR(160)
) COMMENT = 'consolidated expense and income entries'
;

-- insert personal expense
INSERT INTO acc_cashflow
SELECT
  USER_ID user_id,
  DATE(CREDTM) date,
  -AMOUNT amt,
  TRUE is_expense,
	FALSE is_group,
  NULL group_id,
  A.CATEGORY category_id,
  B.CATEGORY category,
  NOTE note
FROM zoo.ACC_USER_DETAIL A
LEFT JOIN zoo_checkchick3.ACC_CATEGORY B
ON A.CATEGORY = B.ID
;

-- insert group expense
INSERT INTO acc_cashflow
SELECT
  USER_ID user_id,
  DATE(CREDTM) date,
  -AMOUNT amt,
  TRUE is_expense,
	TRUE is_group,
  GROUP_ID group_id,
  A.CATEGORY category_id,
  B.CATEGORY category,
  NOTE note
FROM zoo_checkchick.ACC_GROUP_DETAIL A
LEFT JOIN zoo_checkchick3.ACC_CATEGORY B
ON A.CATEGORY = B.ID
;

-- insert personal income
INSERT INTO acc_cashflow
SELECT
  USER_ID user_id,
  DATE(CREDTM) date,
  AMOUNT amt,
  FALSE is_expense,
	FALSE is_group,
  NULL group_id,
  A.CATEGORY category_id,
  B.CATEGORY category,
  NOTE note
FROM zoo_checkchick2.ACC_USER_DETAIL_INCOME A
LEFT JOIN zoo_checkchick3.ACC_CATEGORY_INCOME B
ON A.CATEGORY = B.ID
;

-- insert group income
INSERT INTO acc_cashflow
SELECT
  USER_ID user_id,
  DATE(CREDTM) date,
  AMOUNT amt,
  FALSE is_expense,
	TRUE is_group,
  GROUP_ID group_id,
  A.CATEGORY category_id,
  B.CATEGORY category,
  NOTE note
FROM zoo_checkchick2.ACC_GROUP_DETAIL_INCOME A
LEFT JOIN zoo_checkchick3.ACC_CATEGORY_INCOME B
ON A.CATEGORY = B.ID
;
```