# **Analysis and visualization of the data**

**Questions for this analysis**.
1. Where did my money go?
2. How can I cut expenses?

To answer these questions, I create proxy tasks:
1. identify expenses:
   1 total by category.
   - total expenses per month
   - Total and average expenses per place
2. identify the top places by length of stay and break them down by category

Determine the nature of the expense (need, want) and compare the goal/actual ratio.
1. Break down the best categories by subcategory
2. Create recommendations based on the result

## **Import libraries and read the data from SQLite database file**

In [1]:
import pandas as pd
import sqlite3 as sq
import numpy as np
import altair as alt
from datetime import datetime
from altair import datum


In [2]:
# read data from SQLite database file, parse dates from 'date' column
cnx = sq.connect("data/EXPENSES.db")
df = pd.read_sql_query("SELECT * FROM dftravel",
                       cnx, parse_dates="date")
df.head()


Unnamed: 0,date,year,month,day,weekday,time,category,subcategory,nature,amount,account,payment_type,lat,lng,place,country
0,2022-08-31 14:33:25,2022,August,31,Wednesday,14:33:25.000000,Food and Drinks,Coffee,want,1.4,Hanseatic Visa,Credit card,12.923556,100.882455,Pattaya,Thailand
1,2022-08-31 14:33:25,2022,August,31,Wednesday,14:33:25.000000,Food and Drinks,Groceries,need,5.53,Hanseatic Visa,Credit card,12.923556,100.882455,Pattaya,Thailand
2,2022-08-31 14:02:09,2022,August,31,Wednesday,14:02:09.000000,Transportation,Public transport,need,0.55,Thai Baht cash,Cash,12.923556,100.882455,Pattaya,Thailand
3,2022-08-31 12:21:11,2022,August,31,Wednesday,12:21:11.000000,Transportation,Public transport,need,0.27,Thai Baht cash,Cash,12.923556,100.882455,Pattaya,Thailand
4,2022-08-31 11:12:21,2022,August,31,Wednesday,11:12:21.000000,Food and Drinks,Coffee,want,3.3,Thai Baht cash,Cash,12.923556,100.882455,Pattaya,Thailand


### **Total expenses**  
Here I just quickly aggregate the data to find out the total expenses.

#### **Quick look at the summary**

In [3]:
df["amount"].agg(["sum", "count", "mean", "max", "min"])


sum      18553.07000
count     2113.00000
mean         8.78044
max       2296.35000
min          0.04000
Name: amount, dtype: float64

#### **Total per Category**  
Group by category to find the totals of each category for the whole period.

In [39]:
b1 = alt.Chart(df).mark_bar(cornerRadius=2).encode(
    alt.X("sum(amount)", title="Total Amount"),
    alt.Y("category:N", sort="-x", title="Category"),
).properties(
    title='Total Travel Expenses per Category from Oct 2021 to Aug 2022'
).configure_axis(grid=False)
b1


The bar graph above reveals the top categories of spending and the total amount spent.

#### **1.2 Total per Month**  
What about total spending per month? 

In [44]:
df = df
bars = (
    alt.Chart(df)
    .mark_bar(cornerRadius=2)
    .encode(
        alt.Y("sum(amount)", title="Total Amount"),
        alt.X("month(date):O", title=None),
        # color='category:N',
    )
)

alt.layer(bars, data=df).facet(column=alt.Row("year(date)", title=None)).resolve_scale(
    x="independent"
).configure_axis(grid=False
).properties(
    title='Total Expenses by Month')


  for col_name, dtype in df.dtypes.iteritems():


The graph shows the total spending per month. The first 3 travel months being the most expensive, starting from January the expenses were significantly lower and the trend is downwards. 

#### **Total and average per place**
Let's find out how much I spent in total and on average in each place I've been to.

In [6]:
# make a new df with totals and avg per place
df["date"].dt.date.groupby([df.place]).nunique()
place_count_days = (
    df.groupby(["place"])["date"].apply(
        lambda x: x.dt.date.nunique()).reset_index()
)
sum_place = df.groupby("place")["amount"].sum().reset_index()
place_sum = pd.merge(place_count_days, sum_place, on="place")
place_sum.rename(columns={"date": "days_cnt"}, inplace=True)
place_sum["day_avg"] = round(place_sum["amount"] / place_sum["days_cnt"])


In [47]:
alt.Chart(place_sum).mark_point(filled=True).encode(
    alt.X("place", title="Place"),
    alt.Y("day_avg", title="Daily AVG"),
    size=alt.Size("days_cnt", title="Day Count"),
    tooltip=["place", "days_cnt", "amount", "day_avg"],
).interactive(
).properties(
    title='Average Amount and Duration in Days per Place'
)


In the above graph the y-axis represents the daily average amount and the x-axis the place. The size of the points represents the stay duration in each place. 
  There are two top places by stay duration: Bangkok and Pattaya. The amount spent in Bangkok is twice as high as in Pattaya.
  The amount ranges about 25 to 50 Eur per day with Bangkok, Hua Hin and Phuket being exceptions - more than 60 Eur per day.
  Hover over the points to see additional information.
  
  


### **Category by place**  
In the previous graph I've discovered the top places in terms of duration: Bangkok and Pattaya.  I will focus on these two places and compare category expenses.

In [48]:
alt.Chart(df).mark_bar(cornerRadius=2).encode(
    alt.X(
        "place:N",
        title=None,
        sort=alt.EncodingSortField(
            field="amount", op="sum", order="descending"),
    ),
    alt.Y(
        "sum(amount):Q",
        title="Total Amount",
    ),
    alt.Column(
        "category",
        sort=[
            "Shopping",
            "Food and Drinks",
            "Life and Entertainment",
            "Housing",
            "Financial Expenses",
        ],
        title=None,
        header=alt.Header(labels=False)
    ),
    alt.Color(
        "category",
        sort=[
            "Shopping",
            "Food and Drinks",
            "Life and Entertainment",
            "Housing",
            "Financial Expenses",
        ],
        title='Category'
    ),
).transform_filter(
    (alt.FieldOneOfPredicate(field="place", oneOf=["Bangkok", "Pattaya"]))
).configure_axis(grid=False
).properties(
    title='Category Total Amounts Bangkok vs Pattaya')


It seems like I've spent more in Bangkok in almost every category.


### **Expense nature: need and want**
I want to know the ratio of spending nature: need vs want per month and see if I met the target ratio: need - 80%, want - 20%.
   Considering how much I spent on shopping shown in previous graph I doubt I hit the target ratio.

In [53]:
# create a new df: group by year, month and calculate the percentage of need and want nature of total monthly expense
need_want = df.groupby(['year', 'month', 'nature'])[
    'amount'].sum().reset_index()

need_want['perc_total'] = round((need_want['amount'].div(
    need_want.groupby(['year', 'month'])['amount'].transform('sum')))*100)


In [55]:
data = need_want
bars = (
    alt.Chart(need_want).mark_bar(cornerRadius=2).encode(
        x=alt.X('sum(perc_total)', stack="normalize",
                axis=alt.Axis(format='%'), title='% of monthly total'),
        y=alt.Y('month:O', title=None, sort=['January', 'February', 'March', 'April', 'May',
                'June', 'July', 'August', 'September', 'October', 'November', 'December']),
        color=alt.Color('nature', title='Spending Nature')
    )
)

alt.layer(bars, data=data).facet(row=alt.Row("year", title=None)).resolve_scale(
    y="independent"
).properties(
    title='Spending Nature by Month')


The graph shows me that I could not hit the target ratio for all months except Feb and Mar 2022. Although the general trend is towards the goal.

### **Top 3 Categories broken down by subcategories**
What are the total category expenses broken down by subcategories?

In [58]:
selection = alt.selection_multi(fields=['category'], bind='legend')

alt.Chart(df).mark_bar(cornerRadius=2).encode(
    alt.X('sum(amount)', title='Total Amount'),
    alt.Y('subcategory', sort='-x', title='Subcategory'),
    alt.Color('category', title='Category'),
    opacity=alt.condition(selection, alt.value(1), alt.value(0.3))
).transform_filter(
    (alt.FieldOneOfPredicate(field="category", oneOf=[
     "Food and Drinks", "Shopping", "Life and Entertainment"])
     )
).add_selection(selection
                ).properties(
    title='Category broken down by Subcategory'
    ).configure_axis(grid=False)


  for col_name, dtype in df.dtypes.iteritems():


Total amount per category broken down by subcategory. Click on category in legend to highlight the corresponding subcategory in graph.


### **Actions to take to reduce expenses and further plans to enhance the report**

#### **Actions to take**
Based on the analysis findings there are several actions I should take in order to better control my finances:
1. Budgeting:
   1. Set total budget per month and actively monitor whether or not I follow it
   2. Set category budgets, especially shopping where I tend to spent a lot of money on things that are not necessarily important.
   3. Control the expense nature ratio by weekly finance review.
      1. Adjust the nature when there is source of income again: need: 50%, want: 30%, save: 20%
2. Create more subcategories in shopping and food categories in order to better track the expenses. For example, expand the "Food and Drinks" category by "Eating out" and "Coffee" as there are more to
3. Mindfulness training to reduce impulsive buying habits.

#### **Report enhancing plans**
2. Further plans to enhance the report:
   1. Bring in more dynamic by creating a plotly-dashboard to make the monitoring process better
   2. Add further graphs to fully utilize the available data and reveal possible insights and patterns
      - Cumulative sum over the whole period
      - Plot the relationship between average amount per category vs total category amount
      - ...
   3. Add Geo visualization 
   