## **Personal Expenses Data Preparation**

### **Data Loading and first look**  
Export data from the smartphone app I use to collect my expense data. The data comes in a handy CSV format, so I can easily load it into a pandas DataFrame by specifying a delimiter. I also specify other parameters: which columns to load and parsing dates from the 'date' column.

In [None]:
import pandas as pd
import numpy as np

# save file path to a variable
fname = "data/report_2022-08-05_110949.csv"
# load the data
df = pd.read_csv(
    fname,
    sep=";",
    usecols=[
        "date",
        "category",
        "account",
        "ref_currency_amount",
        "payment_type_local",
        "gps_latitude",
        "gps_longitude",
        "labels",
    ],
    parse_dates=["date"],
)
df.head()


In [None]:
# slightly adjust the column names to something more meaningful to me and change the order.
# change the 'category' to 'subcategory' the values actually refer to subcategories. I will add 'category' column later.
df.columns = [
    "date",
    "account",
    "subcategory",
    "amount",
    "payment_type",
    "lat",
    "long",
    "labels",
]
df.head()


### **Data Cleaning and Preparation**

#### **Handling Missing Data**

In [None]:
# check summary of each column
df.info()


Here I see that the 'lat' and 'long' geodata columns do not contain any values. So far this is fine, I will get the place names from the 'labels' column and use the geopy library to get the relevant data about the places I have visited during my travels.

#### **Data Transformation**

In [None]:
# check and if necessary remove duplicates
df.duplicated()


In this step, I add the category names. The exported data set doesn't contain this data, so I copied it manually from the application and created a dictionary(***category : subcategory***)
After that, I will map the category value to each row based on the subcategory using pandas **map()**.

In [None]:
# create a dictionary with categories as keys and subcategories as values
# also assign the missing category for Fitness Supplements

d = {
    "Food_drinks": [
        "Food & Drinks",
        "Bar, cafe",
        "Groceries",
        "Restaurant, fast-food",
        "Fitness Supplements",
    ],
    "Shopping": [
        "Shopping",
        "Clothes & shoes",
        "Drug-store, chemist",
        "Electronics, accessories",
        "Camera expenses",
        "Free time",
        "Gifts, joy",
        "Health and beauty",
        "Home, garden",
        "Jewels, accessories",
        "Stationery, tools",
    ],
    "Housing": ["Housing", "Energy, utilities", "Maintenance, repairs", "Rent"],
    "Transportation": [
        "Transportation",
        "Business trips",
        "Long distance",
        "Public transport",
        "Taxi",
    ],
    "Vehicle": [
        "Vehicle",
        "Fuel",
        "Leasing",
        "Parking",
        "Rentals",
        "Vehicle insurance",
        "Vehicle maintenance",
    ],
    "Life_Entertainment": [
        "Life & Entertainment",
        "Active sport, fitness",
        "Alcohol, tobacco",
        "Books, audio, subscriptions",
        "Charity, gifts",
        "Culture, sport events",
        "Education, development",
        "Health care, doctor",
        "Hobbies",
        "Holiday, trips, hotels",
        "Sightseeing, activities",
        "Accommodation",
        "Life events",
        "Lottery, gambling",
        "TV, Streaming",
        "Wellness, beauty",
    ],
    "Communication_PC": [
        "Communication, PC",
        "Internet",
        "Phone, mobile phone",
        "Postal services",
        "Software, apps, games",
        "Phone, cell phone",
    ],
    "Financial_expenses": [
        "Financial expenses",
        "Advisory",
        "Charges, Fees",
        "Fines",
        "Insurances",
        "Loan, interests",
        "Taxes",
    ],
    "Investments": [
        "Investments",
        "Financial investments",
        "Collections",
        "Realty",
        "Savings",
        "Vehicles, chattels",
    ],
    "Income": ["Income", "Gifts", "Refunds (tax, purchase)", "Sale", "Wage, invoices"],
    "Other": ["Missing", "Other"],
}


In [None]:
# the dictionary needs to be flatten before using the map function
def flatten_dict(d):
    nd = {}
    for k, v in d.items():
        # Check if it's a list, if so then iterate through
        if hasattr(v, "__iter__") and not isinstance(v, str):
            for item in v:
                nd[item] = k
        else:
            nd[v] = k
    return nd


In [None]:
# use the new function to flatten the dict
flatten_d = flatten_dict(d)


In [None]:
# and finally map using the pandas map() function to assign the values
df["category"] = df["subcategory"].map(flatten_d)
df.head()


In [None]:
# rearrange the column order
df = df[
    [
        "date",
        "category",
        "subcategory",
        "amount",
        "account",
        "payment_type",
        "lat",
        "long",
        "labels",
    ]
]
df.head()


In [None]:
# convert the amount column to absolute value
df["amount"] = df["amount"].abs()
df.head()


##### **Create a subset DataFrame for a time period at home.**

In [None]:
# split the DataFrame to 2 DataFrames
dfhome = df.loc[
    (df["date"] < "2021-10-02T00:00:00"),
    ["date", "category", "subcategory", "amount", "account", "payment_type"],
].reset_index(drop=True)
dfhome.head()


##### **Create a subset of the DataFrame containing expenses while travelling**

In [None]:
# create a df subset with data while travelling
cols = list(df.columns)
dftravel = df.loc[df["date"] > "2021-10-02T00:00:00", cols].reset_index(drop=True)
dftravel.head()


In [None]:
# exclude the deposit records as they don't count as expenses
# filter out the records
dftravel = dftravel[
    ~(
        (dftravel["category"] == "Financial_expenses")
        & (dftravel["subcategory"] == "Loan, interests")
    )
]


###### **Split the *Labels* column to 3 columns as it contains multiple values**.

In [None]:
dftravel[["l1", "l2", "l3", "l4"]] = dftravel["labels"].str.rsplit("|", expand=True)
dftravel[["l1", "l3", "l3", "l4"]]


The values are mixed across these 4 label columns. I convert these Series to lists to bring the values in correct place. 

In [None]:
# save the the splitted columns to lists to iterate and change the values
list_1 = dftravel["l1"].to_list()
list_2 = dftravel["l2"].to_list()
list_3 = dftravel["l3"].to_list()
list_4 = dftravel["l4"].to_list()


In [None]:

places = list(dftravel["l3"].unique())  # get unique values (These are the place names)
del_place = [1, 4, 6, 7, 19] # create a list with invalid names or NaN values
places_1 = np.delete(places, del_place).tolist() # remove and using numpy and convert back to list


In [None]:
# iterate through list_3 -- there are the majority of correct values.
# Iterate through it and if the value is not in the list with correct places
# look in other columns and append to a new list
nvalid = ("BIG TRIP", "Thailand")
place = []
for x in list_3:
    if x in places_1:
        place.append(x)
    elif x in nvalid and list_2[list_1.index(x)] in nvalid:
        place.append(list_1[list_3.index(x)])
    elif x in nvalid and list_1[list_3.index(x)] in nvalid:
        place.append(list_2[list_1.index(x)])
    elif x == "Accomodation":
        x = list_4[list_1.index(x)]
        place.append(x)
    else:
        place.append(x)


In [None]:
# append the new list to the dataframe
dftravel["place"] = place


In [None]:
# fill na in place column with ffill method (forward fill)
dftravel["place"].fillna(method="ffill", inplace=True)
dftravel.info()


In [None]:
# change values that were not correctly filled in previous step
dftravel.loc[dftravel["place"] == "Accommodation", ["place"]] = "Phuket"
dftravel.loc[dftravel["place"] == "Road trip", ["place"]] = "Sangkhlaburi"
dftravel.loc[dftravel["place"] == "BIG TRIP", ["place"]] = "Bangkok"


In [None]:
# create a new column 'country'
dftravel["country"] = "Thailand"


In [None]:
# finally drop not needed columns
dftravel.drop(["labels", "l1", "l2", "l3", "l4"], axis=1, inplace=True)


In [None]:
# check summary for each column to spot possible issues
dftravel.info()


#### **Get latitude and longitude for the places using the geopy library**

In [None]:
# lookup the geodata for each place from the list and store it in another list
from geopy.geocoders import Nominatim

geolocator = Nominatim(user_agent="disbalanxx@gmail.com")
keys = dftravel["place"].unique()
geodata = []
for x in keys:
    location = geolocator.geocode(x)
    print(f"Fetching geodata of {x} and appending to the list")
    geodata.append(str(location.latitude) + " " + str(location.longitude))

In [None]:
# use the zip function to make a dict from two lists
geo_dict = dict(zip(keys, geodata))
geo_dict


In [None]:
# and finally map the dict values to the dataframe
dftravel["gdata"] = dftravel["place"].map(geo_dict)
dftravel.head()


In [None]:
# latitude and longitude are stored in one column, I split the column to two columns
dftravel[["lat", "long"]] = dftravel["gdata"].str.rsplit(expand=True)
dftravel.drop("gdata", axis=1, inplace=True)


In [None]:
# convert the geodata to float number type
convert = {"lat": float, "long": float}
dftravel = dftravel.astype(convert)
print(dftravel.dtypes)


#### **Write the data to a SQLite database file.**

In [None]:
# write travel expenses dataframe to a SQLite data base file
import sqlite3 as sq

data = dftravel
sql_data = "EXPENSES.db"
conn = sq.connect(sql_data)
cur = conn.cursor()
cur.execute("""DROP TABLE IF EXISTS travel_expenses""")
data.to_sql(
    "travel_expenses", conn, if_exists="replace", index=False
)  # - writes the pd.df to SQLIte DB
pd.read_sql("select * from travel_expenses", conn)
conn.commit()
conn.close()


In [None]:
# write home expense dataframe to a SQLite data base file

data = dfhome
sql_data = "EXPENSES"  # - Creates DB names SQLite
conn = sq.connect(sql_data)
cur = conn.cursor()
cur.execute("""DROP TABLE IF EXISTS home_expenses""")
data.to_sql(
    "home_expenses", conn, if_exists="replace", index=False
)  # - writes the pd.df to SQLIte DB
pd.read_sql("select * from home_expenses", conn)
conn.commit()
conn.close()
