## **Personal Expenses Data Preparation**

### **Data Loading and first look**  
Export data from the smartphone app I use to collect my expense data. The data comes in a handy CSV format, so I can easily load it into a pandas DataFrame by specifying a delimiter. I also specify other parameters: which columns to load and parsing dates from the 'date' column.

In [55]:
import pandas as pd
import numpy as np

fname = "data/report_2022-08-05_110949.csv"
# load the data
df = pd.read_csv(
    fname,
    sep=";",
    usecols=[
        "date",
        "category",
        "account",
        "ref_currency_amount",
        "payment_type_local",
        "gps_latitude",
        "gps_longitude",
        "labels",
    ],
    parse_dates=["date"],
)
df.head()


Unnamed: 0,account,category,ref_currency_amount,payment_type_local,date,gps_latitude,gps_longitude,labels
0,Hanseatic Visa,Groceries,-10.81,Credit card,2022-07-31 15:49:30,,,Thailand|BIG TRIP|Chiang Mai
1,Hanseatic Visa,"Restaurant, fast-food",-15.57,Credit card,2022-07-31 15:16:38,,,Thailand|BIG TRIP|Chiang Mai
2,Thai Baht cash,Life & Entertainment,-1.08,Cash,2022-07-30 13:41:38,,,Thailand|BIG TRIP|Chiang Mai
3,Thai Baht cash,"Restaurant, fast-food",-0.54,Cash,2022-07-30 08:55:38,,,Thailand|BIG TRIP|Chiang Mai
4,Thai Baht cash,"Restaurant, fast-food",-6.46,Cash,2022-07-30 08:55:38,,,Thailand|BIG TRIP|Chiang Mai


In [56]:
# slightly adjust the column names to something more meaningful to me and change the order.
df.columns = [
    "account",
    "category",
    "amount",
    "payment_type",
    "date",
    "lat",
    "lng",
    "labels",
]
df.head()


Unnamed: 0,account,category,amount,payment_type,date,lat,lng,labels
0,Hanseatic Visa,Groceries,-10.81,Credit card,2022-07-31 15:49:30,,,Thailand|BIG TRIP|Chiang Mai
1,Hanseatic Visa,"Restaurant, fast-food",-15.57,Credit card,2022-07-31 15:16:38,,,Thailand|BIG TRIP|Chiang Mai
2,Thai Baht cash,Life & Entertainment,-1.08,Cash,2022-07-30 13:41:38,,,Thailand|BIG TRIP|Chiang Mai
3,Thai Baht cash,"Restaurant, fast-food",-0.54,Cash,2022-07-30 08:55:38,,,Thailand|BIG TRIP|Chiang Mai
4,Thai Baht cash,"Restaurant, fast-food",-6.46,Cash,2022-07-30 08:55:38,,,Thailand|BIG TRIP|Chiang Mai


### **Data Cleaning and Preparation**

#### **Handling Missing Data**

In [57]:
# check summary of each column
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2126 entries, 0 to 2125
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   account       2126 non-null   object        
 1   category      2126 non-null   object        
 2   amount        2126 non-null   float64       
 3   payment_type  2126 non-null   object        
 4   date          2126 non-null   datetime64[ns]
 5   lat           0 non-null      float64       
 6   lng           0 non-null      float64       
 7   labels        2041 non-null   object        
dtypes: datetime64[ns](1), float64(3), object(4)
memory usage: 133.0+ KB


Here I see that the 'lat' and 'lng' geodata columns do not contain any values. So far this is fine, I will get the place names from the 'labels' column and use the geopy library to get the relevant data about the places I have visited during my travels.

#### **Data Transformation**

In [58]:
# check and if necessary remove duplicates
df.duplicated()


0       False
1       False
2       False
3       False
4       False
        ...  
2121    False
2122    False
2123    False
2124    False
2125    False
Length: 2126, dtype: bool

In this step, I add the category names. The exported data set doesn't contain this data, so I copied it manually from the application and created a dictionary(***category : subcategory***)
After that, I will map the category value to each row based on the subcategory using pandas **map()**.

In [59]:
# create a dictionary with categories as keys and subcategories as values
# also assign the missing category for Fitness Supplements

d = {
    "Food_drinks": [
        "Food & Drinks",
        "Bar, cafe",
        "Groceries",
        "Restaurant, fast-food",
        "Fitness Supplements",
    ],
    "Shopping": [
        "Shopping",
        "Clothes & shoes",
        "Drug-store, chemist",
        "Electronics, accessories",
        "Camera expenses",
        "Free time",
        "Gifts, joy",
        "Health and beauty",
        "Home, garden",
        "Jewels, accessories",
        "Stationery, tools",
    ],
    "Housing": ["Housing", "Energy, utilities", "Maintenance, repairs", "Rent"],
    "Transportation": [
        "Transportation",
        "Business trips",
        "Long distance",
        "Public transport",
        "Taxi",
    ],
    "Vehicle": [
        "Vehicle",
        "Fuel",
        "Leasing",
        "Parking",
        "Rentals",
        "Vehicle insurance",
        "Vehicle maintenance",
    ],
    "Life_Entertainment": [
        "Life & Entertainment",
        "Active sport, fitness",
        "Alcohol, tobacco",
        "Books, audio, subscriptions",
        "Charity, gifts",
        "Culture, sport events",
        "Education, development",
        "Health care, doctor",
        "Hobbies",
        "Holiday, trips, hotels",
        "Sightseeing, activities",
        "Accommodation",
        "Life events",
        "Lottery, gambling",
        "TV, Streaming",
        "Wellness, beauty",
    ],
    "Communication_PC": [
        "Communication, PC",
        "Internet",
        "Phone, mobile phone",
        "Postal services",
        "Software, apps, games",
        "Phone, cell phone",
    ],
    "Financial_expenses": [
        "Financial expenses",
        "Advisory",
        "Charges, Fees",
        "Fines",
        "Insurances",
        "Loan, interests",
        "Taxes",
    ],
    "Investments": [
        "Investments",
        "Financial investments",
        "Collections",
        "Realty",
        "Savings",
        "Vehicles, chattels",
    ],
    "Income": ["Income", "Gifts", "Refunds (tax, purchase)", "Sale", "Wage, invoices"],
    "Other": ["Missing", "Other"],
}


In [60]:
# the dictionary needs to be flatten before using the map function
def flatten_dict(d):
    nd = {}
    for k, v in d.items():
        # Check if it's a list, if so then iterate through
        if hasattr(v, "__iter__") and not isinstance(v, str):
            for item in v:
                nd[item] = k
        else:
            nd[v] = k
    return nd


In [61]:
# use the new function to flatten the dict
flatten_d = flatten_dict(d)


In [62]:
# change the column name of category column to subcategory
df = df.rename(columns={'category' : 'subcategory'})

In [63]:
# and finally map using the pandas map() function to assign the values
df["category"] = df["subcategory"].map(flatten_d)
df.head()


Unnamed: 0,account,subcategory,amount,payment_type,date,lat,lng,labels,category
0,Hanseatic Visa,Groceries,-10.81,Credit card,2022-07-31 15:49:30,,,Thailand|BIG TRIP|Chiang Mai,Food_drinks
1,Hanseatic Visa,"Restaurant, fast-food",-15.57,Credit card,2022-07-31 15:16:38,,,Thailand|BIG TRIP|Chiang Mai,Food_drinks
2,Thai Baht cash,Life & Entertainment,-1.08,Cash,2022-07-30 13:41:38,,,Thailand|BIG TRIP|Chiang Mai,Life_Entertainment
3,Thai Baht cash,"Restaurant, fast-food",-0.54,Cash,2022-07-30 08:55:38,,,Thailand|BIG TRIP|Chiang Mai,Food_drinks
4,Thai Baht cash,"Restaurant, fast-food",-6.46,Cash,2022-07-30 08:55:38,,,Thailand|BIG TRIP|Chiang Mai,Food_drinks


In [64]:
# rearrange the column order
df = df[
    [
        "date",
        "category",
        "subcategory",
        "amount",
        "account",
        "payment_type",
        "lat",
        "lng",
        "labels",
    ]
]
df.head()


Unnamed: 0,date,category,subcategory,amount,account,payment_type,lat,lng,labels
0,2022-07-31 15:49:30,Food_drinks,Groceries,-10.81,Hanseatic Visa,Credit card,,,Thailand|BIG TRIP|Chiang Mai
1,2022-07-31 15:16:38,Food_drinks,"Restaurant, fast-food",-15.57,Hanseatic Visa,Credit card,,,Thailand|BIG TRIP|Chiang Mai
2,2022-07-30 13:41:38,Life_Entertainment,Life & Entertainment,-1.08,Thai Baht cash,Cash,,,Thailand|BIG TRIP|Chiang Mai
3,2022-07-30 08:55:38,Food_drinks,"Restaurant, fast-food",-0.54,Thai Baht cash,Cash,,,Thailand|BIG TRIP|Chiang Mai
4,2022-07-30 08:55:38,Food_drinks,"Restaurant, fast-food",-6.46,Thai Baht cash,Cash,,,Thailand|BIG TRIP|Chiang Mai


In [65]:
# convert the amount column to absolute value
df["amount"] = df["amount"].abs()
df.head()


Unnamed: 0,date,category,subcategory,amount,account,payment_type,lat,lng,labels
0,2022-07-31 15:49:30,Food_drinks,Groceries,10.81,Hanseatic Visa,Credit card,,,Thailand|BIG TRIP|Chiang Mai
1,2022-07-31 15:16:38,Food_drinks,"Restaurant, fast-food",15.57,Hanseatic Visa,Credit card,,,Thailand|BIG TRIP|Chiang Mai
2,2022-07-30 13:41:38,Life_Entertainment,Life & Entertainment,1.08,Thai Baht cash,Cash,,,Thailand|BIG TRIP|Chiang Mai
3,2022-07-30 08:55:38,Food_drinks,"Restaurant, fast-food",0.54,Thai Baht cash,Cash,,,Thailand|BIG TRIP|Chiang Mai
4,2022-07-30 08:55:38,Food_drinks,"Restaurant, fast-food",6.46,Thai Baht cash,Cash,,,Thailand|BIG TRIP|Chiang Mai


##### **Create a subset DataFrame for a time period at home.**

In [66]:
# split the DataFrame to 2 DataFrames
dfhome = df.loc[
    (df["date"] < "2021-10-02T00:00:00"),
    ["date", "category", "subcategory", "amount", "account", "payment_type"],
].reset_index(drop=True)
dfhome.head()


Unnamed: 0,date,category,subcategory,amount,account,payment_type
0,2021-10-01 13:58:25,Communication_PC,Postal services,0.8,DKB Debit Card,Debit card
1,2021-10-01 13:58:01,Housing,"Energy, utilities",32.0,DKB Debit Card,Debit card
2,2021-09-30 00:16:25,Shopping,"Home, garden",9.9,DKB Visa,Credit card
3,2021-09-29 22:40:03,Communication_PC,"Phone, cell phone",10.59,DKB Debit Card,Debit card
4,2021-09-29 11:28:49,Transportation,Public transport,9.9,DKB Debit Card,Debit card


##### **Create a subset of the DataFrame containing expenses while travelling**

In [67]:
# create a df subset with data while travelling
cols = list(df.columns)
dftravel = df.loc[df["date"] > "2021-10-02T00:00:00", cols].reset_index(drop=True)
dftravel.head()


Unnamed: 0,date,category,subcategory,amount,account,payment_type,lat,lng,labels
0,2022-07-31 15:49:30,Food_drinks,Groceries,10.81,Hanseatic Visa,Credit card,,,Thailand|BIG TRIP|Chiang Mai
1,2022-07-31 15:16:38,Food_drinks,"Restaurant, fast-food",15.57,Hanseatic Visa,Credit card,,,Thailand|BIG TRIP|Chiang Mai
2,2022-07-30 13:41:38,Life_Entertainment,Life & Entertainment,1.08,Thai Baht cash,Cash,,,Thailand|BIG TRIP|Chiang Mai
3,2022-07-30 08:55:38,Food_drinks,"Restaurant, fast-food",0.54,Thai Baht cash,Cash,,,Thailand|BIG TRIP|Chiang Mai
4,2022-07-30 08:55:38,Food_drinks,"Restaurant, fast-food",6.46,Thai Baht cash,Cash,,,Thailand|BIG TRIP|Chiang Mai


In [68]:
# exclude the deposit records as they don't count as expenses
# filter out the records
dftravel = dftravel[
    ~(
        (dftravel["category"] == "Financial_expenses")
        & (dftravel["subcategory"] == "Loan, interests")
    )
]


###### **Split the *Labels* column to 3 columns as it contains multiple values**.

In [69]:
dftravel[["l1", "l2", "l3", "l4"]] = dftravel["labels"].str.rsplit("|", expand=True)
dftravel[["l1", "l3", "l3", "l4"]]


Unnamed: 0,l1,l3,l3.1,l4
0,Thailand,Chiang Mai,Chiang Mai,
1,Thailand,Chiang Mai,Chiang Mai,
2,Thailand,Chiang Mai,Chiang Mai,
3,Thailand,Chiang Mai,Chiang Mai,
4,Thailand,Chiang Mai,Chiang Mai,
...,...,...,...,...
1970,BIG TRIP,Phuket,Phuket,
1971,BIG TRIP,Accommodation,Accommodation,Phuket
1972,Thailand,,,
1973,Thailand,,,


The values are mixed across these 4 label columns. I convert these Series to lists to bring the values in correct place. 

In [70]:
# save the the splitted columns to lists to iterate and change the values
list_1 = dftravel["l1"].to_list()
list_2 = dftravel["l2"].to_list()
list_3 = dftravel["l3"].to_list()
list_4 = dftravel["l4"].to_list()


In [71]:

places = list(dftravel["l3"].unique())  # get unique values (These are the place names)
del_place = [1, 4, 6, 7, 19] # create a list with invalid names or NaN values
places_1 = np.delete(places, del_place).tolist() # remove and using numpy and convert back to list


In [72]:
# iterate through list_3 -- there are the majority of correct values.
# Iterate through it and if the value is not in the list with correct places
# look in other columns and append to a new list
nvalid = ("BIG TRIP", "Thailand")
place = []
for x in list_3:
    if x in places_1:
        place.append(x)
    elif x in nvalid and list_2[list_1.index(x)] in nvalid:
        place.append(list_1[list_3.index(x)])
    elif x in nvalid and list_1[list_3.index(x)] in nvalid:
        place.append(list_2[list_1.index(x)])
    elif x == "Accomodation":
        x = list_4[list_1.index(x)]
        place.append(x)
    else:
        place.append(x)


In [73]:
# append the new list to the dataframe
dftravel["place"] = place


In [74]:
# fill na in place column with ffill method (forward fill)
dftravel["place"].fillna(method="ffill", inplace=True)
dftravel.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1967 entries, 0 to 1974
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          1967 non-null   datetime64[ns]
 1   category      1967 non-null   object        
 2   subcategory   1967 non-null   object        
 3   amount        1967 non-null   float64       
 4   account       1967 non-null   object        
 5   payment_type  1967 non-null   object        
 6   lat           0 non-null      float64       
 7   lng           0 non-null      float64       
 8   labels        1951 non-null   object        
 9   l1            1951 non-null   object        
 10  l2            1949 non-null   object        
 11  l3            1939 non-null   object        
 12  l4            1 non-null      object        
 13  place         1967 non-null   object        
dtypes: datetime64[ns](1), float64(3), object(10)
memory usage: 230.5+ KB


In [75]:
# change values that were not correctly filled in previous step
dftravel.loc[dftravel["place"] == "Accommodation", ["place"]] = "Phuket"
dftravel.loc[dftravel["place"] == "Road trip", ["place"]] = "Sangkhlaburi"
dftravel.loc[dftravel["place"] == "BIG TRIP", ["place"]] = "Bangkok"


In [76]:
# create a new column 'country'
dftravel["country"] = "Thailand"


In [77]:
# finally drop not needed columns
dftravel.drop(["labels", "l1", "l2", "l3", "l4"], axis=1, inplace=True)


In [78]:
# check summary for each column to spot possible issues
dftravel.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1967 entries, 0 to 1974
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          1967 non-null   datetime64[ns]
 1   category      1967 non-null   object        
 2   subcategory   1967 non-null   object        
 3   amount        1967 non-null   float64       
 4   account       1967 non-null   object        
 5   payment_type  1967 non-null   object        
 6   lat           0 non-null      float64       
 7   lng           0 non-null      float64       
 8   place         1967 non-null   object        
 9   country       1967 non-null   object        
dtypes: datetime64[ns](1), float64(3), object(6)
memory usage: 169.0+ KB


#### **Get latitude and longitude for the places**

In [None]:
import urllib.request, urllib.parse, urllib.error
import json
import ssl

api_key = False
# If you have a Google Places API key, enter it here
# api_key = 'AIzaSy___IDByT70'
# https://developers.google.com/maps/documentation/geocoding/intro

if api_key is False:
    api_key = 42
    serviceurl = 'http://py4e-data.dr-chuck.net/json?'
else :
    serviceurl = 'https://maps.googleapis.com/maps/api/geocode/json?'

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

keys = dftravel["place"].unique()
geodata = list()

for place in keys:
    parms = dict()
    parms['address'] = place
    
    if api_key is not False: parms['key'] = api_key
    url = serviceurl + urllib.parse.urlencode(parms)

    print('Retrieving', url)
    uh = urllib.request.urlopen(url, context=ctx)
    data = uh.read().decode()
    print('Retrieved', len(data), 'characters')

    try:
        js = json.loads(data)
    except:
        js = None

    if not js or 'status' not in js or js['status'] != 'OK':
        print('==== Failure To Retrieve ====')
        print(data)
        continue
    
        print(json.dumps(js, indent=4))

    lat = js['results'][0]['geometry']['location']['lat']
    lng = js['results'][0]['geometry']['location']['lng']
    geodata.append([lat,lng])
    print('lat', lat, 'lng', lng)
    location = js['results'][0]['formatted_address']
    print(location)

In [79]:
geodata

[[18.7883439, 98.98530079999999],
 [12.9235557, 100.8824551],
 [13.7563309, 100.5017651],
 [12.0479159, 102.3234816],
 [11.6680759, 102.5642261],
 [13.5282893, 99.8134211],
 [12.5683747, 99.9576888],
 [14.4391554, 101.3722299],
 [12.9182259, 100.7802624],
 [15.1542081, 98.45306579999999],
 [14.1011393, 99.4179431],
 [9.134194899999999, 99.3334198],
 [8.9873143, 98.6294329],
 [8.0854803, 98.9062856],
 [7.8804479, 98.3922504]]

In [80]:
# use the zip function to make a dict from two lists
geo_dict = dict(zip(keys, geodata))
geo_dict


{'Chiang Mai': [18.7883439, 98.98530079999999],
 'Pattaya': [12.9235557, 100.8824551],
 'Bangkok': [13.7563309, 100.5017651],
 'Koh Chang': [12.0479159, 102.3234816],
 'Koh Kud': [11.6680759, 102.5642261],
 'Ratchaburi': [13.5282893, 99.8134211],
 'Hua Hin': [12.5683747, 99.9576888],
 'Khao Yai': [14.4391554, 101.3722299],
 'Ko Larn': [12.9182259, 100.7802624],
 'Sangkhlaburi': [15.1542081, 98.45306579999999],
 'Kanchanaburi': [14.1011393, 99.4179431],
 'Suratthani': [9.134194899999999, 99.3334198],
 'Khao Sok': [8.9873143, 98.6294329],
 'Krabi': [8.0854803, 98.9062856],
 'Phuket': [7.8804479, 98.3922504]}

In [81]:
# and finally map the dict values to the dataframe
dftravel["gdata"] = dftravel["place"].map(geo_dict)
dftravel.head()


Unnamed: 0,date,category,subcategory,amount,account,payment_type,lat,lng,place,country,gdata
0,2022-07-31 15:49:30,Food_drinks,Groceries,10.81,Hanseatic Visa,Credit card,,,Chiang Mai,Thailand,"[18.7883439, 98.98530079999999]"
1,2022-07-31 15:16:38,Food_drinks,"Restaurant, fast-food",15.57,Hanseatic Visa,Credit card,,,Chiang Mai,Thailand,"[18.7883439, 98.98530079999999]"
2,2022-07-30 13:41:38,Life_Entertainment,Life & Entertainment,1.08,Thai Baht cash,Cash,,,Chiang Mai,Thailand,"[18.7883439, 98.98530079999999]"
3,2022-07-30 08:55:38,Food_drinks,"Restaurant, fast-food",0.54,Thai Baht cash,Cash,,,Chiang Mai,Thailand,"[18.7883439, 98.98530079999999]"
4,2022-07-30 08:55:38,Food_drinks,"Restaurant, fast-food",6.46,Thai Baht cash,Cash,,,Chiang Mai,Thailand,"[18.7883439, 98.98530079999999]"


Ratchaburi 

In [82]:
# latitude and longitude are stored in one column, I split the column to two columns
dftravel[["lat", "lng"]] = pd.DataFrame(dftravel.gdata.to_list(), index=dftravel.index)
dftravel.drop("gdata", axis=1, inplace=True)


In [83]:
# convert the geodata to float number type
# convert = {"lat": float, "lng": float}
# dftravel = dftravel.astype(convert)
print(dftravel.dtypes)


date            datetime64[ns]
category                object
subcategory             object
amount                 float64
account                 object
payment_type            object
lat                    float64
lng                    float64
place                   object
country                 object
dtype: object


#### **Write the data to CSV**

In [84]:
dfhome.to_csv("2022-08-30_home_expenses.csv", index=False)
dftravel.to_csv("2022-08-30_travel_expenses.csv", index=False)


#### **Write the data to a SQLite database file.**

In [None]:
# write travel expenses dataframe to a SQLite data base file
import sqlite3 as sq

data = dftravel
sql_data = "EXPENSES.db"
conn = sq.connect(sql_data)
cur = conn.cursor()
cur.execute("""DROP TABLE IF EXISTS travel_expenses""")
data.to_sql(
    "travel_expenses", conn, if_exists="replace", index=False
)  # - writes the pd.df to SQLIte DB
pd.read_sql("select * from travel_expenses", conn)
conn.commit()
conn.close()


In [None]:
# write home expense dataframe to a SQLite data base file

data = dfhome
sql_data = "EXPENSES"  # - Creates DB names SQLite
conn = sq.connect(sql_data)
cur = conn.cursor()
cur.execute("""DROP TABLE IF EXISTS home_expenses""")
data.to_sql(
    "home_expenses", conn, if_exists="replace", index=False
)  # - writes the pd.df to SQLIte DB
pd.read_sql("select * from home_expenses", conn)
conn.commit()
conn.close()
