## **Personal Expenses Data Preparation**


### **Read data from CSV**

Export data from the smartphone app I use to collect my expense data. The data comes in a handy CSV format, so I can easily load it into a pandas DataFrame by specifying a delimiter. Also specify which columns to import.


In [2]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

fname = "data/report_2022-10-16_090806.csv"
df = pd.read_csv(
    fname,
    sep=";",
    usecols=[
        "account",
        "category",
        "currency",
        "amount",
        "ref_currency_amount",
        "type",
        "payment_type",
        "payment_type_local",
        "note",
        "date",
        "labels",
    ],
    parse_dates=["date"],
)
df.head()


Unnamed: 0,account,category,currency,amount,ref_currency_amount,type,payment_type,payment_type_local,note,date,labels
0,Cash THB,Taxi,THB,-40.0,-1.1,Expenses,CASH,Cash,,2022-09-30 21:36:30,BIG TRIP|Thailand|Pattaya
1,Cash THB,Taxi,THB,-40.0,-1.1,Expenses,CASH,Cash,,2022-09-30 21:12:16,BIG TRIP|Thailand|Pattaya
2,Credit Card,"Phone, cell phone",EUR,-12.33,-12.33,Expenses,CREDIT_CARD,Credit card,Top-up,2022-09-30 19:16:32,Pattaya|Thailand|BIG TRIP
3,Cash THB,Taxi,THB,-40.0,-1.1,Expenses,CASH,Cash,,2022-09-30 14:04:26,BIG TRIP|Thailand|Pattaya
4,Cash THB,Taxi,THB,-40.0,-1.1,Expenses,CASH,Cash,,2022-09-30 13:50:27,BIG TRIP|Thailand|Pattaya


### **Data Cleaning and Preparation**


#### **Check for duplicated and missing data**


In [3]:
# check non-null count and dtype of each variable
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2498 entries, 0 to 2497
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   account              2498 non-null   object        
 1   category             2498 non-null   object        
 2   currency             2498 non-null   object        
 3   amount               2498 non-null   float64       
 4   ref_currency_amount  2498 non-null   float64       
 5   type                 2498 non-null   object        
 6   payment_type         2498 non-null   object        
 7   payment_type_local   2498 non-null   object        
 8   note                 446 non-null    object        
 9   date                 2498 non-null   datetime64[ns]
 10  labels               2387 non-null   object        
dtypes: datetime64[ns](1), float64(2), object(8)
memory usage: 214.8+ KB


"Note" and "labels" variables contain missing values. These fields are optional when I create entries in the app and will not impact the accuracy of the analysis. I fill them with "NA".


In [4]:
# fill NaN values with "NA"
df.fillna("NA", inplace=True)


Check for duplicated rows and remove if any.


In [5]:
# get duplicated values
df[df.duplicated(keep=False)]


Unnamed: 0,account,category,currency,amount,ref_currency_amount,type,payment_type,payment_type_local,note,date,labels
336,Cash THB,Public transport,THB,-10.0,-0.27,Expenses,CASH,Cash,Public taxi,2022-08-03 06:25:24,BIG TRIP|Thailand|Pattaya
337,Cash THB,Public transport,THB,-10.0,-0.27,Expenses,CASH,Cash,Public taxi,2022-08-03 06:25:24,BIG TRIP|Thailand|Pattaya


In [6]:
# remove duplicated rows
df.drop_duplicates(inplace=True)


#### **Transform, add additional variables**


The variable "Category" actually contains subcategory entries. I will add an additional variable that contains category values.  
I will combine this step with adding the nature of the expense -- [need, want].


In [7]:
# get distinct subcategories
subcategories = df["category"].unique().tolist()


In [8]:
# create dict with categories as key and subcategory as value and nature [need, want]

d = {
    "Food and Drinks": [
        "Food & Drinks",
        "Bar, cafe",
        "Groceries",
        "Restaurant, fast-food",
        "Fitness Supplements",
        "Coffee",
        "Eating out",
    ],
    "Shopping": [
        "Shopping",
        "Clothes & shoes",
        "Drug-store, chemist",
        "Electronics, accessories",
        "Camera expenses",
        "Free time",
        "Gifts, joy",
        "Health and beauty",
        "Teeth care",
        "Skincare face",
        "Supplements",
        "Medicine",
        "Home, garden",
        "Jewels, accessories",
        "Stationery, tools",
    ],
    "Housing": ["Housing", "Energy, utilities", "Maintenance, repairs", "Rent"],
    "Transportation": [
        "Transportation",
        "Business trips",
        "Long distance",
        "Public transport",
        "Taxi",
    ],
    "Vehicle": [
        "Vehicle",
        "Fuel",
        "Leasing",
        "Parking",
        "Rentals",
        "Vehicle insurance",
        "Vehicle maintenance",
    ],
    "Life and Entertainment": [
        "Life & Entertainment",
        "Active sport, fitness",
        "Alcohol, tobacco",
        "Books, audio, subscriptions",
        "Charity, gifts",
        "Culture, sport events",
        "Education, development",
        "Health care, doctor",
        "Hobbies",
        "Holiday, trips, hotels",
        "Sightseeing, activities",
        "Accommodation",
        "Life events",
        "Lottery, gambling",
        "TV, Streaming",
        "Wellness, beauty",
    ],
    "Communication and PC": [
        "Communication, PC",
        "Internet",
        "Phone, mobile phone",
        "Postal services",
        "Software, apps, games",
        "Phone, cell phone",
    ],
    "Financial Expenses": [
        "Financial expenses",
        "Advisory",
        "Charges, Fees",
        "Fines",
        "Insurances",
        "Loan, interests",
        "Taxes",
    ],
    "Investments": [
        "Investments",
        "Financial investments",
        "Collections",
        "Realty",
        "Savings",
        "Vehicles, chattels",
    ],
    "Income": [
        "Income",
        "Gifts",
        "Refunds (tax, purchase)",
        "Sale",
        "Wage, invoices",
        "Lending, renting",
        "Rentals",
    ],
    "Other": ["Missing", "Other"],
}

d_nat = {
    "need": [
        "Food & Drinks",
        "Groceries",
        "Restaurant, fast-food",
        "Clothes & shoes",
        "Drug-store, chemist",
        "Teeth care",
        "Supplements",
        "Medicine",
        "Home, garden",
        "Housing",
        "Energy, utilities",
        "Maintenance, repairs",
        "Rent",
        "Transportation",
        "Long distance",
        "Public transport",
        "Taxi",
        "Active sport, fitness",
        "Communication, PC",
        "Internet",
        "Phone, mobile phone",
        "Postal services",
        "Phone, cell phone",
        "Charges, Fees",
        "Fines",
        "Insurances",
        "Loan, interests",
        "Taxes",
        "Other",
        "Missing",
        "Housing",
        "Financial expenses",
    ],
    "want": [
        "Bar, cafe",
        "Fitness Supplements",
        "Coffee",
        "Eating out" "Shopping",
        "Electronics, accessories",
        "Camera expenses",
        "Free time",
        "Gifts, joy",
        "Health and beauty",
        "Skincare face",
        "Skincare body" "Jewels, accessories",
        "Stationery, tools",
        "Business trips",
        "Vehicle",
        "Fuel",
        "Leasing",
        "Parking",
        "Rentals",
        "Vehicle insurance",
        "Vehicle maintenance",
        "Life & Entertainment",
        "Alcohol, tobacco",
        "Books, audio, subscriptions",
        "Charity, gifts",
        "Culture, sport events",
        "Education, development",
        "Health care, doctor",
        "Hobbies",
        "Holiday, trips, hotels",
        "Sightseeing, activities",
        "Accommodation",
        "Life events",
        "Lottery, gambling",
        "TV, Streaming",
        "Wellness, beauty",
        "Software, apps, games",
        "Advisory",
        "Shopping",
    ],
}


In [9]:
# define a function to flatten dict
def flatten_dict(d):
    """This function flattens dictionaries"""
    nd = {}
    for k, v in d.items():
        # Check if it's a list, if so then iterate through
        if hasattr(v, "__iter__") and not isinstance(v, str):
            for item in v:
                nd[item] = k
        else:
            nd[v] = k
    return nd


In [10]:
# flatten the category and nature dictionaries
flatten_d = flatten_dict(d)
flatten_d_nat = flatten_dict(d_nat)


In [11]:
# rename the category to subcategory
df = df.rename(columns={"category": "subcategory"})


In [12]:
# map the values from dictionaries to corresponding values in data frame
df["category"] = df["subcategory"].map(flatten_d)
df["nature"] = df["subcategory"].map(flatten_d_nat)
df.head()


Unnamed: 0,account,subcategory,currency,amount,ref_currency_amount,type,payment_type,payment_type_local,note,date,labels,category,nature
0,Cash THB,Taxi,THB,-40.0,-1.1,Expenses,CASH,Cash,,2022-09-30 21:36:30,BIG TRIP|Thailand|Pattaya,Transportation,need
1,Cash THB,Taxi,THB,-40.0,-1.1,Expenses,CASH,Cash,,2022-09-30 21:12:16,BIG TRIP|Thailand|Pattaya,Transportation,need
2,Credit Card,"Phone, cell phone",EUR,-12.33,-12.33,Expenses,CREDIT_CARD,Credit card,Top-up,2022-09-30 19:16:32,Pattaya|Thailand|BIG TRIP,Communication and PC,need
3,Cash THB,Taxi,THB,-40.0,-1.1,Expenses,CASH,Cash,,2022-09-30 14:04:26,BIG TRIP|Thailand|Pattaya,Transportation,need
4,Cash THB,Taxi,THB,-40.0,-1.1,Expenses,CASH,Cash,,2022-09-30 13:50:27,BIG TRIP|Thailand|Pattaya,Transportation,need


In [13]:
# convert the amount variables to absolute values
df[["amount", "ref_currency_amount"]] = df[["amount", "ref_currency_amount"]].abs()
df.head()


Unnamed: 0,account,subcategory,currency,amount,ref_currency_amount,type,payment_type,payment_type_local,note,date,labels,category,nature
0,Cash THB,Taxi,THB,40.0,1.1,Expenses,CASH,Cash,,2022-09-30 21:36:30,BIG TRIP|Thailand|Pattaya,Transportation,need
1,Cash THB,Taxi,THB,40.0,1.1,Expenses,CASH,Cash,,2022-09-30 21:12:16,BIG TRIP|Thailand|Pattaya,Transportation,need
2,Credit Card,"Phone, cell phone",EUR,12.33,12.33,Expenses,CREDIT_CARD,Credit card,Top-up,2022-09-30 19:16:32,Pattaya|Thailand|BIG TRIP,Communication and PC,need
3,Cash THB,Taxi,THB,40.0,1.1,Expenses,CASH,Cash,,2022-09-30 14:04:26,BIG TRIP|Thailand|Pattaya,Transportation,need
4,Cash THB,Taxi,THB,40.0,1.1,Expenses,CASH,Cash,,2022-09-30 13:50:27,BIG TRIP|Thailand|Pattaya,Transportation,need


Split the date column to date only and time.


In [14]:
df["time"] = df["date"].dt.time
df.head()


Unnamed: 0,account,subcategory,currency,amount,ref_currency_amount,type,payment_type,payment_type_local,note,date,labels,category,nature,time
0,Cash THB,Taxi,THB,40.0,1.1,Expenses,CASH,Cash,,2022-09-30 21:36:30,BIG TRIP|Thailand|Pattaya,Transportation,need,21:36:30
1,Cash THB,Taxi,THB,40.0,1.1,Expenses,CASH,Cash,,2022-09-30 21:12:16,BIG TRIP|Thailand|Pattaya,Transportation,need,21:12:16
2,Credit Card,"Phone, cell phone",EUR,12.33,12.33,Expenses,CREDIT_CARD,Credit card,Top-up,2022-09-30 19:16:32,Pattaya|Thailand|BIG TRIP,Communication and PC,need,19:16:32
3,Cash THB,Taxi,THB,40.0,1.1,Expenses,CASH,Cash,,2022-09-30 14:04:26,BIG TRIP|Thailand|Pattaya,Transportation,need,14:04:26
4,Cash THB,Taxi,THB,40.0,1.1,Expenses,CASH,Cash,,2022-09-30 13:50:27,BIG TRIP|Thailand|Pattaya,Transportation,need,13:50:27


###### **Split the _Labels_ column to 3 columns as it contains multiple values**.


In [15]:
df[["l1", "l2", "l3", "l4"]] = df["labels"].str.rsplit("|", expand=True)
df[["l1", "l3", "l3", "l4"]]


Unnamed: 0,l1,l3,l3.1,l4
0,BIG TRIP,Pattaya,Pattaya,
1,BIG TRIP,Pattaya,Pattaya,
2,Pattaya,BIG TRIP,BIG TRIP,
3,BIG TRIP,Pattaya,Pattaya,
4,BIG TRIP,Pattaya,Pattaya,
...,...,...,...,...
2493,Food,,,
2494,Food,,,
2495,Fixed expenses,,,
2496,Food,,,


The values are mixed across these 4 label columns. I convert these Series to lists to bring the values in correct place.


In [16]:
# save the the splitted columns to lists to iterate and change the values
list_1 = df["l1"].to_list()
list_2 = df["l2"].to_list()
list_3 = df["l3"].to_list()
list_4 = df["l4"].to_list()


In [17]:
# get unique values (these are the place names)
places = list(df["l3"].unique())


In [18]:
# create a list with invalid names or NaN values
del_place = [1, 2, 6, 18]
# remove and using numpy and convert back to list
places_1 = np.delete(places, del_place).tolist()


In [19]:
# iterate through list_3 -- there are the majority of correct values.
# Iterate through it and if the value is not in the list with correct places
# look in other columns and append to a new list
nvalid = ("BIG TRIP", "Thailand")
place = []
for x in list_3:
    if x in places_1:
        place.append(x)
    elif x in nvalid and list_2[list_1.index(x)] in nvalid:
        place.append(list_1[list_3.index(x)])
    elif x in nvalid and list_1[list_3.index(x)] in nvalid:
        place.append(list_2[list_1.index(x)])
    elif x == "Accommodation":
        x = list_4[list_3.index(x)]
        place.append(x)
    else:
        place.append(x)


In [20]:
# append the new list to the data frame
df["place"] = place
df.head()


Unnamed: 0,account,subcategory,currency,amount,ref_currency_amount,type,payment_type,payment_type_local,note,date,labels,category,nature,time,l1,l2,l3,l4,place
0,Cash THB,Taxi,THB,40.0,1.1,Expenses,CASH,Cash,,2022-09-30 21:36:30,BIG TRIP|Thailand|Pattaya,Transportation,need,21:36:30,BIG TRIP,Thailand,Pattaya,,Pattaya
1,Cash THB,Taxi,THB,40.0,1.1,Expenses,CASH,Cash,,2022-09-30 21:12:16,BIG TRIP|Thailand|Pattaya,Transportation,need,21:12:16,BIG TRIP,Thailand,Pattaya,,Pattaya
2,Credit Card,"Phone, cell phone",EUR,12.33,12.33,Expenses,CREDIT_CARD,Credit card,Top-up,2022-09-30 19:16:32,Pattaya|Thailand|BIG TRIP,Communication and PC,need,19:16:32,Pattaya,Thailand,BIG TRIP,,Pattaya
3,Cash THB,Taxi,THB,40.0,1.1,Expenses,CASH,Cash,,2022-09-30 14:04:26,BIG TRIP|Thailand|Pattaya,Transportation,need,14:04:26,BIG TRIP,Thailand,Pattaya,,Pattaya
4,Cash THB,Taxi,THB,40.0,1.1,Expenses,CASH,Cash,,2022-09-30 13:50:27,BIG TRIP|Thailand|Pattaya,Transportation,need,13:50:27,BIG TRIP,Thailand,Pattaya,,Pattaya


In [21]:
# Add note to an wrongly collected entry to filter it out in next step
df.loc[df["amount"] == 3000, ["note"]] = "Transfer"


In [22]:
# exclude/filter out deposit and transfer
filter_dep_trans = ["Transfer", "Deposit"]

df = df[~df.note.str.contains("|".join(filter_dep_trans))]


In [23]:
# fill NaN values in nature column to NA (for income entries)
df["nature"] = df["nature"].fillna(value="NA")


In [24]:
# create new column
df[["country", "lat", "lng"]] = "NA"


In [25]:
# split the data before and during travel
start_date = pd.datetime(2021, 10, 2)
end_date = pd.datetime(2022, 10, 24)

home_df = df.loc[df["date"] < start_date]
travel_df = df.loc[(df["date"] >= start_date) & (df["date"] <= end_date)]

non_travel_exp = [
    "Camera Expenses",
    "Electronics, accessories",
    "Books, audio, subscriptions",
    "Education, development",
    "Software, apps, games",
]

trip_label = "BIG TRIP"


  start_date = pd.datetime(2021, 10, 2)
  end_date = pd.datetime(2022, 10, 24)


In [26]:
# assign conditional column
home_df["travel_expense"] = np.where(
    home_df["labels"].str.contains(trip_label), True, False
)
travel_df["travel_expense"] = np.where(
    travel_df["subcategory"].isin(non_travel_exp), False, True
)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  home_df['travel_expense'] = np.where(home_df['labels'].str.contains(trip_label), True, False)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  travel_df['travel_expense'] = np.where(travel_df['subcategory'].isin(non_travel_exp), False, True)


In [27]:
# inspect the dataframe and correct some values
travel_df.loc[travel_df["l2"] == "Phuket", ["place"]] = "Phuket"
travel_df.loc[travel_df["place"] == "Road trip", ["place"]] = "Sangkhlaburi"
travel_df.loc[travel_df["place"] == "BIG TRIP", ["place"]] = "Bangkok"
travel_df.loc[travel_df["type"] == "Income", ["travel_expense"]] = "NA"


In [28]:
# change travel expense entries to False
travel_df.loc[
    (travel_df["subcategory"] == "Internet") & (travel_df["currency"] == "EUR"),
    ["travel_expense"],
] = False


In [29]:
# apply filter where travel expense = True and place is NaN
travel_df.loc[
    (travel_df["place"].isnull()) & (travel_df["travel_expense"] == True)
].head()


Unnamed: 0,account,subcategory,currency,amount,ref_currency_amount,type,payment_type,payment_type_local,note,date,...,time,l1,l2,l3,l4,place,country,lat,lng,travel_expense
58,Credit Card,Long distance,EUR,423.13,423.13,Expenses,CREDIT_CARD,Credit card,Flight back home,2022-09-23 09:42:05,...,09:42:05,BIG TRIP,,,,,,,,True
1134,DKB Visa,"Drug-store, chemist",EUR,6.14,6.14,Expenses,CREDIT_CARD,Credit card,Ocean brush,2022-03-11 10:52:14,...,10:52:14,BIG TRIP,Thailand,,,,,,,True
1470,DKB Visa,Missing,EUR,15.01,15.01,Expenses,CASH,Cash,,2022-01-03 10:44:28,...,10:44:28,,,,,,,,,True
1473,Cash THB,Missing,THB,1359.78,34.88,Expenses,CASH,Cash,,2022-01-03 10:39:36,...,10:39:36,,,,,,,,,True
1567,Debit Card,"Phone, cell phone",EUR,9.99,9.99,Expenses,DEBIT_CARD,Debit card,Klarmobil,2021-12-22 14:48:44,...,14:48:44,,,,,,,,,True


In [30]:
# fill NaN values in place variable to NA
travel_df["place"].fillna(value="NA", inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  travel_df['place'].fillna(value='NA', inplace=True)


In [31]:
# assign country to rows where travel expense is true
travel_df.loc[travel_df["travel_expense"] == True, ["country"]] = "Thailand"


In [32]:
# fill place to NA
home_df["place"].fillna(value="NA", inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  home_df['place'].fillna(value='NA', inplace=True)


In [33]:
# combine the datasets again
df = pd.concat([home_df, travel_df])


In [34]:
# finally drop not needed columns
df.drop(["labels", "l1", "l2", "l3", "l4"], axis=1, inplace=True)


In [35]:
# check summary for each column to spot possible issues
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2478 entries, 2336 to 2335
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   account              2478 non-null   object        
 1   subcategory          2478 non-null   object        
 2   currency             2478 non-null   object        
 3   amount               2478 non-null   float64       
 4   ref_currency_amount  2478 non-null   float64       
 5   type                 2478 non-null   object        
 6   payment_type         2478 non-null   object        
 7   payment_type_local   2478 non-null   object        
 8   note                 2478 non-null   object        
 9   date                 2478 non-null   datetime64[ns]
 10  category             2478 non-null   object        
 11  nature               2478 non-null   object        
 12  time                 2478 non-null   object        
 13  place                2478 non-

#### **Get latitude and longitude for the places**


In [36]:
import urllib.request
import urllib.parse
import urllib.error
import json
import ssl

api_key = False
# If you have a Google Places API key, enter it here
# api_key = 'AIzaSy___IDByT70'
# https://developers.google.com/maps/documentation/geocoding/intro

if api_key is False:
    api_key = 42
    serviceurl = "http://py4e-data.dr-chuck.net/json?"
else:
    serviceurl = "https://maps.googleapis.com/maps/api/geocode/json?"

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

keys = df["place"].unique().tolist()
keys.remove("NA")
geodata = list()

for place in keys:
    parms = dict()
    parms["address"] = place

    if api_key is not False:
        parms["key"] = api_key
    url = serviceurl + urllib.parse.urlencode(parms)

    print("Retrieving", url)
    uh = urllib.request.urlopen(url, context=ctx)
    data = uh.read().decode()
    print("Retrieved", len(data), "characters")

    try:
        js = json.loads(data)
    except:
        js = None

    if not js or "status" not in js or js["status"] != "OK":
        print("==== Failure To Retrieve ====")
        print(data)
        continue

        print(json.dumps(js, indent=4))

    lat = js["results"][0]["geometry"]["location"]["lat"]
    lng = js["results"][0]["geometry"]["location"]["lng"]
    geodata.append([lat, lng])
    print("lat", lat, "lng", lng)
    location = js["results"][0]["formatted_address"]
    print(location)


Retrieving http://py4e-data.dr-chuck.net/json?address=Pattaya&key=42
Retrieved 1938 characters
lat 12.9235557 lng 100.8824551
Pattaya City, Bang Lamung District, Chon Buri 20150, Thailand
Retrieving http://py4e-data.dr-chuck.net/json?address=Bangkok&key=42
Retrieved 1526 characters
lat 13.7563309 lng 100.5017651
Bangkok, Thailand
Retrieving http://py4e-data.dr-chuck.net/json?address=Chiang+Mai&key=42
Retrieved 1840 characters
lat 18.7883439 lng 98.98530079999999
Chiang Mai, Mueang Chiang Mai District, Chiang Mai, Thailand
Retrieving http://py4e-data.dr-chuck.net/json?address=Koh+Chang&key=42
Retrieved 1597 characters
lat 12.0479159 lng 102.3234816
Ko Chang District, Trat, Thailand
Retrieving http://py4e-data.dr-chuck.net/json?address=Koh+Kud&key=42
Retrieved 3564 characters
lat 11.6680759 lng 102.5642261
Koh Kood, Ko Kut, Ko Kut District, Trat, Thailand
Retrieving http://py4e-data.dr-chuck.net/json?address=Ratchaburi&key=42
Retrieved 1393 characters
lat 13.5282893 lng 99.8134211
Ratcha

In [37]:
# use the zip function to make a dict from two lists
geo_dict = dict(zip(keys, geodata))
geo_dict


{'Pattaya': [12.9235557, 100.8824551],
 'Bangkok': [13.7563309, 100.5017651],
 'Chiang Mai': [18.7883439, 98.98530079999999],
 'Koh Chang': [12.0479159, 102.3234816],
 'Koh Kud': [11.6680759, 102.5642261],
 'Ratchaburi': [13.5282893, 99.8134211],
 'Hua Hin': [12.5683747, 99.9576888],
 'Khao Yai': [14.4391554, 101.3722299],
 'Ko Larn': [12.9182259, 100.7802624],
 'Sangkhlaburi': [15.1542081, 98.45306579999999],
 'Kanchanaburi': [14.1011393, 99.4179431],
 'Suratthani': [9.134194899999999, 99.3334198],
 'Khao Sok': [8.9873143, 98.6294329],
 'Krabi': [8.0854803, 98.9062856],
 'Phuket': [7.8804479, 98.3922504]}

In [38]:
# and finally map the dict values to the dataframe
df["gdata"] = df["place"].map(geo_dict)
df.head()


Unnamed: 0,account,subcategory,currency,amount,ref_currency_amount,type,payment_type,payment_type_local,note,date,category,nature,time,place,country,lat,lng,travel_expense,gdata
2336,Debit Card,Postal services,EUR,0.8,0.8,Expenses,DEBIT_CARD,Debit card,,2021-10-01 13:58:25,Communication and PC,need,13:58:25,,,,,False,
2337,Debit Card,"Energy, utilities",EUR,32.0,32.0,Expenses,DEBIT_CARD,Debit card,,2021-10-01 13:58:01,Housing,need,13:58:01,,,,,False,
2338,DKB Visa,"Home, garden",EUR,9.9,9.9,Expenses,CREDIT_CARD,Credit card,,2021-09-30 00:16:25,Shopping,need,00:16:25,,,,,True,
2339,Debit Card,"Phone, cell phone",EUR,10.59,10.59,Expenses,DEBIT_CARD,Debit card,Klarmobil,2021-09-29 22:40:03,Communication and PC,need,22:40:03,,,,,False,
2340,Debit Card,Public transport,EUR,9.9,9.9,Expenses,DEBIT_CARD,Debit card,,2021-09-29 11:28:49,Transportation,need,11:28:49,,,,,False,


In [39]:
df["gdata"].fillna(value="NA", inplace=True)


In [42]:
# split the subset
df_gdata = df[df["place"] != "NA"]
df_wo_gdata = df[df["place"] == "NA"]


In [None]:
# latitude and longitude are stored in one column, I split the column to two columns
df_gdata[["lat", "lng"]] = pd.DataFrame(df_gdata.gdata.to_list(), index=df_gdata.index)


In [None]:
# set lat and lng columns to NA
df_wo_gdata[["lat", "lng"]] = "NA"


In [47]:
# concat the dataframes
df = pd.concat([df_gdata, df_wo_gdata])
df.drop("gdata", axis=1, inplace=True)


#### **Write cleaned data to CSV**


In [48]:
df.to_csv("data/2022-10-29_Expenses_clean.csv", index=False)


#### **Write the data to a SQLite database file.**


In [49]:
# write data to a SQLite database file
import sqlite3 as sq

sql_data = "data/EXPENSES.db"
conn = sq.connect(sql_data)
cur = conn.cursor()
cur.execute("""DROP TABLE IF EXISTS travel_expenses""")
df.to_sql("expenses", conn, if_exists="replace", index=False)
conn.commit()
conn.close()
