# DataCreation

In this notebook, we construct a bank statement for a given fiscal year.

In [None]:
import pandas as pd
import numpy as np
import numpy.random as npr

We set up two dictionaries. One for regular payments and one for expenses that occur (relatively) unscheduled. The first kind is characterized simply by its amount while the second kind is associated with a tuple of three values that represent the probability of the expense occuring, and the average value and standard deviation for the amount spent. 

In [None]:
regular = {}
regular["Rent"] = 1000
regular["Car Loan"] = 250
regular["Insurances"] = 30
regular["Phone"] = 60

random = {}
random["CoffeeBrothers"] = (6./7, 2., 0.1)
random["BeansBeansBeans"] = (3./7, 2.2, 0.3)
random["Who likes dough?"] = (2./30, 5., 3.)
random["BestBakers"] = (5./30, 3, 2)
random["Cuisine Francaise"] = (2./50, 100, 10)
random["Best Burgers"] = (1./50, 7, 2)
random["Gimme Gyros"] = (1./50, 4, 1)
random["Bernhard Bratwurst"] = (3./50, 2, 0.05)
random["Trattoria Accento"] = (1./40, 15, 1)
random["Convencience"] = (1./20, 7., 2)
random["Plumber"] = (1./300, 70, 30)
random["DIY Furniture"] = (5./300, 150, 20)
random["Electrobuddy"] = (1./100, 500, 15)
random["Farmer's Market"] = (1./7, 25, 1)
random["Franny's Fantastic Food"] = (1./7, 60, 4)
random["Super Foods Market"] = (1./4, 60, 3)
random["Gas"] = (1./8, 65, 1)
random["Clothes left and right"] = (1./30, 45, 5)

messages = ["Thank you for your purchase", "See you soon!", "We value your business", "Always a good choice", ""]

The data frame is constructed by providing column names and a preliminary length that equals the number of regular payments times 12 - one for each month. This is strictly not necessary but allows pandas to allocate some memory beforehand. Again, this would be more important for larger datasets where adding data row by row might become prohibitively expensive.

In [None]:
bank_account = pd.DataFrame(columns=["Date", "Recipient", "Subject", "ID", "Amount"], index=range(12 * len(regular)))

frst_day = pd.datetime(2015, 1, 1)
last_day = pd.datetime(2017, 7, 31)

r_idx = 0
for d in pd.date_range(frst_day, last_day, freq="MS"):
    for name, amount in regular.items():
        entry = [d, name, "", "", -amount]
        bank_account.loc[r_idx, :] = entry
        r_idx += 1        

For the random entries, we first construct a range of all days in a given year. Using numpys random number generators, we check for each day whether a purchase should have happened. For those that did, we also choose a random message and a random ID. The data is then compiled in a separate dataframe that shows you how to construct a dataframe from a dictionary and then appended to the overall frame.

In [None]:
days = pd.date_range(frst_day, last_day, freq="1d")

for recipient, props in random.items():
    prob = props[0]
    mu = props[1]
    sigma = props[2]
    
    purchased = npr.rand(len(days)) < prob
    dates = days[purchased]
    recipients = [recipient] * len(dates)
    subjects = npr.choice(messages, len(dates))
    ids = npr.randint(10000, 100000, len(dates))
    amounts = -1. * np.around([max(mu - sigma, a) for a in npr.normal(mu, sigma, len(dates))], decimals=2)
    
    to_insert = pd.DataFrame({"Date":dates, "Recipient":recipients, "Subject":subjects, "ID":ids, "Amount":amounts})
    bank_account = bank_account.append(to_insert)

After all data is inserted, the dataframe is sorted by date and saved as csv. To make the analysis a bit more difficult, we are going to create overlapping datasets so that we have to filter out duplicate transactions.

In [None]:
bank_account.sort_values(["Date"], inplace=True)
bank_account.Date = pd.to_datetime(bank_account.Date)

checkpoints = [frst_day, pd.datetime(2015, 4, 16), pd.datetime(2015, 12, 1), pd.datetime(2016, 7, 17), pd.datetime(2016, 9, 23), pd.datetime(2017, 3, 1), last_day]

for first, last in zip(checkpoints[:-2], checkpoints[2:]):
    f_str = "{0}{1}{2}".format(first.year, first.month, first.day)
    l_str = "{0}{1}{2}".format(last.year, last.month, last.day)
    
    bank_account[(bank_account.Date >= first) & (bank_account.Date <= last)].to_csv("money_bin_{0}_{1}.csv".format(f_str, l_str), index=False)