# Financial Accounting ETL

This notebook transforms the [Online Retail](https://archive.ics.uci.edu/dataset/352/online+retail) dataset into a double-entry compliant bookkeeping format. Realistically, financial data would come from an ERP system (Workday, SAP) in this exact format.

In [63]:
import pandas as pd

In [64]:
online_schema = {
    'InvoiceNo': 'string',
    'InvoiceDate': 'string',
    'StockCode': 'string',
    'Quantity': 'int64',
    'UnitPrice' : 'float64',
    'CustomerID' : 'string',
    'Country' : 'string'
}
schema_de = {
    'InvoiceNo': 'string',
    'InvoiceDate': 'string',
    'Type': 'string',
    'Account': 'string',
    'Amount': 'float64',
    'CustomerID' : 'string',
    'Country' : 'string'
}

In [65]:
online = pd.read_csv('../data/online_retail.csv', parse_dates=True, dtype=online_schema).sample(100000)
online.to_csv('../data/online_retail_sampled.csv', index=False)
online['InvoiceDate'] = pd.to_datetime(pd.to_datetime(online['InvoiceDate']).dt.date)
online['PurchaseTotal'] = online['Quantity']*online['UnitPrice']
online['InvoiceDate'] = online['InvoiceDate'] + pd.to_timedelta(52*11, unit='W') # Year unit deprecated????? why?????
online.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,PurchaseTotal
262786,559923,20761,BLUE PAISLEY SKETCHBOOK,1,2022-06-29,7.46,,United Kingdom,7.46
94294,544341,22671,FRENCH LAUNDRY SIGN BLUE METAL,12,2022-02-04,1.65,13012.0,United Kingdom,19.8
232073,557301,22191,IVORY DINER WALL CLOCK,2,2022-06-05,8.5,15373.0,United Kingdom,17.0
70800,542102,21621,VINTAGE UNION JACK BUNTING,12,2022-01-11,8.5,12744.0,Singapore,102.0
431690,573744,23503,PLAYING CARDS KEEP CALM & CARRY ON,12,2022-10-18,1.25,17733.0,United Kingdom,15.0


# Enforcing Double-Entry Bookkeeping Compliant Format

This dataset is simply formatted. In order to transform to Double-Entry, we must duplicate each row and specify debited and credited accounts so that our accounting equation is balanced.

In [68]:
online_cleaned = online.loc[(online['Description'].notna())&(online['Quantity'].notna())&(online['UnitPrice'].notna()) \
&(online['CustomerID'].notna())&(online['Country'].notna())]

double_entry = pd.DataFrame(columns=schema_de.keys()).astype(schema_de)

In [67]:
def create_double_entry(row):
    debit_row = {
        'InvoiceNo': row['InvoiceNo'],
        'InvoiceDate': row['InvoiceDate'],
        'Type': 'Debit',
        'Account': 'Accounts Receivable',
        'Amount': row['PurchaseTotal'],
        'CustomerID': row['CustomerID'],
        'Country': row['Country']
    }

    credit_row = {
        'InvoiceNo': row['InvoiceNo'],
        'InvoiceDate': row['InvoiceDate'],
        'Type': 'Credit',
        'Account': 'Revenue',
        'Amount': row['PurchaseTotal'],
        'CustomerID': row['CustomerID'],
        'Country': row['Country']
    }
    return pd.DataFrame([debit_row, credit_row], columns=schema_de.keys())

double_entry = pd.concat(online_cleaned.apply(create_double_entry, axis=1).to_list(), ignore_index=True)
double_entry.to_csv('../data/double_entry_online_retail.csv', index=False)