# Sam's awesome Machine Learning experiment

This is my attempt to train a machine learning system to auto-categorise my transactions. Leggo.

## Problem statement

Use the machine learning workflow to process and transform Sam's personal categorisation data until August 2016 to create a prediction model. This model must predict how I will categorise future transactions with 90% accuracy.

**TODO:** Check what the current accuracy ratio is of 22seven's keyword categorisation system, on my own data, today.

## Import things

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

## Load the csv

In [5]:
fs_base = '/Users/Sam/Source/categorisation/'
training_url = fs_base + 'categorised-txns-training-set.csv'

df = pd.read_csv(training_url)
df

Unnamed: 0,id,date,description,account,amount,category
0,SBE0001,23/08/2016,Uber Za Aug20 Vcftj 4* *28,A Wallet,-170.00,Public Transport & Taxi Services
1,SBE0002,23/08/2016,Uber Bv 4* *28,A Wallet,-97.00,Public Transport & Taxi Services
2,SBE0003,22/08/2016,Epspg_Wechat-Haa5u1h5 _ Rosebank,FNB credit card,-25.00,Eating Out & Takeouts
3,SBE0004,21/08/2016,Ciao Bella Strand Za,FNB credit card,-170.00,Eating Out & Takeouts
4,SBE0005,21/08/2016,Cafeen Claremont,FNB credit card,-130.00,Eating Out & Takeouts
5,SBE0006,20/08/2016,Wechat Snapscan Rosebank,FNB credit card,-30.00,Coffee
6,SBE0007,19/08/2016,Vineyard Service Stationnewlands Za,FNB credit card,-295.15,Eating Out & Takeouts
7,SBE0008,19/08/2016,St Park Solutions Cashi Cape Town,FNB credit card,-104.00,"Tolls, Tickets & Parking"
8,SBE0009,18/08/2016,Uber Za Aug16 Zucne 4* *28,A Wallet,-62.00,Public Transport & Taxi Services
9,SBE0010,18/08/2016,Mpower,A Wallet,5000.00,Income


## Check that we're covering rare categories

In [9]:
num_publictransport = len(df.loc[df['category'] == 'Public Transport & Taxi Services'])
num_gifts = len(df.loc[df['category'] == 'Gifts'])
print('Number of Public Transport & Taxi transactions: {0} ({1:2.2f}%)'.format(num_publictransport, (num_publictransport / len(df)) * 100))
print('Number of Gift transactions: {0} ({1:2.2f}%)'.format(num_gifts, (num_gifts / len(df)) * 100))

Number of Public Transport & Taxi transactions: 243 (12.47%)
Number of Gift transactions: 56 (2.87%)


## INTERLUDE: which algorithm to choose?

What I'm looking for is a supervised, classification algorithm that can handle text.
Naive Bayes is a place to start but it will treat all fields as equally important.
Logistic regression is better.
Decision Trees sound ideal but is super complicated.
Random Forests :)
SVM? That sounds like exactly the right thing. Support Vector Machines.
Essentially I need a neural net?

In summary: start with Naive Bayes, progress to SVM.

This could be a useful thing to read later: http://www.nltk.org/book/

## Transform some columns to integers

In [13]:
account_map = {'A Wallet' : 0, 'FNB credit card' : 1, 'Top 40 Savings' : 2,
       'Old Mutual Global FTSE RAFI® All World Index Feeder Fund (tax free)' : 3,
       'Old Mutual Global FTSE RAFI® All World Index Feeder Fund (not tax free)' : 4,
       'Tax Free Core Diversified Investment' : 5, '22seven Wallet' : 6,
       'FNB Vehicle Loan xxxx7483' : 7, 'Debt I will crush' : 8,
       'Investment with PSG RA' : 9, 'Z - PSG Short Term Investments' : 10,
       'Z - FNB Savings Pocket' : 11, 'Z - Stanlib Short Term' : 12,
       'FNB Old Credit Card' : 13, 'FNB Gold Credit Card' : 14,
       'FNB Credit Card xxxx8000' : 15}

category_map = {'Public Transport & Taxi Services' : 0, 'Eating Out & Takeouts' : 1,
       'Coffee' : 2, 'Tolls, Tickets & Parking' : 3, 'Income' : 4,
       'Tuition and Course Fees' : 5, 'Inter-account transfers' : 6,
       'Groceries & Household Essentials' : 7, 'Fuel' : 8, 'Gifts' : 9,
       'Concerts, Events & Tickets' : 10, 'Holidays, Travel, Adventure' : 11,
       'Phone & Internet' : 12, 'Clothes & Shoes' : 13, 'TV, Music & Streaming' : 14,
       'Investments' : 15, 'Alcohol Cigarettes & Vaping' : 16, 'Interest' : 17,
       'Insurance' : 18, 'Digital Services & Software' : 19, 'Donations to charity' : 20,
       'Domestic Worker & Garden Service' : 21, 'Rent' : 22,
       'Bank charges & interest' : 23, 'Appliances Furniture Decor' : 24,
       'ATM & Cash Transactions' : 25, 'Car Repayments' : 26,
       'Electricity, Water & Rates' : 27, 'Hobbies' : 28, 'Health and Medical Costs' : 29,
       'Things I will be paid back for' : 30, 'Grooming / Cosmetics' : 31, 'Gaming' : 32,
       'Books' : 33, 'Supporting Others' : 34, 'External Consultants / Services' : 35,
       'Savings' : 36, 'Tech' : 37, 'Car Costs' : 38, 'Accessories & Toys' : 39,
       'Sports & Exercise' : 40, 'Taxes' : 41, 'Cape Town Move' : 42,
       'Home Improvement & Maintenance' : 43, 'Refundable work expenses' : 44}

In [12]:
df.account.unique()

array(['A Wallet', 'FNB credit card', 'Top 40 Savings',
       'Old Mutual Global FTSE RAFI® All World Index Feeder Fund (tax free)',
       'Old Mutual Global FTSE RAFI® All World Index Feeder Fund (not tax free)',
       'Tax Free Core Diversified Investment', '22seven Wallet',
       'FNB Vehicle Loan xxxx7483', 'Debt I will crush',
       'Investment with PSG RA', 'Z - PSG Short Term Investments',
       'Z - FNB Savings Pocket', 'Z - Stanlib Short Term',
       'FNB Old Credit Card', 'FNB Gold Credit Card',
       'FNB Credit Card xxxx8000'], dtype=object)

In [15]:
df.category.unique()

array(['Public Transport & Taxi Services', 'Eating Out & Takeouts',
       'Coffee', 'Tolls, Tickets & Parking', 'Income',
       'Tuition and Course Fees', 'Inter-account transfers',
       'Groceries & Household Essentials', 'Fuel', 'Gifts',
       'Concerts, Events & Tickets', 'Holidays, Travel, Adventure',
       'Phone & Internet', 'Clothes & Shoes', 'TV, Music & Streaming',
       'Investments', 'Alcohol Cigarettes & Vaping', 'Interest',
       'Insurance', 'Digital Services & Software', 'Donations to charity',
       'Domestic Worker & Garden Service', 'Rent',
       'Bank charges & interest', 'Appliances Furniture Decor',
       'ATM & Cash Transactions', 'Car Repayments',
       'Electricity, Water & Rates', 'Hobbies', 'Health and Medical Costs',
       'Things I will be paid back for', 'Grooming / Cosmetics', 'Gaming',
       'Books', 'Supporting Others', 'External Consultants / Services',
       'Savings', 'Tech', 'Car Costs', 'Accessories & Toys',
       'Sports & Exercise'