# Classify Financial Transactions
A package to classify financial transactions with a lightweight neural network. I use this to automatically categorize my spending between `common expenses` vs `personal`.   
This uses `pytorch` and [`pytorch-lighting`](https://pytorch-lightning.readthedocs.io/en/latest/)

## Let's look at the data
We'll use some financial transactions exported as csv with several useful fields.

In [2]:
from models import TransClassifier, read_and_process_data, train_tokenizer, compute_features, prepair_training_dataset, \
    inspect
import pytorch_lightning as pl
from argparse import Namespace
from logging import getLogger
import torch
import os
from shutil import copyfile
device = "cuda" if torch.cuda.is_available() else "cpu"

In [3]:
data = read_and_process_data("transactions_training_data.csv", after='2017-07-01')
data.head()

INFO:root:0.32961435847683723


Unnamed: 0,Date,Description,Original Description,Amount,Transaction Type,Category,Account Name,Labels,Notes,feature_string,label,feature_float,weights
0,2020-01-19,LOYAL NINE,LOYAL NINE,22.47,debit,Coffee Shops,CREDIT CARD,,,coffee shops sunday credit card debit loyal nine,False,22.47,1.0
1,2020-01-19,LOYAL NINE,LOYAL NINE,12.2,debit,Coffee Shops,CREDIT CARD,,,coffee shops sunday credit card debit loyal nine,False,12.2,1.0
2,2020-01-18,Stop & Shop,STOP & SHOP 0039,13.68,debit,Groceries,CREDIT CARD,,,groceries saturday credit card debit stop & sh...,False,13.68,1.0
3,2020-01-17,Liquor Junction,LIQUOR JUNCTION-,55.51,debit,Alcohol & Bars,CREDIT CARD,,,alcohol & bars friday credit card debit liquor...,False,55.51,1.0
4,2018-03-21,Lonestar Taqueria,LONESTAR TAQUERIA,14.16,debit,Restaurants,CREDIT CARD,,,restaurants wednesday credit card debit lonest...,False,14.16,1.0


We'll concatenate a bunch of string fields into `feature_string`. We'll also use a numerical transaction `Amount` as a feature a well.

In [4]:
data.loc[:,["feature_string","feature_float","label"]].head(5)

Unnamed: 0,feature_string,feature_float,label
0,coffee shops sunday credit card debit loyal nine,22.47,False
1,coffee shops sunday credit card debit loyal nine,12.2,False
2,groceries saturday credit card debit stop & sh...,13.68,False
3,alcohol & bars friday credit card debit liquor...,55.51,False
4,restaurants wednesday credit card debit lonest...,14.16,False


## Tokenizer
Let's fit a subword tokenizer and convert the data to pytorch tensor dataset consisting of `string features, numerical features, labels, weights`

In [5]:
train_tokenizer(data)
features_ids = compute_features(data)
dataset = prepair_training_dataset(features_ids, data)
dataset.tensors

(tensor([[176, 208,  92,  ...,   0,   0,   0],
         [176, 208,  92,  ...,   0,   0,   0],
         [213, 576,  64,  ...,   0,   0,   0],
         ...,
         [201,  86, 199,  ...,   0,   0,   0],
         [196, 326,  92,  ...,   0,   0,   0],
         [139,  86, 165,  ...,   0,   0,   0]]),
 tensor([[22.4700],
         [12.2000],
         [13.6800],
         ...,
         [36.4700],
         [53.8700],
         [57.2200]]),
 tensor([0, 0, 0,  ..., 0, 1, 1]),
 tensor([1., 1., 1.,  ..., 1., 1., 1.]))

## Model
A lightweight 1D CNN based model to encode a string sequence. Define some hparams

In [6]:
hparams = Namespace(gpus=1 if device == "cude" else None,
                        dropout_rate=.2,
                        hidden_dim=32,
                        batch_size=256,
                        seq_type="cnn",
                        max_epochs=100,
                        min_epochs=10,
                        progress_bar_refresh_rate=0,
                        best_model_path="model.ckpt")

In [7]:
model = TransClassifier(hparams)
model

TransClassifier(
  (emb): Embedding(1000, 32)
  (seq_encoder): Conv1d(32, 32, kernel_size=(3,), stride=(1,))
  (cont_lin): Linear(in_features=1, out_features=1, bias=True)
  (cls): Linear(in_features=33, out_features=2, bias=True)
  (drop): Dropout(p=0.2, inplace=False)
)

## Traing
train using `pl`. using `tensorboard --logdir="./"`, we can inspect training at `localhost:6006`

In [None]:
trainer = pl.Trainer(max_epochs=hparams.max_epochs,
                         min_epochs=hparams.min_epochs,
                         gpus=hparams.gpus,
                         progress_bar_refresh_rate=hparams.progress_bar_refresh_rate)

trainer.fit(model)

![training](imgs/img1.png)

## Inspect

Let's see what our model got wront

In [9]:
model = TransClassifier.load_from_checkpoint(trainer.checkpoint_callback.kth_best_model)
    # check how model did
with torch.no_grad():
    x_s, x_f, y, w = dataset.tensors
    out = model(x_s.to(device), x_f.to(device))
    probs = out[0].softmax(dim=1)[:, 1].cpu().numpy()
    
wrong = ((probs > .5) != data.label.values)
data["probs"] = probs
data["wrong"] = wrong
f"mean error {wrong.mean()}"

'mean error 0.15110356536502548'

top false positives

In [10]:
data.query("wrong").sort_values("probs").tail().loc[:,["Date", "Original Description", "Labels", "Amount", "probs"]]

Unnamed: 0,Date,Original Description,Labels,Amount,probs
970,2019-06-04,PETPOCKETBOOK,,44.0,0.958228
2052,2018-10-04,Medford MA Utility ~ Tran: ACHDW,,54.96,0.962399
1665,2018-12-31,WHOLEFDS MDF 10380,,10.84,0.970529
1909,2018-11-05,Loyal Nine,,31.25,0.971103
1617,2019-01-11,NAVEO CU ONLINE PMT 190111,,178.58,0.978446


top false negatives

In [11]:
data.query("wrong").sort_values("probs").head().loc[:,["Date", "Original Description", "Labels", "Amount", "probs"]]

Unnamed: 0,Date,Original Description,Labels,Amount,probs
2129,2018-09-23,UBER TECHNOLOGIES INC,Common,23.84,0.006179
1072,2019-05-12,SQ *THE BACON TRUCK LLC,Common,29.21,0.00811
1318,2019-03-17,JETBLUE 2792607175535,Common,60.0,0.01022
4003,2017-07-30,Amazon.com,Common,43.55,0.01114
3463,2017-11-26,UBER *TRIP 4FPFB,Common,16.17,0.012539
