# Bank Statement Description Classification

This Jupyter notebook provides a machine learning approach to classify bank statement descriptions into predefined categories. Bank statement descriptions often contain abbreviated or vague information, making it challenging to interpret their meaning. By classifying these descriptions into categories, financial institutions can better analyze transaction data for various purposes such as fraud detection, customer segmentation, and trend analysis.

What we're going to cover:

1. Getting the data ready
2. Choose the right estimator/algorithm for our problems
3. Fit the model/algorithm and use it to make predictions on our data
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together!

## 1. Getting the data ready

In [55]:
import pandas as pd
import numpy as np

bank_statement = pd.read_csv("./data/bankstatement_seyi.csv")
bank_statement.head()

Unnamed: 0,DESCRIPTION,TRX_TYPE,CLASS,SUB-CLASS,BANK
0,AKANNI O EMMANUEL/MOB/UTO/ROTIMI EMMANUE/t/184...,Credit,USSD_TRANSFER,,Access Bank
1,TRF/Transfer/FRMROTIMI EMMANUEL AKANNI TO MUHA...,Debit,BANK_TRANSFER,,Access Bank
2,KANU WINNER U/MOBILE/UNION Transfer from KANU ...,Credit,APP_TRANSFER,,Access Bank
3,TRF//FRMROTIMI EMMANUEL AKANNI TO ROTIMI EMMAN...,Credit,APP_TRANSFER,,Access Bank
4,TRF/NULL/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,Credit,APP_TRANSFER,,Access Bank


In [56]:
# Total number of rows in bank statement
len(bank_statement)

4113

In [57]:
bank_statement.dtypes

DESCRIPTION    object
TRX_TYPE       object
CLASS          object
SUB-CLASS      object
BANK           object
dtype: object

### 1a. Drop N/A columns

In [58]:
# Identify Column With N/A values
bank_statement.isna().sum()

DESCRIPTION       2
TRX_TYPE          4
CLASS            13
SUB-CLASS      1183
BANK             10
dtype: int64

In [59]:
bank_statement.dropna(inplace=True)
bank_statement.head()

Unnamed: 0,DESCRIPTION,TRX_TYPE,CLASS,SUB-CLASS,BANK
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,Debit,APP_TRANSFER,REFUND,Access Bank
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,Debit,APP_TRANSFER,Meat pie,Access Bank
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,Credit,APP_TRANSFER,Fuel,Access Bank
15,AIRTIME/ 9MOBILE/08179000904,Debit,UTILITY,Airtime,Access Bank
16,RVSL_AIRTIME/ 9MOBILE/08179000904,Credit,REFUND,Airtime,Access Bank


In [60]:
bank_statement.isna().sum()

DESCRIPTION    0
TRX_TYPE       0
CLASS          0
SUB-CLASS      0
BANK           0
dtype: int64

### 1b.Removing Unnecessary Columns
Remove unwanted columns, keeping only "description" and "class".

In [61]:
streamlined_bank_statement = bank_statement.drop("TRX_TYPE", axis=1)
streamlined_bank_statement.head()

Unnamed: 0,DESCRIPTION,CLASS,SUB-CLASS,BANK
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER,REFUND,Access Bank
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER,Meat pie,Access Bank
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER,Fuel,Access Bank
15,AIRTIME/ 9MOBILE/08179000904,UTILITY,Airtime,Access Bank
16,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND,Airtime,Access Bank


In [62]:
streamlined_bank_statement = streamlined_bank_statement.drop("SUB-CLASS", axis=1)
streamlined_bank_statement.head()

Unnamed: 0,DESCRIPTION,CLASS,BANK
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER,Access Bank
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER,Access Bank
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER,Access Bank
15,AIRTIME/ 9MOBILE/08179000904,UTILITY,Access Bank
16,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND,Access Bank


In [63]:
streamlined_bank_statement = streamlined_bank_statement.drop("BANK", axis=1)
streamlined_bank_statement.head()

Unnamed: 0,DESCRIPTION,CLASS
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER
15,AIRTIME/ 9MOBILE/08179000904,UTILITY
16,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND


In [64]:
# Check Number of Rows
len(streamlined_bank_statement)

2916

### 1c. Rename Columns

In [65]:
streamlined_bank_statement['txn_description'] = streamlined_bank_statement['DESCRIPTION']
streamlined_bank_statement.head()

Unnamed: 0,DESCRIPTION,CLASS,txn_description
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...
15,AIRTIME/ 9MOBILE/08179000904,UTILITY,AIRTIME/ 9MOBILE/08179000904
16,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND,RVSL_AIRTIME/ 9MOBILE/08179000904


In [66]:
streamlined_bank_statement['target'] = streamlined_bank_statement['CLASS']
streamlined_bank_statement.head()

Unnamed: 0,DESCRIPTION,CLASS,txn_description,target
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER
15,AIRTIME/ 9MOBILE/08179000904,UTILITY,AIRTIME/ 9MOBILE/08179000904,UTILITY
16,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND


In [67]:
streamlined_bank_statement = streamlined_bank_statement.drop("DESCRIPTION", axis=1)
streamlined_bank_statement.head()

Unnamed: 0,CLASS,txn_description,target
5,APP_TRANSFER,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER
6,APP_TRANSFER,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER
12,APP_TRANSFER,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER
15,UTILITY,AIRTIME/ 9MOBILE/08179000904,UTILITY
16,REFUND,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND


In [68]:
streamlined_bank_statement = streamlined_bank_statement.drop("CLASS", axis=1)
streamlined_bank_statement.head()

Unnamed: 0,txn_description,target
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER
15,AIRTIME/ 9MOBILE/08179000904,UTILITY
16,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND


In [69]:
# Save Refined data as csv
streamlined_bank_statement.to_csv("./data/streamlined_bank_statement.csv", index=False)

### 1c. Convert target object to Int

In [70]:
LABELS = {
    'AGENT_WITHDRAWAL': 0,
    'APP_TRANSFER': 1,
    'ATM': 2,
    'BANK TRANSFER': 3,
    'BANK_CHARGES': 4,
    'CASH_WITHDRAWAL': 6,
    'Class': 7,
    'DEBIT': 8,
    'LOAN': 9,
    'ONLINE_TRANSFER': 10,
    'POS': 11,
    'REFUND': 12,
    'SALARY': 13,
    'TAX': 14,
    'USSD_TRANSFER': 15,
    'UTILITY': 16
}

In [71]:
streamlined_bank_statement['target'] = streamlined_bank_statement['target'].map(LABELS)
streamlined_bank_statement.head()

Unnamed: 0,txn_description,target
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,1.0
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,1.0
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,1.0
15,AIRTIME/ 9MOBILE/08179000904,16.0
16,RVSL_AIRTIME/ 9MOBILE/08179000904,12.0


### 1c. Split Data into X/y

In [72]:
X = streamlined_bank_statement.drop("target", axis=1)
y = streamlined_bank_statement["target"]

### 1d. Convert txn_description object to Int using OneHotEncoder

In [73]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["txn_description"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                               remainder="passthrough")

transformed_X =transformer.fit_transform(X)
transformed_X

<2916x2086 sparse matrix of type '<class 'numpy.float64'>'
	with 2916 stored elements in Compressed Sparse Row format>

### 1e. Check how many rows we have now

In [76]:
len(y), len(X)

(2916, 2916)

## 2. choosing the right estimator/algorithm for your problem
Using:https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html as a roadmap to picking a model, Our options are:
- Linear SVC
- KNeighbors Classifier
- Ensemble Classifier
- SVC