# Bank Statement Description Classification

This Jupyter notebook provides a machine learning approach to classify bank statement descriptions into predefined categories. Bank statement descriptions often contain abbreviated or vague information, making it challenging to interpret their meaning. By classifying these descriptions into categories, financial institutions can better analyze transaction data for various purposes such as fraud detection, customer segmentation, and trend analysis.

What we're going to cover:

1. Getting the data ready
2. Choose the right estimator/algorithm for our problems
3. Fit the model/algorithm and use it to make predictions on our data
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together!

## 1. Getting the data ready

In [86]:
import pandas as pd
import numpy as np

bank_statement = pd.read_csv("./data/bankstatement_seyi.csv")
bank_statement.head()

Unnamed: 0,DESCRIPTION,TRX_TYPE,CLASS,SUB-CLASS,BANK
0,AKANNI O EMMANUEL/MOB/UTO/ROTIMI EMMANUE/t/184...,Credit,USSD_TRANSFER,,Access Bank
1,TRF/Transfer/FRMROTIMI EMMANUEL AKANNI TO MUHA...,Debit,BANK_TRANSFER,,Access Bank
2,KANU WINNER U/MOBILE/UNION Transfer from KANU ...,Credit,APP_TRANSFER,,Access Bank
3,TRF//FRMROTIMI EMMANUEL AKANNI TO ROTIMI EMMAN...,Credit,APP_TRANSFER,,Access Bank
4,TRF/NULL/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,Credit,APP_TRANSFER,,Access Bank


In [87]:
# Total number of rows in bank statement
len(bank_statement)

4113

In [88]:
bank_statement.dtypes

DESCRIPTION    object
TRX_TYPE       object
CLASS          object
SUB-CLASS      object
BANK           object
dtype: object

### 1a. Drop N/A columns

In [89]:
# Identify Column With N/A values
bank_statement.isna().sum()

DESCRIPTION       2
TRX_TYPE          4
CLASS            13
SUB-CLASS      1183
BANK             10
dtype: int64

In [90]:
bank_statement.dropna(inplace=True)
bank_statement.head()

Unnamed: 0,DESCRIPTION,TRX_TYPE,CLASS,SUB-CLASS,BANK
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,Debit,APP_TRANSFER,REFUND,Access Bank
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,Debit,APP_TRANSFER,Meat pie,Access Bank
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,Credit,APP_TRANSFER,Fuel,Access Bank
15,AIRTIME/ 9MOBILE/08179000904,Debit,UTILITY,Airtime,Access Bank
16,RVSL_AIRTIME/ 9MOBILE/08179000904,Credit,REFUND,Airtime,Access Bank


In [91]:
bank_statement.isna().sum()

DESCRIPTION    0
TRX_TYPE       0
CLASS          0
SUB-CLASS      0
BANK           0
dtype: int64

### 1b.Removing Unnecessary Columns
Remove unwanted columns, keeping only "description" and "class".

In [92]:
streamlined_bank_statement = bank_statement.drop("TRX_TYPE", axis=1)
streamlined_bank_statement.head()

Unnamed: 0,DESCRIPTION,CLASS,SUB-CLASS,BANK
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER,REFUND,Access Bank
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER,Meat pie,Access Bank
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER,Fuel,Access Bank
15,AIRTIME/ 9MOBILE/08179000904,UTILITY,Airtime,Access Bank
16,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND,Airtime,Access Bank


In [93]:
streamlined_bank_statement = streamlined_bank_statement.drop("SUB-CLASS", axis=1)
streamlined_bank_statement.head()

Unnamed: 0,DESCRIPTION,CLASS,BANK
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER,Access Bank
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER,Access Bank
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER,Access Bank
15,AIRTIME/ 9MOBILE/08179000904,UTILITY,Access Bank
16,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND,Access Bank


In [94]:
streamlined_bank_statement = streamlined_bank_statement.drop("BANK", axis=1)
streamlined_bank_statement.head()

Unnamed: 0,DESCRIPTION,CLASS
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER
15,AIRTIME/ 9MOBILE/08179000904,UTILITY
16,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND


In [97]:
# Check Number of Rows
len(streamlined_bank_statement)


2916

In [98]:
streamlined_bank_statement.isna().sum()

DESCRIPTION    0
CLASS          0
dtype: int64

### 1c. Rename Columns

In [99]:
streamlined_bank_statement['txn_description'] = streamlined_bank_statement['DESCRIPTION']
streamlined_bank_statement.head()

Unnamed: 0,DESCRIPTION,CLASS,txn_description
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...
15,AIRTIME/ 9MOBILE/08179000904,UTILITY,AIRTIME/ 9MOBILE/08179000904
16,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND,RVSL_AIRTIME/ 9MOBILE/08179000904


In [100]:
streamlined_bank_statement['target'] = streamlined_bank_statement['CLASS']
streamlined_bank_statement.head()

Unnamed: 0,DESCRIPTION,CLASS,txn_description,target
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER
15,AIRTIME/ 9MOBILE/08179000904,UTILITY,AIRTIME/ 9MOBILE/08179000904,UTILITY
16,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND


In [101]:
streamlined_bank_statement = streamlined_bank_statement.drop("DESCRIPTION", axis=1)
streamlined_bank_statement.head()

Unnamed: 0,CLASS,txn_description,target
5,APP_TRANSFER,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER
6,APP_TRANSFER,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER
12,APP_TRANSFER,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER
15,UTILITY,AIRTIME/ 9MOBILE/08179000904,UTILITY
16,REFUND,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND


In [102]:
streamlined_bank_statement = streamlined_bank_statement.drop("CLASS", axis=1)
streamlined_bank_statement.head()

Unnamed: 0,txn_description,target
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,APP_TRANSFER
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,APP_TRANSFER
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,APP_TRANSFER
15,AIRTIME/ 9MOBILE/08179000904,UTILITY
16,RVSL_AIRTIME/ 9MOBILE/08179000904,REFUND


In [103]:
# Save Refined data as csv
streamlined_bank_statement.to_csv("./data/streamlined_bank_statement.csv", index=False)

### 1c. Convert target object to Int

In [70]:
LABELS = {
    'AGENT_WITHDRAWAL': 0,
    'APP_TRANSFER': 1,
    'ATM': 2,
    'BANK TRANSFER': 3,
    'BANK_CHARGES': 4,
    'CASH_WITHDRAWAL': 6,
    'Class': 7,
    'DEBIT': 8,
    'LOAN': 9,
    'ONLINE_TRANSFER': 10,
    'POS': 11,
    'REFUND': 12,
    'SALARY': 13,
    'TAX': 14,
    'USSD_TRANSFER': 15,
    'UTILITY': 16
}

In [104]:
streamlined_bank_statement.isna().sum()

txn_description    0
target             0
dtype: int64

In [105]:
streamlined_bank_statement['target'] = streamlined_bank_statement['target'].map(LABELS)
streamlined_bank_statement.head()

Unnamed: 0,txn_description,target
5,TRF/REFUND/FRMROTIMI EMMANUEL AKANNI TO KANU W...,1.0
6,TRF/Meat pie/FRMROTIMI EMMANUEL AKANNI TO AKIN...,1.0
12,TRF/Fuel/FRMUFUOMA BENEDICTA MARCHIE TO ROTIMI...,1.0
15,AIRTIME/ 9MOBILE/08179000904,16.0
16,RVSL_AIRTIME/ 9MOBILE/08179000904,12.0


In [106]:
streamlined_bank_statement.isna().sum()

txn_description      0
target             242
dtype: int64

In [107]:
streamlined_bank_statement.dropna(inplace=True)

In [108]:
streamlined_bank_statement.isna().sum()

txn_description    0
target             0
dtype: int64

In [110]:
len(streamlined_bank_statement)

2674

### 1c. Split Data into X/y

In [111]:
X = streamlined_bank_statement.drop("target", axis=1)
y = streamlined_bank_statement["target"]

### 1d. Convert txn_description object to Int using OneHotEncoder

In [112]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["txn_description"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                               remainder="passthrough")

transformed_X =transformer.fit_transform(X)
transformed_X

<2674x2074 sparse matrix of type '<class 'numpy.float64'>'
	with 2674 stored elements in Compressed Sparse Row format>

### 1e. Check how many rows we now have

In [113]:
len(y), len(X)

(2674, 2674)

### 1f. Check for N/A

In [114]:
y.isna().sum()

0

In [115]:
X.isna().sum()

txn_description    0
dtype: int64

## 2. choosing the right estimator/algorithm for your problem
Using:https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html as a roadmap to picking a model, Our options are:
- Linear SVC
- KNeighbors Classifier
- Ensemble Classifier


## 3. Fit the model/algorithm and use it to make predictions on our data

##### 3a. In this step, we split the data into training and testing sets, allocating 80% for training and 20% for testing.

In [116]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

#### 3a. Linear SVC

In [121]:
from sklearn.svm import LinearSVC

#set up random seed
np.random.seed(42)
linear_svc_clf = LinearSVC(max_iter=10000, dual='auto')
linear_svc_clf.fit(X_train, y_train)

# Evaluate LinearSVC
linear_svc_clf.score(X_test, y_test)

0.6728971962616822

In [125]:
np.random.seed(42)
for i in range(10, 100, 1):
    print(f"Trying LinearSVC with {i} max_iter")
    linear_svc_clf = LinearSVC(max_iter=i, dual='auto')
    linear_svc_clf.fit(X_train, y_train)
    print(f" Model accuracy on test set: {linear_svc_clf.score(X_test, y_test)*100:.2f}%")
    
    print("")

Trying LinearSVC with 10 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 11 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 12 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 13 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 14 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 15 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 16 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 17 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 18 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 19 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 20 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 21 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 22 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 23 max_iter
 Model accuracy on test set: 67.29%

Trying



 Model accuracy on test set: 67.29%

Trying LinearSVC with 28 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 29 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 30 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 31 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 32 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 33 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 34 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 35 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 36 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 37 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 38 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 39 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 40 max_iter
 Model accuracy on test set: 67.29%

Trying LinearSVC with 41 max_iter
 Model

#### 3b. KNeighbors Classifier

In [123]:
from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)
knn_classifier.score(X_test, y_test)


0.6149532710280374

In [128]:
np.random.seed(42)
for i in range(1, 100, 1):
    print(f"Trying KNeighborsClassifier with {i} n_neighbors")
    knn_classifier = KNeighborsClassifier(n_neighbors=i)
    knn_classifier.fit(X_train, y_train)
    print(f" Model accuracy on test set: {knn_classifier.score(X_test, y_test)*100:.2f}%")
    
    print("")

Trying KNeighborsClassifier with 1 n_neighbors
 Model accuracy on test set: 67.29%

Trying KNeighborsClassifier with 2 n_neighbors
 Model accuracy on test set: 34.39%

Trying KNeighborsClassifier with 3 n_neighbors
 Model accuracy on test set: 62.99%

Trying KNeighborsClassifier with 4 n_neighbors
 Model accuracy on test set: 67.10%

Trying KNeighborsClassifier with 5 n_neighbors
 Model accuracy on test set: 61.50%

Trying KNeighborsClassifier with 6 n_neighbors
 Model accuracy on test set: 61.50%

Trying KNeighborsClassifier with 7 n_neighbors
 Model accuracy on test set: 61.50%

Trying KNeighborsClassifier with 8 n_neighbors
 Model accuracy on test set: 61.50%

Trying KNeighborsClassifier with 9 n_neighbors
 Model accuracy on test set: 61.31%

Trying KNeighborsClassifier with 10 n_neighbors
 Model accuracy on test set: 61.31%

Trying KNeighborsClassifier with 11 n_neighbors
 Model accuracy on test set: 61.12%

Trying KNeighborsClassifier with 12 n_neighbors
 Model accuracy on test se

#### 3c. Ensemble Classifier

In [129]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

random_forest_classifier_model =  RandomForestClassifier(n_estimators=100)
random_forest_classifier_model.fit(X_train, y_train)

# Check score of model
random_forest_classifier_model.score(X_test, y_test)

0.6728971962616822

In [130]:
np.random.seed(42)
for i in range(1, 100, 1):
    print(f"Trying RandomForestClassifier with {i} n_estimators")
    random_forest_classifier_model = RandomForestClassifier(n_estimators=i)
    random_forest_classifier_model.fit(X_train, y_train)
    print(f" Model accuracy on test set: {random_forest_classifier_model.score(X_test, y_test)*100:.2f}%")
    
    print("")

Trying RandomForestClassifier with 1 n_estimators
 Model accuracy on test set: 65.61%

Trying RandomForestClassifier with 2 n_estimators
 Model accuracy on test set: 66.54%

Trying RandomForestClassifier with 3 n_estimators
 Model accuracy on test set: 65.79%

Trying RandomForestClassifier with 4 n_estimators
 Model accuracy on test set: 66.92%

Trying RandomForestClassifier with 5 n_estimators
 Model accuracy on test set: 66.17%

Trying RandomForestClassifier with 6 n_estimators
 Model accuracy on test set: 66.73%

Trying RandomForestClassifier with 7 n_estimators
 Model accuracy on test set: 65.61%

Trying RandomForestClassifier with 8 n_estimators
 Model accuracy on test set: 66.92%

Trying RandomForestClassifier with 9 n_estimators
 Model accuracy on test set: 65.98%

Trying RandomForestClassifier with 10 n_estimators
 Model accuracy on test set: 66.92%

Trying RandomForestClassifier with 11 n_estimators
 Model accuracy on test set: 65.79%

Trying RandomForestClassifier with 12 n_e