In [58]:
import pandas as pd
import numpy as np

In [59]:
transactions = pd.read_csv("../data/bank_transactions.csv")
transactions.head()

Unnamed: 0,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,PAYMENT,983.09,C1454812978,36730.24,35747.15,M1491308340,0.0,0.0,0,0
1,PAYMENT,55215.25,C1031766358,99414.0,44198.75,M2102868029,0.0,0.0,0,0
2,CASH_IN,220986.01,C1451868666,7773074.97,7994060.98,C1339195526,924031.48,703045.48,0,0
3,TRANSFER,2357394.75,C458368123,0.0,0.0,C620979654,4202580.45,6559975.19,0,0
4,CASH_OUT,67990.14,C1098978063,0.0,0.0,C142246322,625317.04,693307.19,0,0


The following features are not predictive and can be dropped:
- `nameOrig`: Origin account name — not useful for prediction.
- `nameDest`: Destination account name — not useful for prediction.
- `isFlaggedFraud`: Output of a hard-coded fraud rule — ineffective and redundant.

In [60]:
df = transactions.drop(columns=['nameOrig', 'nameDest', 'isFlaggedFraud'], axis=1)
df.head()

Unnamed: 0,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
0,PAYMENT,983.09,36730.24,35747.15,0.0,0.0,0
1,PAYMENT,55215.25,99414.0,44198.75,0.0,0.0,0
2,CASH_IN,220986.01,7773074.97,7994060.98,924031.48,703045.48,0
3,TRANSFER,2357394.75,0.0,0.0,4202580.45,6559975.19,0
4,CASH_OUT,67990.14,0.0,0.0,625317.04,693307.19,0


- **TRANSFER** and **CASH_OUT** transactions are the only types associated with fraud.
- These two types also involve significantly higher transaction amounts on average.
- Other types such as **PAYMENT**, **DEBIT**, and **CASH_IN** have much smaller amounts and no fraud cases.

In [61]:
df_encoded = pd.get_dummies(
    data=df,
    columns=['type'],
    dtype='int',
    drop_first=True
)
df_encoded.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,983.09,36730.24,35747.15,0.0,0.0,0,0,0,1,0
1,55215.25,99414.0,44198.75,0.0,0.0,0,0,0,1,0
2,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,0,0,0,0
3,2357394.75,0.0,0.0,4202580.45,6559975.19,0,0,0,0,1
4,67990.14,0.0,0.0,625317.04,693307.19,0,1,0,0,0


Fraudulent transactions make up only 0.13% of the dataset, posing a significant challenge for model training. Most classifiers may default to predicting the majority class (non-fraud) to maximize accuracy, resulting in poor fraud detection.

To address this, the following strategies were considered:

- **Class Weights**: Using `class_weight='balanced'` helps the model give more importance to the minority class without altering the data.
- **SMOTE**: Synthetic Minority Over-sampling Technique generates synthetic examples of the minority class. It should only be applied to the **training set** after splitting.

**Approach Taken**:
Model training began with `class_weight='balanced'` for simplicity and efficiency. If model performance is inadequate, SMOTE will be applied as a follow-up step.

Some patterns in the data combine multiple features, like high amounts in certain transaction types. To capture these, new features were created:

- `high_risk_type`: Marks transactions over 400,000 in `TRANSFER` or `CASH_OUT` types, where most fraud happens.

- `orig_diff` and `dest_diff`: Measure balance changes in origin and destination accounts. These don't always match the `amount`, which may show hidden behavior. Including them could help detect irregular or suspicious transactions.

These features help the model see patterns not directly shown in the original data.

In [62]:
df_encoded['high_risk_type'] = (
    ((df_encoded['type_TRANSFER']) | (df_encoded['type_CASH_OUT'])) &
    (df_encoded['amount'] > 400000)
).astype(int)

In [63]:
df_encoded['orig_diff'] = df_encoded['oldbalanceOrg'] - df_encoded['newbalanceOrig']
df_encoded['dest_diff'] = df_encoded['newbalanceDest'] - df_encoded['oldbalanceDest']

In [64]:
df_encoded.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,high_risk_type,orig_diff,dest_diff
0,983.09,36730.24,35747.15,0.0,0.0,0,0,0,1,0,0,983.09,0.0
1,55215.25,99414.0,44198.75,0.0,0.0,0,0,0,1,0,0,55215.25,0.0
2,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,0,0,0,0,0,-220986.01,-220986.0
3,2357394.75,0.0,0.0,4202580.45,6559975.19,0,0,0,0,1,1,0.0,2357394.74
4,67990.14,0.0,0.0,625317.04,693307.19,0,1,0,0,0,0,0.0,67990.15


While new features introduces some redundancy with existing variables (`amount` and `type_*`), it captures an important interaction that models might miss. Tree-based models handle such multicollinearity well. Its impact will be evaluated during modeling.

In [65]:
df_encoded.to_csv('../data/bank_transactions_transformed.csv', index=False)