# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# import data 
transactions = pd.read_csv("../data/bank_transactions.csv")

## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

Answer here

During EDA, there were no missing values in the dataset. However, some columns such as `nameOrig`, `nameDest` were identified as non-predictive, since they act as unique identifiers. These columns should be removed before training a model to ensure better predictive performance. Additionally, the `isFlaggedFraud` column was found to be unreliable, as it incorrectly flags most fraudulent transactions. This columns should be dropped as well.

In [3]:
# drop columns

transactions = transactions.drop(['nameOrig','nameDest','isFlaggedFraud'], axis=1)
transactions.head()

Unnamed: 0,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
0,PAYMENT,983.09,36730.24,35747.15,0.0,0.0,0
1,PAYMENT,55215.25,99414.0,44198.75,0.0,0.0,0
2,CASH_IN,220986.01,7773074.97,7994060.98,924031.48,703045.48,0
3,TRANSFER,2357394.75,0.0,0.0,4202580.45,6559975.19,0
4,CASH_OUT,67990.14,0.0,0.0,625317.04,693307.19,0


## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

Answer here

Yes, from the EDA, it was observed that TRANSFER and CASH_OUT transaction types are significantly more associated with fraudulent activity compared to others.

In [11]:
# encoding type column

df_dummy = pd.get_dummies(transactions, columns=['type'], drop_first= True)
df_dummy.head()


Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,983.09,36730.24,35747.15,0.0,0.0,0,False,False,True,False
1,55215.25,99414.0,44198.75,0.0,0.0,0,False,False,True,False
2,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,False,False,False,False
3,2357394.75,0.0,0.0,4202580.45,6559975.19,0,False,False,False,True
4,67990.14,0.0,0.0,625317.04,693307.19,0,True,False,False,False


## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

Answer here

Yes, here the minority class is the fraudulent transactions,  only 1,297 out of 1,000,000 rows are labeled as fraud. This creates a class imbalance problem, where the model may become biased toward predicting the majority class (non-fraud), potentially missing actual fraud cases.

To address class imbalance, we can apply techniques like SMOTE, undersampling the majority class etc

In [None]:

# apply smote


## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

Answer Here

Yes, during EDA, we noticed that fraudulent transactions often involved high amounts and occurred mostly in TRANSFER or CASH_OUT types. This suggests an interaction effect between transaction type and amount, which isn’t explicitly captured in the dataset but could be valuable for modeling.
This means a combination of certain features (like high amount and type) might be a better indicator of fraud than either one alone.

In [2]:
# write out newly transformed dataset to your folder
...