# Data Transform

In this notebook, we will ask you a series of questions to evaluate your findings from your EDA. Based on your response & justification, we will ask you to also apply a subsequent data transformation. 

If you state that you will not apply any data transformations for this step, you must **justify** as to why your dataset/machine-learning does not require the mentioned data preprocessing step.

The bonus step is completely optional, but if you provide a sufficient feature engineering step in this project we will add `1000` points to your Kahoot leaderboard score.

You will write out this transformed dataframe as a `.csv` file to your `data/` folder.

**Note**: Again, note that this dataset is quite large. If you find that some data operations take too long to complete on your machine, simply use the `sample()` method to transform a subset of your data.

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier

from imblearn.over_sampling import SMOTE
from collections import Counter


## Q1

Does your model contain any missing values or "non-predictive" columns? If so, which adjustments should you take to ensure that your model has good predictive capabilities? Apply your data transformations (if any) in the code-block below.

Yes, my model does contain non-predictive columns. I decided to drop columns 'nameOrig', 'nameDest' and 'isFlaggedFraud'.

In [10]:
sample = pd.read_csv("../data/bank_transactions.csv")
mod1 = sample.drop(columns=['nameOrig', 'nameDest', 'isFlaggedFraud'])
mod1.head()

Unnamed: 0,type,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud
0,PAYMENT,983.09,36730.24,35747.15,0.0,0.0,0
1,PAYMENT,55215.25,99414.0,44198.75,0.0,0.0,0
2,CASH_IN,220986.01,7773074.97,7994060.98,924031.48,703045.48,0
3,TRANSFER,2357394.75,0.0,0.0,4202580.45,6559975.19,0
4,CASH_OUT,67990.14,0.0,0.0,625317.04,693307.19,0


In [11]:
x =mod1[['oldbalanceOrg', 'amount']] #predictor variables
y =mod1['isFraud'] #target variable
 

#Split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

#train KNN on imbalanced data
knn_imb = KNeighborsClassifier(n_neighbors=3, metric = 'cityblock')
knn_imb.fit(x_train, y_train)

In [12]:
yhat =knn_imb.predict(x_test)
baseline_accuracy = accuracy_score(y_test, yhat)
print(f"Baseline accuracy: {baseline_accuracy:.2f}")


Baseline accuracy: 1.00


100% classification 
The classifer just labeled everyhting as 0(non-fradulent) = this type of accuracy should not be trusted we did nothing of interest 

## Q2

Do certain transaction types consistently differ in amount or fraud likelihood? If so, how might you transform the type column to make this pattern usable by a machine learning model? Apply your data transformations (if any) in the code-block below.

From our EDA, we learned that the transaction types that we detected as fraudulent behavior were cash-in and transfer transactions. Based on these findings, we can convert the 'type' column using one-hot encoding. This transformation will make our categorical variables into a binary format suitable for future machine learning models.

In [13]:
# Using one-hot encoding to transform the 'type' column
encoded_mod1 = pd.get_dummies(mod1, columns=['type'], drop_first=True)
encoded_mod1.head() # hmmmm I don't want T/F i want 0/1

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,983.09,36730.24,35747.15,0.0,0.0,0,False,False,True,False
1,55215.25,99414.0,44198.75,0.0,0.0,0,False,False,True,False
2,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,False,False,False,False
3,2357394.75,0.0,0.0,4202580.45,6559975.19,0,False,False,False,True
4,67990.14,0.0,0.0,625317.04,693307.19,0,True,False,False,False


In [17]:
# Initialize encoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform
encoded_array = encoder.fit_transform(mod1[['type']])
encoded_df = pd.DataFrame(encoded_array, columns=encoder.get_feature_names_out(['type']))

# Concatenate the original DataFrame with the encoded DataFrame and drop the original 'type' column
mod1_encoded = pd.concat([mod1.drop('type', axis=1), encoded_df], axis=1)
mod1_encoded

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,983.09,36730.24,35747.15,0.00,0.00,0,0.0,0.0,0.0,1.0,0.0
1,55215.25,99414.00,44198.75,0.00,0.00,0,0.0,0.0,0.0,1.0,0.0
2,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,1.0,0.0,0.0,0.0,0.0
3,2357394.75,0.00,0.00,4202580.45,6559975.19,0,0.0,0.0,0.0,0.0,1.0
4,67990.14,0.00,0.00,625317.04,693307.19,0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
999995,13606.07,114122.11,100516.04,0.00,0.00,0,0.0,0.0,0.0,1.0,0.0
999996,9139.61,0.00,0.00,0.00,0.00,0,0.0,0.0,0.0,1.0,0.0
999997,153650.41,50677.00,0.00,0.00,380368.36,0,0.0,1.0,0.0,0.0,0.0
999998,163810.52,0.00,0.00,357850.15,521660.67,0,0.0,1.0,0.0,0.0,0.0


## Q3

After exploring your data, you may have noticed that fraudulent transactions are rare compared to non-fraudulent ones. What challenges might this pose when training a machine learning model? What strategies could you use to ensure your model learns meaningful patterns from the minority class? Apply your data transformations (if any) in the code-block below.

The challenge of our data rarely catching fraudulent activities in comparison to non-fraudulent ones is having class imbalance. Through class imbalance, we notice this trend where there is a small percentage of transactions correctly identified as fraudulent. The main issue that arises when models are trained on data sets that are imbalanced is a heavy bias towards the majority class, in this case non-fraudulent cases. To address this challenge, we implemented the SMOTE method to balance the training data. With this method, we split the model to have equal representation of non-fraudulent and fraudulent cases to ensure our model learns meaningful patterns from the minority class.

In [28]:
#splitting model1_encoded into train and test sets
X = mod1_encoded.drop('isFraud', axis=1)  
y = mod1_encoded['isFraud']  

#Split into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

# Applying SMOTE to the training data
smote = SMOTE(random_state=42)
x_train_resampled, y_train_resampled = smote.fit_resample(x_train, y_train)

# Train a Random Forest Classifier on the resampled data
trained_mod = RandomForestClassifier(random_state =42)
trained_mod.fit(x_train_resampled, y_train_resampled)

#Evaluatre the model on testing set
yhat = trained_mod.predict(x_test)

print(f"Resampled class distribution: {Counter(y_train_resampled)}")

Resampled class distribution: Counter({0: 699087, 1: 699087})


## Bonus (optional)

Are there interaction effects between variables (e.g., fraud and high amount and transaction type) that aren't captured directly in the dataset? Would it be helpful to manually engineer any new features that reflect these interactions? Apply your data transformations (if any) in the code-block below.

Answer Here

In [16]:
# write out newly transformed dataset to your folder
...