# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

Answer here

Are you predicting for multiple classes or binary classes?  

Answer here

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

List your models here

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV


In [5]:
# import data 
transactions = pd.read_csv("../data/bank_transactions.csv")
transactions = transactions.drop(columns=["nameOrig", "nameDest"])

In [6]:
from sklearn.preprocessing import OneHotEncoder

# perform one-hot-encoding on a set of categorical columns

# TODO: select your choice of categorical columns
cat_features = transactions.select_dtypes(include=['object']).columns
#cat_features = cat_features.drop(["nameOrig", "nameDest"])
cat_features

Index(['type'], dtype='object')

In [7]:
                            
#numerical columns
num_features = transactions.select_dtypes(include=['int64', 'float64']).columns
num_features

Index(['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest',
       'newbalanceDest', 'isFraud', 'isFlaggedFraud'],
      dtype='object')

In [8]:

X_cat = transactions[cat_features]
X_num = transactions[num_features]

X_cat.head()

Unnamed: 0,type
0,PAYMENT
1,PAYMENT
2,CASH_IN
3,TRANSFER
4,CASH_OUT


In [9]:
# TODO: Implement your machine learning model!
# Split features and target
X = transactions.select_dtypes(include=['int64', 'float64']).drop(columns=["isFraud"])
y = transactions['isFraud']
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features/ scale data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Return shape for confirmation
X_train.shape, X_test.shape

((800000, 6), (200000, 6))

In [10]:
# instantiate model 
# logistic regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report

# Initialize the logistic regression model
logreg_model = LogisticRegression()
logreg_model.fit(X_train_scaled, y_train)
# Make predictions on the test set
y_pred = logreg_model.predict(X_test_scaled)
# F1-score and classification report
print("F1-score:", f1_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Not Fraud", "Is Fraud"]))

F1-score: 0.5582655826558266

Classification Report:
              precision    recall  f1-score   support

   Not Fraud       1.00      1.00      1.00    199743
    Is Fraud       0.92      0.40      0.56       257

    accuracy                           1.00    200000
   macro avg       0.96      0.70      0.78    200000
weighted avg       1.00      1.00      1.00    200000



### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [11]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score, classification_report# Initated KNN model
knn_model = KNeighborsClassifier(n_neighbors=5)
# Fit the model on the training data
knn_model.fit(X_train_scaled, y_train)
# Make predictions on the test set
y_pred_knn = knn_model.predict(X_test_scaled)
# F1-score and classification report
print("F1-score:", f1_score(y_test, y_pred_knn))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_knn, target_names=["Not Fraud", "Is Fraud"]))

F1-score: 0.7317073170731707

Classification Report:
              precision    recall  f1-score   support

   Not Fraud       1.00      1.00      1.00    199743
    Is Fraud       0.85      0.64      0.73       257

    accuracy                           1.00    200000
   macro avg       0.93      0.82      0.87    200000
weighted avg       1.00      1.00      1.00    200000



### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.

In [None]:
# Initialize the SVM model
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
svm_model = SVC(kernel='rbf')
# Fit the model on the training data
svm_model.fit(X_train_scaled, y_train)
# Make predictions on the test set
y_pred_svm = svm_model.predict(X_test_scaled)
# F1-score and classification report
print("F1-score:", f1_score(y_test, y_pred_svm))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_svm, target_names=["Not Fraud", "Is Fraud"]))

In [None]:
!pip install imblearn

In [None]:
#import SMOTE
from imblearn.over_sampling import SMOTE


sample = transactions.sample(10000)
sample.head()

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score, classification_report
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

In [None]:
# TODO: split the data into features and labels, select 2 numerical columns
X = sample[["oldbalanceOrg", "amount"]]
y = sample["isFraud"]

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# train kNN on the imbalanced data
knn_imb = KNeighborsClassifier(n_neighbors=3)
knn_imb.fit(X_train, y_train)

yhat = knn_imb.predict(X_test)
baseline_acc = accuracy_score(y_test, yhat)

print("Baseline testing accuracy (imbalanced) (WOW AMAZING!):", baseline_acc)

In [None]:
print(precision_score(y_test, yhat))

In [None]:
# Apply SMOTE to rebalance the training set (number of neighbors needs to be less than number of minority class samples)
smote = SMOTE(k_neighbors=2, random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("Class distribution after SMOTE:")
print(y_train_smote.value_counts())

In [None]:
# Retrain kNN on the balanced data
knn_smote = KNeighborsClassifier(n_neighbors=3)
knn_smote.fit(X_train_smote, y_train_smote)

yhat_pred = knn_smote.predict(X_test)
smote_acc = accuracy_score(y_test, yhat_pred)

print("Testing accuracy after applying SMOTE:", smote_acc)