# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

This is a classification task.

Are you predicting for multiple classes or binary classes?  

We are predicting binary classes. 

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

Based on the observations, I will be using a K-Nearest Neighbors (KNN) and Naive Bayes model. For the third and bonus model, I was attempting to implement an SVM model, but the results were taking too long to run.

## First Model KNN

Using the first model that you've chosen, implement the following steps.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

from imblearn.over_sampling import SMOTE
from collections import Counter

In [3]:
# Calling cleaned and transformed dataset for models
model_train = pd.read_csv("../data/model_train.csv")
model_train.head()

# Dataset is large. Sampling 20% of data for training
mod_train = model_train.sample(frac=0.2, random_state=42)

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [4]:
# select predictor & target
X = mod_train.drop(columns= ['isFraud'])
y = mod_train['isFraud']

# Create train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)

In [5]:
#Variables must abide by the same scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Applying SMOTE for imbalanced dataset
sm = SMOTE(random_state=42)
X_train_smote, y_train_smote = sm.fit_resample(X_train_scaled, y_train)

In [6]:
#Re-train KNN on the balanced and scaled data 
knn_s_smote = KNeighborsClassifier(n_neighbors= 3)
knn_s_smote.fit(X_train_smote, y_train_smote)

### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [7]:
params = {
    'n_neighbors': [1,3,5,7,9],
    'weights': ['uniform', 'distance'],
    'metric': ['manhattan', 'euclidean']
}
# Setting up GridSearchCV
grid_search = GridSearchCV(estimator=knn,param_grid=params,cv=3, scoring= 'accuracy')
grid_search.fit(X_train_scaled, y_train)

#Best parameters
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
print("Best parameters:" , grid_search.best_params_)
print("Best Cross-Validation Score:", grid_search.best_score_)

Best parameters: {'metric': 'manhattan', 'n_neighbors': 7, 'weights': 'distance'}
Best Cross-Validation Score: 0.9995125003515222


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [8]:
#Lon hand was to specify our params but we have already created the best_params variable
knn = KNeighborsClassifier(metric = "manhattan", n_neighbors=7, weights= 'distance')
knn.fit(X_train_scaled, y_train)

#Evaluate the classifier on the scaled test data 
yhat = knn.predict(X_test_scaled)

accuracy = accuracy_score( y_test, yhat)
precision = precision_score(y_test, yhat)
recall = recall_score(y_test, yhat)
f1 = f1_score(y_test, yhat)

print("Testing accuracy :", accuracy)
print("Testing Percision :", precision )
print("Testing sensitivity :", recall )
print("Testing F1 score:", f1)

Testing accuracy : 0.99955
Testing Percision : 0.925
Testing sensitivity : 0.7115384615384616
Testing F1 score: 0.8043478260869565


## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report,accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [10]:
# select predictor & target
X = mod_train.drop(columns= ['isFraud'])
y = mod_train['isFraud']

# Create train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
#Variables must abide by the same scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Applying SMOTE for imbalanced dataset: retrain data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

In [12]:
# Initialize the Gaussian classifier by re-training NB mod on balanced and scaled data 
gnb = GaussianNB()
gnb.fit(X_train_smote, y_train_smote)

### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [13]:
# Define the parameter grid
param_grid = {
    'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]
    }

#  GridSearchCV with 5-fold cross-validation and fitting model on your training data
grid_search = GridSearchCV(gnb, param_grid=param_grid, cv=5)
grid_search.fit(X_train_smote, y_train_smote)

# retrieve the best model and hyperparameters
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

print("Best parameters:" , grid_search.best_params_)
print("Best Cross-Validation Score:", grid_search.best_score_)

Best parameters: {'var_smoothing': 1e-05}
Best Cross-Validation Score: 0.7850714978566289


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [14]:
#Use the best model to make predictions on the test set
y_pred = best_model.predict(X_test_scaled)

# create a classification report and evaluate
class_report = classification_report(y_test, y_pred)

print("Best hyperparameters:", best_params)
print(class_report)

Best hyperparameters: {'var_smoothing': 1e-05}
              precision    recall  f1-score   support

           0       1.00      0.58      0.74     39948
           1       0.00      0.98      0.01        52

    accuracy                           0.58     40000
   macro avg       0.50      0.78      0.37     40000
weighted avg       1.00      0.58      0.74     40000



In [15]:
# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)  # Recall is sensitivity
f1 = f1_score(y_test, y_pred)

# Display metrics
print("Testing accuracy :", accuracy)
print("Testing Percision :", precision )
print("Testing sensitivity :", recall )
print("Testing F1 score :",  f1)



Testing accuracy : 0.584325
Testing Percision : 0.0030581039755351682
Testing sensitivity : 0.9807692307692307
Testing F1 score : 0.006097196485145556


### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.

In [5]:
from sklearn.svm import LinearSVC , SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import RandomizedSearchCV

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [6]:
# select predictor & target
X = mod_train.drop(columns=['isFraud'])
y = mod_train['isFraud']

# Create train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
#Variables must abide by the same scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Applying SMOTE for imbalanced dataset
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train_scaled, y_train)

In [8]:
# initiate a support vector classifier with C=1.0 and max_iter =100000
lin_svc = LinearSVC (C=1.0,max_iter =100000, random_state=42)
lin_svc.fit(X_train_sm, y_train_sm)

### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [None]:
# Define the parameter grid
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'max_iter': [1000, 5000, 10000, 20000],
    'penalty': ['l2'],
    'dual': [True, False]
}

svc = LinearSVC(random_state=42)

# Set up RandomizedSearchCV  with 3-fold cross validation
random_search = RandomizedSearchCV(svc, param_distributions=param_grid, cv=3, random_state=42)
random_search.fit(X_train_sm, y_train_sm)

# retrieve the best model and hyperparameters
best_params = random_search.best_params_
best_model = random_search.best_estimator_

print("Best parameters:" , random_search.best_params_)
print("Best Cross-Validation Score:", random_search.best_score_)



### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [None]:
#Use the best model to make predictions on the test set
y_pred = best_model.predict(X_test_scaled)

# create a classification report and evaluate
class_report = classification_report(y_test, y_pred)

print("Best hyperparameters:", best_params)
print(class_report)

In [None]:
# Model Evaluation
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)  # Recall is sensitivity
f1 = f1_score(y_test, y_pred)

# Display metrics
print("Testing accuracy :", accuracy)
print("Testing Percision :", precision )
print("Testing sensitivity :", recall )
print("Testing F1 score :",  f1)