# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

Revisit: Can I apply a minmax scaler to all the columns now that transaction type is between 0 and 1 and might not be interpretted as important even though it is? If so, would this come after splitting the data for fitting the model as well?

## Questions
Is this a classification or regression task?  

Classification

Are you predicting for multiple classes or binary classes?  

Binary classes

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

Logistic Regression, KNN classifier, and possibly Random Forest?

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report

import pandas as pd
import numpy as np


## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [32]:
transactions = pd.read_csv("../data/bank_transactions_cleaned.csv")
transactions = transactions.sample(n=20000)

X = transactions.drop(columns=["isFraud"])
y = transactions["isFraud"]
X.sample(5)

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
123770,2240.79,156375.0,154134.21,0.0,0.0,0.0,0.0,1.0,0.0
738838,261004.64,0.0,0.0,860043.03,1121047.67,1.0,0.0,0.0,0.0
170860,92339.32,29904.0,0.0,304772.23,397111.55,1.0,0.0,0.0,0.0
938899,88409.98,13873.0,0.0,0.0,88409.98,1.0,0.0,0.0,0.0
564897,2603.95,56629.0,54025.05,0.0,0.0,0.0,0.0,1.0,0.0


In [33]:
y.sample(5)

88964     0
207265    0
121910    0
113535    0
224783    0
Name: isFraud, dtype: int64

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
908583,96664.81,5059.98,0.0,0.0,219272.02,1.0,0.0,0.0,0.0
531126,234606.48,2827.0,0.0,1768390.93,2002997.41,1.0,0.0,0.0,0.0
439249,37290.2,843707.91,806417.7,0.0,0.0,0.0,0.0,1.0,0.0
860853,32243.2,195.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
950382,70683.93,4037456.72,4108140.65,222862.81,152178.88,0.0,0.0,0.0,0.0


In [35]:
y_train.value_counts()

isFraud
0    15980
1       20
Name: count, dtype: int64

SMOTE FIRST?

In [36]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(k_neighbors=2, random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print("Class distribution after SMOTE:")
print(y_train_smote.value_counts())

Class distribution after SMOTE:
isFraud
0    15980
1    15980
Name: count, dtype: int64


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [7]:
log_reg = LogisticRegression()

param_dist = {
    'penalty': ['l1', 'l2'],
    'C': np.linspace(0.01, 1, 100),
    'solver': ['saga'], 
    'max_iter': [10000]
}

random_search_log = RandomizedSearchCV(log_reg, param_distributions=param_dist, cv=5, scoring='accuracy', random_state=42)
random_search_log.fit(X_train_smote, y_train_smote)


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [10]:
best_random_log = random_search_log.best_estimator_

yhat_log = best_random_log.predict(X_test)

confusion_rf = confusion_matrix(y_test, yhat_log)
class_report_rf = classification_report(y_test, yhat_log)

print("Confusion Matrix \n", confusion_rf)
print("\nClassification Report\n", class_report_rf)

Confusion Matrix 
 [[1750  244]
 [   0    6]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      0.88      0.93      1994
           1       0.02      1.00      0.05         6

    accuracy                           0.88      2000
   macro avg       0.51      0.94      0.49      2000
weighted avg       1.00      0.88      0.93      2000



Not a great f1-score for predicting fraud using logistic regression.

## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [37]:
params = {
    "n_neighbors": range(1,30),
    "metric": ["cityblock", "euclidean", "cosine"]
}

knn = KNeighborsClassifier()

random_search_knn = RandomizedSearchCV(knn, param_distributions=params, cv=5,random_state=42)
random_search_knn.fit(X_train_smote, y_train_smote)


In [38]:
best_random_knn = random_search_knn.best_estimator_

yhat_knn = best_random_knn.predict(X_test)

confusion_rf = confusion_matrix(y_test, yhat_knn)
class_report_rf = classification_report(y_test, yhat_knn)

print("Confusion Matrix \n", confusion_rf)
print("\nClassification Report\n", class_report_rf)

Confusion Matrix 
 [[3990    3]
 [   1    6]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      3993
           1       0.67      0.86      0.75         7

    accuracy                           1.00      4000
   macro avg       0.83      0.93      0.87      4000
weighted avg       1.00      1.00      1.00      4000



Improvement from logistic regression.

### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.

In [39]:
from sklearn.ensemble import RandomForestClassifier

param_dist = {
    "criterion": ["gini", "entropy", "log_loss"], 
    "max_depth": range(1,30),
    "min_samples_split": range(2,20),
     "max_features": ["sqrt", "log2"]
}

rf = RandomForestClassifier(random_state=42)
random_search_rf = RandomizedSearchCV(rf,param_distributions=param_dist,n_iter=20, scoring='f1', cv=5,random_state=42)
random_search_rf.fit(X_train_smote, y_train_smote)


In [40]:
best_random_rf = random_search_rf.best_estimator_

yhat_rf = best_random_rf.predict(X_test)

confusion_rf = confusion_matrix(y_test, yhat_rf)
class_report_rf = classification_report(y_test, yhat_rf)

print("Confusion Matrix \n", confusion_rf)
print("\nClassification Report\n", class_report_rf)

Confusion Matrix 
 [[3992    1]
 [   1    6]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      3993
           1       0.86      0.86      0.86         7

    accuracy                           1.00      4000
   macro avg       0.93      0.93      0.93      4000
weighted avg       1.00      1.00      1.00      4000



The macro avg f1 score was from using random forest classifier.