# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

Revisit: Can I apply a minmax scaler to all the columns now that transaction type is between 0 and 1 and might not be interpretted as important even though it is? If so, would this come after splitting the data for fitting the model as well?

## Questions
Is this a classification or regression task?  

Classification

Are you predicting for multiple classes or binary classes?  

Binary classes

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

Logistic Regression, KNN classifier, Random Forest

In [44]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report


import pandas as pd
import numpy as np


## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [29]:
transactions = pd.read_csv("../data/bank_transactions_cleaned.csv")
transactions = transactions.sample(n=20000)

X = transactions.drop(columns=["isFraud"])
y = transactions["isFraud"]
X.sample(5)

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
762242,50764.47,6873446.81,6924211.29,824449.73,773685.25,True,False,False,False,False
443511,18720.17,0.0,0.0,852967.46,871687.63,False,True,False,False,False
139456,390466.98,0.0,0.0,1025430.55,1415897.53,False,False,False,False,True
678077,165495.52,28492.0,193987.52,0.0,0.0,True,False,False,False,False
652511,16734.39,32442.0,15707.61,0.0,0.0,False,False,False,True,False


In [30]:
y.sample(5)

965302    0
595642    0
257154    0
593602    0
855731    0
Name: isFraud, dtype: int64

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
652217,493060.37,1540118.64,2033179.01,972224.32,479163.94,True,False,False,False,False
535639,14552.4,0.0,0.0,0.0,0.0,False,False,False,True,False
738818,4670.91,45566.0,40895.09,0.0,0.0,False,False,False,True,False
865196,20446.74,66818.0,46371.26,0.0,0.0,False,False,False,True,False
637451,120379.59,0.0,0.0,3153007.63,3273387.22,False,True,False,False,False


In [32]:
y_train.value_counts()

isFraud
0    13985
1       15
Name: count, dtype: int64

In [33]:
X_train.shape

(14000, 10)

In [34]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print("Class distribution after SMOTE:")
print(y_train_smote.value_counts())

Class distribution after SMOTE:
isFraud
0    13985
1    13985
Name: count, dtype: int64


In [35]:
X_train_smote.shape

(27970, 10)

### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [40]:
log_reg = LogisticRegression()

param_dist = {
    'penalty': ['l1', 'l2'],
    'C': np.linspace(0.01, 1, 100),
    'solver': ['saga'], 
    'max_iter': [10000]
}

random_search_log = RandomizedSearchCV(log_reg, param_distributions=param_dist, cv=5, scoring='accuracy', random_state=42)
random_search_log.fit(X_train_smote, y_train_smote)


Reusable function to evaluate model.

In [None]:
def print_metrics(model):
    best_model = model.best_estimator_
    yhat = best_model.predict(X_test)

    confusion_rf = confusion_matrix(y_test, yhat)
    class_report_rf = classification_report(y_test, yhat)

    print("Confusion Matrix \n", confusion_rf)
    print("\nClassification Report\n", class_report_rf)
    

### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [41]:
print_metrics(random_search_log)

Confusion Matrix 
 [[4444 1549]
 [   0    7]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      0.74      0.85      5993
           1       0.00      1.00      0.01         7

    accuracy                           0.74      6000
   macro avg       0.50      0.87      0.43      6000
weighted avg       1.00      0.74      0.85      6000



logistic regression with randomized search: accuracy was 74%, f1-score for predicting fraud was only a 1%, while precision was 0% and recall was 100%. This means that the model is not great at predicting fraud but it's good at predicting when it's not fraud. 



## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [36]:
params = {
    "n_neighbors": range(1,30),
    "metric": ["cityblock", "euclidean", "cosine"]
}

knn = KNeighborsClassifier()

random_search_knn = RandomizedSearchCV(knn, param_distributions=params, cv=5,random_state=42)
random_search_knn.fit(X_train_smote, y_train_smote)


In [37]:
print_metrics(random_search_knn)

Confusion Matrix 
 [[5986    7]
 [   0    7]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      5993
           1       0.50      1.00      0.67         7

    accuracy                           1.00      6000
   macro avg       0.75      1.00      0.83      6000
weighted avg       1.00      1.00      1.00      6000



KNN with random search was an improvement from logistic regression with an accuracy of 100%, and f1-score of 67%, and precision of 50%, recall of 100%. This model is perfect at predicting non-fraudulent transactions and decent at predicting fraudulent cases. However, this still may not be accurate because the there may not be that many fraudulent cases to predict on in the testing dataset. 

### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.

In [59]:
from sklearn.ensemble import RandomForestClassifier

param_dist = {
    "criterion": ["gini", "entropy", "log_loss"], 
    "max_depth": range(1,30),
    "min_samples_split": range(2,20),
     "max_features": ["sqrt", "log2"]
}

rf = RandomForestClassifier(random_state=42)
random_search_rf = RandomizedSearchCV(rf,param_distributions=param_dist,n_iter=20, scoring='f1', cv=5,random_state=42)
random_search_rf.fit(X_train_smote, y_train_smote)


In [61]:
print_metrics(random_search_rf)

Confusion Matrix 
 [[5981   12]
 [   0    7]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      5993
           1       0.37      1.00      0.54         7

    accuracy                           1.00      6000
   macro avg       0.68      1.00      0.77      6000
weighted avg       1.00      1.00      1.00      6000



Random forest classifier was an improvement from logistic regression but not as great as KNN with an accuracy of 100%, and f1-score of 54%, and precision of 37%, recall of 100%. 

The model with the highest f1-score was KNN at 67% for predicting fraudulent cases.