# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

This is a classification task since the target variable, isFraud, is categorical with two classes.

Are you predicting for multiple classes or binary classes?  

I am predicting for binary classes since isFraud only has two possible values, 0 and 1.

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

Logistic Regression,
Naive Bayes,
K-Nearest Neighbors (KNN)


## First Model

Using the first model that you've chosen, implement the following steps.

In [18]:
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import time

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [7]:
df = pd.read_csv("../data/transformed_transactions.csv")

# 3. Define features and target
X = df.drop('isFraud', axis=1)
y = df['isFraud']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [9]:

# Randomly search for the best hyperparameters on a logistic regression model
param_dist = {
    'penalty': ['l1', 'l2'],
    'C': np.linspace(0.01, 1, 100),
    'solver': ['saga'], 
    'max_iter': [10000]
}

random_search = RandomizedSearchCV(LogisticRegression(), param_distributions=param_dist, cv=5, scoring='accuracy', random_state=42)
random_search.fit(X_train, y_train)

# Best model from random search
best_params_random = random_search.best_params_
best_score_random = random_search.best_score_

print(f"RandomizedSearchCV - Best Params: {best_params_random}")
print(f"RandomizedSearchCV - Cross-Val Accuracy: {best_score_random:.2f}")

RandomizedSearchCV - Best Params: {'solver': 'saga', 'penalty': 'l2', 'max_iter': 10000, 'C': np.float64(0.48000000000000004)}
RandomizedSearchCV - Cross-Val Accuracy: 0.93


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [12]:
# Use the best model found from RandomizedSearchCV to predict on unseen test data

# extract the best estimator
best_log = random_search.best_estimator_

# predict on testing data
log_predictions = best_log.predict(X_test)

# evaluate its accuracy
test_score = accuracy_score(log_predictions, y_test)

print(f"RandomizedSearchCV - Coefficients: {best_log.coef_}")
print(f"RandomizedSearchCV - Test Accuracy: {test_score:.2f}")


print("Confusion Matrix:")
print(confusion_matrix(y_test, log_predictions))

print("\nClassification Report:")
print(classification_report(y_test, log_predictions, digits=4))

RandomizedSearchCV - Coefficients: [[-3.76340955e-06  1.79107653e-05 -2.26321116e-05  6.53035309e-06
  -6.63268667e-06  2.95876825e-12 -4.03312285e-11 -2.63518958e-11
  -1.06990901e-11 -4.40184976e-10  1.35391882e-10]]
RandomizedSearchCV - Test Accuracy: 0.92
Confusion Matrix:
[[335  50]
 [ 14 380]]

Classification Report:
              precision    recall  f1-score   support

           0     0.9599    0.8701    0.9128       385
           1     0.8837    0.9645    0.9223       394

    accuracy                         0.9178       779
   macro avg     0.9218    0.9173    0.9176       779
weighted avg     0.9214    0.9178    0.9176       779



## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [15]:
# create a Gaussian NB classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# generate predictions and display
y_pred = gnb.predict(X_test)

In [16]:
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Confusion Matrix:
 [[375  10]
 [233 161]]

Classification Report:
               precision    recall  f1-score   support

           0       0.62      0.97      0.76       385
           1       0.94      0.41      0.57       394

    accuracy                           0.69       779
   macro avg       0.78      0.69      0.66       779
weighted avg       0.78      0.69      0.66       779



### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.

In [19]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)

y_pred = knn.predict(X_test_scaled)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nAccuracy:", accuracy_score(y_test, y_pred))

Confusion Matrix:
 [[347  38]
 [ 26 368]]

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.90      0.92       385
           1       0.91      0.93      0.92       394

    accuracy                           0.92       779
   macro avg       0.92      0.92      0.92       779
weighted avg       0.92      0.92      0.92       779


Accuracy: 0.9178433889602053


Naive Bayes performed worst out of all the three models which having 0.69 accuracy, precision of 0.94, Recall of 0.41, F1-Score of 0.57. It have high precision, but very low recall since it missed many frauds which shows it only flags obvious cases. Overall accuracy of 0.69 is misleading due to class imbalance.

Both models Logistic regression and KNN perform equally well in accuracy with both having accuracy of 91.78%. KNN has better precision with fewer false alarms. Logistic Regression has higher recall kf 96.5% when KNN have 93% which mean Logistic regression catches more fraud compare to KNN. Logistic Regression is slightly better with the f1-scores too.

Overall, Logistic regresstion is slightly better than KNN since it can catch as many frauds as possible due to it's higher recall.

