# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

Answer here

Are you predicting for multiple classes or binary classes?  

Answer here

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

List your models here

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, f1_score
from imblearn.over_sampling import RandomOverSampler

transcations = pd.read_csv('../data/transformed_bank_transcations.csv')
X = transcations.drop(columns=['isFraud'])
y = transcations['isFraud']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

overSampling = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = overSampling.fit_resample(X_train, y_train)


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [2]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
base_model = LogisticRegression()
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'max_iter': [100, 300, 500]
}

grid_search = GridSearchCV(base_model, param_grid, cv=3, scoring='f1')
grid_search.fit(X_train_resampled, y_train_resampled)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [3]:
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("best parameters:", grid_search.best_params_)
print("Confusion Matrix:", classification_report(y_test, y_pred))
print("f1_score: ", f1_score(y_test, y_pred))

best parameters: {'C': 10, 'max_iter': 300, 'penalty': 'l2'}
Confusion Matrix:               precision    recall  f1-score   support

           0       1.00      0.95      0.97    199743
           1       0.02      0.98      0.05       257

    accuracy                           0.95    200000
   macro avg       0.51      0.96      0.51    200000
weighted avg       1.00      0.95      0.97    200000

f1_score:  0.046550445103857564


## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [4]:
from sklearn.neighbors import KNeighborsClassifier

base_model = KNeighborsClassifier()
param_grid = {
    'n_neighbors': [ 2, 3, 5],
    'metric': ['euclidean', 'manhattan']
}
grid_search = GridSearchCV(base_model, param_grid, cv=5, scoring='f1')
grid_search.fit(X_train_resampled, y_train_resampled)

In [5]:
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("best parameters:", grid_search.best_params_)
print("Confusion Matrix:", classification_report(y_test, y_pred))
print("f1_score: ", f1_score(y_test, y_pred))

best parameters: {'metric': 'euclidean', 'n_neighbors': 2}
Confusion Matrix:               precision    recall  f1-score   support

           0       1.00      1.00      1.00    199743
           1       0.64      0.72      0.68       257

    accuracy                           1.00    200000
   macro avg       0.82      0.86      0.84    200000
weighted avg       1.00      1.00      1.00    200000

f1_score:  0.6751824817518248


### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.