# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

This is a classification task.

Are you predicting for multiple classes or binary classes? 

We are predicting binary classes (fraud/not fraud).

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

I will be using RandomForestClassifier and Logistic Regression. 

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [30]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from imblearn.over_sampling import SMOTE


# first load the data into this notebook
data = pd.read_csv("../data/transactions_transformed.csv")

# take a sample of the dataset so that the models can run without crashing jupiter notebook
data_sample = data.sample(frac=0.1, random_state=42)

# select the predictors and target variable
X = data_sample.drop(columns=["isFraud"])
y = data_sample["isFraud"]

# train test split the data
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y)
# using stratify=y to ensure that the miniority clas (isFraud) is within the training and testing data splits
# Perform SMOTE
sm = SMOTE(random_state=42)
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

y_resampled.value_counts()  # checking to see the class distribution after SMOTE

isFraud
0    79891
1    79891
Name: count, dtype: int64

In [31]:
# imports for creating machine learning models
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

In [32]:
# instantiate a RandomForestClassifier with a fixed random seed for reproducibility
rf = RandomForestClassifier(random_state=42)

# fitting the data using the resampled data points
rf.fit(X_resampled, y_resampled) 

# generate a classification report to see how well our random forest performed
yhat = rf.predict(X_test) 

confusion_rf = confusion_matrix(y_test, yhat)
class_report_rf = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion_rf)
print("\nClassification Report\n", class_report_rf)

Confusion Matrix 
 [[19973     0]
 [    0    27]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     19973
           1       1.00      1.00      1.00        27

    accuracy                           1.00     20000
   macro avg       1.00      1.00      1.00     20000
weighted avg       1.00      1.00      1.00     20000



### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [33]:
# implementing RandomSearchCV to find the best hyperparameters for the RF model

param_dist = {
    "n_estimators": range(10,100),
    "criterion": ["entropy", "gini", "log_loss"],
    "max_depth": [10, 20, 30, 40],
    "min_samples_split": [2, 5, 10],
    "max_features": ["sqrt", "log2"],
    "bootstrap": [True, False]
}

# instantiate a RandomForestClassifier
rf = RandomForestClassifier()

# set up RandomizedSearchCV with 5-fold cross-validation
random_search_rf = RandomizedSearchCV(rf, param_distributions=param_dist, cv=5, random_state=42)

#fit this model on your training data
random_search_rf.fit(X_resampled, y_resampled)


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [37]:
# use best params to retrain 
best_params_rf =  {
    "n_estimators": 65,
    "criterion": "log_loss",
    "max_depth": 40,
    "max_features": "log2",
    "bootstrap": True
}

rf_final = RandomForestClassifier(**best_params_rf, random_state=42)

rf_final.fit(X_resampled, y_resampled)
print("Final Model Classification Report:\n", classification_report(y_test, yhat))
print("Final Model Confusion Matrix:\n", confusion_matrix(y_test, yhat))

Final Model Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     19973
           1       1.00      1.00      1.00        27

    accuracy                           1.00     20000
   macro avg       1.00      1.00      1.00     20000
weighted avg       1.00      1.00      1.00     20000

Final Model Confusion Matrix:
 [[19973     0]
 [    0    27]]


## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [47]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

# Define parameter grid
param_dist_logreg = {
    "C": [0.001, 0.01, 0.1, 1, 10, 100],  # regularization strength
    "penalty": ['l1', 'l2'],              # l1 = Lasso, l2 = Ridge
    "solver": ['saga'],
}

# Initialize Logistic Regression
logreg = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)

# Randomized Search
random_search_logreg = RandomizedSearchCV(
    estimator=logreg,
    param_distributions=param_dist_logreg,
    n_iter=10,
    scoring='f1',
    cv=5,
    random_state=42,
    n_jobs=-1
)

# Fit on resampled data
random_search_logreg.fit(X_resampled, y_resampled)

# Show best parameters
print("Best Logistic Regression Parameters:", random_search_logreg.best_params_)


Best Logistic Regression Parameters: {'solver': 'saga', 'penalty': 'l1', 'C': 100}




In [48]:
# Get best model
logreg_final = random_search_logreg.best_estimator_

# Predict
y_pred_logreg = logreg_final.predict(X_test)

# Evaluation
print("Logistic Regression Classification Report:\n", classification_report(y_test, y_pred_logreg))
print("Logistic Regression Confusion Matrix:\n", confusion_matrix(y_test, y_pred_logreg))


Logistic Regression Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.89      0.94     19973
           1       0.01      1.00      0.02        27

    accuracy                           0.89     20000
   macro avg       0.51      0.94      0.48     20000
weighted avg       1.00      0.89      0.94     20000

Logistic Regression Confusion Matrix:
 [[17752  2221]
 [    0    27]]


### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.

In [49]:
# AdaBoost model
# import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# create a decision stump (a tree with max_depth=1) to use as the weak learner
stump = DecisionTreeClassifier(max_depth=1)

# initialize the AdaBoostClassifier with 50 weak learners and a fixed random state
ada = AdaBoostClassifier(estimator=stump, random_state=42)

ada.fit(X_resampled, y_resampled)

# Parameter grid
param_dist_ada = {
    'n_estimators': [50, 100, 150, 200],
    'learning_rate': [0.05, 0.1, 0.5, 1]
}

# Randomized Search CV
random_search_ada = RandomizedSearchCV(ada, param_distributions=param_dist_ada, cv=5, random_state=42)
random_search_ada.fit(X_resampled, y_resampled)

# Best Model Evaluation
best_ada = random_search_ada.best_estimator_
y_pred_ada = best_ada.predict(X_test)

print("Best Parameters (AdaBoost):", random_search_ada.best_params_)
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_ada))
print("\nClassification Report:\n", classification_report(y_test, y_pred_ada))


Best Parameters (AdaBoost): {'n_estimators': 150, 'learning_rate': 1}

Confusion Matrix:
 [[19973     0]
 [    0    27]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     19973
           1       1.00      1.00      1.00        27

    accuracy                           1.00     20000
   macro avg       1.00      1.00      1.00     20000
weighted avg       1.00      1.00      1.00     20000



In [50]:
# Get best ada model
ada_final = random_search_ada.best_estimator_

# Predict on test set
y_pred_ada = ada_final.predict(X_test)

# Evaluation
print("AdaBoost Classification Report:\n", classification_report(y_test, y_pred_ada))
print("AdaBoost Confusion Matrix:\n", confusion_matrix(y_test, y_pred_ada))

AdaBoost Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     19973
           1       1.00      1.00      1.00        27

    accuracy                           1.00     20000
   macro avg       1.00      1.00      1.00     20000
weighted avg       1.00      1.00      1.00     20000

AdaBoost Confusion Matrix:
 [[19973     0]
 [    0    27]]
