# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

In [20]:
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

In [21]:
df = pd.read_csv('../data/bank_transactions_transformed.csv')

In [22]:
df.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER,high_risk_type,orig_diff,dest_diff
0,983.09,36730.24,35747.15,0.0,0.0,0,0,0,1,0,0,983.09,0.0
1,55215.25,99414.0,44198.75,0.0,0.0,0,0,0,1,0,0,55215.25,0.0
2,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,0,0,0,0,0,-220986.01,-220986.0
3,2357394.75,0.0,0.0,4202580.45,6559975.19,0,0,0,0,1,1,0.0,2357394.74
4,67990.14,0.0,0.0,625317.04,693307.19,0,1,0,0,0,0,0.0,67990.15


## Questions
Is this a classification or regression task?  

Answer here

This is a **classification task** since the target `isFraud` is binary (0 or 1).

Are you predicting for multiple classes or binary classes?  

Answer here

This is a **binary classification** problem: predicting whether a transaction is fraudulent (`1`) or not (`0`).

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

List your models here

- **Logistic Regression**: Simple and interpretable.  
- **Random Forest**: More accurate and robust than a single tree.  
- **XGBoost**: High-performance boosting algorithm, effective for imbalanced data.

These models offer a good balance of clarity and performance.

In [23]:
def evaluate_model(name, model, X_test, y_test):
    y_pred = model.predict(X_test)
    
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    print(f"\n{name} Evaluation:")
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print(f"Accuracy:  {acc:.4f}")
    print(f"Precision: {prec:.4f}")
    print(f"Recall:    {rec:.4f}")
    print(f"F1 Score:  {f1:.4f}")

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [24]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

X = df.drop(columns=['isFraud'])
y = df['isFraud']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.3, random_state=42
)

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# optional_cols = ['high_risk_type', 'orig_diff', 'dest_diff']
# X_train_resampled_reduced = X_train_resampled.drop(columns=optional_cols)
# X_test_reduced = X_test.drop(columns=optional_cols)

In [25]:
from sklearn.preprocessing import StandardScaler

cols_to_scale = ['amount', 'oldbalanceOrg', 'newbalanceOrig',
                 'oldbalanceDest', 'newbalanceDest', 'orig_diff', 'dest_diff']

X_train_resampled_scaled = X_train_resampled.copy()
X_test_scaled = X_test.copy()

scaler = StandardScaler()
X_train_resampled_scaled[cols_to_scale] = scaler.fit_transform(X_train_resampled[cols_to_scale])
X_test_scaled[cols_to_scale] = scaler.transform(X_test[cols_to_scale])

### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [26]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

param_dist = {
    'penalty': ['l2'],                    
    'C': [0.01, 0.1, 1, 10],              
}

logreg = LogisticRegression(solver='lbfgs', max_iter=1000)

random_search = RandomizedSearchCV(
    logreg,
    param_distributions=param_dist,
    n_iter=4,            
    scoring='f1',
    cv=3,
    n_jobs=-1,
    random_state=42
)
random_search.fit(X_train_resampled_scaled, y_train_resampled)
best_logreg = random_search.best_estimator_

### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [27]:
evaluate_model("Logistic Regression", best_logreg, X_test_scaled, y_test)


Logistic Regression Evaluation:
Confusion Matrix:
 [[286653  12958]
 [     4    385]]
Accuracy:  0.9568
Precision: 0.0289
Recall:    0.9897
F1 Score:  0.0561


## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [29]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier()
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, 20, None],
}

grid_rf = GridSearchCV(
    rf, 
    param_grid, 
    scoring='f1', 
    cv=3, 
    n_jobs=-1
)

grid_rf.fit(X_train_resampled_scaled, y_train_resampled)

best_rf = grid_rf.best_estimator_

In [31]:
evaluate_model("Random Forest", best_rf, X_test_scaled, y_test)


Random Forest Evaluation:
Confusion Matrix:
 [[299315    296]
 [    21    368]]
Accuracy:  0.9989
Precision: 0.5542
Recall:    0.9460
F1 Score:  0.6990


### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.

In [30]:
from xgboost import XGBClassifier

xgb = XGBClassifier(
    objective='binary:logistic',
    scale_pos_weight=1,
    eval_metric='logloss',
    random_state=42
)

xgb.fit(X_train_resampled_scaled, y_train_resampled)

In [32]:
evaluate_model("XGBoost", xgb, X_test_scaled, y_test)


XGBoost Evaluation:
Confusion Matrix:
 [[298875    736]
 [    14    375]]
Accuracy:  0.9975
Precision: 0.3375
Recall:    0.9640
F1 Score:  0.5000


## Model Comparison and Decision

### Model Performance on Resampled Data (SMOTE Applied)

| Metric        | Logistic Regression | Random Forest       | XGBoost              |
|---------------|---------------------|---------------------|----------------------|
| Accuracy      | 0.9568              | **0.9989**          | 0.9975               |
| Precision     | 0.0289              | **0.5542**          | 0.3375               |
| Recall        | **0.9897**          | 0.9460              | **0.9640**           |
| F1 Score      | 0.0561              | **0.6990**          | 0.5000               |

- **Random Forest** offers the best balance between precision and recall, achieving the highest F1 score and the lowest false positive rate.
- **XGBoost** performs well on recall but produces more false positives, lowering its precision.
- **Logistic Regression** has extremely high recall but very poor precision and F1 score, making it less useful in practice despite catching nearly all fraud cases.

### Next Step: Try Without SMOTE

To ensure model generalization and reduce overfitting risk, the next step is to **retrain models on the original (non-SMOTE) dataset**. This will help assess model performance on naturally imbalanced data, which better reflects real-world fraud detection challenges.

In [33]:
from xgboost import XGBClassifier

weight = (y_train == 0).sum() / (y_train == 1).sum()

xgb = XGBClassifier(
    objective='binary:logistic',
    scale_pos_weight=weight, 
    eval_metric='logloss',
    random_state=42
)

xgb.fit(X_train, y_train)

In [34]:
evaluate_model("XGBoost (No SMOTE)", xgb, X_test, y_test)


XGBoost (No SMOTE) Evaluation:
Confusion Matrix:
 [[299353    258]
 [    33    356]]
Accuracy:  0.9990
Precision: 0.5798
Recall:    0.9152
F1 Score:  0.7099


**XGBoost (no SMOTE, with scale_pos_weight)** performed best:

- **F1 Score**: 0.7099  
- **Precision**: 0.5798  
- **Recall**: 0.9152  

This approach outperformed Random Forest and Logistic Regression.  
Using `scale_pos_weight` preserved data integrity and improved fraud detection without oversampling.

## Model Comparison Summary

- **Logistic Regression** has high recall but very low precision. It finds frauds but makes many false positives.
- **Random Forest** works well, with high recall and better precision. F1-score: 0.6990.
- **XGBoost with SMOTE** has good recall but lower precision. F1-score: 0.5000.
- **XGBoost without SMOTE** performs best. It balances precision and recall well. F1-score: 0.7099.

**Conclusion**:  
XGBoost without SMOTE is the top choice. It finds frauds accurately without using synthetic data.