# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

Classifaction since 'isFraud' is binary

Are you predicting for multiple classes or binary classes?  

Binary classes because we're predicting whether the transcation is or is not fraud.

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

List your models here: Linear Regression & Random Forest

## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [13]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, LogisticRegression

from sklearn.ensemble import RandomForestRegressor

from sklearn.linear_model import Lasso, Ridge
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV

import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np


In [14]:
transactions = pd.read_csv("../data/bank_transactions.csv")
transactions.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   type            1000000 non-null  object 
 1   amount          1000000 non-null  float64
 2   nameOrig        1000000 non-null  object 
 3   oldbalanceOrg   1000000 non-null  float64
 4   newbalanceOrig  1000000 non-null  float64
 5   nameDest        1000000 non-null  object 
 6   oldbalanceDest  1000000 non-null  float64
 7   newbalanceDest  1000000 non-null  float64
 8   isFraud         1000000 non-null  int64  
 9   isFlaggedFraud  1000000 non-null  int64  
dtypes: float64(5), int64(2), object(3)
memory usage: 76.3+ MB


In [15]:
# Drop ID columns
transactions_cleaned = transactions.drop(columns=['nameOrig', 'nameDest'])

# One-hot encode 'type'
transactions_cleaned = pd.get_dummies(transactions, columns=['type'], drop_first=True)

# Then re-define X, y
X = transactions_cleaned.drop(columns=['isFraud'])
y = transactions_cleaned['isFraud']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

# Drop object columns
X_train = X_train.drop(columns=['type', 'nameOrig', 'nameDest'], errors='ignore')
X_test = X_test.drop(columns=['type', 'nameOrig', 'nameDest'], errors='ignore')


In [16]:
# train basic linear regression 
lin_model = LinearRegression()
lin_model.fit(X_train, y_train) 

y_pred_lin = lin_model.predict(X_test) 
mse_lin = mean_squared_error(y_test, y_pred_lin)
r2_lin = r2_score(y_test, y_pred_lin)

print(f"Test MSE for linear regression: {mse_lin:.2f}")
print(f"R2 for linear regression: {r2_lin:.2f}")

Test MSE for linear regression: 0.00
R2 for linear regression: 0.19


In [17]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [18]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
import numpy as np

param_dist = {
    'alpha': np.linspace(0.01, 10, 100),
    'max_iter': [100]
}


lasso_model = Lasso()

random_search = RandomizedSearchCV(
    estimator=lasso_model, 
    param_distributions=param_dist,
    n_iter = 100,
    cv = 5
)


random_search.fit(X_train, y_train)

best_lasso = random_search.best_estimator_

y_pred_lasso = best_lasso.predict(X_test)

mse_lasso = mean_squared_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


KeyboardInterrupt: 

### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [None]:
param_dist = {
    # choose alpha between 0.01 to 10
    'alpha': np.linspace(0.01, 10, 100),
    'max_iter': [100]
}
lasso_model = Lasso()
random_search = RandomizedSearchCV(
    estimator=lasso_model, 
    param_distributions=param_dist,
    n_iter = 10,
    cv = 5
)
random_search.fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier()
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, 20, None],
}

grid_rf = GridSearchCV(
    rf, 
    param_grid, 
    scoring='f1', 
    cv=3, 
    n_jobs=-1
)

grid_rf.fit(X_train_scaled, y_train)

best_rf = grid_rf.best_estimator_

NameError: name 'evaluate_model' is not defined

### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.