<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Supervised Learning Model Comparison

---

### Let us begin...

Recall the `data science process`.
   1. Define the problem.
   2. Gather the data.
   3. Explore the data.
   4. Model the data.
   5. Evaluate the model.
   6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

#### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. 

#### When predicting `e401k`, you may use the entire dataframe if you wish.

In [3]:
# Import lib
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score,GridSearchCV
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import BaggingRegressor, BaggingClassifier, RandomForestRegressor, RandomForestClassifier, AdaBoostRegressor, AdaBoostClassifier
from sklearn.metrics import classification_report
from sklearn import svm
from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt

### Step 2: Gather the data.

##### 1. Read in the data.

In [5]:
df = pd.read_csv('./401ksubs.csv')

In [6]:
df.columns

Index(['e401k', 'inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'p401k',
       'pira', 'incsq', 'agesq'],
      dtype='object')

In [7]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

In [9]:
# 1. Highest education > as this may reflect to income
# 2. Credit score > shows the financial behavior of the participant in this data 
# 3. Debt > if the participant have high debt, they might not enroll in 401K

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

In [11]:
# It's unethical decision. putting race is violate the privacy and not a good practice for ethical standards.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

In [13]:
# Ans
# 'inc' , because we want to predict incomes. In this case will be my y (target variables)
# 'incsq', for the same reasons.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs (Subject Matter Experts) might have done this!

In [15]:
# Ans
# There are 2 variables in the dataset created though feature engineering.
# 1. 'incsq' > income squared
# 2. 'agesq' > age squared
# The person who collect data may want to make and income be more robust to have the better prediction. 
# The squared terms will improve the model accuracy by letting it capture the curving patterns.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

In [17]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [18]:
# Actually I see there are 2 variables that have error in description.

# 1. 'inc : inc^2  can be described as 'income per person'
    # Plus the value inside income is not telling the unit or the scale number that should be like in MB of in $1000

# 2. 'age' : age^2 can be described as 'age of the participant'

In [19]:
# I want to explore the correlation
correlations = df.corr()
print(correlations['inc'])

e401k     0.268178
inc       1.000000
marr      0.362008
male     -0.069871
age       0.105638
fsize     0.110170
nettfa    0.376586
p401k     0.270833
pira      0.364354
incsq     0.940161
agesq     0.087305
Name: inc, dtype: float64


## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

In [21]:
# 1. Linear Regression – Yes: Can help to understand the relationships between each variables. 
# 2. Ridge Regression – Yes: This models help regularization for multicollinearity.
# 3. Lasso Regression – Yes: This models help regularization with feature selection.
# 4. XGBoost - Yes: This models help to understand the influence of the features.
# 5. Random Forest Regression – No: Hard to interpret.
# 6. KNN Regression – No: Hard to understand the incluence of features.

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [23]:
df.columns

Index(['e401k', 'inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'p401k',
       'pira', 'incsq', 'agesq'],
      dtype='object')

In [24]:
from sklearn.pipeline import make_pipeline
# I want to use make_pipeline as it's shorthand that automatically assigns names to steps, useful for quick and simple pipelines.

# Set random seed for reproducibility
RANDOM_SEED = 42

# Set up X,y
X = df[['marr', 'agesq', 'fsize', 'nettfa']]  # Features
y = df['inc']  # Target variable

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)

# Define models in a dictionary to easily loop over them
models = {
    "Linear Regression": LinearRegression(),
    "K-Nearest Neighbors": make_pipeline(StandardScaler(), KNeighborsRegressor()),
    "Decision Tree": DecisionTreeRegressor(random_state=RANDOM_SEED),
    "Bagged Decision Trees": make_pipeline(StandardScaler(), BaggingRegressor(estimator=DecisionTreeRegressor(), random_state=RANDOM_SEED)),
    "Random Forest": RandomForestRegressor(random_state=RANDOM_SEED),
    "AdaBoost": AdaBoostRegressor(random_state=RANDOM_SEED),
    "Support Vector Regressor": make_pipeline(StandardScaler(), svm.SVR())  
    # Other models like Decision Trees, Random Forests, and AdaBoost don’t require scaling for the model itself.
}

# Store results
results = []

# Fit each model and calculate MSE, Train R^2, and Test R^2
for name, model in models.items():
    model.fit(X_train, y_train)  # Train the model
    y_pred = model.predict(X_test)  # Predict on test data
    mse = mean_squared_error(y_test, y_pred)  # Calculate MSE
    
    # Train and test scores (R^2)
    train_score = model.score(X_train, y_train)  # R^2 on training set
    test_score = model.score(X_test, y_test)  # R^2 on test set
    
    # Store the results
    results.append({
        'Model': name,
        'Mean Squared Error': mse,
        'Train R^2': train_score,
        'Test R^2': test_score
    })

# Convert the results to DataFrame (I want easier for visualize the output)
results_df = pd.DataFrame(results)

# Display the results
print(results_df)

                      Model  Mean Squared Error  Train R^2  Test R^2
0         Linear Regression          458.058245   0.266254  0.239491
1       K-Nearest Neighbors          414.573642   0.531082  0.311688
2             Decision Tree          706.395105   0.987854 -0.172820
3     Bagged Decision Trees          457.567714   0.862075  0.240305
4             Random Forest          420.472157   0.892758  0.301895
5                  AdaBoost          580.425115   0.080711  0.036326
6  Support Vector Regressor          416.436236   0.323204  0.308595


##### 9. What is bootstrapping?

In [26]:
# Bootstrapping is a method used to estimate the distribution of a sample by resampling with replacement.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

In [28]:
# A decision tree is a single predictive model which splits data into branches based on feature values to make predictions.

# A set of bagged decision trees trains multiple trees on different bootstrap samples and averages their predictions, 
# reducing overfitting and improving accuracy by lowering model variance.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

In [30]:
# A set of bagged decision trees uses multiple trees trained on different bootstrap samples, averaging their predictions to reduce variance.

# In a random forest, the process also begins with bootstrap sampling, 
# but with an added step at each node, only a random subset of features is considered for splitting

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

In [32]:
# Random Forests usually perform better than bagging because they reduce the correlation between the trees 
# as it a random feature selection therefore, Random Forests can reduce variance more effectively.

# By reducing variance more, Random Forests often lead to a model that generalizes better to new, unseen data.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [34]:
def rmse(model, X_train, X_test, y_train, y_test):
    # Calculate RMSE for train
    mse_train = mean_squared_error(y_true = y_train,y_pred = model.predict(X_train))
    
    # Calculate RMSE for test
    mse_test = mean_squared_error(y_true = y_test, y_pred = model.predict(X_test))
    
    rmse_train = np.sqrt(mse_train)
    rmse_test = np.sqrt(mse_test)
    print (f'Model name : , {name}')
    print(f'Training RMSE: , {rmse_train}')
    print(f'Testing RMSE: , {rmse_test}')
    print('==================================')

# Fit each model and evaluate RMSE
for name, model in models.items():
    model.fit(X_train, y_train)  # Train the model
    rmse(model, X_train, X_test, y_train, y_test)  # Calculate and print RMSE for train and test

Model name : , Linear Regression
Training RMSE: , 20.53593520213564
Testing RMSE: , 21.402295330329196
Model name : , K-Nearest Neighbors
Training RMSE: , 16.416859796590547
Testing RMSE: , 20.361081542817715
Model name : , Decision Tree
Training RMSE: , 2.642139911336674
Testing RMSE: , 26.57809446479684
Model name : , Bagged Decision Trees
Training RMSE: , 8.903542637985298
Testing RMSE: , 21.39083248154066
Model name : , Random Forest
Training RMSE: , 7.851000059550053
Testing RMSE: , 20.50541774231338
Model name : , AdaBoost
Training RMSE: , 22.986234492672242
Testing RMSE: , 24.092013508724484
Model name : , Support Vector Regressor
Training RMSE: , 19.72288789084942
Testing RMSE: , 20.40676936181438


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

In [36]:
# Overfitting is most likely occurring in:
# - Decision Tree (low training RMSE, high testing RMSE)
# - Bagged Decision Trees (lower training RMSE, higher testing RMSE)
# - Random Forest (lower training RMSE, higher testing RMSE)

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

In [38]:
# I would choose Linear Regression rather than other because...
# - A slightly difference in RMSE between train and test
# - Linear Regrssion is easy to interpret

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

In [40]:
# Hyperparameter: I didn't use GridSearchCV tuning my hyperparameter
# Do the cross validation

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

In [42]:
# Calculate the correlation between e401k and p401k
correlation = df['e401k'].corr(df['p401k'])
print(f'Correlation between e401k and p401k:, {correlation}')

Correlation between e401k and p401k:, 0.7691696232534353


In [43]:
# Using p401k as a feature can cause data leakage, 
# as it's directly related to the target (e401k). 

#This can make the model's predictions too accurate and not generalizable.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

In [45]:
# All of these models below are appropriate for a classification problem 
# Logistic Regression Model : Great for simple binary classification.
# K-nearest neighbors Model : Works well for smaller datasets but may struggle with large or high-dimensional data.
# Decision Tree : Can easily overfit, but good for interpretable models.
# Random Forests :A robust, powerful classifier that reduces overfitting.
# Adaboost : Effective for complex, noisy problems and can boost weak models' performance

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [47]:
from sklearn.metrics import accuracy_score

# Set random seed for reproducibility
RANDOM_SEED = 42

# Set up X,y 
X = df.drop(columns=['e401k', 'p401k'])
y = df['e401k']# Target variable

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_SEED)

# Define models in a dictionary to easily loop over them
models = {
    "Logistic Regression": LogisticRegression(random_state=RANDOM_SEED,max_iter=10_000,solver='saga'),
    "K-Nearest Neighbors": make_pipeline(StandardScaler(), KNeighborsClassifier()),
    "Decision Tree": DecisionTreeClassifier(random_state=RANDOM_SEED),
    "Bagged Decision Trees": make_pipeline(StandardScaler(), BaggingClassifier(estimator=DecisionTreeClassifier(), random_state=RANDOM_SEED)),
    "Random Forest": RandomForestClassifier(random_state=RANDOM_SEED),
    "AdaBoost": AdaBoostClassifier(algorithm='SAMME',random_state=RANDOM_SEED)
}

# Store results
results = []

# Fit each model, predict, and calculate accuracy
for name, model in models.items():
    model.fit(X_train, y_train)  # Train the model
    y_pred = model.predict(X_test)  # Predict on test data
    accuracy = accuracy_score(y_test, y_pred)  # Calculate accuracy
    
    # Store the results
    results.append({
        'Model': name,
        'Accuracy': accuracy
    })

# Convert the results to DataFrame for easier visualization
results_update = pd.DataFrame(results)

# Display the result
print(results_update)

                   Model  Accuracy
0    Logistic Regression  0.659838
1    K-Nearest Neighbors  0.639353
2          Decision Tree  0.585984
3  Bagged Decision Trees  0.639353
4          Random Forest  0.669003
5               AdaBoost  0.686792


In [48]:
print(y.value_counts(normalize=True))

e401k
0    0.607871
1    0.392129
Name: proportion, dtype: float64


## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

In [50]:
# False Positives : These are individuals who are predicted to be eligible for a 401(k) (positive class) but are actually not eligible. 

# False Negatives : These are individuals who are predicted to be not eligible for a 401(k) (negative class) but are actually eligible.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

In [52]:
# Minimize False Negatives BECAUSE....
# If someone who is eligible misses out on a 401(k), 
# it can lead to legal and financial problems for the company and hurt employee satisfaction.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

In [54]:
# By optimizing for Recall.
# TP / (TP + FN)

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

In [56]:
# The F1-score is an appropriate metric to balance the importance of false positives and false negatives.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [58]:
from sklearn.metrics import f1_score

# Fit each model and calculate F1-score for Train and Test sets
for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on training and testing sets
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Calculate F1-score for both training and testing data
    f1_train = f1_score(y_train, y_train_pred)
    f1_test = f1_score(y_test, y_test_pred)
    
    # Print the F1-score results for each model
    print(f"Model: {name}")
    print(f"Train F1-score: {f1_train:.4f}")
    print(f"Test F1-score: {f1_test:.4f}")
    print("===========================") 

Model: Logistic Regression
Train F1-score: 0.3603
Test F1-score: 0.3808
Model: K-Nearest Neighbors
Train F1-score: 0.6515
Test F1-score: 0.4966
Model: Decision Tree
Train F1-score: 1.0000
Test F1-score: 0.4674
Model: Bagged Decision Trees
Train F1-score: 0.9705
Test F1-score: 0.4858
Model: Random Forest
Train F1-score: 1.0000
Test F1-score: 0.5377
Model: AdaBoost
Train F1-score: 0.5601
Test F1-score: 0.5667


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

In [60]:
# The models showing clear overfitting are:
# - Decision Tree
# - Random Forest
# These 2 models have a very high F1-score on the training data but much lower scores on the testing data, indicating overfitting.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

In [62]:
# Based on performance in terms of F1-scores, overfitting, and generalization, 
# I will choose AdaBoost as the final model. 

# Because...its balanced performance, good generalization to unseen data, and its robustness to overfitting.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

In [64]:
# 1. Feature Engineering 
# 2. Hyperparameter Tuning: adjust n_estimators, learning_rate
# 3. Stratified sampling

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

In [66]:
# For the regression problem, 
# Though the model performance suggests that some key variables are missing or need further refinement. 
# Models like Random Forest and K-Nearest Neighbors provide better predictions than simpler models like Linear Regression.


# For the classification problem, 
# AdaBoost is the most accurate model for predicting 401k eligibility, 
# outperforming others such as Logistic Regression and Random Forest, 
# though improvements could still be made with further data preprocessing or fine-tuning.