<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Supervised Learning Model Comparison

---

### Let us begin...

Recall the `data science process`.
   1. Define the problem.
   2. Gather the data.
   3. Explore the data.
   4. Model the data.
   5. Evaluate the model.
   6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

#### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. 

#### When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data.

In [4]:
# library imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression, Lasso, Ridge, ElasticNet, LinearRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.ensemble import BaggingRegressor, BaggingClassifier,\
RandomForestRegressor, RandomForestClassifier, AdaBoostRegressor, AdaBoostClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import metrics
import time

In [5]:
data = pd.read_csv('401ksubs.csv')

In [6]:
data.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [7]:
data.dtypes

e401k       int64
inc       float64
marr        int64
male        int64
age         int64
fsize       int64
nettfa    float64
p401k       int64
pira        int64
incsq     float64
agesq       int64
dtype: object

In [8]:
data.shape

(9275, 11)

In [9]:
data.describe()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
count,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0,9275.0
mean,0.392129,39.254641,0.628571,0.20442,41.080216,2.885067,19.071675,0.276226,0.25434,2121.192483,1793.652722
std,0.488252,24.090002,0.483213,0.403299,10.299517,1.525835,63.963838,0.447154,0.435513,3001.469424,895.648841
min,0.0,10.008,0.0,0.0,25.0,1.0,-502.302,0.0,0.0,100.1601,625.0
25%,0.0,21.66,0.0,0.0,33.0,2.0,-0.5,0.0,0.0,469.1556,1089.0
50%,0.0,33.288,1.0,0.0,40.0,3.0,2.0,0.0,0.0,1108.091,1600.0
75%,1.0,50.16,1.0,0.0,48.0,4.0,18.4495,1.0,1.0,2516.0255,2304.0
max,1.0,199.041,1.0,1.0,64.0,13.0,1536.798,1.0,1.0,39617.32,4096.0


In [10]:
data.isnull().sum()

e401k     0
inc       0
marr      0
male      0
age       0
fsize     0
nettfa    0
p401k     0
pira      0
incsq     0
agesq     0
dtype: int64

##### 2. What are 2-3 other variables that, if available, would be helpful to have?

In [12]:
# Years of Experience, Net Debt, Employment Status, Type of Work

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

In [14]:
# Using race in the model might lead to discriminatory practices. 
# Basing predictions on race could reinforce biases and limit access to financial resources and information for certain racial groups.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

In [16]:
# inches not used in income prediction model because inches or how tall they are doesn't meanning with income

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs (Subject Matter Experts) might have done this!

In [18]:
# agesq, incsq came from feature engineering with the reason that, the age and inches 

In [19]:
data.drop(columns = ['agesq', 'incsq'], inplace = True)

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

In [21]:
# inc is described as inches^2, but it may actually represent income.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

1. Multiple Linear Regression: A good choice for simplicity and interpretability. (**appropriate**)
2. Decision Trees: Handles non-linear relationships and interactions well but can overfit. (**appropriate**)
3. Random Forests: Reduces the overfitting tendency of decision trees by averaging multiple trees, though less interpretable. (**appropriate**)
4. k-Nearest Neighbors (kNN): Sensitive to the scale of features and works better with low-dimensional data. (**appropriate**)
5. Gradient Boosting Machines: Offers strong predictive power and handles complex relationships, though less interpretable. (**appropriate**)

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [25]:
X = data.drop(columns = ['e401k', 'p401k' ,'pira','inc'])
y = data['inc']
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

##### 9. What is bootstrapping?

In [27]:
#  Create multiple small datasets from an original dataset by sampling with replacement.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

In [29]:
# Decision tree ---> A single model that splits the data based on feature values to make predictions. 
# Set of bagged decision trees ---> An ensemble of decision trees created by training multiple trees of the original dataset. The final prediction is majority voting (for classification) across all trees.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

In [31]:
# Set of bagged decision trees --->  Collection of trees on bootstrapped data, reducing overfitting by averaging/voting.
# Random Forest ---> Bagged decision trees with additional feature randomness, reducing overfitting even further for more reliable predictions.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

In [33]:
# Reduce correlation among trees: Each tree considers only a random subset of features, making it less likely to produce similar splits.
# Less similar splits across trees reduce variance and improve accuracy.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [35]:
Regressor = [
    ('Linear Regression', LinearRegression()),
    ('k-Nearest Neighbors', KNeighborsRegressor()),
    ('Decision Tree', DecisionTreeRegressor()),
    ('Bagged Decision Trees', BaggingRegressor()),
    ('Random Forest', RandomForestRegressor()),
    ('AdaBoost',  AdaBoostRegressor())
]

In [36]:
def compare_regression(models):
    
    for name,model in models:        
        pipeline = Pipeline([
            ('scaler', StandardScaler()),        
            (f'{model}', model)           
        ])

        pipeline.fit(X_train, y_train)

        y_train_pred = pipeline.predict(X_train)
        y_test_pred = pipeline.predict(X_test)

        rmse_train = metrics.root_mean_squared_error(y_train, y_train_pred)
        rmse_test = metrics.root_mean_squared_error(y_test, y_test_pred)

        print(f"{name} model")
        print(f"Training RMSE: {rmse_train:.4f}")
        print(f"Testing RMSE: {rmse_test:.4f}")
        print()
        
compare_regression(Regressor)

Linear Regression model
Training RMSE: 20.4019
Testing RMSE: 21.4684

k-Nearest Neighbors model
Training RMSE: 16.5042
Testing RMSE: 20.2338

Decision Tree model
Training RMSE: 2.2981
Testing RMSE: 27.3855

Bagged Decision Trees model
Training RMSE: 8.7880
Testing RMSE: 20.9266

Random Forest model
Training RMSE: 7.6859
Testing RMSE: 20.4184

AdaBoost model
Training RMSE: 22.4442
Testing RMSE: 23.4737



##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

In [38]:
# Overfitting is detected in the Decision Tree, Bagged Decision Trees, Random Forest  (strong overfitting with a training RMSE).

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

In [40]:
# The Linear Regression model 
# The reason : this model performs well on both training and testing data, with minimal difference between the RMSE values 
# (Training RMSE: 20.4019, Testing RMSE: 21.4684). 
# This indicates that it does not overfit the training data, making it a reliable model for generalization to new, unseen data.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

In [42]:
regulars = [Ridge(), Lasso(), ElasticNet()]
params = {'alpha': [0.01 ,0.1, 1, 10, 100]}

for regular in regulars:
    gridsearch = GridSearchCV(regular, params, cv=5, verbose=1)
    gridsearch.fit(X_train, y_train)

    y_train_pred = gridsearch.predict(X_train)
    y_test_pred = gridsearch.predict(X_test)

    rmse_train = metrics.root_mean_squared_error(y_train, y_train_pred)
    rmse_test = metrics.root_mean_squared_error(y_test, y_test_pred)

    print("Best Estimator:", gridsearch.best_estimator_)
    print(f"Training RMSE: {rmse_train:.4f}")
    print(f"Testing RMSE: {rmse_test:.4f}")
    print()
    
lr = LinearRegression()
lr.fit(X_train, y_train)
linear_score = lr.score(X_train, y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

rmse_train = metrics.root_mean_squared_error(y_train, y_train_pred)

print("LinearRegression Best Estimator")
rmse_test = metrics.root_mean_squared_error(y_test, y_test_pred)
print(f"Training RMSE: {rmse_train:.4f}")
print(f"Testing RMSE: {rmse_test:.4f}")


Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best Estimator: Ridge(alpha=0.01)
Training RMSE: 20.4019
Testing RMSE: 21.4684

Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best Estimator: Lasso(alpha=0.01)
Training RMSE: 20.4019
Testing RMSE: 21.4690

Fitting 5 folds for each of 5 candidates, totalling 25 fits
Best Estimator: ElasticNet(alpha=0.01)
Training RMSE: 20.4039
Testing RMSE: 21.4716

LinearRegression Best Estimator
Training RMSE: 20.4019
Testing RMSE: 21.4684


In [43]:
# Normal Linear Regression has the best score.
lr = LinearRegression()
lr.fit(X_train,y_train)
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

rmse_train = metrics.root_mean_squared_error(y_train, y_train_pred)
rmse_test = metrics.root_mean_squared_error(y_test, y_test_pred)

print(f"Training RMSE: {rmse_train:.4f}")
print(f"Testing RMSE: {rmse_test:.4f}")

Training RMSE: 20.4019
Testing RMSE: 21.4684


In [44]:
df_coef = pd.DataFrame({'Feature': X.columns,'Coefficient': lr.coef_})
df_coef

Unnamed: 0,Feature,Coefficient
0,marr,20.290498
1,male,3.075914
2,age,0.047262
3,fsize,-1.508391
4,nettfa,0.131868


## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

In [46]:
# The reason is that before participating in a 401k, they must be eligible for it. 
# If using p401k in the model, it can lead to incorrect conclusions, 
# such as assuming they are participating in a 401k when they are not eligible for it. This is not true.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

1. Logistic Regression: This model is for binary classification due to its simplicity and interpretability. (**appropriate**)
2. Decision Trees: This model is especially for classification but can overfit. (**appropriate**)
3. Random Forests: Reduces the overfitting tendency of decision trees by averaging multiple trees, though it is less interpretable. (**appropriate**)
4. k-Nearest Neighbors (kNN): Sensitive to the scale of features and works better with low-dimensional data. (**appropriate**)
5. AdaBoost model: This model enhances weak learners like decision trees. (**appropriate**)
6. Gradient Boosting: This model builds an ensemble of weak learners sequentially, with strong predictive performance but is less interpretable. (**appropriate**)
7. XGBoost model: This is an optimized version of Gradient Boosting that is widely used for classification tasks. (**appropriate**)

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [50]:
data['e401k'].value_counts(normalize = True)

e401k
0    0.607871
1    0.392129
Name: proportion, dtype: float64

In [51]:
X = data.drop(columns = ['p401k','e401k'])
y = data['e401k']

In [52]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42, stratify=y)

In [53]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((6956, 7), (2319, 7), (6956,), (2319,))

In [54]:
y_train.value_counts(normalize = True)

e401k
0    0.607821
1    0.392179
Name: proportion, dtype: float64

In [55]:
y_test.value_counts(normalize = True)

e401k
0    0.608021
1    0.391979
Name: proportion, dtype: float64

In [56]:
Classifier = [
    ('Logistic Regression', LogisticRegression(solver='liblinear', penalty='l2', C=1.0, max_iter=1000)),
    ('k-Nearest Neighbors', KNeighborsClassifier()),
    ('Decision Tree', DecisionTreeClassifier()),
    ('Bagged Decision Trees', BaggingClassifier()),
    ('Random Forest', RandomForestClassifier()),
    ('AdaBoost',  AdaBoostClassifier(algorithm='SAMME'))
]

In [57]:
def compare_classification(models):
    
    f1_list = []
    
    for name,model in models:        
        pipeline = Pipeline([
            ('scaler', StandardScaler()),        
            ('model', model)           
        ])

        pipeline.fit(X_train, y_train)

        cv_scores = cross_val_score(model, X_train, y_train, cv=5)
        train_score = pipeline.score(X_train, y_train)
        test_score = pipeline.score(X_test, y_test)

        # Predict on training and testing data
        train_pred = pipeline.predict(X_train)
        test_pred = pipeline.predict(X_test)

        train_f1 = metrics.f1_score(y_train, train_pred)
        test_f1 = metrics.f1_score(y_test, test_pred)

        print(f"{name} model")
        print(f"Cross Validation score {cv_scores.mean():.4f}")
        print(f"Training score: {train_score:.4f}")
        print(f"Testing score: {test_score:.4f}")
        
        print()

        # Append results to the list
        f1_list.append({
            'model': name,
            'train f1': train_f1,
            'test f1': test_f1
        })
        
    # Convert the list to a DataFrame
    f1_df = pd.DataFrame(f1_list)

    return f1_df
        

f1_score = compare_classification(Classifier)

Logistic Regression model
Cross Validation score 0.6489
Training score: 0.6486
Testing score: 0.6542

k-Nearest Neighbors model
Cross Validation score 0.6333
Training score: 0.7542
Testing score: 0.6266

Decision Tree model
Cross Validation score 0.5966
Training score: 1.0000
Testing score: 0.5826

Bagged Decision Trees model
Cross Validation score 0.6429
Training score: 0.9770
Testing score: 0.6481

Random Forest model
Cross Validation score 0.6686
Training score: 1.0000
Testing score: 0.6636

AdaBoost model
Cross Validation score 0.6801
Training score: 0.6875
Testing score: 0.6900



## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

In [59]:
# False positives: Individuals who are not eligible for a 401k but are predicted to be eligible.
# False negatives: Individuals who are eligible for a 401k but are predicted not to be eligible.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

In [61]:
# Minimizing false negatives to ensure that all eligible individuals are given the opportunity to participate in the 401k. 
# It's better than tell someone they are eligible (false positive).

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

In [63]:
# Optimize sensitivity (also known as recall).

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

In [65]:
# f1-score gives a better sense of how well the model performs in terms of both 
# identifying true positives (sensitivity) and avoiding false positives (precision).

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [67]:
f1_score

Unnamed: 0,model,train f1,test f1
0,Logistic Regression,0.374616,0.391502
1,k-Nearest Neighbors,0.658819,0.475787
2,Decision Tree,1.0,0.476757
3,Bagged Decision Trees,0.969947,0.487437
4,Random Forest,1.0,0.527273
5,AdaBoost,0.584321,0.585113


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

In [69]:
# Overfitting is most likely present in Decision Tree, Random Forest, 
# and possibly Bagged Decision Trees, based on the significant disparity between their training and testing f1-scores.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

In [71]:
# Although Random Forest and Decision Trees may have higher training performance, they suffer from overfitting. 
# On the other hand, AdaBoost strikes a good balance between training and testing performance, 
# making it the most suitable model for reliable predictions in this case.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

In [73]:
t0 = time.time()

ada = AdaBoostClassifier(estimator=DecisionTreeClassifier(),algorithm='SAMME')
params = {
    'n_estimators' : [30,50,100],
    'estimator__max_depth' : [1,2],
    'learning_rate': [0.7,0.9,1.0],
}

gridsearch = GridSearchCV(ada, params, cv =5 , verbose=1)

gridsearch.fit(X_train,y_train)

# Print best results
print("Best Score:", gridsearch.best_score_)
print("Best Params:", gridsearch.best_params_)

# Predict using the best model
train_pred = gridsearch.best_estimator_.predict(X_train)
test_pred = gridsearch.best_estimator_.predict(X_test)

# Compute F1 scores
train_f1 = metrics.f1_score(y_train, train_pred)
test_f1 = metrics.f1_score(y_test, test_pred)

# Output F1 scores
print(f"Training f1-score: {train_f1:.4f}")
print(f"Testing f1-score: {test_f1:.4f}")

# Print total runtime
print(f"How long did all take to run: {(time.time() - t0):.0f} s")

Fitting 5 folds for each of 18 candidates, totalling 90 fits
Best Score: 0.6838702413710471
Best Params: {'estimator__max_depth': 2, 'learning_rate': 1.0, 'n_estimators': 30}
Training f1-score: 0.5709
Testing f1-score: 0.5764
How long did all take to run: 62 s


In [74]:
# Normal Adaboost has the best score.
ada = AdaBoostClassifier( estimator= DecisionTreeClassifier(max_depth=2),
                         algorithm='SAMME', learning_rate= 1.0, n_estimators= 30)

ada.fit(X_train,y_train)
train_pred = ada.predict(X_train)
test_pred = ada.predict(X_test)

train_acc = metrics.accuracy_score(y_train, train_pred)
test_acc = metrics.accuracy_score(y_test, test_pred)

print(f"Training accuracy score: {train_acc:.4f}")
print(f"Testing accuracy score: {test_acc:.4f}\n")

feature_importances_df = pd.DataFrame({'feature': X.columns, 'importance': ada.feature_importances_})\
.sort_values('importance', ascending=  False)

print(feature_importances_df)

Training accuracy score: 0.6913
Testing accuracy score: 0.6964

  feature  importance
5  nettfa    0.534129
0     inc    0.327972
3     age    0.066109
1    marr    0.029835
6    pira    0.023770
4   fsize    0.018186
2    male    0.000000


## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

In [76]:
# Regression
# Marital status has the most significant impact on predicting income. 
# If an individual is married, their income may increase by 20. Additionally, 
# for every additional 1 million in net total financial assets, income rises by 131.868.

# Classification
# For predicting eligibility for a 401(k) plan, net total financial assets (nettfa) is the most influential feature. 
# Higher net total financial assets are associated with a greater probability of eligibility for a 401(k) plan.

#Limitations
# The dataset has limited features for predicting 401(k) plan eligibility, including only income, family size, 
# net total financial assets, age, marital status, and sex. Adding features such as job contract type (e.g., permanent, temporary, freelance) 
# and years of work experience could improve model accuracy and insights.