## Supervised Learning Model Comparison

Recall one formulation of the data science process.

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model. Iterate.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. We'll define the problem and gather the data for you.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that they have tax benefits. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, you check out [this site](investopedia.com/ask/answers/12/401k.asp).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether someone is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

---
### NOTE ⚠️

When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

---

### Step 2: Import the data.

##### 1. Read in the data from the repository.

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
data = pd.read_csv('401ksubs.csv')

In [3]:
data.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9275 entries, 0 to 9274
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   e401k   9275 non-null   int64  
 1   inc     9275 non-null   float64
 2   marr    9275 non-null   int64  
 3   male    9275 non-null   int64  
 4   age     9275 non-null   int64  
 5   fsize   9275 non-null   int64  
 6   nettfa  9275 non-null   float64
 7   p401k   9275 non-null   int64  
 8   pira    9275 non-null   int64  
 9   incsq   9275 non-null   float64
 10  agesq   9275 non-null   int64  
dtypes: float64(3), int64(8)
memory usage: 797.2 KB


##### 2. What are 2-3 other variables that, if they were available, would be helpful to have?

- Occupation would probably be a good predictor of income.
- Education level
- Zip Code
- Renter or Homeowner
- Expected Retirement age

##### 3. Suppose a peer recommended putting `race` into your model to better predict who to target when advertising IRAs and 401(k)s. Why might this be unethical?

This would be a discriminatory tactic.

### Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we exclude? Why?

In [5]:
data.corr()['inc']

e401k     0.268178
inc       1.000000
marr      0.362008
male     -0.069871
age       0.105638
fsize     0.110170
nettfa    0.376586
p401k     0.270833
pira      0.364354
incsq     0.940161
agesq     0.087305
Name: inc, dtype: float64

'male' and 'agesq' have correlations with income that are closest to 0. These could be good candidates for exclusion.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

Income squared and Age squared have been included from feature engineering. SMEs have probably found that these features improve the performance of their models.

##### 6. Looking at the data dictionary, one or more variable descriptions appear to be erroneous. What's the issue and what do you think the correct value(s) should be?

Income and Age were probably changed after feature engineering.

### Step 4: Model the data. (Part 1: Regression Problem)

- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all models you've learned through the date this lab was assigned that could be used to solve a regression problem. For each model type, identify whether it might be appropriate for solving this specific regression problem and explain why.

Regression Models and their potential for solving this problem:
- Linear/Multiple Linear Regression.
- Ridge Regression.
- Lasso Regression.
- KNN Regression.
- Decision Tree Regression.
- Bagging.
- Random Forest.

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a K-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - a boosting model

    
> As always, be sure to do a train/test split! To compare modeling techniques, use the same train-test split on each. 

> You may find it helpful to set up pipelines, but you are not required to do so.

In [68]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor

models = {
    'LinReg': LinearRegression(),
    'KNN': KNeighborsRegressor(),
    'Tree': DecisionTreeRegressor(),
    'Bagging': BaggingRegressor(),
    'RandomForest': RandomForestRegressor(),
    'Boost': AdaBoostRegressor()
}

In [89]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Seems like including squared income here is wrong, but no indication above to exclude it
X = data.drop(columns=['e401k', 'p401k', 'pira', 'inc', 'incsq'])
y = data['inc']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13)

ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [90]:
def regressions(X_train, X_test, y_train, y_test, models):
    for k, model in models.items():
        model.fit(X_train, y_train)
        print(f"{k} R2 score: {model.score(X_test, y_test)}")
        try:
            print(model.coef_)
        except:
            pass
        try:
            print(model.feature_importances_)
        except:
            pass

In [91]:
regressions(X_train_sc, X_test_sc, y_train, y_test, models)

LinReg R2 score: 0.2896129712905864
[ 10.1143359    1.33706639  31.79434649  -3.21247748   8.36036042
 -31.88611564]
KNN R2 score: 0.31368738655845974
Tree R2 score: -0.18326789592736503
[0.10284809 0.02034971 0.10595146 0.06766602 0.60312921 0.10005549]
Bagging R2 score: 0.3045835877717117
RandomForest R2 score: 0.3230888918946818
[0.09970107 0.02181028 0.10344717 0.07011745 0.60158544 0.10333859]
Boost R2 score: -0.03157212900556883
[0.14951596 0.01173512 0.10576606 0.01495378 0.67015389 0.04787519]


In [92]:
X.columns

Index(['marr', 'male', 'age', 'fsize', 'nettfa', 'agesq'], dtype='object')

##### 9. What is bootstrapping?

Bootstrapping is resampling our training data with replacement, so that some data/rows might show up multiple times.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific!

Bagged decision trees incorporate bootstraping to mitigate the high error from variance of deep decision trees, by exposing shallower trees to random bootstrapped data.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific!

A random forest is a type of bagging where a random subset of features is selected to decorrelate the otherwise high correlation trees in a bagging ensemble.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

By decorrelating the decision trees in a random forest, model variance is reduced and bias is increased slightly, improving the performance of random forests over bagging in general.

### Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [55]:
from sklearn.metrics import mean_squared_error

def regressions_rmse(X_train, X_test, y_train, y_test, models):
    for k, model in models.items():
        model.fit(X_train, y_train)
        print(f"{k} RMSE score: {mean_squared_error(y_test, model.predict(X_test), squared=False)}")

In [57]:
regressions_rmse(X_train_sc, X_test_sc, y_train, y_test, models)

LinReg RMSE score: 20.713430555278393
KNN RMSE score: 19.93374910374894
Tree RMSE score: 26.509139005538216
Bagging RMSE score: 20.55602552123894
RandomForest RMSE score: 20.02721865877122
Boost RMSE score: 20.69949056616224


##### 14.Which model performs best?

The Random Forest model has the lowest RMSE at 19.84, so on that metric it is the best performing model.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use for this problem, which one model would you pick? Why?

I would pick the KNN model because it has the lowest RMSE and the second highest R2.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

I would run a Grid Search and tune hyperparameters:
- 'n_neighbors'

### Step 4: Model the data. (Part 2: Classification Problem)

- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

These features have high correlation but could cause false positives as everyone participating in a 401k has to be eligible first, but not everyone eligible is participating.

Also, commercially, if someone is already participating in a 401k then they might not be the best target for our company.

##### 18. List all models you've learned that could be used to solve a classification problem. For each, identify whether it is appropriate for solving this specific classification problem and explain why.

- Logistic Regression
- KNN
- Decision Trees
- Ensemble methods

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a K-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - a boosting model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [58]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = data.drop(columns=['e401k'])
y = data['e401k']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=13)

ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.transform(X_test)

In [59]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier

models_class = {
    'LogReg': LogisticRegression(),
    'KNN Classifier': KNeighborsClassifier(),
    'Tree Classifier': DecisionTreeClassifier(),
    'Bagging Classifier': BaggingClassifier(),
    'RandomForest Classifier': RandomForestClassifier(),
    'Boost Classifier': AdaBoostClassifier()
}

In [60]:
def classifications(X_train, X_test, y_train, y_test, models):
    for k, model in models.items():
        model.fit(X_train, y_train)
        print(f"{k} Accuracy score: {model.score(X_test, y_test)}")

In [61]:
classifications(X_train_sc, X_test_sc, y_train, y_test, models_class)

LogReg Accuracy score: 0.8822768434670116
KNN Classifier Accuracy score: 0.8667529107373868
Tree Classifier Accuracy score: 0.8059508408796895
Bagging Classifier Accuracy score: 0.8710651142733937
RandomForest Classifier Accuracy score: 0.877533419577404
Boost Classifier Accuracy score: 0.8796895213454075


### Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

- False positives are cases where our models predicts someone is eligible for a 401k when they are in fact not eligible.
- False negatives are cases where our model predicts someone is NOT eligible when they are, in fact, eligible.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

Given our problem statement explains that we are trying to identify potential customers, I would rather minimize false positives to save our company from contacting people who aren't even eligible for the product we are selling (401k accounts).

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

With help from: https://towardsdatascience.com/what-metrics-should-we-use-on-imbalanced-data-set-precision-recall-roc-e2e79252aeba

To select a model that minimizes false positives we would optimize for precision.

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

F1 score is the harmonic mean of precision and recall, as such it balances false positives and false negatives and would be appropiate in this case.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [64]:
from sklearn.metrics import f1_score

def classifications_f1(X_train, X_test, y_train, y_test, models):
    for k, model in models.items():
        model.fit(X_train, y_train)
        try:
            print(f"{k} Train F1 score: {f1_score(y_train, model.predict(X_train))}")
            print(f"{k} Test F1 score: {f1_score(y_test, model.predict(X_test))}")
        except:
            print(f"Could not calculate F1 for {k}")

In [65]:
classifications_f1(X_train_sc, X_test_sc, y_train, y_test, models_class)

LogReg Train F1 score: 0.8276751181779115
LogReg Test F1 score: 0.8233009708737864
KNN Classifier Train F1 score: 0.8489179256839525
KNN Classifier Test F1 score: 0.8086687306501547
Tree Classifier Train F1 score: 1.0
Tree Classifier Test F1 score: 0.7490389895661724
Bagging Classifier Train F1 score: 0.9782852864095845
Bagging Classifier Test F1 score: 0.8189762796504368
RandomForest Classifier Train F1 score: 1.0
RandomForest Classifier Test F1 score: 0.8196303377947737
Boost Classifier Train F1 score: 0.8286818376985831
Boost Classifier Test F1 score: 0.8201160541586074


##### 25. Based on training f1-score and testing f1-score, is there evidence of a great deal overfitting in any of your models? Which ones?

The most overfit models by F1 score are:
1. Decision Tree Classifier
1. Random Forest Classifier
1. Bagging Classifier


##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I would use a LogReg classifier to find potential customers as it has the second best F1 score but can be more interpretable than others and easier to tune.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

I would drop rows where p401k == 1 and Grid Search over hyperparameters for LogReg

### Step 6: Answer the problem.

##### Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

### Regression
We were not able to accurately predict income unless the income squared feature was included, which isn't reasonable to include.

Our models identify Net Total Financial Assets is the most important predictor for Income, followed by Age.

### Classification
Our models were able to predict Eligibility for 401k with good accuracy, though not great, and with good F1 scores, that is, with a good balance of minimizing both false positives and false negatives.