## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.

Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('./401ksubs.csv')
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

1. Region
2. Number of working months
3. Disability

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

It could create unconscious bias towards a certain race.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

inc and age because we already have incsq and agesq. It would create multicollinearity.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

incsq and agesq. They might provide the model with better prediction.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

inc, the label shows inc^2

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

Linear Regression

Ridge

Lasso

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [11]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler

In [5]:
df.columns

Index(['e401k', 'inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'p401k',
       'pira', 'incsq', 'agesq'],
      dtype='object')

In [7]:
X = df[['marr','male','age','fsize','nettfa']]
y = df['inc']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=12)

In [25]:
models = [RandomForestRegressor(), SVR(), KNeighborsRegressor(), AdaBoostRegressor()]

In [27]:
for model in models:
    pipe = Pipeline([
        ('ss', StandardScaler()),
        ('model', model)
    ])
    pipe.fit(X_train, y_train)
    print (f'{model}')
    print (pipe.score(X_test, y_test))

RandomForestRegressor()
0.24900909018917827
SVR()
0.28710094479897597
KNeighborsRegressor()
0.2756849818385504
AdaBoostRegressor()
-0.06242167421889033


##### 9. What is bootstrapping?

It is taking samples from the existing population.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

Decision tree use a single group of data for prediction.
Bagged decision tree ustilizes bootstrap method and perform deiciosn tree on each group of samples and the final prediction will be based on the majority/average of all predictions.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

In random forest, only a subset of features based on the randomly selected samples will be used to perform the best split.

In bagged decision trees, all the features available are considered for splitting a node.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

Random forest split the nodes with optimal features available in each sample. Thus, less noise is included in each tree.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [28]:
from sklearn.metrics import mean_squared_error as mse

In [30]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.fit_transform(X_test)

In [33]:
for model in models:
    y_train_p = model.predict(X_train_sc)
    y_test_p = model.predict(X_test_sc)
    rmse_train = mse(y_train, y_train_p, squared=False)
    rmse_test = mse(y_test, y_test_p, squared=False)
    
    print (f'{model}')
    print (f'RMSE of train data {rmse_train}')
    print (f'RMSE of test data {rmse_test}')

RandomForestRegressor()
RMSE of train data 7.745249830426105
RMSE of test data 20.498883704796235
SVR()
RMSE of train data 19.956593132110527
RMSE of test data 19.6444755616537
KNeighborsRegressor()
RMSE of train data 16.553799088481348
RMSE of test data 20.223635493359218
AdaBoostRegressor()
RMSE of train data 24.112955301206462
RMSE of test data 26.589316630584317


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

Support Vector Machine has slight overfitting but I would say it is quite negligible.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

Support Vector Machine because it provides the RMSE of train and test data are closest to each other which is not very overfitting not underfitting.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

1. Have more rows.
2. Try different feature engineering.
3. Add some variables as stated in Step 2.2

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

If p401k = 1, it means the person has already participated in 401k and definitely eligible for 401k.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

Logistic Regression,
KNN,
Decision Tree,
Random Forest,
AdaBoost,
SVM

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [34]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

In [37]:
df.columns

Index(['e401k', 'inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'p401k',
       'pira', 'incsq', 'agesq'],
      dtype='object')

In [38]:
X = df.drop(columns=['e401k', 'p401k'])
y = df['e401k']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=12)

In [39]:
models = [RandomForestClassifier(), SVC(), KNeighborsClassifier(), AdaBoostClassifier(), LogisticRegression(), DecisionTreeClassifier()]

In [40]:
for model in models:
    pipe = Pipeline([
        ('ss', StandardScaler()),
        ('model', model)
    ])
    pipe.fit(X_train, y_train)
    print (f'{model}')
    print (pipe.score(X_test, y_test))

RandomForestClassifier()
0.6452830188679245
SVC()
0.6501347708894879
KNeighborsClassifier()
0.6264150943396226
AdaBoostClassifier()
0.663611859838275
LogisticRegression()
0.6490566037735849
DecisionTreeClassifier()
0.584366576819407


## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

False positive = predicted positive but actual negative

False negative = predicted negative but actual positive

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

Minimize false negative in order to maximize value gain.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

Precision

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [42]:
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score, plot_roc_curve, roc_auc_score, recall_score, precision_score, f1_score


In [46]:
ss = StandardScaler()
X_train_sc = ss.fit_transform(X_train)
X_test_sc = ss.fit_transform(X_test)

In [48]:
for model in models:
    y_train_p = model.predict(X_train_sc)
    y_test_p = model.predict(X_test_sc)
    rmse_train = mse(y_train, y_train_p, squared=False)
    rmse_test = mse(y_test, y_test_p, squared=False)
    
    print (f'{model}')
    print (f'Train f1 score: {f1_score(y_train, y_train_p)}')
    print (f'Test f1 score: {f1_score(y_test, y_test_p)}')

RandomForestClassifier()
Train f1 score: 1.0
Test f1 score: 0.5144827586206897
SVC()
Train f1 score: 0.4849565411187876
Test f1 score: 0.4238178633975482
KNeighborsClassifier()
Train f1 score: 0.6554526554526554
Test f1 score: 0.4889867841409692
AdaBoostClassifier()
Train f1 score: 0.5800186741363212
Test f1 score: 0.5374917053749172
LogisticRegression()
Train f1 score: 0.4774700289375775
Test f1 score: 0.4686737184703011
DecisionTreeClassifier()
Train f1 score: 1.0
Test f1 score: 0.4883720930232558


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

They all are overfitting except AdaBoost and LogisticRegression which are not as overfitting.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I would choose AdaBoost because it is not so much overfitting and higher score than logistic regression.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.