In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [2]:
import pandas as pd
df = pd.read_csv('./401ksubs.csv')
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

- whether the employer matches 401k contributions
- how long the employee has been at the company
- level of education

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

It could limit access to retirement accounts on the basis of race.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

using income squared seems like cheating.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

"Income Squared" and "Age Squared". Perhaps income and age become increasingly impactful at higher values.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

both income and age are described as squares. income, like net total financial assets, might be in thousands.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

- Standardization/Scaling: This never hurts.
- Regularization: This could help us to avoid overfitting.

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.svm import SVR

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

  from numpy.core.umath_tests import inner1d


In [4]:
df.columns

Index(['e401k', 'inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'p401k',
       'pira', 'incsq', 'agesq'],
      dtype='object')

In [5]:
features = ['inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'agesq']
X = df[features]
y = df.inc

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [7]:
def regression(regressor):
    
    steps = [
        ('scaler', StandardScaler()),
        ('regressor', regressor())
    ]

    pipe = Pipeline(steps = steps)

    model = pipe.fit(X_train, y_train)
    print(f'train score: {model.score(X_train, y_train)}')
    print(f'test score: {model.score(X_test, y_test)}')

In [8]:
regression(LinearRegression)

train score: 1.0
test score: 1.0


In [9]:
regression(KNeighborsRegressor)

train score: 0.986037407993265
test score: 0.9808575706489306


In [10]:
regression(DecisionTreeRegressor)

train score: 1.0
test score: 0.9999589182377502


In [11]:
regression(BaggingRegressor)

train score: 0.9999381526275468
test score: 0.9999512057270288


In [12]:
regression(RandomForestRegressor)

train score: 0.9999818107147898
test score: 0.9999775553716564


In [13]:
regression(AdaBoostRegressor)

train score: 0.9909406180625542
test score: 0.991561501352494


In [14]:
regression(SVR)

train score: 0.915903244988523
test score: 0.9186377989207589


##### 9. What is bootstrapping?

basically, it means taking random sample rows of our data, with replacement, to expose different decision trees to different sub-samples of the data.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

a set of bagged decision trees exposes multiple different trees to different sub-samples of a training set to develop one aggregate prediction. this is an ensemble method which balances out the high bias of simple (shallow) trees and the high variance of complicated (deep) trees.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

A random forest differs from a set of bagged decision trees in only one way: a random forest only considers a random subset of variables (features) at each split in the learning process.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

Because bagged decision trees are generated on similar data, the trees are correlated with one another, and the model can tend to have high variance. A random forest de-correlates the trees by only considering a random subset of variables at each split, thus reducing variance.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [15]:
from sklearn.metrics import mean_squared_error

In [16]:
def rmse(regressor):
    steps = [
        ('scaler', StandardScaler()),
        ('regressor', regressor())
    ]

    pipe = Pipeline(steps = steps)

    model = pipe.fit(X_train, y_train)
    
    train_pred = model.predict(X_train)    
    print(f'train RMSE: {np.sqrt(mean_squared_error(y_train, train_pred))}')
    test_pred = model.predict(X_test)
    print(f'test RMSE: {np.sqrt(mean_squared_error(y_test, test_pred))}')

In [17]:
rmse(LinearRegression)

train RMSE: 3.46095717534282e-14
test RMSE: 3.5423191560726574e-14


In [18]:
rmse(KNeighborsRegressor)

train RMSE: 2.833322907531698
test RMSE: 3.378338530863359


In [19]:
rmse(DecisionTreeRegressor)

train RMSE: 6.80555106201231e-16
test RMSE: 0.16274720408165236


In [20]:
rmse(BaggingRegressor)

train RMSE: 0.08357582910598772
test RMSE: 0.07408518039588405


In [21]:
rmse(RandomForestRegressor)

train RMSE: 0.10689516215193925
test RMSE: 0.18715579436035787


In [22]:
rmse(AdaBoostRegressor)

train RMSE: 2.212310773266255
test RMSE: 2.183340116088242


In [23]:
rmse(SVR)

train RMSE: 6.9534873369912855
test RMSE: 6.964917627608414


##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

DecisionTreeRegressor has a test RMSE that is significantly larger than its train RMSE.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

The LinearRegression model, which has extremely low RMSE scores, with little evidence of overfitting.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

1. collect more data
2. engineer more features
3. hyperparameter tuning

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

not sure about this one.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

- Bootstrapping
- Bagging
- Boosting
- Voting

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC

In [25]:
df.columns

Index(['e401k', 'inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'p401k',
       'pira', 'incsq', 'agesq'],
      dtype='object')

In [52]:
features = ['inc', 'marr', 'male', 'age', 'fsize', 'nettfa', 'pira', 'incsq', 'agesq']
X = df[features]
y = df.e401k

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=23)

In [28]:
def classification(clf):
    steps = [
        ('scaler', StandardScaler()),
        ('classifier', clf())
    ]
    pipe = Pipeline(steps=steps)
    model = pipe.fit(X_train, y_train)
    print(f'train score: {model.score(X_train, y_train)}')
    print(f'test score: {model.score(X_test, y_test)}')

In [29]:
classification(LogisticRegression)

train score: 0.6541115583668775
test score: 0.6589047003018542


In [30]:
classification(KNeighborsClassifier)

train score: 0.7524439332949971
test score: 0.6326002587322122


In [31]:
classification(DecisionTreeClassifier)

train score: 1.0
test score: 0.6226821905993963


In [32]:
classification(BaggingClassifier)

train score: 0.977142035652674
test score: 0.6576110392410521


In [33]:
classification(RandomForestClassifier)

train score: 0.9751293847038528
test score: 0.6541612764122466


In [34]:
classification(AdaBoostClassifier)

train score: 0.6926394479585969
test score: 0.6821905993962915


In [35]:
classification(SVC)

train score: 0.683438757906843
test score: 0.6597671410090556


## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

False positives are those we predict to be eligible for a 401k who are actually ineligible.

False negatives are those we predict to be ineligible who are actually eligible.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

We want to minimize false negatives. False positives mean we target people who are ineligible, in which case the worst outcome is that they just ignore our advertising, but false negatives mean that we don't target people who are eligible, and they have no opportunity to get our product.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

Sensitivity

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

The f1 score can be interpreted as a weighted average of the precision and recall. Since optimizing precision minimizes false positives and optimizing recall minimizes false negatives, optimizing the f1 score will accomplish our goal.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [36]:
from sklearn.metrics import f1_score

In [37]:
def f1(classifier):
    steps = [
        ('scaler', StandardScaler()),
        ('classifier', classifier())
    ]

    pipe = Pipeline(steps = steps)

    model = pipe.fit(X_train, y_train)
    
    train_pred = model.predict(X_train)
    print(f'train f1 score: {f1_score(y_train, train_pred)}')
    test_pred = model.predict(X_test)
    print(f'test f1 score: {f1_score(y_test, test_pred)}')

In [38]:
f1(LogisticRegression)

train f1 score: 0.46246648793565687
test f1 score: 0.4651791751183232


In [39]:
f1(KNeighborsClassifier)

train f1 score: 0.6508515815085159
test f1 score: 0.4785801713586291


In [40]:
f1(DecisionTreeClassifier)

train f1 score: 1.0
test f1 score: 0.4940509915014165


In [41]:
f1(BaggingClassifier)

train f1 score: 0.9690566037735849
test f1 score: 0.47948717948717945


In [42]:
f1(RandomForestClassifier)

train f1 score: 0.968448894766673
test f1 score: 0.48771021992238034


In [43]:
f1(AdaBoostClassifier)

train f1 score: 0.5686037126715092
test f1 score: 0.5536038764385222


In [44]:
f1(SVC)

train f1 score: 0.45763546798029553
test f1 score: 0.41771217712177117


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

DecisionTree, Bagging, and RandomForest all score significantly higher on the training set than on the test set.

KNeighbors, too a lesser degree, also exhibits this evidence of overfitting.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I'd choose AdaBoost, since it has the highest f1 scores with only slight evidence of overfitting.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

1. engineer more features
2. collect more data
3. hyperparameter-tuning

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.