## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [100]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor



In [101]:
df_401k = pd.read_csv('./401ksubs.csv')

In [102]:
df_401k.isnull().sum()

e401k     0
inc       0
marr      0
male      0
age       0
fsize     0
nettfa    0
p401k     0
pira      0
incsq     0
agesq     0
dtype: int64

In [103]:
df_401k.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

*Occupation, education level*

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

*The model may predict advertising to one race over another, possibly creating a barrier to access, or creating a bias within the team/company on who to target for financial services. We would not want to avoid targeting a certain group of people due to their race.*

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

*We would want to make sure that some of the variables are not collinear. For example, marriage and family size may be too highly correlated to include both in the model. We would want to test them for collinearity before including them. Similary, Age^2 seems to be an engineered feature that would be directly correlated to Age, so we would not want to include age, just Age^2.*

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

https://stats.stackexchange.com/questions/52585/what-happens-when-i-include-a-squared-variable-in-my-regression

*agesq, incsq are squared versions of the variables. This may be because income eventually will reach a plateau.*

*Squaring the data will also standardaize variables like age, since they are so far away from income it would be helpful to make them closer together to understand their relationship.*

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

*both age^2 and inc^2 are in the data dictionary twice. When examining the data, it seems that the variables are correct though.*

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

**Multiple Linear regression** - multiple variables to predict y and interpretable cofficients make this model very useful in this situation. Since we are predicting a numeric variable, we can use MLR.

**Logistic regression** - this predicts categories, so it would not be helpful in this problem. 

**KNearest Neighbors** - non-parametric (not making assumptions about the distribution of the data), uses distance so the data needs to be scaled, is difficult to interpret the features which is importnat to our model because we want to understand the features. 

**Decision Tree Regression** - We don't have to scale the data, it is prone to overfitting, and is easy to interpret the features, which is what we are looking for in this question. 

**Random Forests and ExtraTrees** - This de-correlates the decision tree regression and helps take away variance by employing random subset sampling in the algorithm. If the Decision tree regression is too overfit, it is useful to employe these to help reduce variance and strenghten the model. 

**Adaboost Model** - An ensemble method of modeling that assigns weights to specific features with the goal of correct misclassifications

**Suppor Vector Regressor** - Black box model, which is not helpful for interpretation. 

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [104]:
np.random.seed(42)

In [105]:
X =df_401k[['marr','male','age','fsize','nettfa','incsq','agesq']]
y= df_401k['inc']

In [106]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#### Multiple Linear Regression

In [142]:
#instantiate the Model
lr = LinearRegression()

#fit the model 
lr.fit(X_train, y_train)

#score the  data
print(f'Training Score: {lr.score(X_train, y_train)}, Testing Score: {lr.score(X_test, y_test)}')

Training Score: 0.8948254673897011, Testing Score: 0.9055024120733456


In [143]:
from sklearn import metrics

In [145]:
#Train rmse
preds = lr.predict(X_train)

resids = preds - y_train

rmse_train_mlr = np.sqrt(metrics.mean_squared_error(y_train, preds))

In [146]:
#test rmse
preds = lr.predict(X_test)

resids = preds - y_test

rmse_mlr = np.sqrt(metrics.mean_squared_error(y_test, preds))

#### K-Nearest Neighbor

In [110]:
sc = StandardScaler()

In [111]:
sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)

In [112]:
knnr = KNeighborsRegressor(n_neighbors=5)
knnr.fit(Z_train, y_train)

KNeighborsRegressor()

In [113]:
print(f'Training Score: {knnr.score(Z_train, y_train)},Testing Score: {knnr.score(Z_test, y_test)}')

Training Score: 0.9795704183969427,Testing Score: 0.9728418554699854


In [147]:
#Train rmse
preds = knnr.predict(Z_train)

rmse_train_knnr = np.sqrt(metrics.mean_squared_error(y_train, preds))

In [148]:
#Test rmse

preds = knnr.predict(Z_test)

rmse_knnr = np.sqrt(metrics.mean_squared_error(y_test, preds))

### Decision Tree

In [255]:
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier

In [116]:
dtr = DecisionTreeRegressor()

In [117]:
dtr.fit(Z_train, y_train)

DecisionTreeRegressor()

In [118]:
dtr.score(Z_train, y_train), dtr.score(Z_test, y_test)

(1.0, 0.9999481486457086)

In [151]:
#Train rmse
preds = dtr.predict(Z_train)

rmse_train_dtr = np.sqrt(metrics.mean_squared_error(y_train, preds))

In [150]:
#Test rmse

preds = dtr.predict(Z_test)

rmse_dtr = np.sqrt(metrics.mean_squared_error(y_test, preds))

### Bagged Decision Tree

In [120]:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html

from sklearn.ensemble import BaggingRegressor

In [121]:
bgr = BaggingRegressor(random_state = 42, base_estimator = DecisionTreeRegressor())

In [122]:
bgr.fit(Z_train, y_train)

BaggingRegressor(base_estimator=DecisionTreeRegressor(), random_state=42)

In [123]:
bgr.score(Z_train, y_train), bgr.score(Z_test, y_test)

(0.999989416529091, 0.9999883790392515)

In [152]:
#Train rmse
preds = bgr.predict(Z_train)

rmse_train_bgr = np.sqrt(metrics.mean_squared_error(y_train, preds))

In [153]:
#Test rmse
preds = bgr.predict(Z_test)

rmse_bgr = np.sqrt(metrics.mean_squared_error(y_test, preds))

### Random Forest

In [125]:
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor

In [126]:
rfr = RandomForestRegressor(n_estimators = 100)

In [127]:
rfr.fit(Z_train, y_train)

RandomForestRegressor()

In [128]:
rfr.score(Z_train, y_train),rfr.score(Z_test, y_test)

(0.9999889348368476, 0.9999907283001671)

In [154]:
#Train rmse
preds = rfr.predict(Z_train)

rmse_train_rfr = np.sqrt(metrics.mean_squared_error(y_train, preds))

In [155]:
#Test rmse

preds = rfr.predict(Z_test)

rmse_rfr = np.sqrt(metrics.mean_squared_error(y_test, preds))

### Ada Boosting

In [130]:
from sklearn.ensemble import AdaBoostRegressor

In [131]:
abr = AdaBoostRegressor(random_state = 42)

In [132]:
abr.fit(Z_train, y_train)

AdaBoostRegressor(random_state=42)

In [133]:
abr.score(Z_train, y_train)

0.9913321095472987

In [134]:
abr.score(Z_test, y_test)

0.9916962982297579

In [135]:
abr.estimators_[:5]

[DecisionTreeRegressor(max_depth=3, random_state=1608637542),
 DecisionTreeRegressor(max_depth=3, random_state=1776589882),
 DecisionTreeRegressor(max_depth=3, random_state=1666910835),
 DecisionTreeRegressor(max_depth=3, random_state=754668208),
 DecisionTreeRegressor(max_depth=3, random_state=1177281088)]

In [156]:
#Train rmse
preds = abr.predict(Z_train)

rmse_train_abr = np.sqrt(metrics.mean_squared_error(y_train, preds))

In [157]:
#Test rmse

preds = abr.predict(Z_test)

rmse_abr = np.sqrt(metrics.mean_squared_error(y_test, preds))

### Support Vector Regressor

In [160]:
from sklearn.svm import LinearSVR

In [161]:
svr = LinearSVR(random_state=42, max_iter=5000)

In [162]:
svr.fit(Z_train, y_train)

LinearSVR(max_iter=5000, random_state=42)

In [163]:
svr.score(Z_train, y_train), svr.score(Z_test, y_test)

(0.8149823407006432, 0.8266033728325769)

In [164]:
#Train rmse
preds = svr.predict(Z_train)

rmse_train_svr = np.sqrt(metrics.mean_squared_error(y_train, preds))

In [165]:
#test rmse
preds = svr.predict(Z_test)

rmse_svr = np.sqrt(metrics.mean_squared_error(y_test, preds))

##### 9. What is bootstrapping?

*Bootstrapping is a resampling method by which the model chooses random samples of the population then replaces them each time. The model is fit on each new bootstrapped sample. All of those models are then combined.*

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

*A decision tree is a model that finds rules based on the X dataset and then splits the data into smaller subset, each time making a decision on how to split the data. 'Bagging' stands for bootstrap aggregating. The model pulls samples from the population and then puts it back. Each time it pulls, it runs the model. It then puts back the sample into the population. Then, it creates another decision tree, and runs the model. Finally, the models are then aggregated to make a final decision.* 

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

*Bagged decision trees refers to resampling the data for each decision tree and then returning it to the overall sample before aggregating the restuls of each decision and returning a decision. A random forest refers to a method of de-correlating the decisions. At each split/decision, the algorithm selects a **random** subset of the data in its decisions. This prevents the correlation that tends to occur because some features may be really strong indicators of the target, and therefore the algorithm will keep using them in each decision tree.*

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

*If the decision tree is overfit, a random forest will help reduce the variance because it picks a random subset of the data for each tree.*

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

#### Test RMSE

In [99]:
print(f'MLR: {rmse_mlr}'),
print(f'KNN: {rmse_knnr}'),
print(f'DTR: {rmse_dtr}'),
print(f'BGR: {rmse_bgr}'),
print(f'RFR: {rmse_rfr}'),
print(f'ABR: {rmse_abr}'),
print(f'SVR: {rmse_svr}')

MLR: 7.506111330932499
KNN: 4.02396956785657
DTR: 0.14957548524733985
BGR: 0.08323877058190046
RFR: 0.0806068262554918
ABR: 2.2250537752157458
SVR: 10.167752419073258


#### Train RMSE 

In [166]:
print(f'MLR: {rmse_train_mlr}'),
print(f'KNN: {rmse_train_knnr}'),
print(f'DTR: {rmse_train_dtr}'),
print(f'BGR: {rmse_train_bgr}'),
print(f'RFR: {rmse_train_rfr}'),
print(f'ABR: {rmse_train_abr}'),
print(f'SVR: {rmse_train_svr}')

MLR: 7.776216516193241
KNN: 3.4272263258974696
DTR: 5.553987297632147e-16
BGR: 0.07800583843213096
RFR: 0.07976125153602444
ABR: 2.2323883936942956
SVR: 10.313822194654584


In [207]:
data = {'rmse': [rmse_mlr,rmse_knnr,rmse_dtr,rmse_bgr,rmse_rfr,rmse_abr, rmse_svr], 'train_rmse': [rmse_train_mlr,rmse_train_knnr,rmse_train_dtr,rmse_train_bgr,rmse_train_rfr,rmse_train_abr,rmse_train_svr]}
index = ['MLR',
'KNN',
'DTR',
'BGR',
'RFR',
'ABR',
'SVR']
df = pd.DataFrame.from_dict(data)
df.index = index


In [208]:
df.head()

Unnamed: 0,rmse,train_rmse
MLR,7.506111,7.776217
KNN,4.02397,3.427226
DTR,0.175827,5.553987e-16
BGR,0.083239,0.07800584
RFR,0.074351,0.07976125


In [211]:
# Baseline RMSE:
baseline = metrics.mean_squared_error(y_test, [y_train.mean()]*len(y_test), squared=False)
baseline

24.417756069233555

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

*it looks likes the Decision Tree, Bagged Decision Tree and Random Forest are overfit.*

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

*I would pick the MLR because the training and testing scores are very close together*

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

*I would continue feature engineering via polynomial features and add gridsearch to find the best hyper parameters*

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

*If you are participating in a 401k, you are already eligible for the 401k, so these variables would be correlated.*

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

**Logistic Regression** - this predicts categories, so it would be helpful in this problem.

**KNearest Neighbors** - non-parametric (not making assumptions about the distribution of the data), uses distance so the data needs to be scaled, is difficult to interpret the features but we are just trying to predict whether the person is eligible.

**Decision Tree classifier** - We don't have to scale the data, it is prone to overfitting. 

**Random Forests and ExtraTrees Classfiier** - This de-correlates the decision tree regression and helps take away variance by employing random subset sampling in the algorithm. If the Decision tree regression is too overfit, it is useful to employ these to help reduce variance and strenghten the model.

**Adaboost Classifier** - An ensemble method of modeling that assigns weights to specific features with the goal of correct misclassifications.

**Suppor Vector Classifier** - Black box model predictor

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [247]:
X = df_401k.drop(columns =['e401k','p401k'])
y = df_401k['e401k']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

sc = StandardScaler()
Z_train = sc.fit_transform(X_train)
Z_test = sc.transform(X_test)

In [260]:
# What is the accuracy of our baseline model?
y.value_counts(normalize=True)

0    0.607871
1    0.392129
Name: e401k, dtype: float64

### Logisitic Regression

In [251]:
lgr = LogisticRegression()
lgr.fit(Z_train, y_train)
lgr.score(Z_train, y_train), lgr.score(Z_test, y_test)

(0.6569867740080506, 0.6550237171194481)

### k-nearest neighbors

In [254]:
knn = KNeighborsClassifier()
knn.fit(Z_train, y_train)
knn.score(Z_train, y_train), knn.score(Z_test, y_test)

(0.7515813686026452, 0.6399310047434239)

### decision tree 

In [256]:
dtc = DecisionTreeClassifier()
dtc.fit(Z_train, y_train)
dtc.score(Z_train, y_train), dtc.score(Z_test, y_test)

(1.0, 0.5890470030185425)

### bagged decision trees

In [257]:
from sklearn.ensemble import BaggingClassifier

In [263]:
bgc = BaggingClassifier()
bgc.fit(Z_train, y_train)
bgc.score(Z_train, y_train), bgc.score(Z_test, y_test)

(0.9764232317423807, 0.6338939197930142)

### random forest

In [268]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

In [267]:
rfc = RandomForestClassifier()
rfc.fit(Z_train, y_train)
rfc.score(Z_train, y_train), rfc.score(Z_test, y_test)

(0.9998562392179413, 0.6580422595946529)

In [273]:
etc = ExtraTreesClassifier(n_estimators=500)
etc.fit(Z_train, y_train)
etc.score(Z_train, y_train), etc.score(Z_test, y_test)

(1.0, 0.6520051746442432)

### Adaboost model

In [274]:
from sklearn.ensemble import AdaBoostClassifier
abc = AdaBoostClassifier()
abc.fit(Z_train, y_train)
abc.score(Z_train, y_train), abc.score(Z_test, y_test)

(0.6927832087406556, 0.685640362225097)

### support vector classifier

In [275]:
from sklearn.svm import SVC

svc = SVC()
svc.fit(Z_train, y_train)
svc.score(Z_train, y_train), svc.score(Z_test, y_test)

(0.683870040253019, 0.6774471755066839)

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

**Positive class = eligible for 401k**

*False Positive* = not eligible for 401k but classified **as** eligible 

*False Negative* = eligible for 401k but classified as **not** eligible

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

*I would argue we should minimize false negatives, because it could prevent people from saving their money and taking advantage of the 401k investments.*

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

https://stats.stackexchange.com/questions/277347/optimise-svm-to-avoid-false-negative-in-binary-classification


*I would utilize the SVM in my final model, because the training and testing score are closes together. Therefor I would tune the hyper-parameter 'class weight'.*

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

*Precision = true positives/all results that were classified as positive*

*Recall = true positives/false negatives*

*The f-score is the mean of precision and recall, which would be the true positives over all positives + false negatives.*

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [279]:
from sklearn.metrics import f1_score

In [291]:
#with help from lab review breakfast hour

def f1_scorer(model,X_train, X_test, y_train, y_test):
    f1_train = f1_score(y_true=y_train,
                       y_pred = model.predict(X_train))
    f1_test = f1_score(y_true = y_test,
                      y_pred = model.predict(X_test))
    print(f'Training score: {model} is {f1_train}, Testing score: {model} is {f1_test}')
    print()

In [292]:
f1_scorer(lgr,X_train, X_test, y_train, y_test)

f1_scorer(knn,X_train, X_test, y_train, y_test)

f1_scorer(dtc,X_train, X_test, y_train, y_test)

f1_scorer(bgc,X_train, X_test, y_train, y_test)

f1_scorer(rfc,X_train, X_test, y_train, y_test)

f1_scorer(etc,X_train, X_test, y_train, y_test)

f1_scorer(abc,X_train, X_test, y_train, y_test)

f1_scorer(svc,X_train, X_test, y_train, y_test)


Training score: LogisticRegression() is 0.0, Testing score: LogisticRegression() is 0.0

Training score: KNeighborsClassifier() is 0.07669808254793631, Testing score: KNeighborsClassifier() is 0.08040201005025127

Training score: DecisionTreeClassifier() is 0.5576208178438662, Testing score: DecisionTreeClassifier() is 0.5456638526477361

Training score: BaggingClassifier() is 0.5799736495388669, Testing score: BaggingClassifier() is 0.5800391389432484

Training score: RandomForestClassifier() is 0.15424912689173456, Testing score: RandomForestClassifier() is 0.15602836879432622

Training score: ExtraTreesClassifier(n_estimators=500) is 0.246064139941691, Testing score: ExtraTreesClassifier(n_estimators=500) is 0.264783759929391

Training score: AdaBoostClassifier() is 0.48099516240497575, Testing score: AdaBoostClassifier() is 0.4846912298910223

Training score: SVC() is 0.0, Testing score: SVC() is 0.0



##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

#### "The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero."

*I do not see evidence of overfitting in these scores, they are either lowest possible or very simliar between testing and training.*

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

*I would use the Adaboost Classifier as it has the highest f1 scores in both training and testing.*

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

*I would utilize gridsearch to tune the hyper parameters like the decision tree nodes and estimators.*

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

In [None]:
REgre