## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

In [7]:
import pandas as pd
import numpy as np
import matplotlib as plt

from math import sqrt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import BaggingRegressor, BaggingClassifier, RandomForestRegressor, RandomForestClassifier, AdaBoostRegressor, AdaBoostClassifier
from sklearn.metrics import mean_squared_error, f1_score
from sklearn import svm

%matplotlib inline

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [3]:
df = pd.read_csv('401ksubs.csv')

##### 2. What are 2-3 other variables that, if available, would be helpful to have?

In [4]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


Retirement date, children age, and early withdrawl history would be helpful to have to identify potential customers and abuse.

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

There may be bias in determining potential customers based on race.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

Any features involving various values like salary would be features we would not reasonably use because they are hard to predict income being an unscaled amount unlike years of work experience and job titles. 

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

The two variables incsq and agesq seemed to have been created through feature engineering as there are already two variables age and inc. age and inc alone may not have been enough to scale the variables so squaring them would have done the job. Thus, the incsq and agesq variables were created.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

Age and Income seem to be errors because they are being described as squared of themselves when it should be their raw original values.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

The following regression models would be appropriate for predicting one's income:

Linear Regression: predictions and cofficients can easily be interpreted.

Ridge Regression prediction and cofficients not easily interpreted but has regulated cofficients improving predictive performance of model.

Lasso Regression: regulates coefficients more harshly than Ridge Regression improvine model predictive performance.

ElasitcNet Regression: combines elements of Lasso Regression and Ridge Regression.

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [5]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [6]:
features = ['marr','male','agesq','fsize','nettfa']
X = df[features]
y = df['incsq']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   random_state=42)

In [10]:
ss = StandardScaler()

In [11]:
ss.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [12]:
X_train_sc = ss.transform(X_train)

In [13]:
X_test_sc = ss.transform(X_test)

### Linear Regression

In [14]:
lr = LinearRegression()

In [15]:
lr.fit(X_train_sc, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [16]:
cross_val_score(lr, X_train_sc, y_train).mean()



0.2508303293885998

In [17]:
lr.score(X_train_sc, y_train)

0.2535546958398234

In [18]:
lr.score(X_test_sc, y_test)

0.1767209397886863

### K Nearest Neighbors

In [19]:
knn = KNeighborsRegressor()

In [20]:
knn.fit(X_train_sc, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

In [21]:
cross_val_score(knn, X_train_sc, y_train).mean()



0.23088654138199372

In [22]:
knn.score(X_train_sc, y_train)

0.48097097365025354

In [23]:
knn.score(X_test_sc,y_test)

0.22033294839055806

### Descision Tree

In [24]:
dt = DecisionTreeRegressor()

In [25]:
dt.fit(X_train_sc, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [26]:
cross_val_score(dt, X_train_sc,y_train).mean()



-0.3331787593133045

In [27]:
dt.score(X_train_sc, y_train)

0.9930593439412755

In [28]:
dt.score(X_test_sc, y_test)

-0.5289731075991289

### Bagged Decision Tree

In [29]:
bag = BaggingRegressor()

In [30]:
bag.fit(X_train_sc, y_train)

BaggingRegressor(base_estimator=None, bootstrap=True, bootstrap_features=False,
                 max_features=1.0, max_samples=1.0, n_estimators=10,
                 n_jobs=None, oob_score=False, random_state=None, verbose=0,
                 warm_start=False)

In [31]:
cross_val_score(bag, X_train_sc, y_train).mean()



0.18759057401619106

In [32]:
bag.score(X_train_sc, y_train)

0.8469138507353811

In [33]:
bag.score(X_test_sc,y_test)

0.12495589954806585

### Random Forests

In [34]:
rf = RandomForestRegressor()

In [35]:
rf.fit(X_train_sc, y_train)



RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [36]:
cross_val_score(rf, X_train_sc, y_train).mean()



0.20391540038136358

In [37]:
rf.score(X_train_sc, y_train)

0.8462960783486458

In [38]:
rf.score(X_test_sc, y_test)

0.10996700939283766

### AdaBoost

In [39]:
ada = AdaBoostRegressor()

In [40]:
ada.fit(X_train_sc, y_train)

AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
                  n_estimators=50, random_state=None)

In [41]:
cross_val_score(ada, X_train_sc, y_train).mean()



-0.18036926446648657

In [42]:
ada.score(X_train_sc, y_train)

-0.4053579547544468

In [43]:
ada.score(X_test_sc, y_test)

-0.4826476266622852

### Support Vector Machine

In [44]:
svr = svm.SVR()

In [45]:
svr.fit(X_train_sc, y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='rbf', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)

In [46]:
cross_val_score(svr, X_train_sc, y_train).mean()



-0.07432116300293101

In [48]:
svr.score(X_train_sc, y_train)

-0.06097722225719848

In [50]:
svr.score(X_test_sc, y_test)

-0.06038690657261925

##### 9. What is bootstrapping?

Bootstrapping is random sampling with replacement.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

A set of bagged decision trees is one that builds its model by iteratively taking a random sample of the rows with replacement in a dataset, building a decision tree model, and then taking the average of those decision trees to build its final model.

It is an ensemble method, meant to make 'weak signals' stronger, reducing variance in the model.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

The difference between a set of bagged decision trees and a random forest is that a random forest model take a random sample of the rows in a dataset with replacement iteratively, and it also takes a random sample set of features when building its multiple models that eventually build the final random forests model.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

A random forest model might be superior to a set of bagged decision trees because it contains less variance with more bias.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

### Linear Regression

In [66]:
lr_predics_train = lr.predict(X_train_sc)
lr_rms_train = sqrt(mean_squared_error(y_train, lr_predics_train))
lr_rms_train

2577.458304052816

In [54]:
lr_predics_test = lr.predict(X_test_sc)
lr_rms_test = sqrt(mean_squared_error(y_test, lr_predics_test))
lr_rms_test

2771.707049080118

### K Nearest Neighbors

In [55]:
knn_predics_train = knn.predict(X_train_sc)
knn_rms_train = sqrt(mean_squared_error(y_train, knn_predics_train))
knn_rms_train

2149.257625655916

In [56]:
knn_predics_test = knn.predict(X_test_sc)
knn_rms_test = sqrt(mean_squared_error(y_test, knn_predics_test))
knn_rms_test

2697.294596593576

### Decision Tree

In [57]:
dt_predics_train = dt.predict(X_train_sc)
dt_rms_train = sqrt(mean_squared_error(y_train, dt_predics_train))
dt_rms_train

248.53806629171348

In [58]:
dt_predics_test = dt.predict(X_test_sc)
dt_rms_test = sqrt(mean_squared_error(y_test, dt_predics_test))
dt_rms_test

3777.2324763332067

### Bagged Decision Tree

In [59]:
bag_predics_train = bag.predict(X_train_sc)
bag_rms_train = sqrt(mean_squared_error(y_train, bag_predics_train))
bag_rms_train

1167.241184648297

In [60]:
bag_predics_test = bag.predict(X_test_sc)
bag_rms_test = sqrt(mean_squared_error(y_test, bag_predics_test))
bag_rms_test

2857.516601275474

### Random Forests

In [61]:
rf_predics_train = rf.predict(X_train_sc)
rf_rms_train = sqrt(mean_squared_error(y_train, rf_predics_train))
rf_rms_train

1169.5939884750744

In [62]:
rf_predics_test = rf.predict(X_test_sc)
rf_rms_test = sqrt(mean_squared_error(y_test, rf_predics_test))
rf_rms_test

2881.8863104531865

### AdaBoost

In [64]:
ada_predics_train = ada.predict(X_train_sc)
ada_rms_train = sqrt(mean_squared_error(y_train, ada_predics_train))
ada_rms_train

3536.601537168621

In [65]:
ada_predics_test = ada.predict(X_test_sc)
ada_rms_test = sqrt(mean_squared_error(y_test, ada_predics_test))
ada_rms_test

3719.5702458906035

### Support Vector Machine

In [67]:
svr_predics_train = svr.predict(X_train_sc)
svr_rms_train = sqrt(mean_squared_error(y_train, svr_predics_train))
svr_rms_train

3072.880584136111

In [68]:
svr_predics_test = svr.predict(X_test_sc)
svr_rms_test = sqrt(mean_squared_error(y_test, svr_predics_test))
svr_rms_test

3145.617917298943

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

All the models illustrate overfitting with the Linear Regresion, AdaBoost, and Support Vector Machines show a slight overfitting.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I would pick the Linear Regression model because the gap between the RMSE score on the training data vs the testing data is amongst the smallest of all the models. The model will work pretty well on unseen data. The KNN model performed well on the testing data, but is more overfit than the Linear Regression model. Therefore, I am more comfortable selecting the Linear Regression model.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

In [70]:
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


1. Transform the age column into categorical data by making different ranges with the get_dummies function.
2. Passing the features through polynomial features and using the resulting columns as my X.
3. Cube the age column instead of just squaring it and seeing if that further improves model performance.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

Including the p401k in my model is like training the model with the target variable included, which would not lead to great results.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

The following models are appropriate to solve a classification problem

Logistic Regression: cofficients can be interpreted.

KNearest Neighbors: assigned weight for nearer neighbors contribute more to the average than the more distant ones.

Decision Trees: can take a discrete set of values.

Bagged Decision Trees: reduce variance of a decision tree.

Random Forest: individual tree spits out a class prediction and the class with the most votes becomes the model's prediction.

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [72]:
features = ['incsq', 'marr', 'male', 'agesq', 'fsize', 'nettfa', 'pira']

X = df[features]
y = df['e401k']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

ss = StandardScaler()

In [73]:
ss.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [75]:
X_train_sc = ss.transform(X_train)
X_test_sc = ss.transform(X_test)

### Logistic Regression

In [78]:
logreg= LogisticRegression()

In [79]:
logreg.fit(X_train_sc, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [80]:
cross_val_score(logreg, X_train_sc, y_train).mean()



0.6355647456685807

In [81]:
logreg.score(X_train_sc,y_train)

0.6367165037377803

In [82]:
logreg.score(X_test_sc, y_test)

0.648124191461837

### K Nearest Neighbors

In [83]:
knn = KNeighborsClassifier()

In [84]:
knn.fit(X_train_sc, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [85]:
cross_val_score(knn, X_train_sc, y_train).mean()



0.625791903444309

In [86]:
knn.score(X_train_sc, y_train)

0.7514376078205866

In [87]:
knn.score(X_test_sc, y_test)

0.64381198792583

### Decision Tree

In [88]:
dt = DecisionTreeClassifier()

In [89]:
dt.fit(X_train_sc, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [90]:
cross_val_score(dt, X_train_sc, y_train).mean()



0.5990536332351512

In [91]:
dt.score(X_train_sc, y_train)

1.0

In [92]:
dt.score(X_test_sc, y_test)

0.5942216472617508

### Bagged Decision Tree

In [93]:
bag = BaggingClassifier()

In [94]:
bag.fit(X_train_sc, y_train)

BaggingClassifier(base_estimator=None, bootstrap=True, bootstrap_features=False,
                  max_features=1.0, max_samples=1.0, n_estimators=10,
                  n_jobs=None, oob_score=False, random_state=None, verbose=0,
                  warm_start=False)

In [95]:
cross_val_score(bag, X_train_sc, y_train).mean()



0.651381740104926

In [96]:
bag.score(X_train_sc, y_train)

0.9777170787809085

In [97]:
bag.score(X_test_sc, y_test)

0.6425183268650281

### Random Forests

In [98]:
rf = RandomForestClassifier()

In [99]:
rf.fit(X_train_sc, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [100]:
cross_val_score(rf, X_train_sc, y_train).mean()



0.6551210913093927

In [101]:
rf.score(X_train_sc, y_train)

0.9746981023576768

In [102]:
rf.score(X_test_sc, y_test)

0.6511427339370418

### AdaBoost

In [103]:
ada = AdaBoostClassifier()

In [104]:
ada.fit(X_train_sc, y_train)

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=None)

In [105]:
cross_val_score(ada, X_train_sc, y_train).mean()



0.679989933851021

In [106]:
ada.score(X_train_sc, y_train)

0.6927832087406556

In [107]:
ada.score(X_test_sc, y_test)

0.685640362225097

### Support Vector Machine

In [108]:
svc = svm.SVC()

In [109]:
svc.fit(X_train_sc, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [110]:
cross_val_score(svc, X_train_sc, y_train).mean()



0.6656138119464857

In [111]:
svc.score(X_train_sc, y_train)

0.6837262794709603

In [112]:
svc.score(X_test_sc, y_test)

0.6774471755066839

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

False Positives: Wrongly predicting someone is eligible for a 401(k) but is not.

False Negatives: Wrongly predicting someone is not eligible for a 401(k) but is.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

Assuming that the cost to the financial services company is greater if they offer a 401k to someone who is not actually eligible for one than if they did not offer a 401k to someone who is eligible.

We would rather minimize False Positives.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

The Specificity metric would be used to optimize for False Positives.

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

The F1 score is appropriate because it considers the model's precision and recall to measure the model's accuracy.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

### Logistic Regression

In [113]:
logreg_predics_train = logreg.predict(X_train_sc)
logreg_f1_train = f1_score(y_train,logreg_predics_train)
logreg_f1_train

0.29825048597611775

In [114]:
logreg_predics_test = logreg.predict(X_test_sc)
logreg_f1_test = f1_score(y_test, logreg_predics_test)
logreg_f1_test

0.31197301854974707

### K Nearest Neighbors

In [116]:
knn_predics_train = knn.predict(X_train_sc)
knn_f1_train = f1_score(y_train, knn_predics_train)
knn_f1_train

0.6567401230891404

In [117]:
knn_predics_test = knn.predict(X_test_sc)
knn_f1_test = f1_score(y_test, knn_predics_test)
knn_f1_test

0.5006045949214027

### Decision Tree

In [119]:
dt_predics_train = dt.predict(X_train_sc)
dt_f1_train = f1_score(y_train, dt_predics_train)
dt_f1_train

1.0

In [120]:
dt_predics_test = dt.predict(X_test_sc)
dt_f1_test = f1_score(y_test, dt_predics_test)
dt_f1_test

0.48325096101043385

### Bagged Decision Tree

In [121]:
bag_predics_train = bag.predict(X_train_sc)
bag_f1_train = f1_score(y_train, bag_predics_train)
bag_f1_train

0.9710442742387446

In [122]:
bag_predics_test = bag.predict(X_test_sc)
bag_f1_test = f1_score(y_test, bag_predics_test)
bag_f1_test

0.47894406033940917

### Random Forests

In [123]:
rf_predics_train = rf.predict(X_train_sc)
rf_f1_train = f1_score(y_train, rf_predics_train)
rf_f1_train

0.9670658682634731

In [124]:
rf_predics_test = rf.predict(X_test_sc)
rf_f1_test = f1_score(y_test, rf_predics_test)
rf_f1_test

0.49595015576323986

### AdaBoost

In [125]:
ada_predics_train = ada.predict(X_train_sc)
ada_f1_train = f1_score(y_train, ada_predics_train)
ada_f1_train

0.569066344020972

In [126]:
ada_predics_test = ada.predict(X_test_sc)
ada_f1_test = f1_score(y_test, ada_predics_test)
ada_f1_test

0.5552165954850519

### Support Vector Machine

In [127]:
svc_predics_train = svc.predict(X_train_sc)
svc_f1_train = f1_score(y_train, svc_predics_train)
svc_f1_train

0.46962391513982643

In [128]:
svc_predics_test = svc.predict(X_test_sc)
svc_f1_test = f1_score(y_test, svc_predics_test)
svc_f1_test

0.4491899852724595

##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

There is evidence of overfitting for the K Nearest Neighbors, Decision Tree, Bagged Decision Trees, and Random Forests, AdaBoost and Support Vector Machine models. 

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

The AdaBoost model has the strongest f1-test score and shows evidence of being slightly overfit.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

1. Spend more time on feature creation including passing the features through polynomial features.

2. Gridsearch the models to see how the models performance improve if any.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.