## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [2]:
df_401k = pd.read_csv('401ksubs.csv')
df_401k.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


In [3]:
df_401k.dtypes

e401k       int64
inc       float64
marr        int64
male        int64
age         int64
fsize       int64
nettfa    float64
p401k       int64
pira        int64
incsq     float64
agesq       int64
dtype: object

In [4]:
df_401k.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9275 entries, 0 to 9274
Data columns (total 11 columns):
e401k     9275 non-null int64
inc       9275 non-null float64
marr      9275 non-null int64
male      9275 non-null int64
age       9275 non-null int64
fsize     9275 non-null int64
nettfa    9275 non-null float64
p401k     9275 non-null int64
pira      9275 non-null int64
incsq     9275 non-null float64
agesq     9275 non-null int64
dtypes: float64(3), int64(8)
memory usage: 797.1 KB


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

Possible strong predictors of income:
1. Education
2. Occupation or industry
3. Years of total working experience

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

Race data is sensitive and can increase identifiability of the data. Also employers are expected to be fair and impartial when deciding on salary or eligibility of 401k retirement/investment accounts, and not decide based on that criteria. Hence having race in the model will not lead to any useful conclusion.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we reasonably not use? Why?

- We would not use 'pira' variable (which is a dummy variable with 1 if person has an IRA), 'e401k' and 'p401k' (which denote if person is eligible and participate in a 401k respectively)

- For predicting income, we might not use 'nettfa' which stands for net total fin assets, $1000, because having a high net worth and high income are different concepts and may not be linearly related. On one hand, having high income does mean that a person has more purchasing power to buy financial assets under their own name, but the super wealthy or retirees in the population may be asset-rich and not have any income.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

- 'agesq' and 'incsq' are two variables that have been engineered by subject matter experts. 
- The relationship between age and income is probably not linear. In the first ~20 years of a person's life, the person is most likely without an income (or neglible income working part time) as they are dependents/ still schooling. In the next 30-40 years, the income of the person has the potential to accelerate to consummerate work experience. Having the variable 'agesq' makes the curve quadratic instead of linear which may more accurately reflect this nonlinear relationship.

##### 6. Looking at the data dictionary, one variable description appears to be an error. What is this error, and what do you think the correct value would be?

- Variable label for age should not be age^2, but simply the age instead. There is another variable for agesq (age^2).

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific regression problem and explain why or why not.

Answer:

- *OLS regression*

In multiple variable regression, a best fit line is drawn by minimizing the error term between each variable and the best fit line. Only the statistically significant variables can be chosen for the final prediction model by iteratively removing variables with a high p-value (e.g. above 0.05). It will be able to predict a numerical data point like income, so it is _suitable_.

**Pros**: It will do the job of predicting a continuous variable like income, with some feature engineering. Model can be made better by using Standardization/scaling and regularisation by minimizing the cost function.

**Cons**: Need to ensure that the dataset meets the 4 key OLS model assumptions (remove multicollinearity, data is normally distributed, equal variances/homoscedasticity etc). The strength of the model depends on what predictor variables are left after removing insignificant predictor values through looking at the p-values. 

- Logistic Regression 

This regression is specific for classification problems only, _not suitable_.

- K nearest neighbours (ususally more suitable for classification problems, requires standardisation/scaling)

**Pros**: May be a good indicator if we assume that people of the same sex, marriage status, family size, nett financial assets etc are in roughly the same income band.

**Cons**: We can't really make the assumption from above as the variables given are dependent on one's circumstances and not close predictors of income ability (e.g. For a variable like education level - we could assume with more confidence that people of the same education level are in similar income bands, rather than family size).

- Decision Trees/ Bagged Decision Trees

_suitable_

- Random Forest

_suitable_


##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [5]:
from sklearn.pipeline import Pipeline, make_pipeline #can try to explore make_pipeline, don't have to write so much

from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split

from sklearn.preprocessing import StandardScaler

In [6]:
df_401k_inc = df_401k.drop(columns = ['e401k', 'p401k', 'pira'], inplace = False) 
#df_401k_inc = df_401k.drop(['e401k', 'p401k', 'pira'], axis = 1) #This is equivalent

df_401k_inc.head()

Unnamed: 0,inc,marr,male,age,fsize,nettfa,incsq,agesq
0,13.17,0,0,40,1,4.575,173.4489,1600
1,61.23,0,1,35,1,154.0,3749.113,1225
2,12.858,1,0,44,2,0.0,165.3282,1936
3,98.88,1,1,44,2,21.8,9777.254,1936
4,22.614,0,0,53,1,18.45,511.393,2809


In [7]:
df_401k_inc.shape

(9275, 8)

In [8]:
# Conduct train/test split
X_train, X_test, y_train, y_test = train_test_split(df_401k_inc.drop(['inc'], axis=1),
                                                    df_401k_inc['inc'],
                                                    test_size = 0.2,
                                                    random_state = 88)

In [9]:
#Check
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(7420, 7)
(1855, 7)
(7420,)
(1855,)


A) Multiple Linear Regression model

In [10]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV

In [11]:
ss = StandardScaler()
linreg = LinearRegression()

Note to self: Standardization is necessary for regularized regression because the beta values for each predictor variable must be on the same scale. If betas are different sizes just because of the scale of predictor variables the regularization term can't determine which betas are more/less  important based on their size.

In [12]:
pipe_lm = make_pipeline(ss, linreg)

In [13]:
pipe_lm.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])

In [14]:
Xs = ss.fit_transform(X_train)

linreg_scores = cross_val_score(linreg, Xs, y_train, cv=10)

print (linreg_scores) #This is the baseline score for R^2
print (np.mean(linreg_scores))

[0.91368188 0.90285288 0.88221961 0.88364166 0.91373542 0.87978744
 0.91778289 0.89438678 0.90410633 0.88279409]
0.8974988976517574


The R2 of the model is 0.90

In [None]:
#Optional
#Do more: drop p values
#Do more: regularisation

B) K Nearest Neighbours

In [20]:
from sklearn.neighbors import KNeighborsRegressor
ss = StandardScaler()
knn_reg = KNeighborsRegressor()
pipe_knn_reg = make_pipeline(ss, knn_reg)

In [23]:
pipe_knn_reg.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kneighborsregressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform'))])

In [82]:
print(f'Score on training set: {pipe_knn_reg.score(X_train, y_train)}')
print(f'Score on testing set: {pipe_knn_reg.score(X_test, y_test)}')

Score on training set: 0.980641394648469
Score on testing set: 0.9705777227109545


In [None]:
#Optional - to do GridSearch

#pipe_gs = Pipeline([
#        ('ss', ss),
#        ('knn_reg', knn_reg)
#        ])

#pipe_gs_params = {'ss__with_mean': [True, False], 
#                 'ss__with_std': [True, False],
#                 'knn__p': [1, 2], 
#                 'knn__weights': ['uniform', 'distance'],
#                 'knn__n_neighbors': [3, 5, 10]}

C) Decision Trees

In [25]:
from sklearn.tree import DecisionTreeRegressor #DecisionTreeClassifier

In [26]:
# Define Gini function, called gini.
def gini(obs):
    
    # Create a list to store my squared class probabilities.
    gini_sum = []
    
    # Iterate through each class.
    for class_i in set(obs):
        
        # Calculate observed probability of class i.
        prob = (obs.count(class_i) / len(obs))
        
        # Square the probability and append it to gini_sum.
        gini_sum.append(prob ** 2)
        
    # Return Gini impurity.
    return 1 - sum(gini_sum)

In [27]:
dt_reg = DecisionTreeRegressor()
dt_reg.fit(X_train, y_train)
# Evaluate model.
print(f'Score on training set: {dt_reg.score(X_train, y_train)}')
print(f'Score on testing set: {dt_reg.score(X_test, y_test)}')

Accuracy score for decision trees is high at close to 1.0 

D) Bagged Decision Trees

In [31]:
from sklearn.ensemble import BaggingRegressor

# Instantiate BaggingClassifier.
bag_reg = BaggingRegressor(random_state = 42)

# Fit BaggingClassifier.
bag_reg.fit(X_train, y_train)

# Score BaggingClassifier.
bag_reg.score(X_test, y_test)

0.9998032442893479

Accuracy score for Bagged Decision Trees is similarly high at close to 1.0.


E) Random Forest

In [32]:
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators=100)
cross_val_score(rf_reg, X_train, y_train, cv=5).mean()

In [35]:
#Gridsearch - why takes so long- and how to choose n_estimators and max_depth

rf_reg_params = {
    'n_estimators': [100, 150, 200],
    'max_depth': [None, 1, 2, 3, 4, 5],
}
gs = GridSearchCV(rf_reg, param_grid=rf_reg_params, cv=5)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.9999543749162977


{'max_depth': None, 'n_estimators': 150}

In [36]:
gs.score(X_train, y_train)

0.9999957615973312

Comment: The score was originally high at close to 1.0 before GridSearch.

F) Adaboost model

In [42]:
from sklearn.ensemble import AdaBoostRegressor
ada = AdaBoostRegressor(base_estimator=DecisionTreeRegressor())
ada_params = {
    'n_estimators': [50,100],
    'base_estimator__max_depth': [1,2],
    'learning_rate': [.9, 1.]
}
gs = GridSearchCV(ada, param_grid=ada_params, cv=3)
gs.fit(X_train, y_train)
print(gs.best_score_)
gs.best_params_

0.9326629713342318


{'base_estimator__max_depth': 2, 'learning_rate': 1.0, 'n_estimators': 50}

In [41]:
print(f'Score on training set: {ada.score(X_train, y_train)}')
print(f'Score on testing set: {ada.score(X_test, y_test)}')

In [None]:
print(f'Score on training set (GridSearch): {gs.score(X_train, y_train)}')
print(f'Score on testing set (GridSearch): {gs.score(X_test, y_test)}')

Comment:

G) Support Vector Regressor

In [46]:
from sklearn.svm import LinearSVR, SVR
from sklearn.model_selection import StratifiedKFold

In [91]:
svr = LinearSVR(max_iter=20000)
svr.fit(X_train, y_train)

LinearSVR(C=1.0, dual=True, epsilon=0.0, fit_intercept=True,
     intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=20000,
     random_state=None, tol=0.0001, verbose=0)

In [92]:
# Evaluate model.
print(f'Score on training set: {svr.score(X_train, y_train)}')
print(f'Score on testing set: {svr.score(X_test, y_test)}')

Score on training set: 0.5454566880487903
Score on testing set: 0.4925086411469213


Comment: Overfitted model, poor accuracy

##### 9. What is bootstrapping?

Bootstrapping is a resampling technique that involves *randomly* repeatedly drawing samples from our source data *with replacement* to estimate a population parameter.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

Ans: Bagging (Bootstrap Aggregation) is used when our goal is to reduce the variance of a decision tree. Here idea is to create several subsets of data from training sample chosen randomly with replacement. Now, each collection of subset data is used to train their decision trees. As a result, we end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

Ans: Random Forest is an extension over bagging. It takes one extra step where in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

Ans: Random forest decorrelates the trees in decision trees from one another. By loooking at the randomly selected subset of the features (X variables), Random forest results in higher bias, lower variance. As the Random Forest method limits the allowed variables to split on in each node, the bias for a single random forest tree is increased even more.

The fundamental difference is that in Random forests, only a subset of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree, unlike in bagging where all features are considered for splitting a node

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [None]:
#Is there a python function for rmse or do i define a function on my own?

##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

In [None]:
# If train RMSE is higher, is it overfit or underfit and vice versa?

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

In [None]:
# Answer at the end only

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

In [None]:
# Unique to Q15. Things that we can try are like regularisation, to collect more data (either more rows or more columns depending on whether final model is overfit or underfit??)

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

Would definitely have multicollinearity with e401k.

In [50]:
#Drop the variable 'p401k'

df_401k_eligible = df_401k.drop(['p401k'], axis = 1)

In [51]:
df_401k_eligible.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,511.393,2809


##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6). For each tactic, identify whether it is or is not appropriate for solving this specific classification problem and explain why or why not.

Answer:

- Logistic Regression 

Specific for classification problems / binary predictors like this one.

**Pros**: It will do the job of predicting a continuous variable like income, with some feature engineering. Model can be made better by using Standardization/scaling and regularisation by minimizing the cost function.

**Cons**: The strength of the model depends on what predictor variables are left after removing insignificant predictor values through looking at the p-values. 

**Suitability**: Suitable but may not give high accuracy because may not be linearly related

- K nearest neighbours

**Pros**: May be a good indicator if we assume that people of the same sex, marriage status, family size, nett financial assets etc are in roughly the same income band.

**Cons**: We can't really make the assumption from above as the variables given are dependent on one's circumstances and not close predictors of income ability (e.g. For a variable like education level - we could assume with more confidence that people of the same education level are in similar income bands, rather than family size).
**Suitability**:

- Decision Trees

**Pros**:
**Cons**:
**Suitability**:

- A Set of Bagged Decision Trees

**Pros**:
**Cons**:
**Suitability**:

- Random Forest

**Pros**:
**Cons**:
**Suitability**:

_suitable_


##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [53]:
df_401k_eligible.shape

(9275, 10)

In [54]:
# Conduct train/test split
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(df_401k_eligible.drop(['e401k'], axis=1),
                                                    df_401k_eligible['e401k'],
                                                    test_size = 0.2,
                                                    random_state = 88)

In [55]:
#Check
print(X_train_2.shape)
print(X_test_2.shape)
print(y_train_2.shape)
print(y_test_2.shape)

(7420, 9)
(1855, 9)
(7420,)
(1855,)


A) Logistic Regression Model

In [56]:
from sklearn.linear_model import LogisticRegression

In [100]:
logreg = LogisticRegression()

In [101]:
pipe_lr = make_pipeline(ss, logreg) #Remember to scale

In [102]:
pipe_lr.fit(X_train_2, y_train_2)

Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [103]:
# Evaluate model.
print(f'Score on training set: {pipe_lr.score(X_train_2, y_train_2)}')
print(f'Score on testing set: {pipe_lr.score(X_test_2, y_test_2)}')

Score on training set: 0.6586253369272237
Score on testing set: 0.6490566037735849


Comment: Train and test score are similar, no overfitting. Accuracy score is rather low at 0.65.

B) K nearest neighbours

In [57]:
from sklearn.neighbors import KNeighborsClassifier
ss = StandardScaler()
knn_class = KNeighborsClassifier()
pipe_knn_class = make_pipeline(ss, knn_class)

In [58]:
pipe_knn_class.fit(X_train_2, y_train_2)

Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('kneighborsclassifier', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))])

In [93]:
# Evaluate model.
print(f'Score on training set: {pipe_knn_class.score(X_train_2, y_train_2)}')
print(f'Score on testing set: {pipe_knn_class.score(X_test_2, y_test_2)}')

Score on training set: 0.7536388140161725
Score on testing set: 0.631266846361186


Comment: Evidence of overfitting. The test score accuracy is low at 0.63.

C) Decision Tree

In [60]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
#Note: no need to do standard scaler for decision trees
dt_class = DecisionTreeClassifier()
dt_class.fit(X_train_2, y_train_2)

In [65]:
# Evaluate model.
print(f'Score on training set: {dt_class.score(X_train_2, y_train_2)}')
print(f'Score on testing set: {dt_class.score(X_test_2, y_test_2)}')

Score on training set: 1.0
Score on testing set: 0.5859838274932615


Comment: The model is super overfitted

In [99]:
from sklearn.tree import tree
#clf = tree.DecisionTreeClassifier(random_state=0)
#clf = clf.fit(X_train_2, y_train_2)
#tree.plot_tree(clf)

#Is this not working due to old version of scikitlearn? scikit-learn 1.1.3

D) A set of bagged decision Trees

In [86]:
from sklearn.ensemble import BaggingClassifier

# Instantiate BaggingClassifier.
bag_class = BaggingClassifier(random_state = 42)

# Fit BaggingClassifier.
bag_class.fit(X_train_2, y_train_2)


BaggingClassifier(base_estimator=None, bootstrap=True,
         bootstrap_features=False, max_features=1.0, max_samples=1.0,
         n_estimators=10, n_jobs=1, oob_score=False, random_state=42,
         verbose=0, warm_start=False)

In [85]:
print(f'Score on training set: {bag_class.score(X_train_2, y_train_2)}')
print(f'Score on testing set: {bag_class.score(X_test_2, y_test_2)}')

Score on training set: 0.9788409703504043
Score on testing set: 0.6188679245283019


Comment: Model is very overfitted, accuracy is low.

E) Random Forest

In [61]:
from sklearn.ensemble import RandomForestClassifier
rf_class = RandomForestClassifier(n_estimators=100)
cross_val_score(rf_class, X_train_2, y_train_2, cv=5).mean()

0.6644204851752021

In [75]:
rf_class.fit(X_train_2, y_train_2)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [77]:
print(f'Score on training set: {rf_class.score(X_train_2, y_train_2)}')
print(f'Score on testing set: {rf_class.score(X_test_2, y_test_2)}')

Score on training set: 1.0
Score on testing set: 0.6539083557951483


Comment: Model is overfitted, accuracy is low.

In [87]:
# Visualisation - only for classifier
# Import plot_tree from sklearn.tree module.

# Establish size of figure.

#plt.figure(figsize = (50, 30))

# Plot our tree.
#plot_tree(grid.best_estimator_,
#          feature_names = X_train_2.columns,
#          filled = True);

F) Adaboost model

In [95]:
from sklearn.ensemble import AdaBoostClassifier
ada_class = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())

In [96]:
ada_class.fit(X_train_2, y_train_2)

AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=1.0, n_estimators=50, random_state=None)

In [97]:
print(f'Score on training set: {ada_class.score(X_train_2, y_train_2)}')
print(f'Score on testing set: {ada_class.score(X_test_2, y_test_2)}')

Score on training set: 1.0
Score on testing set: 0.5859838274932615


In [74]:
ada_params = {
    'n_estimators': [50,100],
    'base_estimator__max_depth': [1,2],
    'learning_rate': [.9, 1.]
}
gs = GridSearchCV(ada_class, param_grid=ada_params, cv=3)
gs.fit(X_train_2, y_train_2)
print(gs.best_score_)
gs.best_params_

0.6853099730458221


{'base_estimator__max_depth': 1, 'learning_rate': 1.0, 'n_estimators': 50}

In [81]:
print(f'Score on training set: {gs.score(X_train_2, y_train_2)}')
print(f'Score on testing set: {gs.score(X_test_2, y_test_2)}')

Score on training set: 0.6952830188679245
Score on testing set: 0.6738544474393531


Comment: Slight overfitting. After Grid Search, model accuracy is improved but still rather low at 0.67.

E) Support Vector Classifier

In [78]:
from sklearn.svm import LinearSVC, SVC
from sklearn.model_selection import StratifiedKFold

In [79]:
# C values to GridSearch over
pgrid = {"C": np.linspace(0.0001, 2, 10)}

svc = LinearSVC(max_iter=20000)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

gcv = GridSearchCV(svc, pgrid, cv=cv)
gcv.fit(X_train_2, y_train_2);

In [80]:
print(f'Score on training set: {gcv.score(X_train_2, y_train_2)}')
print(f'Score on testing set: {gcv.score(X_test_2, y_test_2)}')

Score on training set: 0.6517520215633423
Score on testing set: 0.6355795148247978


Comment: Slight overfitting, model accuracy is low.

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

Ans: 

False positives would be to wrongly classify someone as eligible for 401k when they are ineligible.

False negatives would be to wrongly classify someone as ineligible for 401k when they are in fact eligible.

*Think of confusion matrix tn, tp, fn, fp

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

Recall the Problem: To predict whether or not one is eligible for a 401k.


Imagine that I wish to identify the profiles of all the people who are eligible for 401k so I can target them and invite them for a talk on starting to use 401k.
Hence I will rather inaccurately classify someone as eligible, so that I can get as many eligible people in my prediction as possible. In this case, I will rather minimize false negatives.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

## Step 6: Answer the problem.

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Regression:

The best predictors of income are:

Classification:

The best predictors of eligibility are: