## 6.01 - Supervised Learning Model Comparison

Recall the "data science process."

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. Thus, we'll define the problem and gather the data for you.
Most of the questions requiring a written response can be written in 2-3 sentences.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data in order to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that there are certain tax benefits to these accounts. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, check out [this site](https://www.nerdwallet.com/article/ira-vs-401k-retirement-accounts).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

### NOTE: When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

### Step 2: Gather the data.

##### 1. Read in the data from the repository.

In [1]:
import pandas as pd
import numpy as np

In [2]:
path = "401ksubs.csv"
df = pd.read_csv(path)
df.head()

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.17,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.23,0,1,35,1,154.0,1,0,3749.113,1225
2,0,12.858,1,0,44,2,0.0,0,0,165.3282,1936
3,0,98.88,1,1,44,2,21.8,0,0,9777.254,1936
4,0,22.614,0,0,53,1,18.45,0,0,511.393,2809


##### 2. What are 2-3 other variables that, if available, would be helpful to have?

debt to income ratio and credit score

##### 3. Suppose a peer recommended putting `race` into your model in order to better predict who to target when advertising IRAs and 401(k)s. Why would this be an unethical decision?

This would be unethical because race should not play a part in determining any factors. This would mean someone would be stereotyping one race to have the ability to start a 401k over another.

## Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) in our dataset would we reasonably not use? Why?

Male, marriage, and family size seem like we could do without. I think keeping them would not add any value to the models.

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

Maybe the SME beleive this would add value to our models.



##### 6. Looking at the data dictionary, two variable descriptions appear to be errors. What are these errors, and what do you think the correct value would be, looking at the data?

It looks like age and income are incorrect. Age should be changed to age not age^2 and income should also be changed to inc.

## Step 4: Model the data. (Part 1: Regression Problem)

Recall:
- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all models/modeling tactics we've learned that could be used to solve a regression problem (as of Wednesday afternoon of Week 6).

Random Forest, KNN, Decision Tree, Logisitc regression, bagging, lasso, ridge

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above. You will be asked to evaluate your models later in Step 5:
    - a multiple linear regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector regressor
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend setting a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [3]:
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import mean_squared_error, f1_score

In [4]:
X = df.drop(columns=['e401k', 'p401k', 'pira', 'inc', 'incsq'])
y = df['inc']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)

In [5]:
ss = StandardScaler()
Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)
np.random.seed(42)

In [6]:
rfr = RandomForestRegressor()
rfr.fit(Xs_train, y_train)

In [7]:
knn_r = KNeighborsRegressor()
knn_r.fit(Xs_train, y_train)

##### 9. What is bootstrapping?

Bootstrapping is aresampling technique that involves repeatedly drawing samples from our source data with replacement.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific and precise!

The main difference between decision trees from bagged decision trees is that bagged decision tree create many subsets of data to sample which helps to reduce the variance we would get from a decision tree.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific and precise!

The difference is that in Random forests a set of features are selected at random out of the total and the best split feature from the subset is used to split each node in a tree. In bagging all features are considered for splitting a node.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

Since random forest build multiple decision trees and aggregate them it helps with variance like bagging does but because of the multiple decision trees it also helps with bias.

## Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

In [8]:
print('RandomForest')
print('Train: ' + str(mean_squared_error(y_train, rfr.predict(Xs_train))))
print('Test: ' + str(mean_squared_error(y_test, rfr.predict(Xs_test))))
print(' ')
print('KNearestNeighbors')
print('Train: ' + str(mean_squared_error(y_train, knn_r.predict(Xs_train))))
print('Test: ' + str(mean_squared_error(y_test, knn_r.predict(Xs_test))))

RandomForest
Train: 60.32831064179695
Test: 382.3506187811915
 
KNearestNeighbors
Train: 276.9797811151482
Test: 382.8289161276765



##### 14. Based on training RMSE and testing RMSE, is there evidence of overfitting in any of your models? Which ones?

Yes, both models that were tested showed scores that were overfitting.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

RandomForest showed less signs of overfitting and the rmse score were very low so I would pick RandomForest.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

Like i stated in question #7, i think using lasso and ridge would definetly improve my model. I could have also done a little more eda on the data and see what features are best to keep.

## Step 4: Model the data. (Part 2: Classification Problem)

Recall:
- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

By giving the model the knowledge whether someone is participating it is giving the model the answer if the person is eligible. This would not train the model well when trying data that does not have that feature.

##### 18. List all modeling tactics we've learned that could be used to solve a classification problem (as of Wednesday afternoon of Week 6).

Random Forest, KNN, Decision Tree, Logisitc regression, bagging

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above. You will be asked to evaluate your models later in Step 5:
    - a logistic regression model
    - a k-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - an Adaboost model
    - a support vector classifier
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [12]:
X = df.drop(columns=['e401k', 'p401k', 'pira'])
y = df['e401k']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)

In [13]:
ss = StandardScaler()
Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)

In [14]:
rfc = RandomForestClassifier()
rfc.fit(Xs_train, y_train)

In [15]:
knn_e = KNeighborsClassifier()
knn_e.fit(Xs_train, y_train)

In [21]:
y_preds_null = np.full_like(y, y.mean())
mean_squared_error(y, y_preds_null) ** 0.5

0.6262023475314575

## Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

False positives would be people who were classified by the model as eligible but are not. False negatives are people who are classified as not eligible but are.

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

Probably minimize false positives so we dont waste our time on people who are not eligible

##### 22. Suppose we wanted to optimize for (minimize) the answer you provided in problem 21. Which metric would we optimize (maximize) in this case?

specificity

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

It seems like the f1-score is concentrated on the postive scores w

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [16]:
#random forest
print(f1_score(y_train, rfc.predict(Xs_train)))

1.0


In [17]:
#random forest
print(f1_score(y_test, rfc.predict(Xs_test)))

0.5141113653699467


In [18]:
#knn train
print(f1_score(y_train, knn_e.predict(Xs_train)))

0.6609556443936798


In [19]:
#knn test
print(f1_score(y_test, knn_e.predict(Xs_test)))

0.4752623688155922


##### 25. Based on training f1-score and testing f1-score, is there evidence of overfitting in any of your models? Which ones?

Yes the models are showing overfitting, especially the random forest

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

This is a hard question because both were pretty bad but the random forest has the best score but it is also more overfit. So maybe the knn because it would be better with unseen data.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

i think using lasso and ridge would definetly improve my model. I could have also done a little more eda on the data and see what features are best to keep.

## Step 6: Answer the problem. [BONUS] 

##### BONUS: Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.