## Supervised Learning Model Comparison

Recall one formulation of the data science process.

1. Define the problem.
2. Gather the data.
3. Explore the data.
4. Model the data.
5. Evaluate the model.
6. Answer the problem.

In this lab, we're going to focus mostly on creating (and then comparing) many regression and classification models. We'll define the problem and gather the data for you.

### Step 1: Define the problem.

You are a data scientist with a financial services company. Specifically, you want to leverage data to identify potential customers.

If you are unfamiliar with "401(k)s" or "IRAs," these are two types of retirement accounts. Very broadly speaking:
- You can put money for retirement into both of these accounts.
- The money in these accounts gets invested and hopefully has a lot more money in it when you retire.
- These are a little different from regular bank accounts in that they have tax benefits. Also, employers frequently match money that you put into a 401k.
- If you want to learn more about them, you check out [this site](investopedia.com/ask/answers/12/401k.asp).

We will tackle one regression problem and one classification problem today.
- Regression: What features best predict one's income?
- Classification: Predict whether someone is eligible for a 401k.

Check out the data dictionary [here](http://fmwww.bc.edu/ec-p/data/wooldridge2k/401KSUBS.DES).

---
### NOTE ⚠️

When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable. When predicting `e401k`, you may use the entire dataframe if you wish.

---

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import mean_squared_error, confusion_matrix, plot_confusion_matrix, f1_score

### Step 2: Import the data.

##### 1. Read in the data from the repository.

In [2]:
customers = pd.read_csv('401ksubs.csv')
customers

Unnamed: 0,e401k,inc,marr,male,age,fsize,nettfa,p401k,pira,incsq,agesq
0,0,13.170,0,0,40,1,4.575,0,1,173.4489,1600
1,1,61.230,0,1,35,1,154.000,1,0,3749.1130,1225
2,0,12.858,1,0,44,2,0.000,0,0,165.3282,1936
3,0,98.880,1,1,44,2,21.800,0,0,9777.2540,1936
4,0,22.614,0,0,53,1,18.450,0,0,511.3930,2809
...,...,...,...,...,...,...,...,...,...,...,...
9270,0,58.428,1,0,33,4,-1.200,0,0,3413.8310,1089
9271,0,24.546,0,1,37,3,2.000,0,0,602.5061,1369
9272,0,38.550,1,0,33,3,-13.600,0,1,1486.1020,1089
9273,0,34.410,1,0,57,3,3.550,0,0,1184.0480,3249


##### 2. What are 2-3 other variables that, if they were available, would be helpful to have?

Education level, location, job title

##### 3. Suppose a peer recommended putting `race` into your model to better predict who to target when advertising IRAs and 401(k)s. Why might this be unethical?

This might be unethical because race is private information to the customer and it should not be used in an analysis that has a lot of eyes on it.

### Step 3: Explore the data.

##### 4. When attempting to predict income, which feature(s) would we exclude? Why?

In [3]:
customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9275 entries, 0 to 9274
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   e401k   9275 non-null   int64  
 1   inc     9275 non-null   float64
 2   marr    9275 non-null   int64  
 3   male    9275 non-null   int64  
 4   age     9275 non-null   int64  
 5   fsize   9275 non-null   int64  
 6   nettfa  9275 non-null   float64
 7   p401k   9275 non-null   int64  
 8   pira    9275 non-null   int64  
 9   incsq   9275 non-null   float64
 10  agesq   9275 non-null   int64  
dtypes: float64(3), int64(8)
memory usage: 797.2 KB


Other than the variables listed above (the e401k, the p401k, and the pira variables), when predicting the inc variable we should exclude incsq because it has been created based off of our y variable. 

##### 5. What two variables have already been created for us through feature engineering? Come up with a hypothesis as to why subject-matter experts may have done this.
> This need not be a "statistical hypothesis." Just brainstorm why SMEs might have done this!

Income squared and age squared have been added to the data. If these variables don't have a linear relationship with the independent variable, the squared values will allow the model to more accurately capture the effects of income and age.

##### 6. Looking at the data dictionary, one or more variable descriptions appear to be erroneous. What's the issue and what do you think the correct value(s) should be?

Income is described as inc^2 and age is described as age^2. I think they are both intended to show as the original income level and age of the customers.

### Step 4: Model the data. (Part 1: Regression Problem)

- Problem: What features best predict one's income?
- When predicting `inc`, you should pretend as though you do not have access to the `e401k`, the `p401k` variable, and the `pira` variable.

##### 7. List all models you've learned through the date this lab was assigned that could be used to solve a regression problem. For each model type, identify whether it might be appropriate for solving this specific regression problem and explain why.

* linear regression - yes, the data is continous and may be linear
* ridge - yes, is a form of linear
* lasso - yes, is a form of linear
* elastic net - yes, is a form of linear
* knn - not sure, may not produce the most accurate results on a continuous dataset
* decision trees - yes, this data can be split based on lowest mse
* random forests - yes, related to decision trees
* bagging models - yes, related to decision trees
* boosting models - yes, related to decision trees
* ann - no, may be overcomplicated

##### 8. Regardless of your answer to number 7, fit at least one of each of the following models to attempt to solve the regression problem above:
    - a multiple linear regression model
    - a K-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - a boosting model

    
> As always, be sure to do a train/test split! To compare modeling techniques, use the same train-test split on each. 

> You may find it helpful to set up pipelines, but you are not required to do so.

In [4]:
X = customers.drop(columns = ['e401k', 'p401k', 'pira', 'inc', 'incsq'])
y = customers['inc']

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 88)

#### Linear Regression

In [6]:
lr = LinearRegression()
lr.fit(X_train, y_train)
lr.score(X_train, y_train)

0.2788276937242058

In [7]:
lr.score(X_test, y_test)

0.31708172488564146

#### Decision Tree

In [8]:
tree = DecisionTreeRegressor(max_depth = 6)
tree.fit(X_train, y_train)
tree.score(X_train, y_train)

0.42793521708417304

In [9]:
tree.score(X_test, y_test)

0.41567052241802216

#### Random Forest

In [10]:
forest = RandomForestRegressor(n_estimators = 300, max_depth = 6)
forest.fit(X_train, y_train)
forest.score(X_train, y_train)

0.44844326533841894

In [11]:
forest.score(X_test, y_test)

0.4347062804270504

##### 9. What is bootstrapping?

 Random resampling with replacement.

##### 10. What is the difference between a decision tree and a set of bagged decision trees? Be specific!

A decision tree is one instance of the model and tends to be overfit to the training set, but a set of bagged decision trees take multiple different sub-samples of the data and aggregates the individual decision trees created.

##### 11. What is the difference between a set of bagged decision trees and a random forest? Be specific!

Random forest uses a modified tree learning algorithm that selects, at each split in the learning process, a random subset of the features. Bagged decision trees will repeatedly use the same set of features, which causes them to all be correlated and creates high variance.

##### 12. Why might a random forest be superior to a set of bagged decision trees?
> Hint: Consider the bias-variance tradeoff.

As mentioned above, bagging usually has high variance because it's using the same samples of features. Using random forest lowers the variance, but could increase bias as a result.

### Step 5: Evaluate the model. (Part 1: Regression Problem)

##### 13. Using RMSE, evaluate each of the models you fit on both the training and testing data.

#### Linear Regression

In [12]:
lr_preds_train = lr.predict(X_train)
lr_preds_test = lr.predict(X_test)

In [13]:
mean_squared_error(y_train, lr_preds_train, squared = False)

20.389791684422967

In [14]:
mean_squared_error(y_test, lr_preds_test, squared = False)

20.09565239559958

#### Decision Tree

In [15]:
tree_preds_train = tree.predict(X_train)
tree_preds_test = tree.predict(X_test)

In [16]:
mean_squared_error(y_train, tree_preds_train, squared = False)

18.16000127426358

In [17]:
mean_squared_error(y_test, tree_preds_test, squared = False)

18.58859821305125

#### Random Forest

In [18]:
forest_preds_train = forest.predict(X_train)
forest_preds_test = forest.predict(X_test)

In [19]:
mean_squared_error(y_train, forest_preds_train, squared = False)

17.83151996432132

In [20]:
mean_squared_error(y_test, forest_preds_test, squared = False)

18.283309993287876

In [21]:
#null model
null_pred = np.full_like(y_test, np.mean(y_test))
mean_squared_error(y_test, null_pred, squared = False)

24.31743381631067

##### 14.Which model performs best?

The random forest regressor performs the best on the test set.

##### 15. Based on everything we've covered so far, if you had to pick just one model as your final model to use for this problem, which one model would you pick? Why?

Based on the information covered in this lab I would pick the random forest regressor because it has the lowest mse and I think it would be the most reliable model if it was introduced with new data.

##### 16. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

I would improve my model by:
* trying different parameters like more estimators or adjust max_depth
* scaling the data

### Step 4: Model the data. (Part 2: Classification Problem)

- Problem: Predict whether or not one is eligible for a 401k.
- When predicting `e401k`, you may use the entire dataframe if you wish.

##### 17. While you're allowed to use every variable in your dataframe, mention at least one disadvantage of using `p401k` in your model.

It will become overfit to the training data because if someone is participating in the 401k plan, it means they are eligible.

##### 18. List all models you've learned that could be used to solve a classification problem. For each, identify whether it is appropriate for solving this specific classification problem and explain why.

* logistic regression - yes, the data is discrete and may be linear
* knn - yes, the probability can be found based on similarities to other data points
* decision trees - yes, this data can be split based on Gini impurity
* random forests - yes, related to decision trees
* bagging models - yes, related to decision trees
* boosting models - yes, related to decision trees
* ann - no, may be overcomplicated

##### 19. Regardless of your answer to number 18, fit at least one of each of the following models to attempt to solve the classification problem above:
    - a logistic regression model
    - a K-nearest neighbors model
    - a decision tree
    - a set of bagged decision trees
    - a random forest
    - a boosting model
    
> As always, be sure to do a train/test split! In order to compare modeling techniques, you should use the same train-test split on each. I recommend using a random seed here.

> You may find it helpful to set up a pipeline to try each modeling technique, but you are not required to do so!

In [22]:
X2 = customers.drop(columns = 'e401k')
y2 = customers['e401k']

In [23]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, random_state = 88)

### Logistic Regression

In [24]:
logr = LogisticRegression(max_iter = 1000)
logr.fit(X2_train, y2_train)
logr.score(X2_train, y2_train)

0.8842725704427832

In [25]:
logr.score(X2_test, y2_test)

0.8835705045278137

### KNN

In [26]:
knn = KNeighborsClassifier()
knn.fit(X2_train, y2_train)
knn.score(X2_train, y2_train)

0.7390741805635422

In [27]:
knn.score(X2_test, y2_test)

0.605002156101768

### Random Forest

In [28]:
rforest = RandomForestClassifier()
rforest.fit(X2_train, y2_train)
rforest.score(X2_train, y2_train)

1.0

In [29]:
rforest.score(X2_test, y2_test)

0.8831392841742131

In [30]:
#null model
y2_test.value_counts(normalize = True)

0    0.598534
1    0.401466
Name: e401k, dtype: float64

### Step 5: Evaluate the model. (Part 2: Classfication Problem)

##### 20. Suppose our "positive" class is that someone is eligible for a 401(k). What are our false positives? What are our false negatives?

### Logistic Regression

In [31]:
logr_preds_train = logr.predict(X2_train)
logr_preds_test = logr.predict(X2_test)

In [32]:
cm_logr = confusion_matrix(y2_test, logr_preds_test)
cm_logr

array([[1388,    0],
       [ 270,  661]], dtype=int64)

In [33]:
tn, fp, fn, tp = cm_logr.ravel()
f'False Positives: {fp}, False Negatives: {fn}'

'False Positives: 0, False Negatives: 270'

### KNN

In [34]:
knn_preds_train = knn.predict(X2_train)
knn_preds_test = knn.predict(X2_test)

In [35]:
cm_knn = confusion_matrix(y2_test, knn_preds_test)
cm_knn

array([[1004,  384],
       [ 532,  399]], dtype=int64)

In [36]:
tn, fp, fn, tp = cm_knn.ravel()
f'False Positives: {fp}, False Negatives: {fn}'

'False Positives: 384, False Negatives: 532'

### Random Forest

In [37]:
rforest_preds_train = rforest.predict(X2_train)
rforest_preds_test = rforest.predict(X2_test)

In [38]:
cm_rforest = confusion_matrix(y2_test, rforest_preds_test)
cm_rforest

array([[1376,   12],
       [ 259,  672]], dtype=int64)

In [39]:
tn, fp, fn, tp = cm_rforest.ravel()
f'False Positives: {fp}, False Negatives: {fn}'

'False Positives: 12, False Negatives: 259'

##### 21. In this specific case, would we rather minimize false positives or minimize false negatives? Defend your choice.

We should minimize false negatives. Because this isn't a life or death situation, I think the goal would be to not miss anyone who is eligible. It's more important for a financial services company to accidentally send marketing to someone who isn't eligible than to miss out on a potential customer.

##### 22. Suppose we wanted to optimize for the answer you provided in problem 21. Which metric would we optimize in this case?

Random Forest.

##### 23. Suppose that instead of optimizing for the metric in problem 21, we wanted to balance our false positives and false negatives using `f1-score`. Why might [f1-score](https://en.wikipedia.org/wiki/F1_score) be an appropriate metric to use here?

The f1-score is an appropriate metric because there is no right or wrong answer to the question above - all metrics are useful depending on how you look at the situation. f1-score gives a good overall performance of the model.

##### 24. Using f1-score, evaluate each of the models you fit on both the training and testing data.

In [40]:
f'Train: {f1_score(y2_train, logr_preds_train)}, Test: {f1_score(y2_test, logr_preds_test)}'

'Train: 0.8252658997178206, Test: 0.8304020100502513'

In [41]:
f'Train: {f1_score(y2_train, knn_preds_train)}, Test: {f1_score(y2_test, knn_preds_test)}'

'Train: 0.638085742771685, Test: 0.46557759626604434'

In [42]:
f'Train: {f1_score(y2_train, rforest_preds_train)}, Test: {f1_score(y2_test, rforest_preds_test)}'

'Train: 1.0, Test: 0.8321981424148606'

##### 25. Based on training f1-score and testing f1-score, is there evidence of a great deal overfitting in any of your models? Which ones?

KNN and Random Forest are overfit compared to logistic regression.

##### 26. Based on everything we've covered so far, if you had to pick just one model as your final model to use to answer the problem in front of you, which one model would you pick? Defend your choice.

I would pick Logistic Regression because it has the highest f1 score on the test set and is not overfit.

##### 27. Suppose you wanted to improve the performance of your final model. Brainstorm 2-3 things that, if you had more time, you would attempt.

I would improve my model by:
* trying different parameters on Logistic Regression for the penalty or C
* scaling the data

### Step 6: Answer the problem.

##### Briefly summarize your answers to the regression and classification problems. Be sure to include any limitations or hesitations in your answer.

- Regression: What features best predict one's income?
- Classification: Predict whether or not one is eligible for a 401k.

In [43]:
importances = forest.feature_importances_
features = X_train.columns
pd.DataFrame(importances, features).sort_values

<bound method DataFrame.sort_values of                0
marr    0.221671
male    0.011232
age     0.033686
fsize   0.012849
nettfa  0.684532
agesq   0.036030>

I chose random forest as my final regression model. The model's R2 is 43.5% on my test set and it's RMSE is 18.27, which lower than the baseline model's at 24.31. It scored higher than other models I tried, but could probably be improved by changing some of the parameters. I left everything as the default, but if there was more time would do a grid search to explore more option. Marriage and sex are the best predictors of one's income. 

I chose logsistic regression as my classification model. The accuracy score was 88.4% on the test set, which is better than all other models and the baseline of 59%. Even though this model didn't have the highest f1 score of all the models I tested it was the most consistent across the train and test sets. Again, if there was more time I would use grid search to test multiple parameters to find the best fit.