# Required end-of-module assignment: Hyperparameters  
---

## Overview

In this assignment, you will compare different models and different hyperparameters to determine the best model in a classification setting.

This activity is designed to build your familiarity and comfort coding in Python while also helping you review key topics from each module. As you progress through the activity, questions will get increasingly more complex. It is important that you adopt a programmer's mindset when completing this activity. Remember to run your code from each cell before submitting your activity, as doing so will give you a chance to fix any errors before submitting.



### Learning outcome addressed

- Apply hyperparameter tuning techniques to a business case.



## Index:

- [Question 1](#Question-1)
- [Question 2](#Question-2)
- [Question 3](#Question-3)
- [Question 4](#Question-4)
- [Question 5](#Question-5)
- [Question 6](#Question-6)

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.dummy import DummyClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

import warnings
warnings.filterwarnings('ignore')

### The data set

The data below contains demographic information extracted from census data in the United States.  The goal of the task is to learn to identify the characteristics of individuals making over 50,000 USD per year in 1994.

In [5]:
from sklearn.datasets import fetch_openml
df_adult = fetch_openml(
    'adult', version=2, data_home='./data/scikit_learn_data')
print(df_adult.DESCR) 

**Author**: Ronny Kohavi and Barry Becker  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Adult) - 1996  
**Please cite**: Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996  

Prediction task is to determine whether a person makes over 50K a year. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

This is the original version from the UCI repository, with training and test sets merged.

### Variable description

Variables are all self-explanatory except __fnlwgt__. This is a proxy for the demographic background of the people: "People with similar demographic characteristics should have similar weights". This similarity-statement is not transferable across the 51 different states.

Description f

In [6]:
df_adult.frame['class'].value_counts(normalize = True)

class
<=50K    0.760718
>50K     0.239282
Name: proportion, dtype: float64

In [7]:
X = df_adult.frame.drop('class', axis = 1)
y = df_adult.frame['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=22)

In [8]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   age             48842 non-null  int64   
 1   workclass       46043 non-null  category
 2   fnlwgt          48842 non-null  int64   
 3   education       48842 non-null  category
 4   education-num   48842 non-null  int64   
 5   marital-status  48842 non-null  category
 6   occupation      46033 non-null  category
 7   relationship    48842 non-null  category
 8   race            48842 non-null  category
 9   sex             48842 non-null  category
 10  capital-gain    48842 non-null  int64   
 11  capital-loss    48842 non-null  int64   
 12  hours-per-week  48842 non-null  int64   
 13  native-country  47985 non-null  category
dtypes: category(8), int64(6)
memory usage: 2.6 MB


###### [Back to top](#Index:)

### Question 1

What is the baseline accuracy for the classification task using the majority class from `y`?  Enter your response as a float accurate to three decimal places (between 0 and 1) to `ans1` below.

In [10]:
ans1 = None
### BEGIN SOLUTION
ans1 = 0.760718
### END SOLUTION
#Answer test
print(f'Guessing the majority class gives accuracy {ans1: .3f}')

Guessing the majority class gives accuracy  0.761


In [11]:
### BEGIN HIDDEN TESTS
ans1_ =  0.760718



#
#
#
assert round(ans1, 3) == round(ans1_, 3), 'Check that your answer is between 0 and 1'
### END HIDDEN TESTS

###### [Back to top](#Index:)

### Question 2

Use the transformed training data below to perform 5-fold cross-validation with `cross_val_score` and the `LogisticRegression` model `lgr` below. Use accuracy as the default metric and assign the results to `cv_scores_1`.

In [13]:
lgr = LogisticRegression(random_state=22)
transformer = make_column_transformer((OneHotEncoder(), X.select_dtypes('category').columns.tolist()),
                                     remainder = 'passthrough')

In [14]:
X_train_transformed = transformer.fit_transform(X_train)
X_test_transformed = transformer.transform(X_test)

In [15]:
cv_scores_1 = None
### BEGIN SOLUTION
cv_scores_1 = cross_val_score(lgr, X_train_transformed, y_train, cv = 5)
### END SOLUTION
#Answer test
print(f'''The basic logistic regression model gives average accuracy: {cv_scores_1.mean(): .3f}
and deviations {cv_scores_1.std(): .4f}''')

The basic logistic regression model gives average accuracy:  0.799
and deviations  0.0050


In [16]:
### BEGIN HIDDEN TESTS
cv_scores_1_ = cross_val_score(lgr, X_train_transformed, y_train, cv = 5)



#
#
#
np.testing.assert_almost_equal(cv_scores_1, cv_scores_1_, err_msg = 'Make sure your using the transformed data')
### END HIDDEN TESTS

###### [Back to top](#Index:)

### Question 3

Use the transformed training data below to perform 5-fold cross-validation with `cross_val_score` and the `RandomForestClassifier` model `forest` below. Use accuracy as the default metric and assign the results to `cv_scores_2`.

In [18]:
forest = RandomForestClassifier(n_estimators=10, random_state=22)

In [19]:
cv_scores_2 = None
### BEGIN SOLUTION
cv_scores_2 = cross_val_score(forest, X_train_transformed, y_train, cv = 5)
### END SOLUTION
#Answer test
print(f'''The basic random forest model gives average accuracy: {cv_scores_2.mean(): .3f}
and deviations {cv_scores_2.std(): .4f}''')

The basic random forest model gives average accuracy:  0.848
and deviations  0.0041


In [20]:
### BEGIN HIDDEN TESTS
cv_scores_2_ = cross_val_score(forest, X_train_transformed, y_train, cv = 5)



#
#
#
np.testing.assert_almost_equal(cv_scores_2, cv_scores_2_, err_msg = 'Make sure your using the transformed data')
### END HIDDEN TESTS

###### [Back to top](#Index:)

### Question 4

While the default forest model seems to do better than both the baseline and logistic model, perhaps you can squeeze some further performance out of the model by tuning the hyperparameters.  Because this is a slow process and may hang up the autograder, the results of searching over hyperparameters are presented as a `DataFrame` below.  Use this dataframe to select the model with the highest average accuracy over the five folds.

What was the optimal number of estimators and the best depth parameter based on the average accuracy?  Assign `best_estimator` and `best_depth` as integers below.

The code below was used to create and store the search results.

```python
#parameters to consider in model
params = {'n_estimators': [10, 50, 100], 'max_depth': [1, 2, 3, None]}
#grid search and cross validate model with parameters
grid = GridSearchCV(forest, param_grid=params)
#fit the transformed data
grid.fit(X_train_transformed, y_train)
#create dataframe of results
results_df = pd.DataFrame(grid.cv_results_)
```

In [23]:
results_df = pd.read_csv('results.csv')

In [24]:
results_df.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.07366,0.010755,0.015565,0.003091,1.0,10,"{'max_depth': 1, 'n_estimators': 10}",0.760748,0.760715,0.760715,0.760715,0.760715,0.760722,1.3e-05,10
1,0.207638,0.027449,0.026752,0.001558,1.0,50,"{'max_depth': 1, 'n_estimators': 50}",0.760748,0.760715,0.760715,0.760715,0.760715,0.760722,1.3e-05,10
2,0.39604,0.023567,0.055455,0.013361,1.0,100,"{'max_depth': 1, 'n_estimators': 100}",0.760748,0.760715,0.760715,0.760715,0.760715,0.760722,1.3e-05,10
3,0.085616,0.015366,0.014412,0.001526,2.0,10,"{'max_depth': 2, 'n_estimators': 10}",0.778627,0.779143,0.776686,0.779962,0.777778,0.778439,0.001128,7
4,0.27764,0.049556,0.031804,0.005028,2.0,50,"{'max_depth': 2, 'n_estimators': 50}",0.776034,0.77573,0.775048,0.777778,0.776276,0.776173,0.000902,8


In [25]:
best_estimators = None
best_depth = None
### BEGIN SOLUTION
best_params = results_df.nsmallest(1, 'rank_test_score')['params'].values[0]
best_estimators = 100
best_depth = None
### END SOLUTION
#Answer test
print(f'The best forest uses depth {best_depth} with {best_estimators} trees.')

The best forest uses depth None with 100 trees.


In [26]:
### BEGIN HIDDEN TESTS
best_estimators_ = 100
best_depth_ = None



#
#
#
assert best_depth == best_depth_
assert best_estimators == best_estimators_
### END HIDDEN TESTS

###### [Back to top](#Index:)

### Question 5

While this model seems to have slight improvement over the original, you may want to retrain the model on the full data set with these optimal parameters and see if the test score improves over the cross-validated version.

In [28]:
forest = RandomForestClassifier(n_estimators=best_estimators, max_depth=best_depth, random_state=22)
forest.fit(X_train_transformed, y_train)
print(f'Accuracy on full training set: {forest.score(X_train_transformed, y_train)}')
print(f'Accuracy on test set: {forest.score(X_test_transformed, y_test)}')

Accuracy on full training set: 0.9999181021539133
Accuracy on test set: 0.8550487265580214


Did the grid search improve the performance on the testing data even if only by a little?  Assign your answer choice `a`, `b`, `c` or `d` as a string to `ans5` below.

```
a. There is no performance increase from hyperparameter tuning.
b. There is a slight performance increase with the new hyperparameters.
c. The baseline model performs better than both.
d. Cannot determine.
```

In [30]:
ans5 = None
### BEGIN SOLUTION
ans5 = 'b'
### END SOLUTION
#Answer test
print(type(ans5))
print(ans5)

<class 'str'>
b


In [31]:
### BEGIN HIDDEN TESTS
ans5_ = 'b'



#
#
#
assert ans5 == ans5_, 'Make sure to compare to the original test score and forest model in problem 4.'
### END HIDDEN TESTS

###### [Back to top](#Index:)

### Question 6

Below, a DataFrame `importance_df` reports the results of the feature importances in the random forest model.  Use this to determine the most important feature in determining whether an individual earns more than 50,000 dollars a year. 

Assign the most important feature according to the `DataFrame` as a string to `most_important_feature`.  

In [33]:
# You may execute this cell on your local machine. Answer the next cell based on the image below.
#importance_df = pd.DataFrame({'features': transformer.get_feature_names(), 'importance': forest.feature_importances_})
#importance_df.head()

![Important Features](./important-features.png)

In [35]:
most_important_feature = None
### BEGIN SOLUTION
most_important_feature = 'onehotencoder__workclass_Private'
### END SOLUTION
#Answer test
print(type(most_important_feature))
print(f'The most imporant feature is {most_important_feature}.')

<class 'str'>
The most imporant feature is onehotencoder__workclass_Private.


In [36]:
### BEGIN HIDDEN TESTS
most_important_feature_ = 'onehotencoder__workclass_Private'



#
#
#
assert most_important_feature == most_important_feature_, 'Make sure to use the exact spelling from X.'
### END HIDDEN TESTS

### Summary
One of the things to note here is the nature of the features selected as most important.  The `.feature_importances_` attribute biases continuous features as they are more likely to be used to split more frequently than a binary categorical variable.  There are different approaches to understanding the importance of features, including permuataion feature importance measures that attempt to deal with these shortcomings.

You are encouraged to explore the [documentation with scikit-learn](https://scikit-learn.org/stable/inspection.html) and compare the results of feature importance interpretation using both methods.  These issues are of importance for more black box-type models and serve as an active area of research for interpreting 'black box' models. There are numerous other packages that implement alternative interpretation approaches. 