d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Cross-Validation Lab

**Objective**: *Assess your ability to apply cross-validated hyperparameter tuning to a model.*

In this lab, you will apply what you've learned in this lesson. When complete, please use the answers to the exercises to answer questions in the following quiz within Coursera.

In [0]:
%run "../../Includes/Classroom-Setup"

-sandbox

## Exercise 1

In this exercise, you will create an enhanced user-level table to try to better predict whether or not each user takes at least *8,000* steps in a day. For this exercise, assume we only have access to heart rate information.

Fill in the blanks in the below cell to create the `adsda.ht_user_metrics_cv_lab` table.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Note that this lab is focused on predicting whether users take 8,000 steps per day rather than 10,000 steps per day.

In [0]:
%sql
-- ANSWER
CREATE OR REPLACE TABLE adsda.ht_user_metrics_cv_lab
USING DELTA LOCATION "/adsda/ht-user-metrics-cv-lab" AS (
  SELECT min(resting_heartrate) AS min_resting_heartrate,
         avg(resting_heartrate) AS avg_resting_heartrate,
         max(resting_heartrate) AS max_resting_heartrate,
         max(resting_heartrate) - min(resting_heartrate) AS resting_heartrate_change,
         min(active_heartrate) AS min_active_heartrate,
         avg(active_heartrate) AS avg_active_heartrate,
         max(active_heartrate) AS max_active_heartrate,
         max(active_heartrate) - min(active_heartrate) AS active_heartrate_change,
         CASE WHEN avg(steps) > 8000 THEN 1 ELSE 0 END AS steps_8000
  FROM adsda.ht_daily_metrics
  GROUP BY device_id
)

In [0]:
%sql
SELECT *
FROM adsda.ht_user_metrics_cv_lab

min_resting_heartrate,avg_resting_heartrate,max_resting_heartrate,resting_heartrate_change,min_active_heartrate,avg_active_heartrate,max_active_heartrate,active_heartrate_change,steps_8000
100.12190323385116,82.68379727873081,99.1382809914936,-0.9836222423575692,120.30779135992243,139.43487473206162,162.34782728523246,42.04003592531002,0
52.71287564903336,77.73294228506452,97.93773119202254,45.22485554298918,109.04938651327842,127.05715346661702,146.86986814834722,37.8204816350688,0
100.37366097806768,86.51162895591307,99.78933645583436,-0.5843245222333167,129.5517281106538,147.31573126952208,177.78314996951866,48.23142185886485,0
58.41880559722201,77.55054135762612,98.8753293685824,40.4565237713604,110.84551702943472,129.5770039396946,146.7386499653505,35.89313293591579,0
49.81689070901131,68.93310580458204,92.68678860925904,42.86989790024773,116.68894230853844,136.50268661405897,162.3638297702582,45.67488746171978,0
47.35569868189229,69.31244794850774,92.76249605385787,45.40679737196558,147.8761660158842,167.18585016710105,186.283614902432,38.40744888654783,0
43.51461112051503,64.64397544858174,82.07381856112256,38.55920744060754,126.1426366136658,152.9654977304546,173.98748183735378,47.84484522368797,0
100.00309226714644,81.33282756113321,99.2374154614732,-0.7656768056732375,119.4038219789244,137.57131998347788,156.9670310686102,37.563209089685785,0
40.7141729432126,64.79507042723496,82.96829328067244,42.254120337459845,116.4176566249598,139.39836367080545,157.39528580691194,40.97762918195215,0
100.00362526100596,89.51117796589962,99.91629552898304,-0.0873297320229369,106.10261352131862,126.57048164605168,148.28320037619991,42.1805868548813,0


-sandbox
**Coursera Quiz:** How many users in `adsda.ht_user_metrics_cv_lab` take, on average, 8,000 steps per day?

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Refer back to the previous lab for guidance on how to answer this question.

-sandbox
## Exercise 2

In this exercise, you will split your data into a cross-validation set (`cross_val_df`) and test set (`test_df`).

Fill in the blanks below to split your data.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Refer to the previous demo for guidance.

In [0]:
# ANSWER
from sklearn.model_selection import train_test_split

ht_user_metrics_pd_df = spark.table("adsda.ht_user_metrics_cv_lab").toPandas()

cross_val_df, test_df = train_test_split(ht_user_metrics_pd_df, train_size=0.80, test_size=0.20, random_state=42)

**Coursera Quiz:** How many rows are in the `cross_val_df` DataFrame?

Fill in the blanks below to answer the question.

In [0]:
# ANSWER
cross_val_df.shape

## Exercise 3

In this exercise, you will prepare your random forest classifier.

Fill in the blanks below to complete the task.

In [0]:
# ANSWER
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=42)

## Exercise 4

In this exercise, you will create a hyperparameter grid to use during the grid search process.

Use the following hyperparameter values:

1. `max_depth`: 5, 8, 20
1. `n_estimators`: 25, 50, 100
1. `min_samples_split`: 2, 4
1. `max_features`: 3, 4
1. `max_samples`: 0.6, 0.8

Fill in the blanks below to create the grid.

In [0]:
# ANSWER
parameter_grid = {
  "max_depth": [5, 8, 20],
  "n_estimators": [25, 50, 100],
  "min_samples_split": [2, 4],
  "max_features": [3, 4],
  "max_samples": [0.6, 0.8]
}

-sandbox
**Coursera Quiz**: How many total unique combinations of hyperparameters are there in `parameter_grid`?

Use the below empty cell to determine the answer to the above question.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Refer to the previous lesson's lab for guidance.

In [0]:
# ANSWER
len(parameter_grid["max_depth"]) * len(parameter_grid["n_estimators"]) * len(parameter_grid["min_samples_split"]) * len(parameter_grid["max_features"]) * len(parameter_grid["max_samples"])

## Exercise 5

In this exercise, you will create the cross-validated grid-search object that you will use to optimize your hyperparameter values while using cross-validation.

Fill in the blanks below to create the object.

:NOTE: Please use 4-fold cross-validation.

In [0]:
# ANSWER
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator=rfc, cv=4, param_grid=parameter_grid)

## Exercise 6

In this exercise, you will fit the grid search process.

Fill in the blanks below to perform the grid search process.

In [0]:
# ANSWER
grid_search.fit(cross_val_df.drop("steps_8000", axis=1), cross_val_df["steps_8000"])

-sandbox
**Coursera Quiz**: How many unique models are being trained by the cross-validated grid search process?

* 4
* 289
* 21
* 288

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Consider the number of unique feature combinations, the number of cross-validation folds and the final retraining of the model on the entire cross-validation set.

## Exercise 7

In this exercise, you will return a Pandas DataFrame of the `grid_search` results.

Fill in the blanks below to return the DataFrame.

In [0]:
# ANSWER
import pandas as pd
pd.DataFrame(grid_search.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_max_features,param_max_samples,param_min_samples_split,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.073803,0.001329,0.006431,0.000190,5,3,0.6,2,25,"{'max_depth': 5, 'max_features': 3, 'max_sampl...",0.891667,0.876667,0.885000,0.891667,0.886250,0.006166,41
1,0.142304,0.001015,0.010535,0.000110,5,3,0.6,2,50,"{'max_depth': 5, 'max_features': 3, 'max_sampl...",0.896667,0.888333,0.885000,0.893333,0.890833,0.004488,12
2,0.286086,0.008795,0.024689,0.008326,5,3,0.6,2,100,"{'max_depth': 5, 'max_features': 3, 'max_sampl...",0.896667,0.890000,0.890000,0.895000,0.892917,0.002976,2
3,0.074263,0.000989,0.006450,0.000064,5,3,0.6,4,25,"{'max_depth': 5, 'max_features': 3, 'max_sampl...",0.891667,0.886667,0.886667,0.891667,0.889167,0.002500,21
4,0.146175,0.002696,0.010875,0.000239,5,3,0.6,4,50,"{'max_depth': 5, 'max_features': 3, 'max_sampl...",0.895000,0.885000,0.888333,0.893333,0.890417,0.003975,15
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67,0.243197,0.030370,0.013240,0.002320,20,4,0.8,2,50,"{'max_depth': 20, 'max_features': 4, 'max_samp...",0.881667,0.868333,0.868333,0.873333,0.872917,0.005449,72
68,0.443375,0.033354,0.020192,0.000291,20,4,0.8,2,100,"{'max_depth': 20, 'max_features': 4, 'max_samp...",0.895000,0.871667,0.868333,0.881667,0.879167,0.010375,67
69,0.106932,0.002034,0.006974,0.000099,20,4,0.8,4,25,"{'max_depth': 20, 'max_features': 4, 'max_samp...",0.898333,0.868333,0.878333,0.886667,0.882917,0.011016,54
70,0.211066,0.002696,0.011531,0.000294,20,4,0.8,4,50,"{'max_depth': 20, 'max_features': 4, 'max_samp...",0.890000,0.870000,0.875000,0.886667,0.880417,0.008197,60


## Exercise 8

In this exercise, you will identify the optimal hyperparameter values.

Fill in the blanks below to indentify the optimal hyperparameter values.

In [0]:
# ANSWER
grid_search.best_params_

**Coursera Quiz:** What is the optimal hyperparameter value for `max_depth` according to the cross-validated grid search process?

## Exercise 9

In this exercise, you will identify the test accuracy achieved by the final, refit model.

Fill in the blanks below to identify the test accuracy.

In [0]:
# ANSWER
from sklearn.metrics import accuracy_score

accuracy_score(
  test_df["steps_8000"], 
  grid_search.predict(test_df.drop("steps_8000", axis=1))
)

**Coursera Quiz:** What is the test set accuracy?

Congratulations! That concludes our lesson on cross-validated hyperparameter optimization and our course!

Be sure to submit your quiz answers to Coursera to fully complete the course!

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>