d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# K-fold Cross-Validation with Random Forest

**Objective**: *Demonstrate the the use of K-fold cross-validation.*

In this demo, we will complete a series of exercises to identify optimal hyperparameters using cross-validation and grid-search.

In [0]:
%run "../../Includes/Classroom-Setup"

Out[2]: DataFrame[]

-sandbox

## Prepare data

We'll create an **`adsda.ht_user_metrics`** table. This table will be at the user-level. 

We'll alse be adding a new binary column **`steps_10000`** indicating whether or not the individual takes an average of at least 10,000 steps per day (`1` for yes, `0` for no).

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Note that we're using fewer features than the previous lab.

In [0]:
%sql
CREATE OR REPLACE TABLE adsda.ht_user_metrics
USING DELTA LOCATION "/adsda/ht-user-metrics" AS (
  SELECT avg(metrics.resting_heartrate) AS avg_resting_heartrate,
         avg(metrics.active_heartrate) AS avg_active_heartrate,
         avg(metrics.bmi) AS avg_bmi,
         avg(metrics.vo2) AS avg_vo2,
         avg(metrics.workout_minutes) AS avg_workout_minutes,
         CASE WHEN avg(metrics.steps) >= 10000 THEN 1 ELSE 0 END AS steps_10000
  FROM adsda.ht_daily_metrics metrics
  INNER JOIN adsda.ht_users users ON metrics.device_id = users.device_id
  GROUP BY metrics.device_id, users.lifestyle
)

num_affected_rows,num_inserted_rows


-sandbox

### Train-Test Split

When we perform cross-validation, remember that it's still important to separate out the cross-validation set from the test (or holdout) set.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> We are using 80 percent of data for cross-validation and 20 percent of data for test. This is because we no longer have to split the non-test data between training and validation sets.

In [0]:
from sklearn.model_selection import train_test_split

ht_user_metrics_pd_df = spark.table("adsda.ht_user_metrics").toPandas()

cross_val_df, test_df = train_test_split(ht_user_metrics_pd_df, train_size=0.80, test_size=0.20, random_state=42)

We now have two DataFrames: `cross_val_df` and `test_df`. It should be noted that `train_val_df` contains all of the folds of our cross-validation data.

## Hyperparameter Tuning via Grid Search with Cross-Validation

As a reminder, we're building a random forest to predict whether each user takes 10,000 steps per day.

### Random Forest

We'll start by defining our random forest estimator.

In [0]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=42)

### Hyperparameter Grid

Just like in the last lesson, our first step is to create a hyperparameter grid.

We'll focus on two hyperparameters:

1. `max_depth` - the maximum depth of each tree
2. `n_estimators` – the number of trees in the forest

In [0]:
parameter_grid = {
  'max_depth':[2, 4, 5, 8, 10, 15, 20, 25], 
  'n_estimators':[3, 5, 10, 50, 100]
}

-sandbox
### Cross-Validated Grid Search

We are now ready to create our grid-search object. We'll use each of the objects we've created thus far.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Instead of passing a `PredefinedSplit` object to the `cv` parameter, we're simply passing the number of folds.

In [0]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator=rfc, cv=3, param_grid=parameter_grid)

### Training the Models

Now that we've created our `grid_search` object, we're ready to perform the process.

This is the same process at the previous lesson.

In [0]:
grid_search.fit(cross_val_df.drop("steps_10000", axis=1), cross_val_df["steps_10000"])

Out[11]: GridSearchCV(cv=3, estimator=RandomForestClassifier(random_state=42),
             param_grid={'max_depth': [2, 4, 5, 8, 10, 15, 20, 25],
                         'n_estimators': [3, 5, 10, 50, 100]})

**Question:** How many models are we training right now?

*Number of Unique Hyperparameter Combinations* x *Number of Folds* + 1

### Cross-validated Results

If you want to examine the results for each individual fold, you can use `grid_search`'s `cv_results_` attribute.

Note that each row of the DataFrame corresponds to a unique set of hyperparameters.

In [0]:
import pandas as pd
pd.DataFrame(grid_search.cv_results_).head()



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.03386,0.010356,0.009586,0.00734,2,3,"{'max_depth': 2, 'n_estimators': 3}",0.8875,0.895,0.89,0.890833,0.003118,34
1,0.059752,0.01676,0.005521,0.00066,2,5,"{'max_depth': 2, 'n_estimators': 5}",0.89,0.89,0.885,0.888333,0.002357,39
2,0.046336,0.004228,0.011601,0.003222,2,10,"{'max_depth': 2, 'n_estimators': 10}",0.89375,0.89625,0.88625,0.892083,0.004249,33
3,0.426534,0.135433,0.037327,0.010789,2,50,"{'max_depth': 2, 'n_estimators': 50}",0.9075,0.89625,0.89375,0.899167,0.00598,23
4,0.874123,0.134755,0.077893,0.026389,2,100,"{'max_depth': 2, 'n_estimators': 100}",0.90875,0.89625,0.89375,0.899583,0.006562,21


### Optimal Hyperparameters

If you don't want to dig through the above DataFrame to determine your optimal hyperparameters, you can still access them using `best_params_`.

In [0]:
grid_search.best_params_

Out[13]: {'max_depth': 15, 'n_estimators': 50}

-sandbox
And we can also see the average accuracy associated with these hyperparameter values.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This is a bit better than we saw with the train-validation-test split.

In [0]:
grid_search.best_score_

Out[14]: 0.9095833333333334

### Final Evaluation

If we want to see how the final, refit model that was trained on the entirety of `train_val_df` after the optimal hyperparameters performs, we can assess it against the test set.

In [0]:
from sklearn.metrics import accuracy_score

accuracy_score(
  test_df["steps_10000"], 
  grid_search.predict(test_df.drop("steps_10000", axis=1))
)

Out[15]: 0.8816666666666667

Great! We have one more lecture video before we finish off the lesson with lab.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>