d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Grid Search for Random Forests

**Objective**: *Demonstrate the grid-search process using a validation set.*

In this demo, we will complete a series of exercises to automate the optimization of hyperparameters.

In [0]:
%run "../../Includes/Classroom-Setup"

Out[2]: DataFrame[]

## Prepare data

We'll use the same data we used in this lesson's previous demo.

As a reminder, we'll create an **`adsda.ht_user_metrics`** table. This table will be at the user-level. 

We'll alse be adding a new binary column **`steps_10000`** indicating whether or not the individual takes an average of at least 10,000 steps per day (`1` for yes, `0` for no).

In [0]:
%sql
CREATE OR REPLACE TABLE adsda.ht_user_metrics
USING DELTA LOCATION "/adsda/ht-user-metrics" AS (
  SELECT avg(metrics.resting_heartrate) AS avg_resting_heartrate,
         avg(metrics.active_heartrate) AS avg_active_heartrate,
         avg(metrics.bmi) AS avg_bmi,
         avg(metrics.vo2) AS avg_vo2,
         avg(metrics.workout_minutes) AS avg_workout_minutes,
         CASE WHEN avg(metrics.steps) >= 10000 THEN 1 ELSE 0 END AS steps_10000
  FROM adsda.ht_daily_metrics metrics
  INNER JOIN adsda.ht_users users ON metrics.device_id = users.device_id
  GROUP BY metrics.device_id, users.lifestyle
)

num_affected_rows,num_inserted_rows


-sandbox

### Train-Validation-Test Split

Our first step is to separate out our true holdout set, our test set.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Notice the holdout set is a bit smaller this time – this is to maximize the amount of data we can use on the training set and validation set.

In [0]:
from sklearn.model_selection import train_test_split

ht_user_metrics_pd_df = spark.table("adsda.ht_user_metrics").toPandas()

train_val_df, test_df = train_test_split(ht_user_metrics_pd_df, train_size=0.85, test_size=0.15, random_state=42)

We now have two DataFrames: `train_val_df` and `test_df`. It should be noted that `train_val_df` contains the data for both the training set and the validation set – we haven't separated those yet.

We need to perform the `train_test_split` again to separate `train_val_df` into a training set and a validation set.

In [0]:
train_df, val_df = train_test_split(train_val_df, train_size=0.7, test_size=0.3, random_state=42)

Now we have our three DataFrames:

1. `train_df`
1. `val_df`
1. `test_df`

But keep in mind we also have `train_val_df`, which is a combination of our training set and our validation set.

## Hyperparameter Tuning via Grid Search

As a reminder, we're building a random forest to predict whether each user takes 10,000 steps per day.

### Random Forest

We'll start by defining our random forest estimator.

In [0]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=42)

-sandbox
### Hyperparameter Grid

Our first step is to create a hyperparameter grid.

We'll focus on two hyperparameters:

1. `max_depth` - the maximum depth of each tree
2. `n_estimators` – the number of trees in the forest

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The `parameter_grid` is just a Python dictionary of values that we are predefining.

In [0]:
parameter_grid = {
  'max_depth':[2, 4, 5, 8, 10, 15, 20, 25, 30], 
  'n_estimators':[3, 5, 10, 50, 100, 150, 250, 500]
}

### Predefined Split

To use our validation set in an automated process, we need to create a predefined split to pass into a our grid-search process. This is because the grid-search process takes a single DataFrame – for us this will be `train_val_df`.

See this [note](https://scikit-learn.org/stable/modules/cross_validation.html#predefined-fold-splits-validation-sets) in the documentation for more explanation.

In [0]:
from sklearn.model_selection import PredefinedSplit

# Create list of -1s for training set row or 0s for validation set row
split_index = [-1 if row in train_df.index else 0 for row in train_val_df.index]

# Create predefined split object
predefined_split = PredefinedSplit(test_fold=split_index)

### Grid Search

We are now ready to create our grid-search object. We'll use each of the objects we've created thus far.

In [0]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator=rfc, cv=predefined_split, param_grid=parameter_grid)

### Training the Models

Now that we've created our `grid_search` object, we're ready to perform the process. `sklearn` makes this easy by implementing a familiar `fit` method to `grid_search`.

In [0]:
grid_search.fit(train_val_df.drop("steps_10000", axis=1), train_val_df["steps_10000"])

Out[13]: GridSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ..., -1,  0])),
             estimator=RandomForestClassifier(random_state=42),
             param_grid={'max_depth': [2, 4, 5, 8, 10, 15, 20, 25, 30],
                         'n_estimators': [3, 5, 10, 50, 100, 150, 250, 500]})

### Optimal Hyperparameters

If we're curious about what our optimal hyperparameters values are, we can access them pretty easily.

In [0]:
grid_search.best_params_

Out[14]: {'max_depth': 15, 'n_estimators': 100}

-sandbox
And we can also see the validation accuracy associated with these values.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This might not be quite as high as expected because remember that it was trained on less data.

In [0]:
grid_search.best_score_

Out[15]: 0.9124183006535947

### Final Evaluation

If we want to see how the final, refit model that was trained on the entirety of `train_val_df` after the optimal hyperparameters performs, we can assess it against the test set.

In [0]:
from sklearn.metrics import accuracy_score

accuracy_score(
  test_df["steps_10000"], 
  grid_search.predict(test_df.drop("steps_10000", axis=1))
)

Out[16]: 0.8711111111111111

Great! Now it's your opportunity to try this yourself in this lesson's lab.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>