d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Hyperparameters in Tree-based Models

**Objective**: *Demonstrate the manual process of changing hyperparameters and testing values.*

In this demo, we will complete a series of exercises to explore the use of hyperparameters.

In [0]:
%run "../../Includes/Classroom-Setup"

Out[2]: DataFrame[]

## Prepare data

In this demo, we'll create an **`adsda.ht_user_metrics`** table. This table will be at the user-level. 

We'll alse be adding a new binary column **`steps_10000`** indicating whether or not the individual takes an average of at least 10,000 steps per day (`1` for yes, `0` for no).

In [0]:
%sql
CREATE OR REPLACE TABLE adsda.ht_user_metrics
USING DELTA LOCATION "/adsda/ht-user-metrics" AS (
  SELECT avg(metrics.resting_heartrate) AS avg_resting_heartrate,
         avg(metrics.active_heartrate) AS avg_active_heartrate,
         avg(metrics.bmi) AS avg_bmi,
         avg(metrics.vo2) AS avg_vo2,
         avg(metrics.workout_minutes) AS avg_workout_minutes,
         CASE WHEN avg(metrics.steps) >= 10000 THEN 1 ELSE 0 END AS steps_10000
  FROM adsda.ht_daily_metrics metrics
  INNER JOIN adsda.ht_users users ON metrics.device_id = users.device_id
  GROUP BY metrics.device_id, users.lifestyle
)

num_affected_rows,num_inserted_rows


We can display our new table and confirm our **`steps_10000`** column.

In [0]:
%sql
SELECT * FROM adsda.ht_user_metrics LIMIT 10

avg_resting_heartrate,avg_active_heartrate,avg_bmi,avg_vo2,avg_workout_minutes,steps_10000
54.0284328440395,108.28615934932893,12.654053882785398,33.03972742973578,49.85972670403038,1
54.77908266525594,108.15071784658238,20.91366854373003,36.50039642145559,32.38710983013407,1
82.71709277672224,123.75660244868575,21.775089958039707,18.609142842404506,40.607693444622576,1
54.65683212557368,110.26062426718684,10.975399291225694,29.183789361817823,53.62276604633095,1
81.1145477254848,131.062285127167,22.33872524194543,24.338430651153843,5.7806587919834,0
49.121654377406905,116.95658221162827,23.873621520980663,41.20961725020616,44.92626070562924,1
54.77494950167725,128.54677955505105,22.100005515666016,35.79725405982539,48.14001371848655,1
74.70086634248925,121.67227153993254,28.792669925300466,25.805953355652942,34.87085859794914,0
82.23193357821977,138.53324585681642,24.707027023043285,20.56033559557636,36.731238237383536,0
96.05237605034688,147.66059236683043,17.117911573521425,15.281249859214014,49.09652773997195,0


### Train-test Split

Remember that we need to split our training data and our test data so we can determine whether or not our models generalize well.

In [0]:
from sklearn.model_selection import train_test_split

ht_user_metrics_pd_df = spark.table("adsda.ht_user_metrics").toPandas()

train_df, test_df = train_test_split(ht_user_metrics_pd_df, train_size=0.8, test_size=0.2, random_state=42)

## Random Forest

In this demo, we'll try to build a random forest to predict whether each user takes 10,000 steps per day.

### Hyperparameters

We'll focus on two hyperparameters:

1. `max_depth` - the maximum depth of each tree
2. `n_estimators` – the number of trees in the forest

In [0]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=42)

Let's take a look at our current values for each of these hyperparameters.

In [0]:
rfc.get_params()

Out[10]: {'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

These aren't the hyperparameters we want. Let's change them.

In [0]:
rfc.set_params(max_depth=2, n_estimators=3)

Out[11]: RandomForestClassifier(max_depth=2, n_estimators=3, random_state=42)

And we can verify that we changed the hyperparameters.

In [0]:
rfc.get_params()

Out[12]: {'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': 2,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 3,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

### Train and Evaluate Model

Now that we've set our hyperparameters to our preferred values, let's train and evaluate our model using accuracy.

In [0]:
from sklearn.metrics import accuracy_score

# Fit the model
rfc.fit(train_df.drop("steps_10000", axis=1), train_df["steps_10000"])

# Train accuracy
train_accuracy = accuracy_score(
  train_df["steps_10000"], 
  rfc.predict(train_df.drop("steps_10000", axis=1))
)

# Test accuracy
test_accuracy = accuracy_score(
  test_df["steps_10000"], 
  rfc.predict(test_df.drop("steps_10000", axis=1))
)
print("Train accuracy:", train_accuracy)
print("Test accuracy:", test_accuracy)

Train accuracy: 0.8891666666666667
Test accuracy: 0.86


Given our pretty small forest and really shallow trees, we're seeing that we're slightly underfitting our training set.

### New Hyperparameter Values
Let's change our hyperparameter values to see if we can get a better accuracy.

In [0]:
rfc.set_params(max_depth=5, n_estimators=10)

Out[14]: RandomForestClassifier(max_depth=5, n_estimators=10, random_state=42)

In [0]:
# Fit the model
rfc.fit(train_df.drop("steps_10000", axis=1), train_df["steps_10000"])

# Train accuracy
train_accuracy = accuracy_score(
  train_df["steps_10000"], 
  rfc.predict(train_df.drop("steps_10000", axis=1))
)

# Test accuracy
test_accuracy = accuracy_score(
  test_df["steps_10000"], 
  rfc.predict(test_df.drop("steps_10000", axis=1))
)
print("Train accuracy:", train_accuracy)
print("Test accuracy:", test_accuracy)

Train accuracy: 0.9191666666666667
Test accuracy: 0.8683333333333333


That's a little better.

### One More Time

Let's try one more time.

In [0]:
rfc.set_params(max_depth=8, n_estimators=100)

# Fit the model
rfc.fit(train_df.drop("steps_10000", axis=1), train_df["steps_10000"])

# Train accuracy
train_accuracy = accuracy_score(
  train_df["steps_10000"], 
  rfc.predict(train_df.drop("steps_10000", axis=1))
)

# Test accuracy
test_accuracy = accuracy_score(
  test_df["steps_10000"], 
  rfc.predict(test_df.drop("steps_10000", axis=1))
)
print("Train accuracy:", train_accuracy)
print("Test accuracy:", test_accuracy)

Train accuracy: 0.9475
Test accuracy: 0.88


And that's even better! Hopefully it's clear how changing the hyperparameter values can affect the training process and, as a result, the performance of the model.

**Question:** How could we determine the optimal values for our hyperparameters?

Through the rest of this lesson, we'll look at how to optimize these values.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>