d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Applied Random Forest Lab

**Objective**: *Apply random forests to a regression problem in an effort to improve model generalization.*

In this lab you will complete a series of guided exercises where you will build a random forest model to solve a regression problem. You will need to prepare the categorical variable appropriately and assess the output of the model. When complete, please use your answers to the exercises to answer questions in the following quiz within Coursera.

In [0]:
%run ../../Includes/Classroom-Setup

Out[39]: DataFrame[]

In [0]:
dbutils.fs.ls('adsda')

Out[43]: [FileInfo(path='dbfs:/adsda/ht-daily-metrics-agg/', name='ht-daily-metrics-agg/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/adsda/ht-user-metrics-lab/', name='ht-user-metrics-lab/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/adsda/ht-user-metrics-lifestyle/', name='ht-user-metrics-lifestyle/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/adsda/ht-user-metrics-pca/', name='ht-user-metrics-pca/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/adsda/ht-user-metrics-pca-lab/', name='ht-user-metrics-pca-lab/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/adsda/ht_daily_metrics/', name='ht_daily_metrics/', size=0, modificationTime=0),
 FileInfo(path='dbfs:/adsda/ht_users/', name='ht_users/', size=0, modificationTime=0)]

## Exercise 1

In this exercise, you will use the user-level lifestyle table. Run the following cell to make sure you can access the `adsda.ht_user_metrics_lifestyle` table.

In [0]:
%sql
SELECT *
FROM adsda.ht_user_metrics_lifestyle
LIMIT 10

avg_resting_heartrate,avg_active_heartrate,avg_bmi,avg_vo2,avg_workout_minutes,avg_steps,lifestyle
82.68379727873081,139.43487473206162,22.398063650890798,20.99401157735923,5.5026324666656405,5171.495890410959,Sedentary
77.73294228506452,127.05715346661702,25.150812654086295,25.52747526955064,37.2167018100805,7115.591780821917,Weight Trainer
86.51162895591307,147.31573126952208,19.14825600046248,19.448406520026342,45.00008651086257,7257.693150684931,Weight Trainer
77.55054135762612,129.5770039396946,24.2403757288568,21.40130178285617,37.886068725488464,7129.690410958904,Weight Trainer
68.93310580458204,136.50268661405897,30.726595797380472,28.85523016925364,32.24198398599063,6958.378082191781,Weight Trainer
69.31244794850774,167.18585016710105,27.1326690342849,30.939205114246853,5.119426899323105,5128.024657534246,Sedentary
64.64397544858174,152.9654977304546,29.17716498363452,28.92795344089978,5.015081852287961,5167.789041095891,Sedentary
81.33282756113321,137.57131998347788,20.850071485672636,22.56400630458249,42.37552145726232,7281.586301369863,Weight Trainer
64.79507042723496,139.39836367080545,31.386431213715436,29.096510773429188,33.3298371336183,7029.608219178082,Weight Trainer
89.51117796589962,126.57048164605168,19.83075371640161,19.750462151303648,43.30528046136424,7362.769863013698,Weight Trainer


Fill in the following cell to create a Pandas DataFrame from the Spark table.

In [0]:
# TODO
ht_metrics_pd_df = spark.table("adsda.ht_user_metrics_lifestyle").toPandas()

## Exercise 2

In this exercise, you will encode the categorical feature `lifestyle` column using `LabelEncoder`.

Fill in the blanks to complete this task.

In [0]:
# TODO
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
ht_metrics_pd_df['lifestyle_cat'] = le.fit_transform(ht_metrics_pd_df['lifestyle'])

## Exercise 3 

In this exercise, you will build a random forest regression model

We will once again try to predict a user's average `vo2` using their other metrics.

Remember to set the `random_state` to 42!

Before splitting the data and fitting the model, import the packages you will need from sklearn for the train test split and the Random Forest regressor.

In [0]:
# TODO
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [0]:
# TODO
X = ht_metrics_pd_df[['avg_resting_heartrate', 'avg_active_heartrate', 'avg_bmi', 'avg_steps', 'avg_workout_minutes', 'lifestyle_cat']]
y = ht_metrics_pd_df['avg_vo2']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) 

rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

y_train_predicted = rf.predict(X_train)
y_test_predicted = rf.predict(X_test)

print("R2 on training set: ", round(rf.score(X_train, y_train), 3))
print("R2 on test set: ", round(rf.score(X_test, y_test), 3))

R2 on training set:  0.992
R2 on test set:  0.944


**Coursera Quiz:** For the `rf` model, what is the R-squared score on the training and test set?

## Exercise 4

Even though the untuned random forest did very well already, explore how tuning some hyperparameters affects the output.

You will build three models:
1. With `n_estimators`=10
1. With `max_depth`=2
1. With `bootstrap`=False

In [0]:
rf_tuned_1 = RandomForestRegressor(n_estimators=10, random_state=42)

rf_tuned_1.fit(X_train, y_train)

y_train_predicted = rf_tuned_1.predict(X_train)
y_test_predicted = rf_tuned_1.predict(X_test)

print("R2 on training set: ", round(rf_tuned_1.score(X_train, y_train),3))
print("R2 on test set: ", round(rf_tuned_1.score(X_test, y_test), 3))

R2 on training set:  0.99
R2 on test set:  0.937


**Coursera Quiz:** For the `rf_tuned_1` model, what is the R-squared score on the training and test set?

In [0]:
rf_tuned_2 = RandomForestRegressor(max_depth=2, random_state=42)

rf_tuned_2.fit(X_train, y_train)

y_train_predicted = rf_tuned_2.predict(X_train)
y_test_predicted = rf_tuned_2.predict(X_test)

print("R2 on training set: ", round(rf_tuned_2.score(X_train, y_train),3))
print("R2 on test set: ", round(rf_tuned_2.score(X_test, y_test), 3))

R2 on training set:  0.868
R2 on test set:  0.86


**Coursera Quiz:** For the `rf_tuned_2` model, what is the R-squared score on the training and test set?

In [0]:
rf_tuned_3 = RandomForestRegressor(bootstrap=False, random_state=42)

rf_tuned_3.fit(X_train, y_train)

y_train_predicted = rf_tuned_3.predict(X_train)
y_test_predicted = rf_tuned_3.predict(X_test)

print("R2 on training set: ", round(rf_tuned_3.score(X_train, y_train),3))
print("R2 on test set: ", round(rf_tuned_3.score(X_test, y_test), 3))

R2 on training set:  1.0
R2 on test set:  0.901


**Coursera Quiz:** For the `rf_tuned_3` model, what is the R-squared score on the training and test set?

**Coursera Quiz:** Which of the tuned random forest models had the best performance on the test set?

-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>