d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Feature Selection Lab

**Objective**: *Apply feature selection to a dataset to derive more meaningful features and improve predictions.*

In this lab, you will apply what you've learned in this lesson. When complete, please use the answers to the exercises to answer questions in the following quiz within Coursera.

In [0]:
%run "../../Includes/Classroom-Setup"

## Exercise 1

In this exercise, you will create a user-level table with the following columns:

1. `avg_resting_heartrate` – the average resting heartrate
1. `avg_active_heartrate` - the average active heartrate
1. `avg_bmi` – the average BMI
1. `avg_vo2` - the average oxygen volume
1. `avg_workout_minutes` - the average of total workout minutes
1. `avg_steps` - the average of total steps
1. `lifestyle` - the lifestyle that best describes the observation

Run the cell below to create the table.

In [0]:
%sql

CREATE OR REPLACE TABLE adsda.ht_user_metrics_pca
USING DELTA LOCATION "/adsda/ht-user-metrics-pca" AS (
  SELECT avg(resting_heartrate) AS avg_resting_heartrate,
  avg(active_heartrate) AS avg_active_heartrate,
  avg(bmi) AS avg_bmi,
  avg(vo2) AS avg_vo2,
  avg(workout_minutes) AS avg_workout_minutes,
  avg(steps) AS avg_steps,
  first(lifestyle) AS lifestyle
  FROM adsda.ht_daily_metrics
  GROUP BY device_id
)

num_affected_rows,num_inserted_rows


Run the cell below to convert to a Pandas DataFrame and introduce missing values.

In [0]:
import numpy as np
import pandas as pd
np.random.seed(0)
df = spark.table("adsda.ht_user_metrics_pca").toPandas()

In [0]:
df.describe()

Unnamed: 0,avg_resting_heartrate,avg_active_heartrate,avg_bmi,avg_vo2,avg_workout_minutes,avg_steps,dummy_Athlete,dummy_Cardio Enthusiast,dummy_Sedentary,dummy_Weight Trainer
count,3000.0,2460.0,3000.0,3000.0,3000.0,2850.0,3000.0,3000.0,3000.0,3000.0
mean,62.26662,120.061389,22.902468,32.351569,35.57314,10209.555233,0.286333,0.354667,0.104,0.255
std,12.521525,17.118151,4.49268,7.029757,12.472619,2991.656138,0.452122,0.478492,0.305311,0.435934
min,45.04649,82.041834,7.592313,10.934276,4.219295,5047.646575,0.0,0.0,0.0,0.0
25%,52.024483,106.470456,19.761279,27.334516,32.626821,7183.227397,0.0,0.0,0.0,0.0
50%,58.526237,117.792745,22.912607,33.212109,36.840635,10836.173973,0.0,0.0,0.0,0.0
75%,70.799247,132.390796,26.005915,37.412472,41.755371,12761.878767,1.0,1.0,0.0,1.0
max,105.810105,182.959229,38.47558,50.749876,103.628452,17481.479452,1.0,1.0,1.0,1.0


In [0]:
df.loc[df.sample(frac=0.18).index, 'avg_active_heartrate'] = np.nan
df.loc[df.sample(frac=0.05).index, 'avg_steps'] = np.nan
df.shape

## Exercise 2

In this exercise, you'll one-hot encode the `lifestyle` column.

Fill in the blanks below to complete the task.

In [0]:
# TODO
df = pd.get_dummies(df, prefix='dummy', columns=['lifestyle'])

Run this cell to ensure that all columns are numeric.

In [0]:
df = df.apply(pd.to_numeric)

## Exercise 2

In this exercise, you'll one-hot encode the `lifestyle` column.

Fill in the blanks below to complete the task.

In [0]:
# TODO
from sklearn.model_selection import train_test_split

train_df, inference_df = train_test_split(df, train_size=0.85, test_size=0.15, random_state=42)

In [0]:
train_df.describe()

Unnamed: 0,avg_resting_heartrate,avg_active_heartrate,avg_bmi,avg_vo2,avg_workout_minutes,avg_steps,dummy_Athlete,dummy_Cardio Enthusiast,dummy_Sedentary,dummy_Weight Trainer
count,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0
mean,62.287825,119.629108,22.984884,32.35609,35.491891,10230.095076,0.281961,0.357255,0.103137,0.257647
std,12.52367,15.522505,4.553164,7.06048,12.434204,2924.339027,0.450043,0.479285,0.304198,0.437424
min,45.04649,83.969552,7.592313,11.522095,4.537248,5047.646575,0.0,0.0,0.0,0.0
25%,52.019572,108.96469,19.756044,27.240501,32.497853,7194.461644,0.0,0.0,0.0,0.0
50%,58.619246,117.850047,22.998389,33.12104,36.646534,10836.756164,0.0,0.0,0.0,0.0
75%,70.850809,128.478442,26.118651,37.461475,41.548309,12654.369178,1.0,1.0,0.0,1.0
max,105.810105,182.959229,38.47558,50.749876,103.628452,17481.479452,1.0,1.0,1.0,1.0


-sandbox
**Coursera Quiz:** How many rows have missing values in the `min_steps` column in the training set?

Write your code in the empty cell below to answer the question.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Refer back to the previous lesson for guidance on how to complete this task.

In [0]:
# TODO
train_df['avg_steps'].isnull().sum()

-sandbox
## Exercise 3

In this exercise, you will fill in these missing values. Using the identified columns from the previous exercise, fill in the missing values with the mean of their respective column.

Fill in the blanks below to complete the task.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Recall that we want to find the mean of training set and use that to impute values on the training set *and* the test set.

In [0]:
mean_training_data_steps = train_df['avg_steps'].median()
train_df['avg_steps'] = train_df['avg_steps'].fillna(mean_training_data_steps)
mean_training_data_heartrate = train_df['avg_active_heartrate'].median()
train_df['avg_active_heartrate'] = train_df['avg_active_heartrate'].fillna(mean_training_data_heartrate)

mean_inference_data_steps = inference_df['avg_steps'].median()
inference_df['avg_steps'] = inference_df['avg_steps'].fillna(mean_inference_data_steps)
mean_inference_data_heartrate = inference_df['avg_active_heartrate'].median()
inference_df['avg_active_heartrate'] = inference_df['avg_active_heartrate'].fillna(mean_inference_data_heartrate)

**Coursera Quiz:** What is the mean of the `min_steps_mean` feature rounded to the nearest hundredth place?

In [0]:
# TODO # ERRADO
mean_training_data_steps

## Exercise 4

Create the `X_train`, `X_test`, `y_train`, `y_test` from the train_df. Recall that we are trying to predict the `avg_bmi`.

Fill in the blanks below to complete the task.

In [0]:
# TODO
from sklearn.model_selection import train_test_split

X = train_df.drop('avg_bmi', axis=1)
y = train_df['avg_bmi']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1, random_state=42)

**Coursera Quiz**: How many rows are in the training set?

In [0]:
X_train.shape

## Exercise 5

In this exercise, you will fit a LASSO model. Fill in the blanks to fit a model with a `0.01` alpha, then run the cells to check coefficients.

Fill in the blanks below to complete the task.

In [0]:
# TODO
from sklearn.linear_model import Lasso

lr = Lasso(alpha=.01)
lr.fit(X_train, y_train)

Print out the R^2 score

In [0]:
print(lr.score(X_test, y_test))

**Coursera Quiz**: Which feature had the largest coefficient?

Uncomment the next cell to see a printout of features and their corresponding coefficients.

In [0]:
pd.DataFrame(list(zip(lr.coef_, X.columns)), columns=['coef', 'feature_name']).sort_values('coef', ascending=False)

Unnamed: 0,coef,feature_name
5,6.738475,dummy_Athlete
8,5.746307,dummy_Weight Trainer
1,0.016868,avg_active_heartrate
4,-0.000351,avg_steps
2,-0.132007,avg_vo2
0,-0.232984,avg_resting_heartrate
3,-0.605483,avg_workout_minutes
6,-1.520724,dummy_Cardio Enthusiast
7,-15.212624,dummy_Sedentary


## Exercise 6

In this exercise, you will take the feature with the highest coeficients and refit a model.

Fill in the blanks below to complete the task.

In [0]:
# TODO
X = train_df[['dummy_Weight Trainer']]
y = train_df['avg_bmi']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1, random_state=42)
lr = Lasso(alpha=.01)
lr.fit(X_train, y_train)

Compute the the R-squared score.

Fill in the blanks below to complete the task.

In [0]:
# TODO
lr.score(X_test, y_test)

Congrats! That concludes our lesson on feature selection!

Be sure to submit your quiz answers to Coursera, and join us in the next lesson to learn about tree based models!

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>