d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# A Review of Assigning Classes

**Objective**: *Demonstrate how to assign classes based on predicted probabilities.*

In this video, we will review how to assign classes based on predicted probabilities. We will look at using different threshold values to do this.

In [0]:
%run "../../Includes/Classroom-Setup"

Out[2]: DataFrame[]

## Prepare data

We will once again be using the `adsda.ht_user_metrics_lifestyle` table that we previously created. This time, however, we will attempt to predict which lifestyle class they fall into:

In [0]:
ht_lifestyle_pd_df = spark.table("adsda.ht_user_metrics_lifestyle").toPandas()

In [0]:
ht_lifestyle_pd_df['lifestyle'].unique().tolist()

Out[7]: ['Sedentary', 'Weight Trainer', 'Athlete', 'Cardio Enthusiast']

We want to know whether each user is either "sedentary" or a "cardio enthusiast", or "athlete" or "weight trainer" (grouping these four categories into two). 

Therefore we will convert the four categories to two numeric classes, 0 and 1. We will accomplish this using a list comprehension and Pandas.

In [0]:
ht_lifestyle_pd_df['lifestyle_num'] = [0 if (x=='Sedentary' or x=='Cardio Enthusiast') else 1 for x in ht_lifestyle_pd_df['lifestyle']]

We can check how many of each category there are:

In [0]:
ht_lifestyle_pd_df['lifestyle_num'].value_counts()

Out[9]: 1    1624
0    1376
Name: lifestyle_num, dtype: int64

## Build a classification model

Now that we have a two class target, we will build a model to predict which class a user is in.

In [0]:
X = ht_lifestyle_pd_df[['avg_resting_heartrate', 'avg_active_heartrate', 'avg_bmi', 'avg_vo2', 'avg_workout_minutes', 'avg_steps']]

y = ht_lifestyle_pd_df['lifestyle_num']

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

##### Fit a random forest classifier

In [0]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()

rf.fit(X_train, y_train)

Out[13]: RandomForestClassifier()

## Examine the output

In this example, we are not interested in evaluating how the model performs (accuracy) - we are only looking at the predicted classes. Therefore, we will skip reviewing the classification metrics and go straight to the predicted classes.

In [0]:
import pandas as pd

preds_df = pd.DataFrame({'lifestyle': y_test, 
              'lifestyle_predicted': rf.predict(X_test),
              'predicted_proba_0': rf.predict_proba(X_test)[:, 0]
             })



In [0]:
preds_df.sample(20)

Unnamed: 0,lifestyle,lifestyle_predicted,predicted_proba_0
2722,1,1,0.0
1011,0,0,1.0
2687,0,0,1.0
1415,1,1,0.03
2385,1,1,0.01
945,0,0,1.0
1069,1,1,0.0
2230,1,1,0.01
2657,1,1,0.0
2611,1,1,0.0


## Adjusting the threshold for predicted class

The model predicts that a particular sample is class 0 if its predicted probability for that class is greater than 0.5, which is the default threshold. 

In this case, we have decided that we want to err on the side of caution and be very careful not to mistakenly assign someone to class 0 (sedentary or cardio enthusiast), even if that means we incorrectly assign some to class 1. We will do this by adjusting the probability threshold to 0.7 for class 0. This means that we only assign someone to class 0 if the model says the probability of belonging to that class is greater than 70%.

In [0]:
preds_df['lifestyle_predicted_adjusted'] = [0 if x > 0.7 else 1 for x in preds_df['predicted_proba_0']]

In [0]:
preds_df.sample(20)

Unnamed: 0,lifestyle,lifestyle_predicted,predicted_proba_0,lifestyle_predicted_adjusted
651,0,0,0.95,0
212,1,1,0.0,1
1914,0,0,1.0,0
2482,1,1,0.0,1
1815,0,0,1.0,0
2192,1,1,0.0,1
2263,0,0,1.0,0
2871,1,1,0.0,1
995,1,1,0.0,1
1113,1,1,0.0,1


-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>