d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Weighting Classes in Random Forest

**Objective**: *Demonstrate label-based record weighting as a method for balancing classes in evaluation.*

In this video we will demonstrate how to perform record weighting based on the class distribution in the training data set, in order to achieve equal weighting of label classes when evaluating models.

In [0]:
%pip install imbalanced-learn

In [0]:
%run "../../Includes/Classroom-Setup"

## Prepare data

We will once again be using the `adsda.ht_user_metrics_lifestyle` table that we previously created, but now joined to the `adsda.ht_users` table, and predicting which country a user is from.

In [0]:
%sql
CREATE OR REPLACE TABLE adsda.ht_user_metrics_lifestyle_country
USING DELTA LOCATION "/adsda/ht-user-metrics-lifestyle_country" AS (
  SELECT metrics.*, users.country 
  FROM adsda.ht_user_metrics_lifestyle AS metrics
  JOIN adsda.ht_users AS users
  ON metrics.device_id = users.device_id
  )

In [0]:
ht_df = spark.table("adsda.ht_user_metrics_lifestyle_country").toPandas()

In [0]:
ht_df.head()

Unnamed: 0,device_id,avg_resting_heartrate,avg_active_heartrate,bmi,avg_vo2,avg_workout_minutes,steps,country
0,fce425f2-e48a-11ea-8204-0242ac110002,73.846705,141.766552,25.972346,30.314502,35.606041,7007.131507,United States
1,fd2073e0-e48a-11ea-8204-0242ac110002,66.651361,147.19022,28.657224,26.331489,4.933199,5222.191781,United States
2,d5b6536a-e48a-11ea-8204-0242ac110002,61.535264,115.354649,28.069176,30.505854,26.808979,11651.545205,United States
3,d62d31e2-e48a-11ea-8204-0242ac110002,60.127616,109.560125,24.272347,33.00946,30.203698,12232.284932,United States
4,d72e7fc4-e48a-11ea-8204-0242ac110002,57.679282,107.348045,26.136668,33.622192,41.929783,10685.441096,United States


We can check how many of each class there are:

In [0]:
print(ht_df['country'].value_counts())

In [0]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

ht_df['country_cat'] = le.fit_transform(ht_df['country'])

ht_df.head(5)

Unnamed: 0,device_id,avg_resting_heartrate,avg_active_heartrate,bmi,avg_vo2,avg_workout_minutes,steps,country,country_cat
0,fce425f2-e48a-11ea-8204-0242ac110002,73.846705,141.766552,25.972346,30.314502,35.606041,7007.131507,United States,1
1,fd2073e0-e48a-11ea-8204-0242ac110002,66.651361,147.19022,28.657224,26.331489,4.933199,5222.191781,United States,1
2,d5b6536a-e48a-11ea-8204-0242ac110002,61.535264,115.354649,28.069176,30.505854,26.808979,11651.545205,United States,1
3,d62d31e2-e48a-11ea-8204-0242ac110002,60.127616,109.560125,24.272347,33.00946,30.203698,12232.284932,United States,1
4,d72e7fc4-e48a-11ea-8204-0242ac110002,57.679282,107.348045,26.136668,33.622192,41.929783,10685.441096,United States,1


In [0]:
X = (ht_df.drop("country", axis=1)
     .drop("country_cat", axis=1)
     .drop("device_id", axis=1)
    )
                           
y = ht_df['country_cat']

## Train a random forest model using class weights

Recall that sklearn has a built in utility function that will calculate weights based on class frequencies. It does this by automatically weighting classes inversely proportional to how frequently they appear in the data.

We can use this class weight function as a parameter specified for a model, with several options:

 - `None` 
  - this is the default
  - the class weights will be uniform
 - `balanced`
  - the function will calculate the class weights automatically 
 - `balanced_subsample`
  - same as “balanced” except that weights are computed based on the bootstrap sample for each individual tree
 - as a dictionary
  - the keys are the classes and the values are the desired class weights

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [0]:
y_train.value_counts()

In [0]:
y_test.value_counts()

**Default (None):**

In [0]:
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(class_weight=None)

rf.fit(X_train, y_train)

print(confusion_matrix(y_test, rf.predict(X_test)))

**Balanced:**

In [0]:
rf = RandomForestClassifier(class_weight="balanced")

rf.fit(X_train, y_train)

print(confusion_matrix(y_test, rf.predict(X_test)))

**Balanced subsample:**

In [0]:
rf = RandomForestClassifier(class_weight="balanced_subsample")

rf.fit(X_train, y_train)

print(confusion_matrix(y_test, rf.predict(X_test)))

**Dictionary of ratios:**

We can calculate the exact ratio we would use to evenly balance the classes, and use that in our class weight dictionary. We can use the sklearn.utils class_weight function to accomplish this.

In [0]:
from sklearn.utils import class_weight

weights = class_weight.compute_class_weight(class_weight='balanced', classes=[0, 1], y=y)

print(weights)

In [0]:
class_weights_dict = dict(enumerate(weights))

print(class_weights_dict)

In [0]:
rf = RandomForestClassifier(class_weight=class_weights_dict)

rf.fit(X_train, y_train)

print(confusion_matrix(y_test, rf.predict(X_test)))

In [0]:
rf = RandomForestClassifier(class_weight={0: 999, 1: 0.0009})

rf.fit(X_train, y_train)

print(confusion_matrix(y_test, rf.predict(X_test)))

The performance isn't improving with any of these methods for balancing the classes, which indicates that we might need to try some hyperparameter tuning, or another type of machine learning model altogether, to obtain more accuract classifications. We can first check how a Random Forest model would do if we balanced the classes before training, instead of using class weights.

In [0]:
from imblearn.over_sampling import RandomOverSampler

oversample = RandomOverSampler(sampling_strategy='minority')

X_over, y_over = oversample.fit_resample(X, y)

print(y_over.value_counts())

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X_over, y_over)

rf = RandomForestClassifier()

rf.fit(X_train, y_train)

print(confusion_matrix(y_test, rf.predict(X_test)))

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>