d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Oversampling and Undersampling Classes

**Objective**: *Demonstrate how to bootstrap records based on their label values.*

In this video we will demonstrate how to bootstrap training set records into a new training set based on the target class distribution to ensure a more balanced distribution in the training set.

In [0]:
%run "../../Includes/Classroom-Setup"

Out[2]: DataFrame[]

We also need to load the below library.

In [0]:
%pip install imbalanced-learn

Python interpreter will be restarted.
Collecting imbalanced-learn
  Using cached imbalanced_learn-0.10.1-py3-none-any.whl (226 kB)
Installing collected packages: imbalanced-learn
Successfully installed imbalanced-learn-0.10.1
Python interpreter will be restarted.


## Prepare data

We will once again be using the `adsda.ht_user_metrics_lifestyle` table that we previously created, and predicting which lifestyle class users fall into.

In [0]:
ht_lifestyle_pd_df = spark.table("adsda.ht_user_metrics_lifestyle").toPandas()

In [0]:
ht_lifestyle_pd_df['lifestyle'].unique().tolist()

Out[2]: ['Sedentary', 'Weight Trainer', 'Athlete', 'Cardio Enthusiast']

In this example, we want to know whether each user is "sedentary" or "active" (any of the other three lifestyle classes). 

Therefore we will convert the four categories into two numeric classes, 0 and 1. We will accomplish this using a list comprehension and Pandas.

In [0]:
ht_lifestyle_pd_df['lifestyle_num'] = [0 if x=='Sedentary' else 1 for x in ht_lifestyle_pd_df['lifestyle']]

We can check how many of each category there are:

In [0]:
ht_lifestyle_pd_df['lifestyle_num'].value_counts()

Out[4]: 1    2688
0     312
Name: lifestyle_num, dtype: int64

In [0]:
X = ht_lifestyle_pd_df.drop("lifestyle", axis=1).drop("lifestyle_num", axis=1)
                           
y = ht_lifestyle_pd_df['lifestyle_num']

## Bootstrap sample the minority class


We now have a dataset with a slightly imbalanced target class: class 1 is almost 10 times bigger than class 0. (This is not a very major imbalance, relatively speaking.) We will attempt to balance the dataset using the bootstrap method.

We will perform random oversampling, where we randomly duplicate examples in the minority class by sampling with replacement.

We set the sampling strategy to `minority` which will make the minority class the same size as the majority class.

In [0]:
from imblearn.over_sampling import RandomOverSampler

oversample = RandomOverSampler(sampling_strategy='minority')

Fit and apply the oversample transformation:

In [0]:
X_over, y_over = oversample.fit_resample(X, y)

In [0]:
y_over.value_counts()

Out[8]: 0    2688
1    2688
Name: lifestyle_num, dtype: int64

## Undersample the majority class

Next, we will perform random undersampling, where we randomly delete examples from the majority class.

Setting the sampling strategy to `majority` will make the majority class the same size as the minority class.

In [0]:
from imblearn.under_sampling import RandomUnderSampler

undersample = RandomUnderSampler(sampling_strategy='majority')

In [0]:
X_under, y_under = undersample.fit_resample(X, y)

In [0]:
y_under.value_counts()

Out[11]: 0    312
1    312
Name: lifestyle_num, dtype: int64

Instead of simply saying that we want the majority class to be the same size as the minority, we can specify a ratio:

In [0]:
undersample_2 = RandomUnderSampler(sampling_strategy=0.75)

In [0]:
X_under2, y_under2 = undersample_2.fit_resample(X, y)

In [0]:
y_under2.value_counts()

Out[14]: 1    416
0    312
Name: lifestyle_num, dtype: int64

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>