### VIDEO 9: HANDLING IMBALANCED DATASET

Hello learners!

In this video, let's use the techniques that we have learnt in the last video to handle class imbalance. Apart from that, we will also be using oversampling and undersampling techniques to balance the dataset. So, let's get started!

First let's do preprocessing in the data just like we are doing while building the classification and regression model.

In [1]:
#Importing libraries
import pandas as pd
import numpy as np

#Loading the data
data = pd.read_csv("Synergix_data_preprocessed_new.csv")

#Storing the ratio in a list named Rating_ratio
Rating_ratio = []
for row in data.values:
    if(row[4]+row[5] == 0):
        if(row[7]+row[8] == 0):
            #If all the ratings are zero then overall rating ratio will also be zero
            Rating_ratio.append(0.0)
        else:
            #If only the numerator(1 and 2 star) ratings are zero then adding -99999 to the list temporarily which
            #will be taken care of in the next cell.
            Rating_ratio.append(-99999)
    else:
        Rating_ratio.append((int(row[7])+(row[8]))/(int(row[4])+int(row[5])))

#replacing -99999 with the maximum ratio in the list
max_rating = max(Rating_ratio)
for x in range(len(Rating_ratio)):
    if(Rating_ratio[x] == -99999):
        Rating_ratio[x] = max_rating

#adding the column 'Good_By_Bad_Rating' to the dataframe
data['Good_By_Bad_Rating'] = Rating_ratio

data = data.drop(columns = ['1_Star_Rating', '2_Star_Rating', '3_Star_Rating', '4_Star_Rating', '5_Star_Rating'])

In [2]:
from sklearn.preprocessing import LabelEncoder
data [['Segment']]= data [['Segment']].apply(LabelEncoder().fit_transform)

In [3]:
data = data.drop(columns = 'Units_sold',axis=1)

X = data.drop(columns = 'Units_sold>1000')
y = data['Units_sold>1000']

In [4]:
# Importing the train-test split from scikit-learn
from sklearn.model_selection import train_test_split

# Performing train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 6)

Before we start the stratified split, let's check the ratio of both the classes in the y_train and y_test splits.

In [5]:
y_train.value_counts(normalize=True)

1    0.598137
0    0.401863
Name: Units_sold>1000, dtype: float64

In [6]:
y_test.value_counts(normalize=True)

1    0.614261
0    0.385739
Name: Units_sold>1000, dtype: float64

To incorporate startify split in out model, we can use train_test_split with the stratify parameter set to ‘y’ to ensure that the split is stratified based on the target variable y.

Now, let's perform the stratified split and see if this ratio changes.

In [7]:
# Performing train test split with stratification
X_train_st, X_test_st, y_train_st, y_test_st = train_test_split(X, y, test_size = 0.3, stratify = y, random_state = 6)

In [8]:
y_train_st.value_counts(normalize=True)

1    0.60294
0    0.39706
Name: Units_sold>1000, dtype: float64

In [9]:
y_test_st.value_counts(normalize=True)

1    0.603056
0    0.396944
Name: Units_sold>1000, dtype: float64

As you can see, the ratio does not change much, this is because, the dataset that we have is large, therefore the chances of sample bias are small.  But as a good practice, for classification problems with imbalanced dataset, we should always do stratified split.

Now let's try the other technique which is using the hyperparameter: class weight.
As discussed earlier specific machine learning algorithms like Decision Tree, Logistic Regression etc allows us to set the class weight hyperparameter. Let's set it to ‘balanced’ and use the stratified splits we have obtained earlier to build the model.

In [10]:
from sklearn.tree import DecisionTreeClassifier
DT_model = DecisionTreeClassifier(max_depth = 11, min_samples_leaf= 6, random_state=42, class_weight = 'balanced')

However, we can also assign customised class weights by defining a dictionary where the keys represent the class labels, and #the values represent the weights we want to assign to each class.

DT_model = DecisionTreeClassifier(max_depth = 11, min_samples_leaf= 6, random_state=42, class_weight = {0:0.8, 1:0.2})

In [11]:
# Train the model
DT_model.fit(X_train_st, y_train_st)

Now, let's make the predictions on the train and test data.

In [12]:
from sklearn.metrics import f1_score

# Make predictions on the train dataset
y_train_pred = DT_model.predict(X_train_st)

# Make predictions on the test dataset
y_test_pred = DT_model.predict(X_test_st)

# Let's display the model performance on the train and test data.

print('Train score: ', f1_score(y_train_st, y_train_pred))
print('Test score: ', f1_score(y_test_st, y_test_pred))

Train score:  0.8870789957134109
Test score:  0.8261986301369864


---

### Undersampling

Now that we have implemented the tricks to handle class imbalance, let's dive deeper to understand how to balance the dataset using undersampling and oversampling techniques. First, let's do undersampling!

To do so, we have to first install a ski-kit learn library called 'imblearn'. It is specifically designed to deal with imbalanced datasets and help us seamlessly implement various methods like undersampling, oversampling, and SMOTE etc. Write the following code to install the same!

In [None]:
#pip install --upgrade scikit-learn imbalanced-learn

In [13]:
!pip install imblearn

Defaulting to user installation because normal site-packages is not writeable
Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
  Downloading imbalanced_learn-0.13.0-py3-none-any.whl (238 kB)
     -------------------------------------- 238.4/238.4 kB 2.1 MB/s eta 0:00:00
Collecting sklearn-compat<1,>=0.1
  Downloading sklearn_compat-0.1.3-py3-none-any.whl (18 kB)
Installing collected packages: sklearn-compat, imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.13.0 imblearn-0.0 sklearn-compat-0.1.3



[notice] A new release of pip available: 22.3.1 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


Once the library is installed, let's import RandomUnderSampler from imblearn.under_sampling

In [14]:
from imblearn.under_sampling import RandomUnderSampler

Now, let's resample the training data and build the decision tree with the resampled data.

In [15]:
sampler = RandomUnderSampler(random_state = 42)
X_train_rus, y_train_rus = sampler.fit_resample(X_train, y_train)

In [16]:
print(y_train_rus.value_counts(normalize = True))

0    0.5
1    0.5
Name: Units_sold>1000, dtype: float64


As you can see from the ratios, the class distribution is equal now due to random undersampling.
Let's build the model using these modified training data and see its performance!

In [17]:
DT_model = DecisionTreeClassifier(max_depth = 11, min_samples_leaf= 6, random_state=42)

DT_model.fit(X_train_rus, y_train_rus)

y_train_pred = DT_model.predict(X_train_rus)
y_pred = DT_model.predict(X_test)


print('Train F1 Score: ', f1_score(y_train_rus, y_train_pred))
print('Test F1 Score: ', f1_score(y_test, y_pred))

Train F1 Score:  0.8667085539897675
Test F1 Score:  0.8229858504187121


As you can see, the model performance has deteriorated slightly. This is because, in undersampling instances from the majority class are  randomly removed and this may lead to loss of information. Let's try random oversampling now.

### Random oversampling

Performing random oversampling involves doing the same steps like we did for random undersampler. However, the only difference is we have to import RandomOverSampler from imblearn.over_sampling and build the model.

In [18]:
from imblearn.over_sampling import RandomOverSampler

Let's perform resampling of the orginial data.

In [19]:
sampler = RandomOverSampler(random_state = 42)
X_train_ros, y_train_ros = sampler.fit_resample(X_train, y_train)

Once random oversampling is done we can quickly check if the ratio of the classes are equal now.

In [20]:
y_train_ros.value_counts(normalize = True)

1    0.5
0    0.5
Name: Units_sold>1000, dtype: float64

As you can see, it's equal. Let's go to the next stepn and build the model now.

In [21]:
DT_model = DecisionTreeClassifier(max_depth = 11, min_samples_leaf= 6, random_state=42)

DT_model.fit(X_train_ros, y_train_ros)

y_train_pred = DT_model.predict(X_train_ros)
y_pred = DT_model.predict(X_test)

y_train_pred = DT_model.predict(X_train_ros)
y_pred = DT_model.predict(X_test)

# Printing the F1 score of the train and test data
print('Train F1 Score: ', f1_score(y_train_ros, y_train_pred))
print('Test F1 Score: ', f1_score(y_test, y_pred))

Train F1 Score:  0.8634057971014493
Test F1 Score:  0.8295983086680762


As you can see, this performance is slightly better compared to the undersampling scenario as no information has been removed from the dataset. Let's try impleting anotherv oversampling technique which is SMOTE.

#### SMOTE

let's starts by importing the SMOTE (Synthetic Minority Over-sampling Technique) method from the imblearn.over_sampling module.

In [22]:
from imblearn.over_sampling import SMOTE

Next, SMOTE is initialized with a random seed for reproducibility.

In [23]:
smote = SMOTE(random_state = 42)

SMOTE is applied to the training data resulting in oversampled training data.

In [24]:
X_train_smt, y_train_smt = smote.fit_resample(X_train,y_train)

Let's quickly check the ratio of the classes in the training data.

In [25]:
y_train_smt.value_counts(normalize = True)

1    0.5
0    0.5
Name: Units_sold>1000, dtype: float64

Let's build the decision tree model using the oversampled training data.

In [26]:
DT_model = DecisionTreeClassifier(max_depth = 11, min_samples_leaf= 6, random_state=42)
DT_model.fit(X_train_smt, y_train_smt)


#Making predictions
y_train_pred = DT_model.predict(X_train_smt)
y_pred = DT_model.predict(X_test)

#Evaluating the model
print('Training F1 score: ', f1_score(y_train_smt, y_train_pred))
print('Testing F1 score: ', f1_score(y_test, y_pred))

Training F1 score:  0.862752248966691
Testing F1 score:  0.832123024348569


As you can see, in this scenario, with SMOTE, the model performance is similar to random oversampling. As we know, iterating different techniques is an essential part of machine learning model building process. So  it is advisable to try out different techniques and tools to get the best performance.  In the next video, we will compare the performance of all the models that we have built so far. So see you in the next one!