# Dealing with Imbalanced Data Set

Welcome to the hands-on lab. This is part of the series of exercises to help you acquire skills in different techniques to fine-tune your model.

In this lab, you will learn:
- how to use resampling correctly for imbalanced data set
- how to perform resampling using K-folds



In this exercise, we will use a highly imbalanced data set from Lending Club that consists of data for both 'bad' and 'good' loans to illustrate how we can apply oversampling and undersampling techniques to improve our model performance. You will also learn to apply resampling correctly when using cross-validation.

## Import the libraries

In [None]:
from __future__ import print_function

import warnings
warnings.filterwarnings('ignore', module='sklearn')

import pandas as pd
import numpy as np
import urllib.request
import shutil
import zipfile

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, recall_score

## Get the data

In [None]:
url = 'https://github.com/nyp-sit/data/raw/master/lending-club-data.csv.zip'
zip_file = "lending_club-data.csv.zip"

# download the zip file and copy to a file 'lending-club-data.csv.zip'
with urllib.request.urlopen(url) as response, open(zip_file, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)
    
# unzip the file to a folder 'data'
data_file = 'lending_club_data.csv'

with zipfile.ZipFile(zip_file,"r") as zip_ref:
    zip_ref.extractall('data')

## Exploratory Data analysis

Here we are trying to find out some information about the dataset

In [None]:
df = pd.read_csv('data/lending-club-data.csv')

Let us just find out about different features and their data types. 

In [None]:
df.info()

In this exercise, we are trying to predict if a member will default on his loan or not. So we will be using the feature column 'bad_loans' as the label for our classification task. If the value of `bad_loan` is 1, it means it is a default (or bad loan), otherwise, it is 0.  

***Exercise:***

Find out how many samples in the data set is bad loans and how many are not. 

Hint: `value_counts()` in [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) give you the count of unique values

<p>
<details><summary>Click here for answer</summary>

```python

df.bad_loans.value_counts()

```

</details>

In [None]:
## Complete the code below ## 



Is the data set imbalanced? Clearly we have a lot of more good loans than bad loans (around 4 times more).

## Data Preparation

There are quite a lot of features in this data set but we are just going to use a few, just for demonstration purpose (as we are not really interested in actual performance of our model).

In [None]:
features = ['grade', 'home_ownership','emp_length_num', 'sub_grade','short_emp',
            'dti', 'term', 'purpose', 'int_rate', 'last_delinq_none', 'last_major_derog_none',
            'revol_util', 'total_rec_late_fee', 'payment_inc_ratio', 'bad_loans']

***Exercise:*** 

Create a data frame that consist of the subset of features listed above.

<p>
<details><summary>Click here for answer</summary>
    
```python
df = df[features]
```
</details>

In [None]:
## Complete the code below ## 



In [None]:
df.info()

Notice that `payment_inc_ratio` has some null values, and since it is only a small number, just remove the rows that have null values for `payment_inc_ratio`.

***Exercise***

Create a new data frame that have the rows that contains null values for `payment_inc_ratio` removed. 

<p>
<details><summary>Click here for answer</summary>
    
```python
loans_df = df.dropna()
```
</details>

In [None]:
## Complete the code below ## 



In [None]:
loans_df.info()

***Exercise:*** 

Encode the categorical columns (dtype=object). You can use the convenience method `get_dummies()` provide by [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html).

<p>
<details><summary>Click here for answer</summary>
    
```python
loans_encoded = pd.get_dummies(loans_df)
```
</details>

In [None]:
## Complete the code below ## 




In [None]:
loans_encoded.info()

### Split the data set into train and test set

***Exercise:*** 

First, separate the features and the label.  

Hint: use `df.drop()` and specify `axis=1` to remove a particular column in dataframe.

Then, split the data into train set (called `X_train, y_train`) and test set (`X_test, y_test`). Think about the splitting strategy, e.g. do you need to ensure the distribution of good/bad is the same in both train and test set?

<p>
<details><summary>Click here for answer</summary>
    
```python

X_df = loans_encoded.drop(['bad_loans'], axis=1)
y_df = loans_encoded['bad_loans']

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, 
                                                    test_size = .2, 
                                                    stratify = y_df,
                                                    random_state = 42)

```
</details>

In [None]:
## Complete the code below ## 

# X_df contains all the feature columns and y_df contains only the label, i.e. bad_loans column

X_df = None
y_df = None

# split the data into train and test set
X_train, X_test, y_train, y_test = None



In [None]:
print(y_train.value_counts())

## Train a baseline model

Now for comparison sake, we will train a classifier and see its performance on the test set.
As we are interested in knowing how well our model is in picking out 'bad loan', it would be useful to look at the recall score of the model. 

In [None]:
clf_rf = RandomForestClassifier(n_estimators=25, random_state=42)
clf_rf.fit(X_train, y_train)

In [None]:
y_test_pred = clf_rf.predict(X_test)

print(precision_score(y_test, y_test_pred))
print(recall_score(y_test, y_test_pred))


## Oversampling

Now we will try the over-sampling techniques to see if we can improve our model performance on the 'bad loan'.

### The ***wrong*** way to oversample ###

With the training data created, we can oversample the minority class (the bad_loan = 1). In this exercise, we will use the SMOTE (from the [imblearn](https://imbalanced-learn.readthedocs.io/en/stable/index.html) library) to create synthetic samples of the minority class. 

After upsampling to a class ratio of 1.0 (i.e. 1 to 1 ratio between positive and negative classes) you should have a balanced dataset. In most cases, there’s often no need to balance the classes totally.

In [None]:
from imblearn.over_sampling import SMOTE

# Set sampling_strategy='minority' to oversample only the minority class 

sm = SMOTE(sampling_strategy='minority',random_state=42)
X_upsample, y_upsample = sm.fit_resample(X_train, y_train)

Now let's see the number of samples we have for each class. You will see that now our train set is totally balanced, with equal number of samples for each class.


In [None]:
y_upsample.value_counts()

Now let us split the up-sampled training data set into training and validation set.

***Note:***

It might be a bit confusing as we talk about training sets. We have our original data set, `X` and we split into `X_train` and `X_test`.  We up-sample the `X_train` to get `X_upsample`. And then from the `X_upsample`, we further set aside a train set and validation set, which we call: `X_train_final`, and `X_val_final` to differentiate from the earlier `X_train` and `X_upsample`. 

In [None]:
#now split into train/validation sets

X_train_final, X_val_final, y_train_final, y_val_final = train_test_split(X_upsample, y_upsample, 
                                                                          test_size=.1, random_state=42)

We then train a classifier and look at the performance of the classifier.

In [None]:
clf_rf = RandomForestClassifier(n_estimators=25, random_state=42)
clf_rf.fit(X_train_final, y_train_final)

Let's see how well our model performs on the validation set, i.e. `X_val_final`

In [None]:
y_val_final_pred = clf_rf.predict(X_val_final)

print(precision_score(y_val_final, y_val_final_pred))
print(recall_score(y_val_final, y_val_final_pred))

Our precision and recall has improved to 93% and and 82% respectively. Impressive!  But is this actually representative of how the model will perform? Let's put our model to test.

In [None]:
y_test_pred = clf_rf.predict(X_test)

print(precision_score(y_test, y_test_pred))
print(recall_score(y_test, y_test_pred))

That’s disappointing! What has happened?

By oversampling before splitting into training and validation datasets, we “leaked” information from the validation set into the training of the model (refer to your lecture for more details)

### The ***right way*** to oversample

So, let do it the right way and see what happens. This time round, we will oversample the training set and not the train + validation set. Oversampling is done after we set aside the validation set. 

In [None]:
## Here we set aside a validation set first 

X_train_proper,  X_val_proper, y_train_proper, y_val_proper = train_test_split(X_train, y_train, test_size=.1, random_state=42)

Now as before, we use SMOTE to oversample the minority class, but this time we only oversample from the train set.  

***Exercise:***

Use SMOTE (as before) to over-sample the `X_train_proper`. Train a classifier as before and evaluate the classifier on validation data (X_val_proper). 

What is your precision and recall score now?

<p>
<details><summary>Click here for answer</summary>
    
```python

sm = SMOTE(sampling_strategy='minority',random_state=42) 
X_train_proper_upsampled, y_train_proper_upsampled = sm.fit_sample(X_train_proper, y_train_proper)
    
clf_rf = RandomForestClassifier(n_estimators=25, random_state=42)
clf_rf.fit(X_train_proper_upsampled, y_train_proper_upsampled)

y_val_proper_pred = clf_rf.predict(X_val_proper)

print(precision_score(y_val_proper, y_val_proper_pred))
print(recall_score(y_val_proper, y_val_proper_pred))
```
</details>

In [None]:
## Complete the code below ## 



Let's see if this validation result is closer to what we will get for the test set.

In [None]:
y_test_pred = clf_rf.predict(X_test)

print(precision_score(y_test, y_test_pred))
print(recall_score(y_test, y_test_pred))


Now, we can see that the recall rate obtained from the validation set matches more closely the result from the test set, which is about 18% recall. 

### Oversampling when doing K-Fold 

If you are doing K-fold cross validation, you can use the pipeline to help you do that. However, sklearn pipeline expects each transformer in the pipeline to implement TransformerMixin or BaseEstimator interfaces. However, imblearn classes like SMOTE does not. Fortunately imblearn provides its own pipeline implementation, which we can use to replace the sklearn pipeline. In the code below, we use the imblearn Pipeline to do first oversample our minority class in train folds, train a classifier on train folds, and validate on the validation fold. 

In [None]:
from imblearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedStratifiedKFold

sm = SMOTE(sampling_strategy='minority', random_state=42)
clf = RandomForestClassifier(n_estimators=25, random_state=42)

# declare a pipeline that consists of the oversampler and the classifier
steps = [('ovr', sm), ('clf', clf)]
pipeline = Pipeline(steps=steps)
scoring = ['precision','recall']

# the oversampling is only applied to the train folds
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=1)
scores = cross_validate(pipeline, X_train, y_train, scoring=scoring, cv=cv, n_jobs=-1)


## Undersampling

It does not seems that we have much success with oversampling. Let us try undersampling to see if we can get a better model.

**Exercise:**

Complete the code cell below, using RandomUndersampler, resample only the majority class. Cross-validate with RandomForestClassifier like before and compare the result with the oversampling approach. What do you observe about the result?

<details><summary>Click here for answer</summary>
<br/>
    
```python
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline 


undersampler  = RandomUnderSampler(sampling_strategy=1.0, random_state=42)
clf = RandomForestClassifier(n_estimators=25, random_state=42)

# declare a pipeline that consists of the oversampler and the classifier
steps = [('under', undersampler), ('clf', clf)]
pipeline = Pipeline(steps=steps)

# let's monitor precision recall scores
scoring = ['precision','recall']

# the oversampling is only applied to the train folds
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=1)
scores = cross_validate(pipeline, X_train, y_train, scoring=scoring, cv=cv, n_jobs=-1)

print('average precision: {}', scores['test_precision'].mean())
print('average recall: {}', scores['test_recall'].mean())
    
```
</details>

In [None]:
## Complete your code below ##

# import the packages


# create the RandomUndersampler and classifer 


# declare a pipeline that consists of the oversampler and the classifier


# let's monitor precision recall scores


# use cross validation to compute the scores 


# print the precision recall scores


### Combining Oversampling and Undersampling

Can we do better by combining oversampling and undersampling? Let's find out. 

**Exercise:**

Complete the codes below to first downsample the majority class such that minority is 50% of majority class, and then upsample the minority to same as majority. 

Compare the result with previous oversample-only and undersample-only results. What do you observe? 

<details><summary>Click here for answer</summary>
<br/>
We observe the the precision has improved compared to undersample-only but recall has gone down compared to undersample-only. 
<br/>

```python
# first downsample the majority classes such that minority class is 50% of majority class
undersampler  = RandomUnderSampler(sampling_strategy=0.5, random_state=42)
# we then upsample the minority class such that both are the same ratio
oversampler = SMOTE(sampling_strategy=1.0, random_state=42)

clf = RandomForestClassifier(n_estimators=25, random_state=42)

# declare a pipeline that first undersample, and then oversample, followed by classifier
steps = [('under', undersampler), ('over', oversampler), ('clf', clf)]

pipeline = Pipeline(steps=steps)

# let's monitor precision recall scores
scoring = ['precision','recall']

# the oversampling is only applied to the train folds
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=1)
scores = cross_validate(pipeline, X_train, y_train, scoring=scoring, cv=cv, n_jobs=-1)
    
# print the results 
print('average precision: {}', scores['test_precision'].mean())
print('average recall: {}', scores['test_recall'].mean())
```
</details>
 

In [None]:
## Complete the code below ## 

# create a downsampler that downsamples the majority classes such that minority class is 50% of majority class


# create a upsampler that upsample the minority class such that both are the same ratio


# create the classifier 


# declare a pipeline that first undersample, and then oversample, followed by classifier


# let's monitor precision recall scores


# run the cross validation here and print the results
