<img src="images/nyp_ago_logo.png" width='400'/>

# Dealing with Imbalanced Data Set

In this lab, you will learn:
- how to use over-sampling correctly for imbalanced data set
- how to perform resampling using K-folds

In this exercise, we will use an imbalanced data set from Lending Club that consists of data for both 'bad' and 'good' loans to illustrate how we can apply oversampling and undersampling techniques to improve our model performance. You will also learn to apply resampling correctly when using cross-validation.

## Import the libraries

In [None]:
import pandas as pd
import numpy as np
import urllib.request
import shutil
import zipfile

from sklearn.tree import DecisionTreeClassifier
#from sklearn.ensemble import (
#    RandomForestClassifier,
#    GradientBoostingClassifier
#)

from sklearn.model_selection import (
    train_test_split,
    RepeatedStratifiedKFold,
    cross_validate
)

from sklearn.metrics import (
    classification_report,
    roc_curve,
    roc_auc_score,
    auc,
    precision_recall_curve,
    RocCurveDisplay
)

from imblearn.pipeline import Pipeline

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

%matplotlib inline

## Understand the data

Here we are trying to find out some information about the dataset

In [None]:
df = pd.read_csv('data/lending-club-data.csv')

Let us just find out about different features and their data types.

In [None]:
df.info()

In this exercise, we are trying to predict if a member will default on his loan or not. So we will be using the feature column 'bad_loans' as the label for our classification task. If the value of `bad_loan` is 1, it means it is a default (or bad loan), otherwise, it is 0.  

***Exercise:***

Find out how many samples in the data set is bad loans and how many are not.

Hint: `value_counts()` in [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) give you the count of unique values

<p>
<details><summary>Click here for answer</summary>

```python
df.bad_loans.value_counts()
```

</details>

In [None]:
### Complete the code below ###


Is the data set imbalanced? Clearly we have a lot of more good loans than bad loans (around 4 times more).

## Data Preparation

There are quite a lot of features in this data set but we are just going to use a few, just for demonstration purpose (as we are not really interested in actual performance of our model).

In [None]:
features = ['grade', 'home_ownership','emp_length_num', 'sub_grade','short_emp',
            'dti', 'term', 'purpose', 'int_rate', 'last_delinq_none', 'last_major_derog_none',
            'revol_util', 'total_rec_late_fee', 'payment_inc_ratio', 'bad_loans']

In [None]:
df = df[features]
df.info()

Notice that `payment_inc_ratio` has some null values, and since it is only a small number, just remove the rows that have null values for `payment_inc_ratio`.

In [None]:
loans_df = df.dropna()

We will go ahead and encode our categorical columns.

In [None]:
loans_encoded = pd.get_dummies(loans_df)
loans_encoded.info()

### Split the data set into train and test set

***Exercise:***

First, separate the features and the label.  

Hint: use `df.drop()` and specify `axis=1` to remove a particular column in dataframe.

Then, split the data into train set (called `X_train, y_train`) and test set (`X_test, y_test`). Think about the splitting strategy, e.g. do you need to ensure the distribution of good/bad is the same in both train and test set?

<p>
<details><summary>Click here for answer</summary>
    
```python

X_df = loans_encoded.drop(['bad_loans'], axis=1)
y_df = loans_encoded['bad_loans']

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df,
                                                    test_size = .2,
                                                    stratify = y_df,
                                                    random_state = 42)

```
</details>

In [None]:
## Complete the code 


In [None]:
print(y_train.value_counts())

## Train a baseline model

Now for comparison sake, we will evaluate a baseline model without any resampling.
As we are dealing with imbalanced dataset, it is useful for us to look at the roc auc score.

In [None]:
clf = DecisionTreeClassifier(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=0)
scores = cross_validate(clf, X_train, y_train, scoring=['roc_auc'], cv=cv, n_jobs=-1)

print('ROC_AUC of baseline model = {}'.format(scores['test_roc_auc'].mean()))

## Oversampling

Now we will try the over-sampling techniques to see if we can improve our model performance on the 'bad loan'.

### The ***wrong*** way to oversample ###

With the training data created, we can oversample the minority class (the bad_loan = 1). In this exercise, we will use the SMOTE (from the [imblearn](https://imbalanced-learn.readthedocs.io/en/stable/index.html) library) to create synthetic samples of the minority class.

After upsampling to a class ratio of 1.0 (i.e. 1 to 1 ratio between positive and negative classes) you should have a balanced dataset. In most cases, there’s often no need to balance the classes totally.

In [None]:
# Set sampling_strategy='auto' to oversample only the minority class

sm = SMOTE(sampling_strategy='auto',random_state=0)

X_upsample, y_upsample = sm.fit_resample(X_train, y_train)

Now let's see the number of samples we have for each class. You will see that now our train set is totally balanced, with equal number of samples for each class.


In [None]:
y_upsample.value_counts()

In [None]:
clf = DecisionTreeClassifier(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=0)
scores = cross_validate(clf, X_upsample, y_upsample, scoring=['roc_auc'], cv=cv, n_jobs=-1)

print('Cross-validation ROC_AUC score SMOTE-wrong way = {}'.format(scores['test_roc_auc'].mean()))

Our roc_auc score has improved to 91.5%. Impressive!  But is this actually representative of how the model will perform? Let's put our model to test.

Now let's train the model using the full up-sampled training set and evaluate on test set.

In [None]:
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_upsample, y_upsample)

y_probas = clf.predict_proba(X_test)[:,1]

roc_auc = roc_auc_score(y_test, y_probas)

print('Test ROC_AUC with SMOTE-wrong way = {}'.format(roc_auc))

You will get around 0.55. That’s disappointing! What has happened?

By oversampling before splitting into training and validation datasets, we “leaked” information from the validation set into the training of the model (refer to your lecture for more details)

### The ***right way*** to oversample

So, let do it the right way and see what happens. This time round, we will oversample the training set and not the train + validation set. Oversampling is done after we set aside the validation set.

In [None]:
sm = SMOTE(sampling_strategy='auto', random_state=0)
clf = DecisionTreeClassifier(random_state=0)

# declare a pipeline that consists of the oversampler and the classifier
steps = [('ovr', sm), ('clf', clf)]
pipeline = Pipeline(steps=steps)

# the oversampling is only applied to the train folds
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=0)
scores = cross_validate(pipeline, X_train, y_train, scoring=['roc_auc'], cv=cv, n_jobs=-1)

print('average roc_auc = {}'.format(scores['test_roc_auc'].mean()))

## Undersampling

It does not seems that we have much success with oversampling (it is marginally better than the baseline model). Let us try undersampling to see if we can get a better model.

**Exercise:**

Complete the code cell below, using RandomUndersampler, resample only the majority class. Cross-validate with RandomForestClassifier like before and compare the result with the oversampling approach. What do you observe about the result?

<details><summary>Click here for answer</summary>
<br/>
    
```python

undersampler  = RandomUnderSampler(sampling_strategy='auto', random_state=0)
clf = DecisionTreeClassifier(random_state=0)

# declare a pipeline that consists of the oversampler and the classifier
steps = [('under', undersampler), ('clf', clf)]
pipeline = Pipeline(steps=steps)

# the oversampling is only applied to the train folds
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=1)
scores = cross_validate(pipeline, X_train, y_train, scoring=['roc_auc'], cv=cv, n_jobs=-1)

print('Cross-validation ROC_AUC score Random Undersampling = {}'.format(scores['test_roc_auc'].mean()))
    
```
</details>

In [None]:
## Complete the code below ##
