<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti103/blob/master/session-7/Oversampling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left"/></a>

# Dealing with Imbalanced Data Set

Welcome to the programming exercise. This is part of the series of exercises to help you acquire skills in different techniques to fine-tune your model.

**You will learn:**
- how to use oversampling correctly for imbalanced data set
- how to perform oversampling using K-folds


## Oversampling

In this exercise, we will use a highly imbalanced data set from Lending Club that consists of data for both 'bad' and 'good' loans to illustrate the proper way of oversampling. The focus of this exercise is not to produce accurate model but to illustrate the 'effect' that wrong oversampling has on the model performance. 

### 1. Import the libraries

In [None]:
from __future__ import print_function

import warnings
warnings.filterwarnings('ignore', module='sklearn')

import pandas as pd
import numpy as np
import urllib.request
import shutil
import zipfile

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTE

### 2. Get the data

In [None]:
url = 'https://github.com/nyp-sit/data/raw/master/lending-club-data.csv.zip'
zip_file = "lending_club-data.csv.zip"

# download the zip file and copy to a file 'lending-club-data.csv.zip'
with urllib.request.urlopen(url) as response, open(zip_file, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)
    
# unzip the file to a folder 'data'
data_file = 'lending_club_data.csv'

with zipfile.ZipFile(zip_file,"r") as zip_ref:
    zip_ref.extractall('data')

### 3. Some data exploratory analysis

Here we are trying to find out some information about the dataset

In [None]:
df = pd.read_csv('data/lending-club-data.csv')

Let us just find out about different features and their data types. 

In [None]:
df.info()

In this exercise, we are trying to predict if a member will default on his loan or not. So we will be using the feature column 'bad_loans' as the label for our classification task. If the value of `bad_loan` is 1, it means it is a default (or bad loan), otherwise, it is 0.  

***Exercise:***

Find out how many samples in the data set is bad loans and how many are not. 

Hint: `value_counts()` in [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) give you the count of unique values

<p>
<details><summary>Click here for answer</summary>

```python

df.bad_loans.value_counts()

```

</details>

In [None]:
### START CODE HERE ###


### END CODE HERE

Is the data set imbalanced? Clearly we have a lot of more good loans than bad loans (around 4 times more).

### 4. Data Preparation

There are quite a lot of features in this data set but we are just going to use a few, just for demonstration purpose (as we are not really interested in actual performance of our model).

In [None]:
features = ['grade', 'home_ownership','emp_length_num', 'sub_grade','short_emp',
            'dti', 'term', 'purpose', 'int_rate', 'last_delinq_none', 'last_major_derog_none',
            'revol_util', 'total_rec_late_fee', 'payment_inc_ratio', 'bad_loans']

***Exercise:*** 

Create a data frame that consist of the subset of features listed above.

<p>
<details><summary>Click here for answer</summary>
    
```python
df = df[features]
```
</details>

In [None]:
### START CODE HERE ### 



### END CODE HERE ###

In [None]:
df.info()

Notice that `payment_inc_ratio` has some null values, and since it is only a small number, just remove the rows that have null values for `payment_inc_ratio`.

***Exercise***

Create a new data frame that have the rows that contains null values for `payment_inc_ratio` removed. 

Hint: `~df.payment_inc_ratio.isnull()` will give return a series of boolean(true/false mask) to indicate which rows of payment_inc_ration is **NOT** null. Construct the new data frame using `df[boolean mask]`

<p>
<details><summary>Click here for answer</summary>
    
```python
loans_df = df[~df.payment_inc_ratio.isnull()]
```
</details>

In [None]:
### START CODE HERE ###

loans_df = None 

#### END CODE HERE ###

In [None]:
loans_df.info()

***Exercise:*** 

Encode the categorical columns (dtype=object). You can use the convenience method `get_dummies()` provide by [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

<p>
<details><summary>Click here for answer</summary>
    
```python
loans_encoded = pd.get_dummies(loans_df)
```
</details>

In [None]:
### START CODE HERE ###

loans_encoded = None

### END CODE HERE ###

In [None]:
loans_encoded.info()

### 5. Split the data set into train and test set

***Exercise:*** 

Separate the features and the label.  

Hint: use `df.drop()` and specify `axis=1` to remove a particular column in dataframe.

<p>
<details><summary>Click here for answer</summary>
    
```python

X_df = loans_encoded.drop(['bad_loans'], axis=1)
y_df = loans_encoded['bad_loans']

```
</details>

In [None]:
### START CODE HERE ###

# X_df contains all the feature columns and y_df contains only the label, i.e. bad_loans column

X_df = None
y_df = None 


### END CODE HERE ### 

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, 
                                                    test_size = .1, 
                                                    stratify = y_df,
                                                    random_state = 42)

In [None]:
print(y_train.value_counts())

### 6. The ***wrong*** way to oversample ###

With the training data created, we can oversample the minority class (the bad_loan = 1). In this exercise, we will use the SMOTE (from the [imblearn](https://imbalanced-learn.readthedocs.io/en/stable/index.html) library) to create synthetic samples of the minority class. 

After upsampling to a class ratio of 1.0 (i.e. 1 to 1 ratio between positive and negative classes) you should have a balanced dataset. In most cases, there’s often no need to balance the classes totally.

In [None]:
# Set sampling_strategy='minority' to oversample only the minority class 

sm = SMOTE(sampling_strategy='minority',random_state=42)
X_upsample, y_upsample = sm.fit_sample(X_train, y_train)

Now you see that the samples are totally balanced.  `np.bincount()` counts number of occurrences of each value in array of non-negative ints.


In [None]:
print(np.bincount(y_upsample))

Now let us split the up-sampled training data set into training and validation set.

***Note:***

It might be a bit confusing as we talk about training sets. We have our original data set, `X` and we split into `X_train` and `X_test`.  We up-sample the `X_train` to get `X_upsample`. And then from the `X_upsample`, we further set aside a train set and validation set, which we call: `X_train_final`, and `X_val_final` to differentiate from the earlier `X_train` and `X_upsample`. 

In [None]:
#now split into cross validation

X_train_final, X_val_final, y_train_final, y_val_final = train_test_split(X_upsample, y_upsample, 
                                                                          test_size=.1, random_state=42)

We then train a classifier and look at the performance of the classifier

In [None]:
clf_rf = RandomForestClassifier(n_estimators=25, random_state=42)
clf_rf.fit(X_train_final, y_train_final)

As we are interested in knowing how well our model is in picking out 'bad loan', it would be useful to look at the recall score of the model. 

***Exercise:*** 

Find the accuracy and the recall of the model on the validation set, i.e. `X_val_final`

<p>
<details><summary>Click here for answer</summary>
    
```python

y_val_final_pred = clf_rf.predict(X_val_final)

accuracy = accuracy_score(y_val_final, y_val_final_pred)
recall = recall_score(y_val_final, y_val_final_pred)

```
</details>

In [None]:
### START THE CODE ### 

y_val_final_pred = None

accuracy = None
recall = None

### END THE CODE ### 

print('accuracy = {}'.format(accuracy))
print('recall = {}'.format(recall))

You should expect to see around 80% recall, that is pretty good! It means the model correctly identified 80% of the total bad loans. But is this actually representative of how the model will perform? To find out, let's test the model on the test set we created initially.

***Exercise:*** 

Find the accuracy and the recall of the model on the test set, i.e. `X_test`


<p>
<details><summary>Click here for answer</summary>
    
```python

y_val_final_pred = clf_rf.predict(X_val_final)

accuracy = accuracy_score(y_val_final, y_val_final_pred)
recall = recall_score(y_val_final, y_val_final_pred)

```
</details>

In [None]:
### START THE CODE ### 
y_test_pred = None

accuracy = None
recall = None 


### END THE CODE ### 

print('accuracy = {}'.format(accuracy))
print('recall = {}'.format(recall))

Only 80% accuracy and 15% recall on the test data. That’s disappointing! What has happened?

By oversampling before splitting into training and validation datasets, we “leaked” information from the validation set into the training of the model (refer to your lecture for more details)

### 7. The ***right way*** to oversample

So, let do it the right way and see what happens. This time round, we will oversample the training set and not the train + validation set. Oversampling is done after we set aside the validation set. 

In [None]:
## Here we set aside a cross validation set first 

X_train_proper,  X_val_proper, y_train_proper, y_val_proper = train_test_split(X_train, y_train, test_size=.1, random_state=42)

Now as before, we use SMOTE to oversample the minority class, but this time we only oversample from the train set.  

***Note:***

It might be a bit confusing as we talk about training sets. We have our original data set, `X` and we split into `X_train` and `X_test`.  And then from the `X_train`, we further set aside a train set and validation set, which we call: `X_train_proper`, and `X_val_proper` to differentiate from the earlier `X_train`. 

***Exercise:***

Use SMOTE (as before) to over-sample the `X_train_proper`. 


<p>
<details><summary>Click here for answer</summary>
    
```python

sm = SMOTE(sampling_strategy='minority',random_state=42) 
X_train_proper_upsampled, y_train_proper_upsampled = sm.fit_sample(X_train_proper, y_train_proper)

```
</details>

In [None]:
### START CODE HERE ###

sm = None
X_train_proper_upsampled, y_train_proper_upsampled = None 


### END CODE HERE ### 

We then train a classifier and look at the performance of the classifier

In [None]:
clf_rf = RandomForestClassifier(n_estimators=25, random_state=42)
clf_rf.fit(X_train_proper_upsampled, y_train_proper_upsampled)

***Exercise:*** 

As before, find the accuracy and recall of the model on the validation set, i.e. `X_val_proper`


<p>
<details><summary>Click here for answer</summary>
    
```python

y_val_proper_pred = clf_rf.predict(X_val_proper)

accuracy = accuracy_score(y_val_proper, y_val_proper_pred )
recall = recall_score(y_val_proper, y_val_proper_pred )


```
</details>

In [None]:
### START CODE HERE ###

y_val_proper_pred = None

accuracy = None
recall = None


### END THE CODE ### 

print('accuracy = {}'.format(accuracy))
print('recall = {}'.format(recall))


This time round, we got only 17% recall. Let's see if this recall rate is more representative of the result on the test set. 

***Exercise:*** 

Find the accuracy and the recall of the model on the test set, i.e. `X_test`

<p>
<details><summary>Click here for answer</summary>
    
```python

y_test_pred = clf_rf.predict(X_test)

accuracy = accuracy_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)



```
</details>

In [None]:
### START THE CODE ### 
y_test_pred = None

accuracy = None
recall = None



### END THE CODE ### 

print('accuracy = {}'.format(accuracy))
print('recall = {}'.format(recall))

Now, we can see that the recall rate obtained from the validation set matches more closely the result from the test set, which is about 18% recall. 

### 8. Oversampling when doing K-Fold 

If you are doing K-fold cross validation, below is the code to show you how to do the oversampling properly. 

In [None]:
from sklearn.model_selection import StratifiedKFold

skfolds = StratifiedKFold(n_splits=3, random_state=42)

clf_rf = RandomForestClassifier(n_estimators=25, random_state=42)
sm = SMOTE(sampling_strategy='minority',random_state=42) 

# We use enumerate() to return also the index position of the list so that we can print out the fold number
for fold, (train_index, val_index) in enumerate(skfolds.split(X_train, y_train)):
    #print(train_index, val_index)
    X_train_fold = X_train.iloc[train_index]
    y_train_fold = y_train.iloc[train_index]
    X_val_fold = X_train.iloc[val_index]
    y_val_fold = y_train.iloc[val_index]
    X_train_fold_oversample, y_train_fold_oversample = sm.fit_sample(X_train_fold, y_train_fold)
    clf_rf.fit(X_train_fold_oversample, y_train_fold_oversample)
    y_val_fold_pred = clf_rf.predict(X_val_fold)
    print('Accuracy score for {} fold: {}'.format(fold, accuracy_score(y_val_fold,y_val_fold_pred)))
    print('Recall score for {} fold: {}'.format(fold, recall_score(y_val_fold, y_val_fold_pred)))

If you don't want to use data frame for the StratifiedKFold, and prefer to work with numpy array

You can first convert the X_train and y_train to numpy array as follows:

In [None]:
X_train_arr = X_train.values
y_train_arr = y_train.values

skfolds = StratifiedKFold(n_splits=3, random_state=42)

clf_rf = RandomForestClassifier(n_estimators=25, random_state=42)
sm = SMOTE(sampling_strategy='minority',random_state=42) 

# We use enumerate() to return also the index position of the list so that we can print out the fold number
for fold, (train_index, val_index) in enumerate(skfolds.split(X_train_arr, y_train_arr)):
    #print(train_index, val_index)
    X_train_fold = X_train_arr[train_index]
    y_train_fold = y_train_arr[train_index]
    X_val_fold = X_train_arr[val_index]
    y_val_fold = y_train_arr[val_index]
    X_train_fold_oversample, y_train_fold_oversample = sm.fit_sample(X_train_fold, y_train_fold)
    clf_rf.fit(X_train_fold_oversample, y_train_fold_oversample)
    y_val_fold_pred = clf_rf.predict(X_val_fold)
    print('Accuracy score for {} fold: {}'.format(fold, accuracy_score(y_val_fold,y_val_fold_pred)))
    print('Recall score for {} fold: {}'.format(fold, recall_score(y_val_fold, y_val_fold_pred)))