<a href="https://colab.research.google.com/github/nyp-sit/sdaai-staff-repo/blob/master/iti103/session-7(imbalanced-data)/Oversampling(Solution).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left"/></a>

# Dealing with Imbalanced Data Set

Welcome to the programming exercise. This is part of the series of exercises to help you acquire skills in different techniques to fine-tune your model.

**You will learn:**
- how to use oversampling correctly for imbalanced data set
- how to perform oversampling using K-folds


## Oversampling

In this exercise, we will use a highly imbalanced data set from Lending Club that consists of data for both 'bad' and 'good' loans to illustrate the proper way of oversampling. The focus of this exercise is not to produce accurate model but to illustrate the 'effect' that wrong oversampling has on the model performance. 

### 1. Import the libraries

In [1]:
from __future__ import print_function

import warnings
warnings.filterwarnings('ignore', module='sklearn')

import pandas as pd
import numpy as np
import urllib.request
import shutil
import zipfile

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTE



### 2. Get the data

In [None]:
url = 'https://github.com/nyp-sit/data/raw/master/lending-club-data.csv.zip'
zip_file = "lending_club-data.csv.zip"

# download the zip file and copy to a file 'lending-club-data.csv.zip'
with urllib.request.urlopen(url) as response, open(zip_file, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)
    
# unzip the file to a folder 'data'
data_file = 'lending_club_data.csv'

with zipfile.ZipFile(zip_file,"r") as zip_ref:
    zip_ref.extractall('data')

### 3. Some data exploratory analysis

Here we are trying to find out some information about the dataset

In [4]:
df = pd.read_csv('data/lending-club-data.csv')

  interactivity=interactivity, compiler=compiler, result=result)


Let us just find out about different features and their data types. 

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122607 entries, 0 to 122606
Data columns (total 68 columns):
id                             122607 non-null int64
member_id                      122607 non-null int64
loan_amnt                      122607 non-null int64
funded_amnt                    122607 non-null int64
funded_amnt_inv                122607 non-null int64
term                           122607 non-null object
int_rate                       122607 non-null float64
installment                    122607 non-null float64
grade                          122607 non-null object
sub_grade                      122607 non-null object
emp_title                      115767 non-null object
emp_length                     118516 non-null object
home_ownership                 122607 non-null object
annual_inc                     122603 non-null float64
is_inc_v                       122607 non-null object
issue_d                        122607 non-null object
loan_status                

In this exercise, we are trying to predict if a member will default on his loan or not. So we will be using the feature column 'bad_loans' as the label for our classification task. If the value of `bad_loan` is 1, it means it is a default (or bad loan), otherwise, it is 0.  

***Exercise:***

Find out how many samples in the data set is bad loans and how many are not. 

Hint: `value_counts()` in [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html) give you the count of unique values 

In [6]:
### START CODE HERE ###

df.bad_loans.value_counts()

### END CODE HERE

0    99457
1    23150
Name: bad_loans, dtype: int64

Is the data set imbalanced? Clearly we have a lot of more good loans than bad loans (around 4 times more).

### 4. Data Preparation

There are quite a lot of features in this data set but we are just going to use a few, just for demonstration purpose (as we are not really interested in actual performance of our model).

In [None]:
features = ['grade', 'home_ownership','emp_length_num', 'sub_grade','short_emp',
            'dti', 'term', 'purpose', 'int_rate', 'last_delinq_none', 'last_major_derog_none',
            'revol_util', 'total_rec_late_fee', 'payment_inc_ratio', 'bad_loans']

***Exercise:*** 

Create a data frame that consist of the subset of features listed above.

In [None]:
### START CODE HERE ### 

df = df[features]

### END CODE HERE ###

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122607 entries, 0 to 122606
Data columns (total 15 columns):
grade                    122607 non-null object
home_ownership           122607 non-null object
emp_length_num           122607 non-null int64
sub_grade                122607 non-null object
short_emp                122607 non-null int64
dti                      122607 non-null float64
term                     122607 non-null object
purpose                  122607 non-null object
int_rate                 122607 non-null float64
last_delinq_none         122607 non-null int64
last_major_derog_none    122607 non-null int64
revol_util               122607 non-null float64
total_rec_late_fee       122607 non-null float64
payment_inc_ratio        122603 non-null float64
bad_loans                122607 non-null int64
dtypes: float64(5), int64(5), object(5)
memory usage: 14.0+ MB


Notice that `payment_inc_ratio` has some null values, and since it is only a small number, just remove the rows that have null values for `payment_inc_ratio`.

***Exercise***

Create a new data frame that have the rows that contains null values for `payment_inc_ratio` removed. 

Hint: `~df.payment_inc_ratio.isnull()` will give return a series of boolean(true/false mask) to indicate which rows of payment_inc_ration is **NOT** null. Construct the new data frame using `df[boolean mask]`

In [None]:
### START CODE HERE ###

loans_df = None 
loans_df = df[~df.payment_inc_ratio.isnull()]

#### END CODE HERE ###

In [11]:
loans_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 122603 entries, 0 to 122606
Data columns (total 15 columns):
grade                    122603 non-null object
home_ownership           122603 non-null object
emp_length_num           122603 non-null int64
sub_grade                122603 non-null object
short_emp                122603 non-null int64
dti                      122603 non-null float64
term                     122603 non-null object
purpose                  122603 non-null object
int_rate                 122603 non-null float64
last_delinq_none         122603 non-null int64
last_major_derog_none    122603 non-null int64
revol_util               122603 non-null float64
total_rec_late_fee       122603 non-null float64
payment_inc_ratio        122603 non-null float64
bad_loans                122603 non-null int64
dtypes: float64(5), int64(5), object(5)
memory usage: 15.0+ MB


***Exercise:*** 

Encode the categorical columns (dtype=object). You can use the convenience method `get_dummies()` provide by [pandas](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

In [None]:
### START CODE HERE ###

loans_encoded = pd.get_dummies(loans_df)

### END CODE HERE ###

In [13]:
loans_encoded.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 122603 entries, 0 to 122606
Data columns (total 70 columns):
emp_length_num                122603 non-null int64
short_emp                     122603 non-null int64
dti                           122603 non-null float64
int_rate                      122603 non-null float64
last_delinq_none              122603 non-null int64
last_major_derog_none         122603 non-null int64
revol_util                    122603 non-null float64
total_rec_late_fee            122603 non-null float64
payment_inc_ratio             122603 non-null float64
bad_loans                     122603 non-null int64
grade_A                       122603 non-null uint8
grade_B                       122603 non-null uint8
grade_C                       122603 non-null uint8
grade_D                       122603 non-null uint8
grade_E                       122603 non-null uint8
grade_F                       122603 non-null uint8
grade_G                       122603 non-null u

### 5. Split the data set into train and test set

***Exercise:*** 

Separate the features and the label.  

Hint: use `df.drop()` and specify `axis=1` to remove a particular column in dataframe.

In [None]:
### START CODE HERE ###

# X_df contains all the feature columns and y_df contains only the label, i.e. bad_loans column

X_df = loans_encoded.drop(['bad_loans'], axis=1)
y_df = loans_encoded['bad_loans']

### END CODE HERE ### 

X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, 
                                                    test_size = .1, 
                                                    stratify = y_df,
                                                    random_state = 42)

In [15]:
print(y_train.value_counts())

0    89507
1    20835
Name: bad_loans, dtype: int64


### 6. The ***wrong*** way to oversample ###

With the training data created, we can oversample the minority class (the bad_loan = 1). In this exercise, we will use the SMOTE (from the [imblearn](https://imbalanced-learn.readthedocs.io/en/stable/index.html) library) to create synthetic samples of the minority class. 

After upsampling to a class ratio of 1.0 (i.e. 1 to 1 ratio between positive and negative classes) you should have a balanced dataset. In most cases, there’s often no need to balance the classes totally.

In [None]:
# Set sampling_strategy='minority' to oversample only the minority class 

sm = SMOTE(sampling_strategy='minority',random_state=42)
X_upsample, y_upsample = sm.fit_sample(X_train, y_train)

Now you see that the samples are totally balanced.  `np.bincount()` counts number of occurrences of each value in array of non-negative ints.


In [17]:
print(np.bincount(y_upsample))

[89507 89507]


Now let us split the up-sampled training data set into training and validation set.

***Note:***

It might be a bit confusing as we talk about training sets. We have our original data set, `X` and we split into `X_train` and `X_test`.  We up-sample the `X_train` to get `X_upsample`. And then from the `X_upsample`, we further set aside a train set and validation set, which we call: `X_train_final`, and `X_val_final` to differentiate from the earlier `X_train` and `X_upsample`. 

In [None]:
#now split into cross validation

X_train_final, X_val_final, y_train_final, y_val_final = train_test_split(X_upsample, y_upsample, 
                                                                          test_size=.1, random_state=42)

We then train a classifier and look at the performance of the classifier

In [19]:
clf_rf = RandomForestClassifier(n_estimators=25, random_state=42)
clf_rf.fit(X_train_final, y_train_final)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=25,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

As we are interested in knowing how well our model is in picking out 'bad loan', it would be useful to look at the recall score of the model. 

***Exercise:*** 

Find the accuracy and the recall of the model on the validation set, i.e. `X_val_final`

In [20]:
### START THE CODE ### 
y_val_final_pred = clf_rf.predict(X_val_final)

accuracy = accuracy_score(y_val_final, y_val_final_pred)
recall = recall_score(y_val_final, y_val_final_pred)

### END THE CODE ### 

print('accuracy = {}'.format(accuracy))
print('recall = {}'.format(recall))

accuracy = 0.8788403530331806
recall = 0.8043982760526025


Expected output: 

<img src='https://github.com/nyp-sit/sdaai-staff-repo/blob/master/iti103/session-7(imbalanced-data)/images/acc_recall_score1.png?raw=1' />

80% recall, that is pretty good! It means the model correctly identified 80% of the total bad loans. But is this actually representative of how the model will perform? To find out, let's test the model on the test set we created initially.

***Exercise:*** 

Find the accuracy and the recall of the model on the test set, i.e. `X_test`

In [21]:
### START THE CODE ### 
y_test_pred = clf_rf.predict(X_test)

accuracy = accuracy_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)

### END THE CODE ### 

print('accuracy = {}'.format(accuracy))
print('recall = {}'.format(recall))

accuracy = 0.8045836391811435
recall = 0.14514038876889848


Expected output: 

<img src='https://github.com/nyp-sit/sdaai-staff-repo/blob/master/iti103/session-7(imbalanced-data)/images/acc_recall_score2.png?raw=1' />

Only 80% accuracy and 15% recall on the test data. That’s disappointing! What has happened?

By oversampling before splitting into training and validation datasets, we “leaked” information from the validation set into the training of the model (refer to your lecture for more details)

### 7. The ***right way*** to oversample

So, let do it the right way and see what happens. This time round, we will oversample the training set and not the train + validation set. Oversampling is done after we set aside the validation set. 

In [None]:
## Here we set aside a cross validation set first 

X_train_proper,  X_val_proper, y_train_proper, y_val_proper = train_test_split(X_train, y_train, test_size=.1, random_state=42)

Now as before, we use SMOTE to oversample the minority class, but this time we only oversample from the train set.  

***Note:***

It might be a bit confusing as we talk about training sets. We have our original data set, `X` and we split into `X_train` and `X_test`.  And then from the `X_train`, we further set aside a train set and validation set, which we call: `X_train_proper`, and `X_val_proper` to differentiate from the earlier `X_train`. 

***Exercise:***

Use SMOTE (as before) to over-sample the `X_train_proper`. 


In [None]:
### START CODE HERE ###

sm = SMOTE(sampling_strategy='minority',random_state=42) 
X_train_proper_upsampled, y_train_proper_upsampled = sm.fit_sample(X_train_proper, y_train_proper)

### END CODE HERE ### 

We then train a classifier and look at the performance of the classifier

In [24]:
clf_rf = RandomForestClassifier(n_estimators=25, random_state=42)
clf_rf.fit(X_train_proper_upsampled, y_train_proper_upsampled)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=25,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

***Exercise:*** 

As before, find the accuracy and recall of the model on the validation set, i.e. `X_val_proper`

In [25]:
y_val_proper_pred = clf_rf.predict(X_val_proper)

accuracy = accuracy_score(y_val_proper, y_val_proper_pred )
recall = recall_score(y_val_proper, y_val_proper_pred )

### END THE CODE ### 

print('accuracy = {}'.format(accuracy))
print('recall = {}'.format(recall))


accuracy = 0.8121431807884005
recall = 0.1484848484848485


Expected output: 

<img src='https://github.com/nyp-sit/sdaai-staff-repo/blob/master/iti103/session-7(imbalanced-data)/images/acc_recall_score3.png?raw=1' />

This time round, we got only 15% recall. Let's see if this recall rate is more representative of the result on the test set. 

***Exercise:*** 

Find the accuracy and the recall of the model on the test set, i.e. `X_test`

In [26]:
### START THE CODE ### 
y_test_pred = clf_rf.predict(X_test)

accuracy = accuracy_score(y_test, y_test_pred)
recall = recall_score(y_test, y_test_pred)

### END THE CODE ### 

print('accuracy = {}'.format(accuracy))
print('recall = {}'.format(recall))

accuracy = 0.8016475002038985
recall = 0.14211663066954644


Expected output: 

<img src='https://github.com/nyp-sit/sdaai-staff-repo/blob/master/iti103/session-7(imbalanced-data)/images/acc_recall_score4.png?raw=1' />

Now, we can see that the recall rate obtained from the cross validation set matches more closely the result from the test set, which is about 14% recall. 

### 8. Oversampling when doing K-Fold 

If you are doing K-fold cross validation, below is the code to show you how to do the oversampling properly. 

In [27]:
from sklearn.model_selection import StratifiedKFold

skfolds = StratifiedKFold(n_splits=3, random_state=42)

clf_rf = RandomForestClassifier(n_estimators=25, random_state=42)
sm = SMOTE(sampling_strategy='minority',random_state=42) 

# We use enumerate() to return also the index position of the list so that we can print out the fold number
for fold, (train_index, val_index) in enumerate(skfolds.split(X_train, y_train)):
    #print(train_index, val_index)
    X_train_fold = X_train.iloc[train_index]
    y_train_fold = y_train.iloc[train_index]
    X_val_fold = X_train.iloc[val_index]
    y_val_fold = y_train.iloc[val_index]
    X_train_fold_oversample, y_train_fold_oversample = sm.fit_sample(X_train_fold, y_train_fold)
    clf_rf.fit(X_train_fold_oversample, y_train_fold_oversample)
    y_val_fold_pred = clf_rf.predict(X_val_fold)
    print('Accuracy score for {} fold: {}'.format(fold, accuracy_score(y_val_fold,y_val_fold_pred)))
    print('Recall score for {} fold: {}'.format(fold, recall_score(y_val_fold, y_val_fold_pred)))

Accuracy score for 0 fold: 0.8018814061607895
Recall score for 0 fold: 0.14240460763138948
Accuracy score for 1 fold: 0.8001957532421631
Recall score for 1 fold: 0.1375089992800576
Accuracy score for 2 fold: 0.8020663404023926
Recall score for 2 fold: 0.14398848092152627


If you don't want to use data frame for the StratifiedKFold, and prefer to work with numpy array

You can first convert the X_train and y_train to numpy array as follows:

In [28]:
X_train_arr = X_train.values
y_train_arr = y_train.values

skfolds = StratifiedKFold(n_splits=3, random_state=42)

clf_rf = RandomForestClassifier(n_estimators=25, random_state=42)
sm = SMOTE(sampling_strategy='minority',random_state=42) 

# We use enumerate() to return also the index position of the list so that we can print out the fold number
for fold, (train_index, val_index) in enumerate(skfolds.split(X_train_arr, y_train_arr)):
    #print(train_index, val_index)
    X_train_fold = X_train_arr[train_index]
    y_train_fold = y_train_arr[train_index]
    X_val_fold = X_train_arr[val_index]
    y_val_fold = y_train_arr[val_index]
    X_train_fold_oversample, y_train_fold_oversample = sm.fit_sample(X_train_fold, y_train_fold)
    clf_rf.fit(X_train_fold_oversample, y_train_fold_oversample)
    y_val_fold_pred = clf_rf.predict(X_val_fold)
    print('Accuracy score for {} fold: {}'.format(fold, accuracy_score(y_val_fold,y_val_fold_pred)))
    print('Recall score for {} fold: {}'.format(fold, recall_score(y_val_fold, y_val_fold_pred)))

Accuracy score for 0 fold: 0.8018814061607895
Recall score for 0 fold: 0.14240460763138948
Accuracy score for 1 fold: 0.8001957532421631
Recall score for 1 fold: 0.1375089992800576
Accuracy score for 2 fold: 0.8020663404023926
Recall score for 2 fold: 0.14398848092152627
