## Imbalanced data for risk detection 


Imbalanced data refers to a classification problem where the number of observations per class is not equally distributed, one class represent majority of the dataset and the other minority class.

Imbalanced datasets abound in the real-world. Example: predicting credit card fraud or intrusion in networks detection where the risk events are less than 1% (rare).

Most machine learning algorithms do not work very well with imbalanced datasets for risk detection.

We will consider a simple example that will help understand some of the unique challenges related to imbalanced datasets in building machine learning models to identify risk patterns and other forms of deception.


Outline the choice of:

- evaluation metrics
- sampling technique
- ensemble learning: XGBoost or Random forests Classifier

We use as imbalanced data, loans data from Lending Club

Most loans are paid off. 
Output (target) variable: bad_loans: 1 (loan was charged off or the lessee defaulted) and 0 otherwise.



In [103]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score

# For oversampling Library (Dealing with Imbalanced Datasets)
from imblearn.over_sampling import SMOTE
from collections import Counter


from sklearn.metrics import confusion_matrix, accuracy_score

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use("fivethirtyeight")
import warnings
warnings.filterwarnings('ignore')


### Load the data

In [22]:
loans = pd.read_csv('C:/Users/uknow/Desktop/loan2.csv')

loans.tail()

Unnamed: 0.1,Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
50738,50738,6904966,8547084,24000,24000,24000,36 months,11.55,792.0,B,...,0.6,1.0,1.0,1.0,0,7.04,20160901T000000,1,1,1
50739,50739,6905934,8547992,6000,6000,6000,36 months,24.08,235.65,F,...,0.4,0.0,1.0,1.0,0,2.45896,20160801T000000,0,1,0
50740,50740,6906006,8548063,11000,11000,11000,36 months,11.55,363.0,B,...,0.6,1.0,1.0,1.0,0,5.12471,20160801T000000,1,1,1
50741,50741,6915061,8557157,18825,18825,18825,36 months,16.78,669.11,C,...,1.0,1.0,1.0,1.0,0,18.6728,20160901T000000,1,1,1
50742,50742,6896033,8538061,8675,8675,8675,36 months,16.78,308.34,C,...,1.0,1.0,0.0,1.0,0,14.5661,20160901T000000,1,0,1


### Explore some features

In [23]:
loans.columns

Index(['Unnamed: 0', 'id', 'member_id', 'loan_amnt', 'funded_amnt',
       'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade',
       'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc',
       'is_inc_v', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc',
       'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs',
       'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp',
       'total_rec_int', 'total_rec_late_fee', 'recoveries',
       'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt',
       'next_pymnt_d', 'last_credit_pull_d', 'collections_12_mths_ex_med',
       'mths_since_last_major_derog', 'policy_code', 'not_compliant', 'status',
       'inactive_loans', 'bad_loans', 'emp_length_num', 'grade_

In [24]:
# features:
loans.iloc[0]

Unnamed: 0                                                                     0
id                                                                       1077501
member_id                                                                1296599
loan_amnt                                                                   5000
funded_amnt                                                                 5000
funded_amnt_inv                                                             4975
term                                                                   36 months
int_rate                                                                   10.65
installment                                                               162.87
grade                                                                          B
sub_grade                                                                     B2
emp_title                                                                    NaN
emp_length                  

In [25]:
loans.shape

(50743, 69)

In [26]:
loans['home_ownership']

0            RENT
1            RENT
2            RENT
3            RENT
4            RENT
5            RENT
6             OWN
7            RENT
8             OWN
9             OWN
10           RENT
11           RENT
12           RENT
13           RENT
14           RENT
15       MORTGAGE
16       MORTGAGE
17           RENT
18           RENT
19            OWN
20           RENT
21           RENT
22       MORTGAGE
23           RENT
24           RENT
25       MORTGAGE
26           RENT
27       MORTGAGE
28       MORTGAGE
29           RENT
           ...   
50713    MORTGAGE
50714        RENT
50715        RENT
50716    MORTGAGE
50717    MORTGAGE
50718    MORTGAGE
50719    MORTGAGE
50720         OWN
50721    MORTGAGE
50722        RENT
50723        RENT
50724    MORTGAGE
50725    MORTGAGE
50726        RENT
50727         OWN
50728    MORTGAGE
50729    MORTGAGE
50730    MORTGAGE
50731         OWN
50732        RENT
50733        RENT
50734    MORTGAGE
50735    MORTGAGE
50736    MORTGAGE
50737    M

### Explore the distribution of the column bad_loans

In [27]:
loans.bad_loans.value_counts()

0    42325
1     8418
Name: bad_loans, dtype: int64

In [28]:
loans = loans[~loans.payment_inc_ratio.isnull()]

### Feature selection 

using a subset of features for the classification algorithm

In [29]:

model_variables = ['grade', 'home_ownership','emp_length_num', 'sub_grade','short_emp',
            'dti', 'term', 'purpose', 'int_rate', 'last_delinq_none', 'last_major_derog_none',
            'revol_util', 'total_rec_late_fee', 'payment_inc_ratio', 'bad_loans']

loans_data_relevent = loans[model_variables]

In [30]:
loans_relevant_enconded = pd.get_dummies(loans_data_relevent)

In [31]:
training_features, test_features, \
training_target, test_target, = train_test_split(loans_relevant_enconded.drop(['bad_loans'], axis=1),
                                               loans_relevant_enconded['bad_loans'],
                                               test_size = .1,
                                               random_state=12)

### Resample the training set

#### Over-sampling

Oversampling tries to balance dataset by increasing the size of rare samples. New rare samples are generated by using SMOTE 

SMOTE(Synthetic Minority Over-Sampling Technique) is an oversampling algorithm that relies on the concept of nearest neighbors to create its synthetic data.

Oversample only the training data, so none of the information in the validation data are  used to create synthetic observations.

In [33]:
x_train, x_val, y_train, y_val = train_test_split(training_features, training_target,
                                                  test_size = .1,
                                                  random_state=12)

In [34]:
sm = SMOTE(random_state=12, ratio = 1.0)
x_train_res, y_train_res = sm.fit_sample(x_train, y_train)

### Build a Random Forest classifier

In [51]:
rf = RandomForestClassifier(n_estimators=25, random_state=12)
rf.fit(x_train_res, y_train_res)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=25,
                       n_jobs=None, oob_score=False, random_state=12, verbose=0,
                       warm_start=False)

In [52]:
rf.feature_importances_

array([0.03992866, 0.00746543, 0.06581897, 0.07290858, 0.050041  ,
       0.0070018 , 0.06482734, 0.01917981, 0.06571771, 0.02086042,
       0.01936511, 0.02210716, 0.03838506, 0.01277534, 0.00645334,
       0.00128952, 0.04693943, 0.00048213, 0.0122958 , 0.05439946,
       0.00092985, 0.00227116, 0.00218933, 0.0041764 , 0.00946774,
       0.00583736, 0.00629839, 0.01341714, 0.00968797, 0.00768973,
       0.00767839, 0.00653448, 0.00732547, 0.00424021, 0.00465881,
       0.00740389, 0.00623926, 0.00736257, 0.00457321, 0.00352546,
       0.00245484, 0.00173584, 0.00214417, 0.00174463, 0.00138867,
       0.00161682, 0.0010767 , 0.00107546, 0.00063436, 0.00081056,
       0.00045505, 0.00037515, 0.00028669, 0.0003426 , 0.00029952,
       0.05191662, 0.05705597, 0.00291123, 0.02136387, 0.05492115,
       0.00837549, 0.0010081 , 0.00443686, 0.00214672, 0.00219954,
       0.01447906, 0.01041723, 0.00100959, 0.00156864])

In [54]:
rf.predict_proba(x_train_res)

array([[0.88, 0.12],
       [0.16, 0.84],
       [1.  , 0.  ],
       ...,
       [0.  , 1.  ],
       [0.04, 0.96],
       [0.  , 1.  ]])

In [93]:
#Accuracy

rfpred = rf.predict(x_train_res)
print(round(accuracy_score(y_train_res, rfpred),2)*100)

100.0


If accuracy is used to measure the goodness of a model, our model which classifies all testing samples into “0” will has accuracy (100%). But, this metrics does not provide any valuable information for us.

### Use the right evaluation metrics

Applying appropriate evaluation metrics for model generated using imbalanced data such as:

- Precision: how many selected instances are relevant.
- Recall/Sensitivity: how many relevant instances are selected.
- F1 score: harmonic mean of precision and recall.
- AUC: relation between true-positive rate and false positive rate.


In [91]:
rfpred = rf.predict(x_train_res)
print(confusion_matrix(y_train_res, rfpred ))

[[34248     1]
 [  116 34133]]


In [55]:
rf.score(x_val, y_val)

0.8327129406612656

In [73]:
recall_score(y_val, rf.predict(x_val))

0.13451086956521738

In [74]:
rf.score(test_features, test_target)

0.8259755616870319

In [75]:
recall_score(test_target, rf.predict(test_features))

0.12484993997599039

### Build a XBoost  classifier

In [48]:
#Sklearn version of gradient boost (xgboost)
from sklearn.ensemble import GradientBoostingClassifier

In [49]:
gbk = GradientBoostingClassifier()
gbk.fit(x_train_res, y_train_res)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [94]:
gbkpred = gbk.predict(x_train_res)

#Accuracy_score:
print(round(accuracy_score(y_train_res, gbkpred),2)*100)


89.0


###  Validation results

In [79]:
#Confussion matrix

print(confusion_matrix(y_train_res, gbkpred ))

[[33477   772]
 [ 6778 27471]]


In [98]:
gbk.score(x_val, y_val)

0.8390628421283118

In [99]:
recall_score(y_val, gbk.predict(x_val))

0.12635869565217392

In [84]:
gbk.score(test_features, test_target)

0.8332676389436342

In [86]:
recall_score(test_target, gbk.predict(test_features))

0.10684273709483794

### Interpretation


Random forests and Gradient Boost are great because they reduce overfitting

In our case, the validation results closely match the unseen test data.