In [1]:
#Tutorial link: https://www.analyticsvidhya.com/blog/2016/03/practical-guide-deal-imbalanced-classification-problems/

In [3]:
#This tutorial will teach you how to tackle the imbalanced classification
#imbalanced classification is the problem where there are one class in dataset outnumbers other class
#imbalanced refer to disparity in target variable
#e.g. a dataset contains 100,000 instance, 98% of them are positive class, other 2% are negative class

In [1]:
#Why machine learning algorithms struggle with imbalanced dataset
#For example, In LogisticRegression which aim to minimize an average error from all instances in dataset
##if in that dataset has small portion of positive class, means positive class contribute less compare to negative class which has more instance
##so, model trained by minimize error from nagative class more that minimize errors from positive class

In [7]:
#load imbalanced dataset
library(ROSE)
data(hacide)
#hacide are split into train and test set
head(hacide.train, 5)

cls,x1,x2
0,0.2007981,0.67803761
0,0.01662009,1.5765579
0,0.22872469,-0.55953375
0,0.12637877,-0.09381378
0,0.60082129,-0.29839489


In [8]:
head(hacide.test, 5)

cls,x1,x2
0,0.05558898,2.09865792
0,-0.74531524,-2.84903952
0,-0.18493608,0.38072888
0,-0.98002974,0.01893521
0,0.10627565,0.90209911


In [10]:
#let's look at the skewness in training set
table(hacide.train$cls)


  0   1 
980  20 

In [13]:
prop.table(table(hacide.train$cls))


   0    1 
0.98 0.02 

In [14]:
#trainn set are serverly skewed

In [18]:
#next, train decision tree clssifier using this skewed data
library(rpart)
tree_clf <- rpart(cls ~ ., data = hacide.train)

#predict the test set
y_test_pred <- predict(tree_clf, newdata = hacide.test)

#compute model's precision, recall, f1 score
accuracy.meas(hacide.test$cls, y_test_pred[, 2])


Call: 
accuracy.meas(response = hacide.test$cls, predicted = y_test_pred[, 
    2])

Examples are labelled as positive when predicted is greater than 0.5 

precision: 1.000
recall: 0.200
F: 0.167

In [19]:
#model has very low recall score mean, it barely recall positive instancesin test set
#since recall = TP / (TP + FN), low recall mean high FN

In [21]:
#compute model AUC
roc.curve(hacide.test$cls, y_test_pred[, 2], plotit = F)

Area under the curve (AUC): 0.600

In [22]:
#AUC score is 0.6 which is very poor

In [25]:
#Now, let's create new balanced dataset using the following techniques
#1. Oversmapling (sample more minority class)
#2. Undersampling (reduce majority class)
#3. Both (apply Over and Under sampling)
#4. ROSE (generate new data)

#Ovsersampling
over_data <- ovun.sample(cls ~ ., hacide.train, method = 'over', N = 1960)$data
table(over_data$cls)


  0   1 
980 980 

In [28]:
#Undersampling
under_data <- ovun.sample(cls ~., data = hacide.train, method = 'under', N = 40, seed = 1)$data
table(under_data$cls)


 0  1 
20 20 

In [30]:
#Both(Over and Under sampling)
both_data <- ovun.sample(cls ~ ., data = hacide.train, method = 'both', N = 1000, seed = 1)$data
table(both_data$cls)


  0   1 
520 480 

In [32]:
#ROSE(generate new minority data)
rose_data <- ROSE(cls ~., hacide.train, N = 1000, seed = 1)$data
table(rose_data$cls)


  0   1 
520 480 

In [38]:
#Next, train 4 decision tree classifiers using these balanced data
over_tree_clf <- rpart(cls ~., data = over_data)
under_tree_clf <- rpart(cls ~., data = under_data)
both_tree_clf <- rpart(cls ~., data = both_data)
rose_tree_clf <- rpart(cls ~., data = rose_data)

#evaluate these model performances using test set
test_over_pred <- predict(over_tree_clf, newdata = hacide.test)
test_under_pred <- predict(under_tree_clf, newdata = hacide.test)
test_both_pred <- predict(both_tree_clf, newdata = hacide.test)
test_rose_pred <- predict(rose_tree_clf, newdat = hacide.test)

In [42]:
#AUC score of model trained with Oversampling data
roc.curve(hacide.test$cls, test_over_pred[, 2], plotit = F)

Area under the curve (AUC): 0.798

In [43]:
#AUC score of model trained with Undersampoing data
roc.curve(hacide.test$cls, test_under_pred[, 2], plotit = F)

Area under the curve (AUC): 0.867

In [44]:
#AUC socre of model trained with Both over and under sampoing data
roc.curve(hacide.test$cls, test_both_pred[, 2], plotit = F)

Area under the curve (AUC): 0.798

In [45]:
#AUC score of model trained with ROSE(generated data) data
roc.curve(hacide.test$cls, test_rose_pred[, 2], plotit = F)

Area under the curve (AUC): 0.989

In [46]:
#It seems like model trained with data generated from ROSE has the highest AUC score