# **Handling Imbalanced Dataset**

In simple terms, in binary classifiers , Imbalance of data is dominance of one class over other of **target variable**.

Class Imbalance occurs in datasets pertaining multiclass classification as well.

Imbalance datasets are characterized by a rare class, which represents a small portion of the entire population (1 out of 1000 or 1 out of 10000 or even more). Class imbalance can be intrinsic to the problem, it is imbalanced by its own nature, or it can be determined by the limitation of data collection, caused by economic or privacy reasons.

The minority class is scarce and its own characteristics and own patterns are scarce as well, but those information is extremely important for the trained model to discriminate the small samples from the crowd. 

<hr>

In Machine Learning and Data Science we often come across a term called Imbalanced Data Distribution, generally happens when observations in one of the class are much higher or lower than the other classes. As Machine Learning algorithms tend to increase accuracy by reducing the error, they do not consider the class distribution. This problem is prevalent in examples such as Fraud Detection, Anomaly Detection, Facial recognition etc.

Standard ML techniques such as Decision Tree and Logistic Regression have a bias towards the majority class, and they tend to ignore the minority class. They tend only to predict the majority class, hence, having major misclassification of the minority class in comparison with the majority class. In more technical words, if we have imbalanced data distribution in our dataset then our model becomes more prone to the case when minority class has negligible or very lesser recall.

<hr>

### **Example**

- User churn in telecom industry.

- Detection of rare diseases.

- Detecting Credit card Fradulant transactions.

In above examples you can notice that number of data points of right class are very less as compared to other.

**Balanced Data**

<img src = 'a_img.png'>

**Imbalanced Data**

<img src = 'b_img.png'>

Suppose to train a machine learning model to discern non-spam emails from spam emails. The entire dataset is composed of 44 emails, including 40 non-spam emails and 4 spam emails. The model used is a standard algorithm and doesn’t take into account the class distribution. The result achieved is the following:

<img src = 'f_img.png'>

The model obtains 90.9% of accuracy. Great result! But is it a good model? Obviously not! The model acts like a Zero Rule model: only the majority class is found, while the rare class, that is more interesting, is ignored.

Accuracy evaluates all the classes as equally important and that’s why it can’t be used as measure of goodness for models working on imbalanced class dataset.
Other metrics are necessary, such as:

-    Recall
-    Precision
-    F1 Measure
-    ROC curve and AUC

Which one is better? There isn’t a better metric. It depends on many factors, such as the goal, the context, and the cost function: is it better to classify correctly one more unit of the rare class but, at the same time, increasing False Positive errors (classify no-spam email as spam email), or misclassify some units of the rare class, but decreasing False Positive errors?

<hr>

## **Ways to Handle Imbalance data**

### **1. Change Another Algorithm & Model ML**
### **2. Change Metrics**
### **3. Re-sampling techniques**
📌 **Natural Resampling**: the main goal is to collect more data of the minority class. It’s simple to obtain, but it’s not always possible. In dataset where the imbalance problem is part of its own nature, it’s hard to collect as data as the majority class.

📌 **Artificial Resampling**: it can be accomplished by:

-    **Undersampling**, by reducing data of the majority class
-    **Oversampling**, by replicating the minority class
-    **SMOTE** (Synthetic Minority Oversampling TEchnique). It is a synthetic minority oversampling technique, which makes synthetic data points by finding the nearest neighbours to each minority sample.
- **NearMiss Algorithm** is an under-sampling technique. It aims to balance class distribution by randomly eliminating majority class examples. 

<hr>

In re-sampling of data either we reduce the proportion of dominant class which is under-sampling or we increase proportion of minority class which is called as oversampling.

However most successful approaches uses both oversampling and under-sampling together.




## **3.1. Under-sampling**
Undersampling can cut out some important and valuable information from the dataset. To overcome the loss of information caused by undersampling, a cluster-based undersampling approach can applied. It uses the k-means alghoritm: the clusters created have the smaller variance within and the most between-class variance, thus the records inside the cluster share the same characteristics. The majority class is undersampled by taking only the centroids of the clusters created.

<img src = 'c_img.png'>

### **3.1.1 Random Under-sampling**

Random under-sampling is very simple and intuitive under- sampling technique . Method works by randomly choosing the samples from dominant class.

There are two major drawbacks of these technique:

1. Major drawback of this technique is that we eliminate samples randomly, which may lead to loss of potential information .

2. The purpose of machine learning is for the classifier to estimate the probability distribution of the target population. Since that distribution is unknown we try to estimate the population distribution using a sample distribution. Statistics tells us that as long as the sample is drawn randomly, the sample distribution can be used to estimate the population distribution from where it was drawn.

### **3.1.2 Cluster centroid under-sampling**

clusters of majority class and replace that cluster with centroid of that cluster. So we undersample majority class by forming clusters and replacing it with cluster centroids.

For example:
- Majority class : 100 samples
- Minority class : 10 samples

Here , in this case we can form 10 clusters of majority class and replace 100 points with 10 data points i.e by cluster centroid.

### **3.1.3 Under-sampling using Tomek links**

Tomek link pair has two opposite class data points who are their own nearest neighbors. Main idea is to separate minority and majority class.

Suppose,
- d(A,B) : distance between two data points A & B
- Then, a(A,B) is Tomek link if and only if
- There is no such point ‘C’ , such that, d(A,C) < d(A<B) or d(B,C) < d(A,B)

If pair of samples form tomek link then either one of the sample is noise or both are placed at border.

As under-sampling technique we eliminate majority class point , while as part of data mining we eliminate both points.

### **3.1.4 NearMiss Algorithm**

NearMiss is an under-sampling technique. It aims to balance class distribution by randomly eliminating majority class examples. When instances of two different classes are very close to each other, we remove the instances of the majority class to increase the spaces between the two classes. This helps in the classification process.

To prevent problem of information loss in most under-sampling techniques, near-neighbor methods are widely used.

The basic intuition about the working of near-neighbor methods is as follows:
- **Step 1**: The method first finds the distances between all instances of the majority class and the instances of the minority class. Here, majority class is to be under-sampled.

- **Step 2**: Then, n instances of the majority class that have the smallest distances to those in the minority class are selected.

- **Step 3**: If there are k instances in the minority class, the nearest method will result in k*n instances of the majority class.

For finding n closest instances in the majority class, there are several variations of applying NearMiss Algorithm :

-    NearMiss – Version 1 : It selects samples of the majority class for which average distances to the k closest instances of the minority class is smallest.
-    NearMiss – Version 2 : It selects samples of the majority class for which average distances to the k farthest instances of the minority class is smallest.
-    NearMiss – Version 3 : It works in 2 steps. Firstly, for each minority class instance, their M nearest-neighbors will be stored. Then finally, the majority class instances are selected for which the average distance to the N nearest-neighbors is the largest.

## **3.2. Over-sampling**
Oversampling, on the other hand, can lead to overfitting. Although, it’s been proved that adjusting the class distribution to the optimal one can improve drastically the performance, but find the best distribution is really difficult. Some dataset are more reactable to fully balanced distribution class, other instead gets greater performance with less skewed dataset. The researcher should find the better solution by trial and error and some heuristics

<img src = 'd_img.png'>

### **3.2.1. Random Oversampling**

In random oversampling technique we replace the samples with existing minority samples. In simple terms we can say that we just do multiplication of existing Minority class.

Major drawback of this technique us that , this technique is highly prone to over fitting.

### **3.2.2. SMOTE (Synthetic Minority Oversampling Technique)**

In this technique we calculate difference between sample under consideration and its nearest neighbors. Once the distance is calculated we multiply that with the number between 0 and 1. We add it to sample under consideration.

Which gives us new sample point for minority class .

Depending upon the amount of oversampling required , neighbours from k-NN are randomly chosen.

<img src = 'e_img.png'>

SMOTE (synthetic minority oversampling technique) is one of the most commonly used oversampling methods to solve the imbalance problem.
It aims to balance class distribution by randomly increasing minority class examples by replicating them.

SMOTE synthesises new minority instances between existing minority instances. It generates the virtual training records by linear interpolation for the minority class. These synthetic training records are generated by randomly selecting one or more of the k-nearest neighbors for each example in the minority class. After the oversampling process, the data is reconstructed and several classification models can be applied for the processed data.

<hr>

## **Credit Card Dataset**

**Context**

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

**Content**

The datasets contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise. 

In [8]:
import pandas  as pd 
import matplotlib.pyplot as plt 
import numpy as np 

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression 
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import confusion_matrix, classification_report 

In [3]:
# load the data set 
data = pd.read_csv('creditcard.csv') 

print(data.info()) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [6]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [7]:
# normalise the amount column 
data['normAmount'] = StandardScaler().fit_transform(np.array(data['Amount']).reshape(-1, 1)) 
  
# drop Time and Amount (not relevant for prediction purpose)
data = data.drop(['Time', 'Amount'], axis = 1) 
  
# there are 492 fraud transactions. 
data['Class'].value_counts() 

0    284315
1       492
Name: Class, dtype: int64

In [27]:
X = data.drop(['Class'], axis=1)
y = data['Class']

In [28]:
# split (80:20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) 
  
# describes info about train and test set 
print("Number transactions X_train dataset: ", X_train.shape) 
print("Number transactions y_train dataset: ", y_train.shape) 
print("Number transactions X_test dataset: ", X_test.shape) 
print("Number transactions y_test dataset: ", y_test.shape) 


Number transactions X_train dataset:  (227845, 29)
Number transactions y_train dataset:  (227845,)
Number transactions X_test dataset:  (56962, 29)
Number transactions y_test dataset:  (56962,)


In [29]:
# logistic regression object 
lr = LogisticRegression() 
  
# train the model on train set 
lr.fit(X_train, y_train.ravel()) 
  
predictions = lr.predict(X_test) 
  
# print classification report 
print(classification_report(y_test, predictions)) 


              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56861
           1       0.88      0.63      0.74       101

    accuracy                           1.00     56962
   macro avg       0.94      0.82      0.87     56962
weighted avg       1.00      1.00      1.00     56962



### **Using SMOTE**

In [35]:
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1))) 
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0))) 
  
# import SMOTE module from imblearn library 
from imblearn.over_sampling import SMOTE 
sm = SMOTE(random_state = 2) 
X_train_res, y_train_res = sm.fit_sample(X_train, y_train.ravel()) 
  
print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape)) 
print('After OverSampling, the shape of train_y: {} \n'.format(y_train_res.shape)) 
  
print("After OverSampling, counts of label '1': {}".format(sum(y_train_res == 1))) 
print("After OverSampling, counts of label '0': {}".format(sum(y_train_res == 0))) 

Before OverSampling, counts of label '1': 391
Before OverSampling, counts of label '0': 227454 

After OverSampling, the shape of train_X: (454908, 29)
After OverSampling, the shape of train_y: (454908,) 

After OverSampling, counts of label '1': 227454
After OverSampling, counts of label '0': 227454


SMOTE Algorithm has oversampled the minority instances and made it equal to majority class. Both categories have equal amount of records. More specifically, the minority class has been increased to the total number of majority class.
Now see the accuracy and recall results after applying SMOTE algorithm (Oversampling).

In [36]:
lr1 = LogisticRegression() 
lr1.fit(X_train_res, y_train_res.ravel()) 
predictions = lr1.predict(X_test) 
  
# print classification report 
print(classification_report(y_test, predictions)) 

              precision    recall  f1-score   support

           0       1.00      0.97      0.99     56861
           1       0.06      0.94      0.12       101

    accuracy                           0.97     56962
   macro avg       0.53      0.96      0.55     56962
weighted avg       1.00      0.97      0.99     56962



 We have reduced the accuracy to 98% as compared to previous model but the recall value of minority class has also improved to 92 %. This is a good model compared to the previous one. Recall is great.

## **NearMiss Algorithm**

NearMiss is an under-sampling technique. It aims to balance class distribution by randomly eliminating majority class examples. When instances of two different classes are very close to each other, we remove the instances of the majority class to increase the spaces between the two classes. This helps in the classification process.

In [33]:
print("Before Undersampling, counts of label '1': {}".format(sum(y_train == 1))) 
print("Before Undersampling, counts of label '0': {} \n".format(sum(y_train == 0))) 
  
# apply near miss 
from imblearn.under_sampling import NearMiss 
nr = NearMiss() 
  
X_train_miss, y_train_miss = nr.fit_sample(X_train, y_train.ravel()) 
  
print('After Undersampling, the shape of train_X: {}'.format(X_train_miss.shape)) 
print('After Undersampling, the shape of train_y: {} \n'.format(y_train_miss.shape)) 
  
print("After Undersampling, counts of label '1': {}".format(sum(y_train_miss == 1))) 
print("After Undersampling, counts of label '0': {}".format(sum(y_train_miss == 0))) 


Before Undersampling, counts of label '1': 391
Before Undersampling, counts of label '0': 227454 

After Undersampling, the shape of train_X: (782, 29)
After Undersampling, the shape of train_y: (782,) 

After Undersampling, counts of label '1': 391
After Undersampling, counts of label '0': 391


The NearMiss Algorithm has undersampled the majority instances and made it equal to majority class. Here, the majority class has been reduced to the total number of minority class, so that both classes will have equal number of records.

In [34]:
# train the model on train set 
lr2 = LogisticRegression() 
lr2.fit(X_train_miss, y_train_miss.ravel()) 
predictions = lr2.predict(X_test) 
  
# print classification report 
print(classification_report(y_test, predictions)) 

              precision    recall  f1-score   support

           0       1.00      0.62      0.76     56861
           1       0.00      0.95      0.01       101

    accuracy                           0.62     56962
   macro avg       0.50      0.78      0.39     56962
weighted avg       1.00      0.62      0.76     56962



This model is better than the first model because it classifies better and also the recall value of minority class is 95 %. But due to undersampling of majority class, its **recall has decreased to 56 %**.

**SMOTE is better to use in this case.**

<hr>

## **Take Home Exercise**

**1. Use any ML for Classification Model to predict Credit Fraud in this dataset + before & after re-sampling target:**
- Logistic Regression (seperti contoh di atas)
- Decision Tree
- Random Forest
- KNN

What's the best model in Precision / Recall Metric?

**2. Use Hyperparameter tuning + after re-sampling target (SMOTE):**
- **Logistic Regression** (Ivan, Gabriella, Taufik, Stevanus, Rifki, Aziz, Dicky, Ramzy)
- **Decision Tree** (Dani, Rizky P., Rinta, Jihar, Cahya, Indra, Shabrina, Agung)
- **Random Forest** (Dimas, Michael, Rizki L., Faris, Pradana, Imanta, Goldy, Fariz G.)

What's the best model in Precision / Recall Metric?


## **Reference**:
- Aditya Patil, "Dealing with Imbalance Data", https://medium.com/@patiladitya81295/dealing-with-imbalance-data-1bacc7d68dff
- Roberta Pollastro, "How to handle Class Imbalance Problem", https://medium.com/quantyca/how-to-handle-class-imbalance-problem-9ee3062f2499
- Baptiste Rocca, "Handling imbalanced datasets in machine learning", https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28
- Geeksforgeeks, "ML | Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python", https://www.geeksforgeeks.org/ml-handling-imbalanced-data-with-smote-and-near-miss-algorithm-in-python/
- Rahul Agarwal, "The 5 most useful Techniques to Handle Imbalanced datasets", https://towardsdatascience.com/the-5-most-useful-techniques-to-handle-imbalanced-datasets-6cdba096d55a
- Dataset source: https://www.kaggle.com/mlg-ulb/creditcardfraud