# Handling imbalanced dataset with Machine learning 

- Imbalanced data refers to those types of datasets where the target class( dependant variable) has an uneven distribution of observations
- one class label has a very high number of observations and the other has a very low number of observations
- Imbalanced technque is said to not to impact when ensemble/ decision tree is used. 
- If imbalanced data is givving me better accuracy, then there is something wrong. Check other performance metrics like precision, recall 


### ways to deal with Imbalaced dataset
1. Hyper parameter tuning. 
2. Under Sampling. 
3. Over Sampling. 
4. SMORTE Tomek
5. Ensemble

In [1]:
import pandas as pd
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df= pd.read_csv('creditcard.csv')
df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [3]:
df.shape

(284807, 31)

In [4]:
df.isnull().sum()
# no null values

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

In [5]:
#class is dependant variable. 
df['Class'].value_counts()

#there are more 0 (no fraud) and less 1 (fraud)

#This is very huge imbalance in dataset of dependant variable

0    284315
1       492
Name: Class, dtype: int64

In [6]:
# segregating independant and dependant features

X= df.drop("Class", axis=1)
y= df["Class"]

In [7]:
#train test split 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test= train_test_split(X,y, train_size=0.70)

## 1. Cross validation and Hyper Parameter tuning
### use this solution at last 

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # to check accuracy
from sklearn.model_selection import KFold  # to use cross validation
from sklearn.model_selection import GridSearchCV

In [9]:
10.0**np.arange(-2,3)
#this is the list of values for C (learning rate I guess)

#if I give 10 instead of 10.0 then error: Integers to negative integer powers are not allowed.

array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])

In [10]:
log_class= LogisticRegression()
grid={'C':10.0 **np.arange(-2,3), 'penalty':['l1','l2']} #hyperparameter tuning 
cv= KFold(n_splits=5, shuffle= False, random_state=None)

In [11]:
clf= GridSearchCV(log_class, grid, cv= cv, n_jobs=-1, scoring="f1_macro")
#f1_macro is scoring parameter

In [12]:
clf.fit(X_train, y_train)

25 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/melissavidiera/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/melissavidiera/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/Users/melissavidiera/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 

GridSearchCV(cv=KFold(n_splits=5, random_state=None, shuffle=False),
             estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]),
                         'penalty': ['l1', 'l2']},
             scoring='f1_macro')

In [13]:
y_pred=clf.predict(X_test)
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
print("Accuracy: " ,accuracy_score(y_test, y_pred))
print("Classification report: \n", classification_report(y_test, y_pred))


Confusion Matrix: 
 [[85247    51]
 [   54    91]]
Accuracy:  0.9987711105649381
Classification report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85298
           1       0.64      0.63      0.63       145

    accuracy                           1.00     85443
   macro avg       0.82      0.81      0.82     85443
weighted avg       1.00      1.00      1.00     85443



- Here accuracy is showing 99%. This is becuase there is very less number of 1s than 0s. 
- As mentioned before, in imbalanced dataset if accuracy is high. There is something wrong and we need to check other performance matrices.

In [14]:
# Class weights
# before running random forest, we will chnange the class weight. 
# checking value counts

y_train.value_counts()
#so we have 199K and 352. to balance this, we can multiply the 1 values into 100 so that it will increase
# trying to have bit more balanced. 


0    199017
1       347
Name: Class, dtype: int64

In [15]:
class_weight= dict({0:1, 1:100})
# class weights should be in dictionary. 
# when my class is 0, multiply that by (do nothing)
#when my class is 1, multiply by 100 (increase importance of 1 by 100 times)

In [16]:
#Applying random forest 
# as Imbalanced technque is said to not to impact when ensemble/ decision tree is used
# applying class weights 

from sklearn.ensemble import RandomForestClassifier
classifier= RandomForestClassifier(class_weight=class_weight)
classifier.fit(X_train, y_train)

RandomForestClassifier(class_weight={0: 1, 1: 100})

In [17]:
# printing accuracy scores

y_pred=classifier.predict(X_test)
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
print("Accuracy: " ,accuracy_score(y_test, y_pred))
print("Classification report: \n", classification_report(y_test, y_pred))

# false negative is reduced to 6 and it is better than previous 
# increased to 115 
#precision is 95, recall is 79 and f1 is 86
#here we need to focus of whether to reduce false positive and false negative. for each use case it is different

Confusion Matrix: 
 [[85291     7]
 [   36   109]]
Accuracy:  0.9994967405170698
Classification report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85298
           1       0.94      0.75      0.84       145

    accuracy                           1.00     85443
   macro avg       0.97      0.88      0.92     85443
weighted avg       1.00      1.00      1.00     85443



## 2. Under Sampling 

- Undersampling is a technique to balance uneven datasets by keeping all of the data in the minority class and decreasing the size of the majority class.
- Not preferable as there is data loss


- (int) <b>how and when to use Undersampling</b>
- If dataset is small, we can go with Undersampling. 
- Also, we need to focus on all performace like recall, precision, f1 score and domain knowledge if we need to reduce false positive or false negative. Based on this I need to check ROC score. I will check smort, oversampling is its performance is not good.
- If nothing works finally I will go with ensembling like XGboost also checking with hyperparameter tuning like class weight etc

In [18]:
y_train.value_counts()

0    199017
1       347
Name: Class, dtype: int64

In [19]:
from collections import Counter #counts 0s n 1s
Counter(y_train)

Counter({0: 199017, 1: 347})

In [20]:
from imblearn.under_sampling import NearMiss

ns= NearMiss(0.80)  #reduce it to 80%
X_train_ns, y_train_ns= ns.fit_resample(X_train,y_train) # 2 new is var created

print("The number of classes before fit {}".format(Counter(y_train)))
print("The number of classes after fit {}".format(Counter(y_train_ns)))



The number of classes before fit Counter({0: 199017, 1: 347})
The number of classes after fit Counter({0: 433, 1: 347})


##### Now the value of 0 has been reduced to 440 from 199K

In [21]:
# 352 is the total number of 1. from this value, I am multiplying 0.8
0.8* 440 
# 352 is value of 1

352.0

In [22]:
#fit_sample error: we can use RandomUnderSampler instead of Nearmiss 
# instead of fir_sample it is fit_resample

#from imblearn.under_sampling import RandomUnderSampler  
#under_sampler = RandomUnderSampler(0.80)

#X_train_ns, y_train_ns= under_sampler.fit_resample(X_train,y_train) # 2 new is var created

#print("The number of classes before fit {}".format(Counter(y_train)))
#print("The number of classes after fit {}".format(Counter(y_train_ns)))


In [23]:
#training to random forest
from sklearn.ensemble import RandomForestClassifier
classifier= RandomForestClassifier()
classifier.fit(X_train_ns, y_train_ns)

RandomForestClassifier()

In [24]:
y_pred=classifier.predict(X_test)
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
print("Accuracy: " ,accuracy_score(y_test, y_pred))
print("Classification report: \n", classification_report(y_test, y_pred))

# false positive is higher than before. 
# accuracy is 76%, this is because we have reduced the dataset and most of dataset is lost 
#This is why we should not use UnderSampling


Confusion Matrix: 
 [[58406 26892]
 [   11   134]]
Accuracy:  0.6851351193193123
Classification report: 
               precision    recall  f1-score   support

           0       1.00      0.68      0.81     85298
           1       0.00      0.92      0.01       145

    accuracy                           0.69     85443
   macro avg       0.50      0.80      0.41     85443
weighted avg       1.00      0.69      0.81     85443



## 3.OverSampling 

- Which class have less value, we will impute to mentioned percentage

In [25]:
from imblearn.over_sampling import RandomOverSampler

os= RandomOverSampler(0.75)   #checked with 0.5 as well 
X_train_ns, y_train_ns= os.fit_resample(X_train,y_train) # 2 new is var created

print("The number of classes before fit {}".format(Counter(y_train)))
print("The number of classes after fit {}".format(Counter(y_train_ns)))

# 1s are increased to 99K from 352



The number of classes before fit Counter({0: 199017, 1: 347})
The number of classes after fit Counter({0: 199017, 1: 149262})


In [26]:
from sklearn.ensemble import RandomForestClassifier
classifier= RandomForestClassifier()
classifier.fit(X_train_ns, y_train_ns)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the n

RandomForestClassifier()

In [27]:
y_pred=classifier.predict(X_test)
print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
print("Accuracy: " ,accuracy_score(y_test, y_pred))
print("Classification report: \n", classification_report(y_test, y_pred))

# false positve  is reduced greatly. Changing 0.75 in RandomOverSampler
#confusion matrrix of under sampling was
#[[64900 20403]
# [   10   130]]

Confusion Matrix: 
 [[85289     9]
 [   35   110]]
Accuracy:  0.9994850368081645
Classification report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85298
           1       0.92      0.76      0.83       145

    accuracy                           1.00     85443
   macro avg       0.96      0.88      0.92     85443
weighted avg       1.00      1.00      1.00     85443



## 4.SMOTE Tomek

- uses combination of under sampling and over sampling. 
- Unlike random oversampling that only duplicates some random examples from the minority class, SMOTE generates examples based on the distance of each data (usually using Euclidean distance) and the minority class nearest neighbors, so the generated examples are different from the original minority class.
- As new points arecreated, it will take more time to execute

### Steps:

1. Choose random data from the minority class.
2. Calculate the Euclidean distance between the random data and its k nearest neighbors.
3. Multiply the difference with a random number between 0 and 1, then add the result to the minority class as a synthetic sample.
4. Repeat the procedure until the desired proportion of minority class is met.

In [28]:
from imblearn.combine import SMOTETomek


In [29]:
os= SMOTETomek(0.75)  
X_train_ns, y_train_ns= os.fit_resample(X_train,y_train) 
print("The number of classes before fit {}".format(Counter(y_train)))
print("The number of classes after fit {}".format(Counter(y_train_ns)))



The number of classes before fit Counter({0: 199017, 1: 347})
The number of classes after fit Counter({0: 198263, 1: 148508})


In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier()
classifier.fit(X_train_ns,y_train_ns)

In [None]:
y_pred=classifier.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

## 5. Ensemble Technique

- If decison tree is getting used which use heirarchy, 

In [None]:
from imblearn.ensemble import EasyEnsembleClassifier

In [None]:
easy= EasyEnsembleClassifier()
easy.fit(X_train,y_train)

In [None]:
pred=easy.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

- It is giving bad result, without any hyper parameter tuning