## Class Imbalance
### Objectives
1. Why class imbalances are a problem
1. Naive oversampling/undersampling
1. Using the “class_weight” parameter (review)
1. Advanced methods (SMOTE and ADASYN)

A classification task with a significant difference in proportionality between classes is said to suffer from a Class Imbalance. 

This is an extremely common problem in real-world settings, and can add significant complexity to the problem. 

<img src="http://harishsivasubramanian.com/wp-content/uploads/2016/09/class.png" width=300>

### Solution 1 - Cost-Sensitive Learning

By default, classifiers in sklearn have a `class_weight` parameter. This is typically set to `None`, which results in a uniform class weight distribution.  

Can also be set to ”balanced”, which computes weights using the following code:  
`n_samples / (n_calsses * np.bincount(y))`  

Weights can also be set manually and passed in as a dictionary, where the weight value determines how much the model should be penalized for getting each class wrong. For example: 

`class_weight = {0: 0.00001, 1: 0.99999}`


### Solution 2: Naive Over/Undersampling

Naive Oversampling/Undersampling are also common ways of dealing with class imbalances. These techniques are not mutually exclusive, and can both be used together. 

<img src="https://i.stack.imgur.com/FEOjd.jpg" width=350>

### Advanced Methods: SMOTE & ADASYN

A more advanced approach to dealing with class imbalances in Synthetic Data Generation.  The two most popular algorithms for this are SMOTE and ADASYN. 

These methods start by computing the K-nearest neighbors for in the minority class for each point in the minority class. They then draw a straight line between them, and pick random points on those lines. 

ADASYN is an improvement on the SMOTE algorithm, because it adds a bit of noise to the points, so they are not linearly correlated with the two real minority samples at each end of the line. 

<img src="https://miro.medium.com/max/1400/1*6UFpLFl59O9e3e38ffTXJQ.png" width=400>

## Implementation & Comparison

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  
%matplotlib inline
from sklearn.model_selection import train_test_split  

In [2]:
bankdata = pd.read_csv('https://raw.githubusercontent.com/matbesancon/BankNotes/master/data_banknote_authentication.txt', header=None)
bankdata.head()

Unnamed: 0,0,1,2,3,4
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [3]:
# our data doesn't have header, so we will manually add that on 
headers = ["Variance", "Skewness", "Curtosis", "Entropy", "Class"]
bankdata.columns = headers

bankdata.head()

Unnamed: 0,Variance,Skewness,Curtosis,Entropy,Class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


__Train-Test-Split__

In [4]:
X = bankdata.drop('Class', axis=1)  
y = bankdata['Class'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1029, stratify=y)

__Scaling Data__

In [5]:
from sklearn.preprocessing import StandardScaler

In [6]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

__Class Imbalance__

In [7]:
!pip install imbalanced-learn --user



In [8]:
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.under_sampling import RandomUnderSampler

In [9]:
# How should we implement