Working with imbalanced data is normal in real world. Some cases of that can be solved by choosing the right metric or make sure the distribution between training and testing set are the same. Otherwise, there are 2 methods can be used to reduce imbalanced in machine learning are: upsampling (over-sampling) or downsampling (under-sampling)

# 1. Over-sampling 

Over-sampling is generating the new observations in minoriry class. This method should be used only when the data size is small, because of data leakage and model performance problem. The same data point can be shown up in both training and testing set, so always do it after splitting the data into training and validation folds. Over sampling can hurt model performance by training the operation points - it's a bit different from real world data.

## 1.1. Random

As simple as the name, random method duplicates the observations in minority class with replacement.

## 1.2. Synthetic Minority  

[Synthetic Minority Oversampling Techique (SMOTE)]() uses the algorithm to generate new data points. It works by selecting points that are close in the feature space, drawing a line between the points and drawing a new sample as a point along that line. Called $x_i$ and $x_j$ are the points in minor class, $x_j$ is one of $k$ nearest neighbors of $x_i$. The new sample $x_n$ will be computed by:

$$x_n = x_i + \lambda (x_j-x_i)$$

With $\lambda$ is a random number in range $[0,1]$. There are some variants of SMOTE:
- [Borderline SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.BorderlineSMOTE.html#imblearn.over_sampling.BorderlineSMOTE): In some cases, the instance of minority is so close to the majority which can lead to misclassified. These instances are called borderline points that a point has more than a half of its neighbors in majority class. To improve the prediction, we need to fullfill the *border* so the algorithm synthesize the border line points instead of others
- [SVM SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SVMSMOTE.html#imblearn.over_sampling.SVMSMOTE): Similar to borderline, svm create new points based on support vectors, but the new point can lie both outside and inside the distance between support vector to its neighbors. If fewer than a half of $k$ nearest neighbors of a point are majority class, the new points will be created outside to expand the minority class area.
- [Kmeans SMOTE](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.KMeansSMOTE.html#imblearn.over_sampling.KMeansSMOTE): Using kmean to cluster data set before oversampling, any cluster with at least 50% of minority instance will be choosen for generating new samples.

## 1.3. Adaptive Synthetic
Adaptive Synthetic Sampling Approach (ADASYN) works similarity to the regular SMOTE. However, the number of samples generated for each $x_i$ is proportional to the number of samples which are not from the same class with $x_i$ in a given neighborhood. The algorithms generated more new samples where the density of minority is low and fewer or none where the density is high. This will improve the learning process for minority class

Called $n_s$ is number of samples in minority class and $n_l$ is number of samples in majority class. 
- Compute the total numbers of samples will be generated $G = (n_l - n_s) * \beta$ with $\beta$ in range (0,1], if $\beta=1$, the ratio between nl/ns=1
- Find k nearest neighbors of $x_i$ - put into set $S_i$
- Find $\delta_i$ as the number of samples in k-nearest neighbors which belong to the majority class. Calculate ratio $r_i = \frac{\delta_i}{k}$
- Normalize $r_i$ according to $$r_i = \frac{r_i}{\sum_{i=1}^{n_s} r_i}$$
- Calculate the number of samples need to generated for each $x_i$: $g_i = r_i\times G$
- From $S_i$, randomly pick $g_i$ samples and generate new samples following the formula:
$$x_n = x_i + \lambda\times (x_i - x_j)$$


# 2. Under-sampling 

Opposite to over-sampling, under-sampling is the process of reducing samples in majority class. It's simple and suitable for large dataset, but the risk is the most informative samples may be dropped and lead to bad model prediction.

## 2.1. Random 

## 2.2. Near miss
[Near Miss](https://imbalanced-learn.org/stable/under_sampling.html#mathematical-formulation) choses the samples to keep using the minimum average distance to $N$ minority samples:
- NearMiss-1: keeping the samples which have minimum distance to $N$ closest samples
- NearMiss-2: keeping the samples which have minimum distance to $N$ farthest samples
- NearMiss-3: keeping the samples for each closest record in the minority class.

The NearMiss-3 seems the most accurate version, because it will only keep those majority class points that are on the decision boundary.

## 2.3. Tomek links 
A [tomek link](https://imbalanced-learn.org/stable/under_sampling.html#tomek-s-links) exist if two samples in different class are the nearest neighbor of each other then the algorithm will remove the point in majority class or both of them. This method will remove the points in majority class which easy to misclassified.


## 2.4. Condensed Nearest Neighbors
CNN uses 1-NN rule to decide if a sample should be remove or not. Firstly, get all minority sample in a set C and others in set S. Then we go through set S, sample by sample, and classify each sample using 1-NN rule. If the sample is misclassified, add it to C otherwise do nothing. Repeat all step until there is no sample to be added Set C will be used in classify instead of all dataset. This method has lower execution time and reduce space complexity but it sensitive to noise

## 2.5. Edited Nearest Neighbors

[Edited Nearest Neighbors (ENN)](https://imbalanced-learn.org/stable/under_sampling.html#edited-data-set-using-nearest-neighbours) uses a nearest neighbor algorithm to classify and edit the dataset by removing samples which do not the same with their neighbors. Assume the instance $p$ has 3 nearest neighbor, if $p$ in majority class but misclassified by these 3 neighbors then $p$ will be dropped. If $p$ in minority class and 3 neighbors in majority class, then these neighbors of $p$ will be removed.

In [ENN](https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.EditedNearestNeighbours.html#imblearn.under_sampling.EditedNearestNeighbours) function, we can set `kind=mode` for dropping just in case $p$ in majority and `kind=all` for both cases, notice that `kind=all` will remove more samples from the dataset.

# 3. Combination
Tomek links and ENN actually are 2 cleaning methods. Cleaning methods do not allow to specify the number of samples to have in each class, they have been added to the pipeline after using over-sampling to drop noisy samples in the space.

## 3.1. SMOTE-Tomek 

## 3.2. SMOTE-ENN 

# 4. Ensemble 
In sklearn ensemble classifier, bagging and boosting method build serveral estimators on different subset of data. However, these classifiers do not allow to balance each subset of data. Therefore, ibm package introduce 3 function of bagging and boosting which integrated `sample_stategy` into the classifier. The limit of this method is it just use ensemble classifier from sklearn