## Class imbalance

### Key concepts

| <p style="font-size: 15px">concept</p>      | <p style="font-size: 15px">description</p>  | 
| ----------- | ----------- |
| <p style="font-size: 15px">class imbalance</p>      | <p style="font-size: 15px">at least one class has far more/less observations than other classes</p>       | 
| <p style="font-size: 15px">under-sampling</p>      | <p style="font-size: 15px">balance the *training*set by reducing the number of observations in the majority class</p>       | 
| <p style="font-size: 15px">over-sampling</p>      | <p style="font-size: 15px">balance the *training* set by increasing the number of observations in the majority class</p>       | 
| <p style="font-size: 15px">SMOTE</p>      | <p style="font-size: 15px">a popular technique for over-sampling that generates new samples by interpolation</p>       | 
| <p style="font-size: 15px">weighting</p>      | <p style="font-size: 15px">assign higher weights to observations from minority class during training</p>       | 
| <p style="font-size: 15px">precision-recall-curve</p>      | <p style="font-size: 15px">calculates and plots the precision and recall scores for a range of threshold values</p>       | 
| <p style="font-size: 15px">threshold value</p>      | <p style="font-size: 15px">assign an observation to class `1` if the predicted score is larger than the threshold</p>       | 

### Intro
For a supervised classification problem, **class imbalance** means that the number of samples/observations from each class is not balanced. Often, we are faced with one class that only has very few samples. This will negatively affect the training phase of many machine learning algorithms. In the end, classifiers will be biased towards the majority class: They favor the class with more samples during prediction.

Note, that for many real-world problems, the issue is not the class imbalance itself. It's the fact that we often only have very few samples in the **minority class**. This makes it very difficult for any ML algorithm to properly distinguish between the classes. The **decision boundaries** that are estimated by a classifier become biased and do not properly seperate the classes from each other.

In practice, we have several options at hand to tackle the imbalance problem. They can be categorized into methods that are applied **before**, **in between**, or **after** the training phase.

<span style="background-color:orange">**-> Change your perspective!**</span>
If you are faced with severe class imbalance and very few samples in the minority class, ordinary supervised classification may not be the right choice for your imbalanced data problem. Instead, have a look at methods for **outlier or anomaly detection**.

### Pre-training strategies: Sampling
The goal of sampling is to draw a balanced sample of the **training** dataset for which each class has rougly the same number of observations. We only apply sampling to the training data **not** to the test data. Because in the end, we have to evaluate our classifier using real world conditions.

#### Oversampling
Random oversampling draws a random sample with replacement from the minority class until the dataset becomes balanced. We can then fit a classifier on the resampled data:

```
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()
X_resampled, y_resampled = ros.fit_resample(X, y)
clf.fit(X_resampled, y_resampled)
```

other strategies: `SMOTE` (Synthetic Minority Oversampling Technique) or `ADASYN`

#### Undersampling
Undersampling removes some observations of the majority class until the dataset becomes balanced. This strategy should only be applied in **data rich** situations: usually, you should never throw awat data as it can always contain valuable information!

### What is it?
At least one of the classes in a classification problem is strongly over-or underrepresented. (We are interest in the labels, y)

### Why do we care?
If there is significant class imbalance, the model cannot or does not have the incentive to actually learn anything about the data.

**Example Credit Card Fraud:**
- most of the transactions will not be fradulent
- there will only be a very tiny fraction of transactions that are actually fradulent
- what easily can happen is that the model has a strong incentive to classify an observation as non-fradulent. The model will be punished for every wrong classification. If almost all observations belong to one class, the model can avoid most of the punishment by strongly learning towards the majority class

### What can we do about it?
- collect more data? --> does generally not help for the class imbalance
- add some weights to the classes. Punish a misclassification of the minority class stronger than a misclassification of the majority class (sklearn models partly have a parameter class_weights)
- drop some datapoints from the majority class --> undersampling
- create datapoints for the minority class --> oversampling

### Undersampling
- at random
- throw out data points that are far away from the minority class

### Oversampling
- at random (randomly duplicate data points from the minority class)
- SMOTE 