Dealing with Class Imbalance
===

Author: Nathan A. Mahynski

Date: 2023/09/06

Description: What to do when we have imbalanced classes.


When building classifiers, class imbalance in the training set can significantly affect the quality of your final model.  If a set is 80% A and 20% B, then a simple majority classifier (predict everything to be A) will have an 80% accuracy.  However, this is neither sensical, nor necessarily relevant for the real world if the balance is skewed relative to what is to be expected.  Dealing with class imbalance has been, and continues to be, the subject of much research, but there are many existing tools which can handle this problem reasonably well.

However, authentication is not the same as classification and it is important that class imbalance be understood in the proper context when doing authentication.

<h3>Option 1: Choose a better metric</h3>

First of all, you can consider using an alternative metric to accuracy.  The metric of accuracy has issues when a majority class dominates (as illustrated above); other metrics like precision or recall might be more helpful depending on your application. 

<h3>Option 2: Weight data points by their inverse frequency</h3>

Second, there are class balancing tools available <i>in-situ</i> in many machine learning models. For example, [trees](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decision%20tree#sklearn.tree.DecisionTreeClassifier) and [SVCs](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) can weight the error of misprediction by the inverse of class frequency in the training set.  scikit-learn implements these as the `class_weight=balanced` option - refer to the documentation for more details.  This is particularly nice since methods based on ensembles of these atomic classifiers, like random forests, also inherit this built in capability.  Since RF's are almost always one of, if not the best, method for classifying dense, tabular data we can often rely on this feature; it can even be treated as hyperparameter during cross validation to test its importance and impact automatically.  

<h3>Option 3: Re-sampling</h3>

Unfortunately, not all models can handle this.  Unsupervised methods (e.g., PCA) in particular ignore class labels and if one class is highly sampled the data structure can be biased toward that region of latent space, resulting in what will usually become a poor model in production.  As a result, you must resort to other methods to balance the classes which will allow you more fairly compare pipelines involving classifiers that cannot automatically balance classes with those that can.  These methods generally involve re-sampling the dataset, but there are a number of different ways to do so.

<h4>Data Duplication</h4>

The simplest is to simply [resample](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html?highlight=resample) (draw with replacement) the minority classes until all classes have the same number of data point. This is referred to as "oversampling" since it boosts the minority classes. However, the repeated reuse of the data can amplify bias toward these specific observations.  It is also possible to undersample the majority class(es) by randomly removing some points.  Generally, the latter is less popular since data is usually very precious and we want to make as much use of it as possible.  It is also common practice to combine the two to shrink the majority class(es) a little, and amplify the minority class(es) a little so they meet somewhere in the middle.

<h4>Synthetic Data</h4>

Instead of re-using "real" data, it is also possible to create synthetic data.  There are a number of approaches, but perhaps the most common is [Synthetic Minority Over-sampling TEchnique (SMOTE)](http://www.jair.org/index.php/jair/article/view/10302); a nice python library called [imblearn](https://imbalanced-learn.org/stable/index.html) implements this, and alternatives, and has excellent examples and tutorials. SMOTE works by looking at the `k` nearest neighbors of a point (which belong to the same class), selecting one randomly, then choosing a random distance to move along the vector connecting the two, essentially interpolating between them.  There are a number of variants of SMOTE, but the vanilla version is common; it does have a number of issues, though:

1. It is unclear what value of `k` to choose, and also, this imposes a minimum number of examples that must be in a dataset (e.g., `k`=10 won't work if you only have 9 observations of a class).

2. It uses Euclidean distance, which (usually) means features need to be on the same scale to make sense.

3. It results in a "stringy" datasets where points follow "lines" between points which is a bit artificial; moreover, noise is introduced when points of classes are very near each other and tend to make it harder to recognize decision boundaries.

The first point is relatively simple to solve if you treat `k` as a hyperparameter and optimize it with cross-validation.  This means that SMOTE should really be part of your overall data modeling pipeline, not just a single preprocessing step.  With enough data, you can split the data into test/train sets; then, you should balance <b>only the training set</b> - we wish to train on data that doesn't bias the model fitting, but we need to look at only the real world data to assess.  If you generated synthetic data for testing you cannot be sure those points are meaningful.  The test data should always remain imbalanced so the model is evaluated on "real" data only. `imblearn` handles this behind the scenes automatically, and is a drop-in replacement for `sklearn`'s pipeline.

The second point can actually be solved by standardizing the data before using SMOTE; this essentially non-dimensionalizes (autoscales) the data and places it all on the same scale.  For example, if you have a feature of height in millimeters, and weight in kilograms where people are being measured, the height feature will have a much larger magnitude.  SMOTE uses Euclidean distance to determine the nearest neighbors, so in this case differences in weight appear very small relative to height variations; the `k` nearest neighbors would really just be the `k` nearest samples with most similar height.  Thus, the data would make a certain minority class seem like they all have very similar heights, which could very easily confuse a model trained on that data.  The solution is (1) standardize, then (2) SMOTE resample, then (3) de-standardize so the resulting dataset now has synthethic data in its original units, but which has been resampled in a more even fashion across the features.  It is also important to stratify your test/train sets as well.  A final note is that outliers can affect standardization so [robust scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html) can be preferable to "standard scaling".

The third problem can be solved by combining the oversampling with undersampling. [imblearn](https://imbalanced-learn.org/stable/index.html) currently implements 2 different methods: Tomek's links and edited nearest neighbors.  While subjective, [SMOTE-ENN](https://imbalanced-learn.org/stable/combine.html#combine) tends to clean up more noisy (re)samples than using Tomek's links</a>.  Refer to their documentation for more information.

<h3>Load the Data</h3>

Using In-Situ Methods
---

Re-sampling
---