# Handling Imbalanced Data

## Introduction

Oftentimes in practical machine learning problems there will be significant differences in the rarity of different classes of data being predicted. For example, when detecting cancer we can expect to have datasets with large numbers of false outcomes, and a relatively smaller number of true outcomes.

The overall performance of any model trained on such data will be constrained by its ability to predict rare points. In problems where these rare points are only equally important or perhaps less important than non-rare points, this constraint may only become significant in the later "tuning" stages of building the model. But in problems where the rare points are important, or even the point of the classifier (as in a cancer example), dealing with their scarcity is a first-order concern for the mode builder.

The relative importance of performance on rare observations should inform your choice of error metric for the problem you are working on; the more important they are, the more your metric should penalize underperformance on them.

Several different techniques exist in the practice for dealing with imbalanced dataset. The most naive class of techniques is `sampling`: changing the data presented to the model by undersampling common classes, oversampling (duplicating) rare classes, or both.

## Learning Curve

In the context of the `bias-variance tradeoff`, what we hope to achieve by resampling data is to reduce `bias`, or `underfit` (recall the clearly one-class-is-the-only-class model from the illustrative example, which is a) more than we increase variance, or overfit (which goes up when decrease the number of input observations or copy-paste points). A way of quantifying this hope is to look at a learning curve.

# Resources

- [imbalanced-learn documentation](https://imbalanced-learn.org/en/stable/index.html)

- [Undersampling and oversampling imbalanced data](https://www.kaggle.com/residentmario/undersampling-and-oversampling-imbalanced-data/notebook)