# Better Solving Heavily Unbalanced Classification Problems With KXY

## The Problem

Let us consider a heavily unbalance binary classification problem (i.e. one where positive outcomes are far rarer than negative ones). Examples include fraud detection, claim prediction, network attack detection, etc.

Let us assume we are $n$ i.i.d. observations $(y_i, x_i)$, of which only $m \ll n$ have class $y=1$, the other $n-m$ having class $0$. To ease notations, let's assume the first $m$ $x_i$ have class $1$, so that the log-likelihood reads $$\mathcal{LL}(\theta) = \sum_{i=1}^m \log q_\theta(1|x_i) + \sum_{i=m+1}^n \log q_\theta(0|x_i),$$ where the function $q_\theta: (y, x) \to q_\theta(y|x)$, which satisfies $q_\theta(1|x) + q_\theta(0|x) = 1$ for any $x$ and $\theta$, is the probability that input $x$ is associated to class $y$ under the model corresponding to parameters $\theta$.

Because $m \ll n$, the contribution $\sum_{i=1}^m \log q_\theta(1|x_i)$ of positive outcomes to the overall log-likelihood $\mathcal{LL}(\theta)$ will tend to be negligible. 

### Subsampling to avoid false-negatives

Thus, maximizing $\mathcal{LL}(\theta)$ will likely result in a model with *a lot of false-negatives* (i.e. a model that predicts all explanatory variables correspond to class $0$). 

This is very problematic because in applications like fraud detection, were a single false negative (i.e. a fraudulent transaction that wasn't detected) can be very costly.

A popular workaround to this problem that does not require redefining the loss function is to downsample the $n-m$ negative explanatory variables into a subset $(\tilde{x}_{m+1}, \dots, \tilde{x}_{m+k})$ of size $k$ that is close enough to $m$, and to maximize the new log-likelihood $$\mathcal{LL}_d(\theta) = \sum_{i=1}^m \log q_\theta(1|x_i) + \sum_{i=m+1}^{m+k} \log q_\theta(0|\tilde{x}_i).$$

### Subsampling bias and false-positives
However, because $m \ll n$, the subsampling scheme risks introducing a bias in the distribution of explanatory variables corresponding to negative outcomes (i.e. our subsample might not fully reflect $P_{x|y=0}$).

Whenever the subsampling introduces in bias in the empirical distribution of negative explanatory variables, the risk is that a model trained by maximizing $\mathcal{LL}_d(\theta)$ would not know of certain parts of the distribution $P_{x|y=0}$ of negative explanatory variables. 

This would result in *an excessive number of false-positives* in production (e.g. too many valid transactions classified as fraudulent).

In addition to being costly (e.g. too many valid transactions classified as fraudulent would require additional staff in the fraud department), an excessive number of false-positives would degrade the customer experience.



## The Solution: Unbiased Subsampling With KXY

To avoid these exceed false-positives, it is important to verify that the subsampling scheme did not introduce a bias.

Let $s_i, m < i \leq n$ be the indicator variable taking value $1$ if and only if explanatory variable $x_i$ was selected by the subsampling scheme, and $0$ otherwise.

Fundamentally, the subsampling scheme used introduced a bias in the empirical distribution of negative explanatory variables if and only if we may find a binary classifier that can predict $s_i$ solely from knowing $x_i$ better than chance (i.e. with an accuracy higher than $\max \left(1-\frac{k}{n-m}, \frac{k}{n-m} \right)$.

This is where the `kxy` package comes in. By computing the highest performance achievable when using $(x_{m+1}, \dots, x_n)$ to predict $(s_{m+1}, \dots, s_{n})$, we may quantify the amount of bias the subsampler introduced. 

The subsampler used induced a bias-free partition of our empirical distribution of negative explanatory variables when the highest classification accuracy achievable is as close to $\max \left(1-\frac{k}{n-m}, \frac{k}{n-m} \right)$ as possible. 

The higher the highest accuracy achievable the higher the bias introduced.

It is important to note that, even a random subsampler is used, a specific instance/split thereof can be biased; it is important to always check.
