# Addressing class imbalance 

This notebook will examine a Random Forest approach on skewed data. The point is to explore how class inbalance leads to indesirable classification perfomance and how we can use non-unit weighting to adress this.

### The fraudulent transaction dataset and costs

The dataset used here is highly unbalanced dataset where the relevant class is whether the transaction is fraudulent (class 1) or not (class 0). In particular in the setting of fraud, each undetected fraudulent transaction carries an average cost, e.g., due direct monetary loss or an indirect reputation loss. On the other hand, a cost may also be associated a false positives, due to e.g., that a human must manually examine this transaction.

For this reason, it does not make sense to just accept an arbitrary tradeoff dictated by a classifier. An acceptable tradeoff would balance expectations about costs of both false positives and false negatives.

### String a balance between false negatives and false positives

This notebook will examine a Random Forest approach on skewed data. The point is to explore how class inbalance leads to indesirable classification perfomance, in particular favouritism of the majority class and whether we (indirectly) can adapt the loss function to create a more appropriate balance between the false positive rate and false negative, rather than just minimizing the total number of misclassifications.

While Random Forests do not have explicit loss function, we  implicitly have one due to the way we split nodes in each decision tree.  To simply the argument, let's consider a single decision tree where the Gini impurity is a splitting criterion (this is default is scikit-learn) and that we have only two classes: 
 
 $$I_{G} = 1 - (p_0^2  + p_1^2)$$ 
 
 With weights, we essentially end up with something like

  $$I_{G} = 1 - (w_0 p_0^2  + w_1 p_1^2)$$ 
  
 (possibly with some some normalization of the weights to make Gini more interpretable)

 
 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn as sk

%matplotlib inline

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

### Load the data

In [None]:
data = pd.read_csv("../input/creditcard.csv")
data.head()

### Checking the target classes

In [None]:
count_classes = pd.value_counts(data['Class'], sort = True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.yscale("log")

### Clearly the data is totally unbalanced!!  

Since the data is so unbalanced, using a typical accuracy score to evaluate our classification algorithm would be misleading. For example, if we just used a majority class to assign values to all records, we will still be having a high accuracy, but we would be classifying all ones incorrectly. 

### Splitting data into train and test set. 

To keep this simple, we use a traditional training/test split.

In [None]:
from sklearn.model_selection import train_test_split

X = data.iloc[:,1:data.shape[1]-1]
y = data.iloc[:,data.shape[1]-1]

# Whole dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))

In [None]:
def confusion(classifier, X_test, y_test):
    y_pred  = classifier.predict(X_test)
    return confusion_matrix(y_test, y_pred).ravel()

In [None]:
def show(tn,fp,fn,tp):
    print("TN:" + str(tn) + " FP:" + str(fp) + " FN:" + str(fn) + " TP:" + str(tp) + 
          " FNR=" + str(fn/(fn+tp)) + " FPR=" + str(fp/(fp+tn)))

### Using ***unbalanced*** RandomForestClassifier approach:

This is usual Random forest classifier which is the default in scikit-learn. All transactions regardless of label are weighted the same. 

In [None]:
show(*confusion(RandomForestClassifier(random_state=0, n_jobs=-1, n_estimators=10).fit(X_train,y_train),X_test,y_test))

Clearly, the false negative rate is much higher than the false positive rate. 
This demonstrates the inherent  _bias_ toward classification of the majority class in this unbalanced data set.

#### Using ***balanced*** RandomForestClassifier approach:


Weights are inversely proportional with the frequency of class observaton: $$w_j = \frac{n_j}{k n_j}$$

where $w_j$ is the weight to class $j$, $n$ is the number of observations, $n_j$  is the number of observations in class $j$, and $k$  is the total number of classes. In our case, that indicates that the minority class label (fraud) should be weighted higher. 

In [None]:
show(*confusion(RandomForestClassifier(random_state=0, n_jobs=-1, n_estimators=10, class_weight="balanced").fit(X_train,y_train),X_test,y_test))

This makes only a small difference, and again, the tradeoff between the false positive rate and the false negative rate is an arbitrary one.  We still have quite the  _bias_ toward classification of the majority class.

#### Using balanced RandomForestClassifier in`balanced_subsample` mode.
This is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

In [None]:
show(*confusion(RandomForestClassifier(random_state=0, n_jobs=-1, n_estimators=10, class_weight="balanced_subsample").fit(X_train,y_train),X_test,y_test))

#### Custom weighting

Positive samples are given exponentially increasingly higher weight $w_{pos}$ relative to negative samples which have constant but very small weights $w_{neg} = 10^{-4}$ . The intention is that we can find a better for the classifier with fewer false negatives at the cost of accepting a large (but not unreasonable) amount of false positives. 

The weights $w_{neg}$ and  $w_{pos}$ become hyperparameters that we can optimize. For the sake of simplicity we evaluate these on the test set here, but we stress that in we should evalaute these hyperparameters using CV in training set alone to avoid leakage to test set.

The balance tips within a rather small numeric range of $\frac{w_{pos}}{w_{neg}}$ so we search for $w_{pos}$ in exponential increments. A form of quisence search around the bend, might find a better tradeoff.

In [None]:
w_neg = 10**-4
w_pos_range = np.exp(np.arange(np.log(1), np.log(10**9)))

In [None]:
for w_pos in w_pos_range:
    print("w_pos: " + str(w_pos))
    show(*confusion(RandomForestClassifier(random_state=0, n_jobs=-1, n_estimators=10, class_weight={0: w_neg, 1: w_pos}).fit(X_train,y_train),X_test,y_test))

Note for instance, 
```
w_pos: 59874.14171519782
TN:84457 FP:839 FN:29 TP:118 FNR=0.19727891156462585 FPR=0.009836334646407803
```

Here we have less than 30 false negatives which is about 25% less than for an unweighted or even balanced weighted approach, but at the cost of having 8 times as many false positives. However,  in a setting where false negatives are way more expensive than false positives, this may be a more acceptable balance.

#### Conclusion
It is feasible to strike a balance between false negatives and false positives using a class weight hyper parameter. The optimal setting of this hyperparameter depends not only on the cost of false negatives and the cost of false positives, but also on the class imbalance. Similarly to the process of finding the expected cost association to class weights, it makes sense to bootstrap the data to estimate the variance around this expectation.

### References:

- Good blog article on different methods handling class imbalance:
https://www.svds.com/learning-imbalanced-classes/
- A Survey of Predictive Modelling under Imbalanced Distributions: https://arxiv.org/pdf/1505.01658.pdf
- Short blog post about scikit-learn RandomForestClassifier balanced mode: https://chrisalbon.com/machine_learning/trees_and_forests/handle_imbalanced_classes_in_random_forests/
- The mechanism of doing the _weighted_ node-split in the tree in scikit learn: https://github.com/scikit-learn/scikit-learn/blob/70f170dedf2927c2d805144425522459d92700a7/sklearn/tree/_criterion.pyx#L635
- Paper that introduces "Weighted Random Forests" as a method to deal with class imbalance: https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
- This paper frames the concepts around "cost sensitive learning" , i.e., minizing cost rather than just misclassification loss: https://cling.csd.uwo.ca/papers/cost_sensitive.pdf