# Imbalanced Data Handling

This notebook describes different approaches to handling imbalanced data.

**Approaches**

1. no modifications to dataset
2. using the default balance_classes option
3. using the weights column
4. ensemble of upsampled models

In this example, we import a data with an imbalanced binary target.  The dataset is anonymized credit card fraud transactions: <https://www.kaggle.com/mlg-ulb/creditcardfraud>

In [77]:
import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_241"; Java(TM) SE Runtime Environment (build 1.8.0_241-b07); Java HotSpot(TM) 64-Bit Server VM (build 25.241-b07, mixed mode)
  Starting server from /Users/megankurka/env2/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmpaqrw0quy
  JVM stdout: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmpaqrw0quy/h2o_megankurka_started_from_python.out
  JVM stderr: /var/folders/fk/z2fjbsq163scfcsq9fhsw7r00000gn/T/tmpaqrw0quy/h2o_megankurka_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,01 secs
H2O_cluster_timezone:,America/New_York
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.1.1
H2O_cluster_version_age:,"14 days, 7 hours and 13 minutes"
H2O_cluster_name:,H2O_from_python_megankurka_8gg4bh
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.556 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


In [78]:
df = h2o.import_file("../../../../Data/FraudDetection/creditcard.csv", 
                    col_types = {'Class': "enum"})

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [79]:
df.head()

Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-1.35981,-0.0727812,2.53635,1.37816,-0.338321,0.462388,0.239599,0.0986979,0.363787,0.0907942,-0.5516,-0.617801,-0.99139,-0.311169,1.46818,-0.470401,0.207971,0.0257906,0.403993,0.251412,-0.0183068,0.277838,-0.110474,0.0669281,0.128539,-0.189115,0.133558,-0.0210531,149.62,0
0,1.19186,0.266151,0.16648,0.448154,0.0600176,-0.0823608,-0.078803,0.0851017,-0.255425,-0.166974,1.61273,1.06524,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.0690831,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.0089831,0.0147242,2.69,0
1,-1.35835,-1.34016,1.77321,0.37978,-0.503198,1.8005,0.791461,0.247676,-1.51465,0.207643,0.624501,0.0660837,0.717293,-0.165946,2.34586,-2.89008,1.10997,-0.121359,-2.26186,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.0553528,-0.0597518,378.66,0
1,-0.966272,-0.185226,1.79299,-0.863291,-0.0103089,1.2472,0.237609,0.377436,-1.38702,-0.0549519,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.05965,-0.684093,1.96578,-1.23262,-0.208038,-0.1083,0.0052736,-0.190321,-1.17558,0.647376,-0.221929,0.0627228,0.0614576,123.5,0
2,-1.15823,0.877737,1.54872,0.403034,-0.407193,0.0959215,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.34585,-1.11967,0.175121,-0.451449,-0.237033,-0.0381948,0.803487,0.408542,-0.0094307,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0
2,-0.425966,0.960523,1.14111,-0.168252,0.420987,-0.0297276,0.476201,0.260314,-0.568671,-0.371407,1.34126,0.359894,-0.358091,-0.137134,0.517617,0.401726,-0.0581328,0.0686531,-0.0331938,0.0849677,-0.208254,-0.559825,-0.0263977,-0.371427,-0.232794,0.105915,0.253844,0.0810803,3.67,0
4,1.22966,0.141004,0.0453708,1.20261,0.191881,0.272708,-0.005159,0.0812129,0.46496,-0.0992543,-1.41691,-0.153826,-0.751063,0.167372,0.0501436,-0.443587,0.00282051,-0.611987,-0.045575,-0.219633,-0.167716,-0.27071,-0.154104,-0.780055,0.750137,-0.257237,0.0345074,0.00516777,4.99,0
7,-0.644269,1.41796,1.07438,-0.492199,0.948934,0.428118,1.12063,-3.80786,0.615375,1.24938,-0.619468,0.291474,1.75796,-1.32387,0.686133,-0.076127,-1.22213,-0.358222,0.324505,-0.156742,1.94347,-1.01545,0.0575035,-0.649709,-0.415267,-0.0516343,-1.20692,-1.08534,40.8,0
7,-0.894286,0.286157,-0.113192,-0.271526,2.6696,3.72182,0.370145,0.851084,-0.392048,-0.41043,-0.705117,-0.110452,-0.286254,0.0743554,-0.328783,-0.210077,-0.499768,0.118765,0.570328,0.0527357,-0.0734251,-0.268092,-0.204233,1.01159,0.373205,-0.384157,0.0117474,0.142404,93.2,0
9,-0.338262,1.11959,1.04437,-0.222187,0.499361,-0.246761,0.651583,0.0695386,-0.736727,-0.366846,1.01761,0.83639,1.00684,-0.443523,0.150219,0.739453,-0.54098,0.476677,0.451773,0.203711,-0.246914,-0.633753,-0.120794,-0.38505,-0.069733,0.0941988,0.246219,0.0830756,3.68,0




The minority class accounts for about 0.17% of the data. The data is highly imbalanced.

In [80]:
100*df[df["Class"] == "1"].nrow/df.nrow

0.1727485630620034

## Approach 1: No Modifications to Dataset

We will begin by simply building models as we normally would.  Our goal is that our imbalanced data handling techniques performed later will perform better than our default/baseline models.

We will split our dataset by time and build a couple of models using different algorithms.

In [81]:
train = df[df["Time"] < 140000]
valid = df[df["Time"] >= 140000]

In [82]:
x = [i for i in train.col_names if i not in ["Time", "Class"]]

In [83]:
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

glm = H2OGeneralizedLinearEstimator(model_id="glm.hex",
                                    nfolds=3, seed=1234,
                                    lambda_search=True,
                                    family="binomial"
                                   )
glm.train(y="Class", x=x, training_frame=train, validation_frame=valid)

glm Model Build progress: |███████████████████████████████████████████████| 100%


In [84]:
from h2o.estimators.random_forest import H2ORandomForestEstimator

rf = H2ORandomForestEstimator(model_id="rf.hex",
                              seed=1234
                             )
rf.train(y="Class", x=x, training_frame=train, validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


In [85]:
from h2o.estimators.xgboost import H2OXGBoostEstimator

xgb = H2OXGBoostEstimator(model_id="xgb.hex",
                          nfolds=3, seed=1234,
                          score_tree_interval=5,
                          ntrees=10000, stopping_rounds=3, # early stopping
                          col_sample_rate=0.8)
xgb.train(y="Class", x=x, training_frame=train, validation_frame=valid)

xgboost Model Build progress: |███████████████████████████████████████████| 100%


We can now compare the model performance across our 3 models.  We use AUCPR instead of AUC.  

AUCPR is an average of the precision-recall weighted by the probability of a given threshold.  The main difference between AUC and AUCPR is that AUC calculates the area under the ROC curve and AUCPR calculates the area under the Precision Recall curve. The Precision Recall curve does not care about True Negatives. For imbalanced data, a large quantity of True Negatives usually overshadows the effects of changes in other metrics like False Positives. The AUCPR will be much more sensitive to True Positives, False Positives, and False Negatives than AUC. As such, AUCPR is recommended over AUC for highly imbalanced data.

In [86]:
print("GLM AUCPR: {:,.3f}".format(glm.aucpr(valid=True)))
print("Random Forest AUCPR: {:,.3f}".format(rf.aucpr(valid=True)))
print("XGBoost AUCPR: {:,.3f}".format(xgb.aucpr(valid=True)))

GLM AUCPR: 0.713
Random Forest AUCPR: 0.790
XGBoost AUCPR: 0.778


We see the best performance from our Random Forest model.  

Note: For imbalanced datasets, we sometimes find that simpler models (glm, random forest) do better since they are less likely to overfit.  We recommend evaluating these models as well as traditional gradient boosting estimators.

The confusion matrix from the model is shown below.  Our model performs well in separating fraud from not-fraud.

In [87]:
rf.confusion_matrix(valid=True)


Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4: 


Unnamed: 0,Unnamed: 1,0,1,Error,Rate
0,0,69442.0,3.0,0.0,(3.0/69445.0)
1,1,25.0,66.0,0.2747,(25.0/91.0)
2,Total,69467.0,69.0,0.0004,(28.0/69536.0)




## Approach 2: Using balance_classes parameter

We will train the same model but this time we will turn on `balance_classes`.  The `balance_classes` parameter balances the training data class counts via over/under-sampling.  The over/under-sampling is automatically determined so that the size of the data after balancing is no more than 5 times bigger than the original by default.  This maximum relative size is configurable through the `max_after_balance_size` parameter.

In [88]:
rf_balanced = H2ORandomForestEstimator(model_id="rf_balanced.hex",
                                       seed=1234,
                                       balance_classes=True)
rf_balanced.train(y="Class", training_frame=train, validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


We see an improvement by using balancing the data provided to the random forest.

In [89]:
print("Random Forest AUCPR: {:,.3f}".format(rf.aucpr(valid=True)))
print("Random Forest Balanced AUCPR: {:,.3f}".format(rf_balanced.aucpr(valid=True)))

Random Forest AUCPR: 0.790
Random Forest Balanced AUCPR: 0.807


## Approach 3: Using Weights Column

We will now try using a weights columns to see if this helps improve our model performance.

We will weight the fraud records as 100x more important.  This has the same effect as upsampling.  To prevent the performance metrics on the validation data from also being weighted, we will force all weights to be equal to 1 on the validation dataset.

In [90]:
train["weight"] = (train["Class"] == "1").ifelse(100, 1)
valid["weight"] = 1

In [91]:
rf_weighted = H2ORandomForestEstimator(model_id="rf_weighted.hex",
                                       seed=1234,
                                       weights_column="weight")
rf_weighted.train(y="Class", training_frame=train, validation_frame=valid)

drf Model Build progress: |███████████████████████████████████████████████| 100%


We see a big improvement by using our weighted Random Forest model.

In [92]:
print("Random Forest AUCPR: {:,.3f}".format(rf.aucpr(valid=True)))
print("Random Forest Balanced AUCPR: {:,.3f}".format(rf_balanced.aucpr(valid=True)))
print("Random Forest Weighted AUCPR: {:,.3f}".format(rf_weighted.aucpr(valid=True)))

Random Forest AUCPR: 0.790
Random Forest Balanced AUCPR: 0.807
Random Forest Weighted AUCPR: 0.808


## Approach 4: Ensemble of Upsampled Models

In this approach, we upsample the minority class (using weights) and train multiple models.  For each model a different random sample of the minority class is upsampled.  These models are then ensembled together in the hopes of creating a more robust model.

For each example, we will randomly upsample 50% of the minority class. To do this, we use the weights column. A weight of 2 is the same as sampling a record twice.

Each model that is trained has a different random sample of minority class up-sampled.

In [93]:
base_models = []
for i in range(3):
    print("Training Base Model #{}".format(i))
    train["weight"] = 1
    
    # randomly sample some minority class to be upsampled
    train[(train.runif() > 0.5) & (train["Class"] == "1"), "weight"] = 2
    
    # train xgboost model with cross validation
    rf = H2ORandomForestEstimator(model_id="rf_{}.hex".format(i),
                                  nfolds=3, seed=1234,
                                  fold_assignment="Modulo", keep_cross_validation_predictions=True,
                                  weights_column="weight"
                                 )
    rf.train(y="Class", training_frame=train, validation_frame=valid)
    base_models = base_models + [rf]

Training Base Model #0
drf Model Build progress: |███████████████████████████████████████████████| 100%
Training Base Model #1
drf Model Build progress: |███████████████████████████████████████████████| 100%
Training Base Model #2
drf Model Build progress: |███████████████████████████████████████████████| 100%


Now that we have our upsampled models using different samples, we will ensemble them together using the Stacked Ensemble algorithm.

In [94]:
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator

stack = H2OStackedEnsembleEstimator(model_id="ensemble",
                                    base_models=[i.model_id for i in base_models])
stack.train(y="Class", training_frame=train, validation_frame=valid)

stackedensemble Model Build progress: |███████████████████████████████████| 100%


The unweighted AUC of the stacked ensemble is below.  This performs slightly better than 2 of the 3 of the individual base models.

In [95]:
stack.aucpr(valid=True)

0.7973530777469093

In [96]:
for i in base_models:
    print(h2o.get_model(i.model_id).model_performance(valid).aucpr())

0.7925736086280576
0.7977647646336248
0.7876173709756262


Below we compare all 3 methods. We can see that we have the best performance by trying an imbalanced handling approach.

In [97]:
print("Random Forest AUCPR: {:,.3f}".format(rf.aucpr(valid=True)))
print("Random Forest Balanced AUCPR: {:,.3f}".format(rf_balanced.aucpr(valid=True)))
print("Random Forest Weighted AUCPR: {:,.3f}".format(rf_weighted.aucpr(valid=True)))
print("Stacked Ensemble AUCPR: {:,.3f}".format(stack.aucpr(valid=True)))

Random Forest AUCPR: 0.788
Random Forest Balanced AUCPR: 0.807
Random Forest Weighted AUCPR: 0.808
Stacked Ensemble AUCPR: 0.797


In [98]:
h2o.cluster().shutdown()

H2O session _sid_9e36 closed.
