In [1]:
import numpy as np
import pandas as pd

# Scikit-learn
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
## Evaluation functions
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
## Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# imbalance-learn
from imblearn.pipeline import Pipeline

All relevant functions and classes are contained in module `wgan`, which has pytorch as its main dependency.

In [2]:
from wgan.imblearn import GANbalancer
#import wgan.data_loader as data_loader

We'll load the COIL00 dataset from an old competition on customer scoring that contains numeric and categorical variables. The preprocessing is outsourced to a function load_coil00. I assume that the data contains no missing variables and is loaded as a pandas `DataFrame` where all categorical variables have the new pandas type `category` for preprocessing. The balancer object itself will run on a numpy array.  

In [3]:
%run -i './data/load_coil00.py'
X, y = load_coil00("./data")

For this exposition, I subsample 5 continuous and 5 categorical variables. 

In [4]:
var_set = list(range(0,5)) + list(range(60,65))
X = X.iloc[:,var_set]

In [5]:
X.shape

(5822, 10)

The GANbalancer creates an embedding layer for each categorical variable. To do so, it needs to know the categorical variables and how many classes exist within each variable. This is easier to automize if this information is set in the data object, but we will feed it to the GANbalancer manually, so it is possible to avoid pandas. 

In [6]:
# Initialize index lists
idx_cont = None
idx_cat  = None

if idx_cat is None:
    idx_cat = list(np.where(X.dtypes == 'category')[0])
    idx_cat = [int(x) for x in idx_cat]

if idx_cont is None:
    idx_cont = [x for x in range(X.shape[1]) if x not in idx_cat]
    idx_cont = [int(x) for x in idx_cont]

In [7]:
idx_cat

[5, 6, 7, 8, 9]

The GANbalancer takes as input a list of tuples of the indices of the categorical variables, the number of levels within the variable and the number of embeddings used in the critic. The critic takes the raw output of the generator as input, where each categorical variable is one-hot encoded (soft one-hot encoding for the generator output). The critic then runs each categorical variable through an embedding layer, the size of which we specify here for each variable separately. 

In [8]:
# Initialize embedding tuples
categorical = None
if idx_cat is not None:
    categorical = [(i, # Cat. Var. index
                    len(X.iloc[:,i].cat.categories), # Number of categories (generator ouput size for this variable)
                    # Critic Embedding layer size: I suggest the heuristic no_nodes = min(no. of levels/2, 15)
                    int(min(15., np.ceil(0.5*len(X.iloc[:,i].cat.categories))))
                   )
                    for i in idx_cat]

# Make sure categorical variables are encoded from 0
if np.any([idx>min(idx_cat) for idx in idx_cont]):
    raise ValueError("Variables need to be ordered [cont, cat]")

After collecting the information on the categorical variables, we can transform the data to a numpy array. Note that categorical variables are level encoded (0,..., no. of levels) rather than one-hot encoded at this point. 

In [9]:
X=X.to_numpy(dtype=np.float32)
y=y.to_numpy(dtype=np.int32)

Since the critic contains embedding layers in the input module, the GANbalancer works on label-encoded variables (0,1,2,...) rather than one-hot encoded variables. Since most sklearn classifiers take one-hot encoded input, we'll use two different preprocessing objects.

In [10]:
# Preprocessing for GANbalancer
preproc_sampler = ColumnTransformer([
    ('scaler', MinMaxScaler(), idx_cont),
    ('pass',   'passthrough',  idx_cat)
])

# Preprocessing for Logistic Regression
preproc_clf = ColumnTransformer([
    ('pass', 'passthrough', idx_cont),
    ('ohe',   OneHotEncoder(categories='auto', handle_unknown='ignore'),  idx_cat)
])

The GANbalancer itself is now simple to construct through a nice wrapper. Defaults are currently fast but not optimal and require more empirical validation. 

In [11]:
sampler = GANbalancer(
            sampling_strategy={1:700}, # How many samples to have after sampling? "auto" balances to 50:50
            idx_cont=idx_cont,  # List of ndices of continuous variables
            categorical=categorical, # List of tuples with info on categorical variables (see above)
            
            verbose = 1,
    
            auxiliary=True, # Train one conditional generator for all classes
            gan_architecture="fisher", # Which GAN loss function to use? Fisher GAN is recommended
            generator_input= X.shape[1], # Noise input to the generator
            generator_layers=[50,50], # Layer in the Generator, typically one or two hidden layers
            critic_layers=[50,50], # Layer in the Critic (w/o embedding layers), typically one or two hidden layers
            layer_norm=True, # Layer normalization? Recommended
            
            batch_size=64, 
            learning_rate=(5e-05, 5e-05),
            n_iter=1e5,  # No. of Generator updates. In the range of [1e5 to 1e6]
            critic_iterations=3) # Number of critic updates before each generator update

GANbalancer works with any model from sklearn

In [12]:
model = LogisticRegression(solver='liblinear')

Construct a pipeline for convenience. The GANbalancer is only needed during model training and can be discarded for deployment. 

In [13]:
pipeline = Pipeline(steps=[
            ('preproc_sampler', preproc_sampler),
            ('sampler', sampler),
            ('preproc_clf', preproc_clf),
            ('classifier', model)
          ])

Training of the GANbalancer may require some time depending on the number of updates `n_iter`. 10000 iterations on full COIL00 should take ~20 minutes on an average CPU. 

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

In [15]:
%timeit
pipeline.fit(X=X_train,y=y_train)

100010it [1:04:23, 27.76it/s]                             


Pipeline(memory=None,
     steps=[('preproc_sampler', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1)), [0, 1, 2, 3, 4]), ('pass', 'passthrough', [5, 6, 7, 8, 9])])), ('sampler', GANbalancer...ty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False))])

Fit a benchmark model without adressing the imbalance in the dataset. 

In [16]:
pipeline_benchmark = Pipeline(steps=[
            ('preproc_clf', ColumnTransformer([
                                ('scaler', MinMaxScaler(), idx_cont),
                                ('ohe',   OneHotEncoder(categories='auto', handle_unknown='ignore'),  idx_cat)
                            ])
            ),
            ('classifier', LogisticRegression(solver='liblinear'))
          ])

pipeline_benchmark.fit(X=X_train,y=y_train)

Pipeline(memory=None,
     steps=[('preproc_clf', ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('scaler', MinMaxScaler(copy=True, feature_range=(0, 1)), [0, 1, 2, 3, 4]), ('ohe', OneHotEncoder(categorical_features=None, categories='auto',
    ...ty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False))])

Simple comparison of the AUC as a sanity check.

In [17]:
print("AUC of model trained on training and synthetic data:",    roc_auc_score(y_true=y_test, y_score=pipeline.predict_proba(X_test)[:,1]))
print("AUC of model trained only on imbalanced training data :", roc_auc_score(y_true=y_test, y_score=pipeline_benchmark.predict_proba(X_test)[:,1]))

AUC of model trained on training and synthetic data: 0.5774791566963049
AUC of model trained only on imbalanced training data : 0.6300722903705196


Creating synthetic data manually

In [18]:
gan = pipeline.named_steps["sampler"]

In [19]:
syn_data = gan.generator.sample_data(num_samples=10000, class_index=1)

The GANbalancer creates continuous variables on the normalized scale (after the first preprocessing step in the pipepline). Categorical variable levels are sampled and returned as a single column of labels to match the original data. 

In [20]:
syn_data.shape

(10000, 10)

In [21]:
syn_data = pipeline.named_steps["preproc_clf"].transform(syn_data)

In [22]:
syn_data.shape

(10000, 39)

In [23]:
syn_data.mean(axis=0)[:,0:10]

matrix([[-0.5809464 , -0.15651135,  0.7383057 ,  0.96522825,  1.085758  ,
          0.        ,  0.6354    ,  0.361     ,  0.0036    ,  0.        ]])

In [24]:
pipeline_benchmark.named_steps["preproc_clf"].transform(X_train).mean(axis=0)[:,0:10]

matrix([[0.01168117, 0.42046496, 0.07716191, 0.51613479, 0.21516262,
         0.01259734, 0.2478241 , 0.51855245, 0.18140174, 0.03481448]])