<a href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/ucc-minority-class-boosting/docs/notebooks/boost_minority_class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Minority class boosting using Synthetic Data

This notebook illustrates how to (a) train a Gretel ACTGAN model on a dataset that only has a few instances of the minority class, and (b) conditionally generate additional minority samples that you can use to augment the original dataset, e.g with the goal to improve a downstream ML task. 

In case the data is highly imbalanced, we suggest resampling the minority class prior to synthetic model training and show how this could benefit the quality of the generated synthetic data. We will provide SQS score metrics and visualize inspect the samples. Note that the synthetic samples are generated with Gretel's Privacy Filters and therefore provide privacy protection compared to other resampling techniques.

## 1. Dataset and data processing

The data we are using here contains transactions made by credit cards in September 2013 by European cardholders. The dataset is a subset of the [Kaggle CreditCardFraud](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)
dataset. 

This subset was constructed by selecting 25,000 financial records through donwsampling the non-fraud transactions and keeping the total of 492 fraud transactions from the original dataset. The dataset is hence highly imbalanced, i.e. the positive class (fraud transaction) accounts for 1.97% of all transactions. 


In [None]:
#@title Define dataset specific settings

# path to dataset
DATASET_PATH = 'https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/creditcard_kaggle_25k.csv.zip'
# column name containing class labels
TARGET_COLUMN='Class' #@param {type:"string"}

# minority/majority class label value             
MAJORITY_CLASS_VALUE=0 #@param {type:"integer"}
MINORITY_CLASS_VALUE=1 #@param {type:"integer"}


In [None]:
#@title Load training data and prepare for conditional data generation

import pandas as pd

data_source = pd.read_csv(DATASET_PATH)

# A requirement for the Gretel ACTGAN to conditionally sample from a column is to have the values in categorical/string format.
data_source[TARGET_COLUMN] = data_source[TARGET_COLUMN].replace(
      [MAJORITY_CLASS_VALUE, MINORITY_CLASS_VALUE],
      ['negative', 'positive']
    )

# show class imbalance in original dataset
print("target class ratio of original dataset")
print(data_source[TARGET_COLUMN].value_counts(normalize=True))

As can be seen, the class imbalance in the target labels is quite high in this dataset. If that is the case, we suggest to upsample the positive classes in the data to give the Gretel ACTGAN enough examples from both classes to train on.

In [None]:
#@title [Optional] Resampling the minority class to establish class balance prior to synthetic model training

UPSAMPLE_MINORITY_CLASS=True #@param {type:"boolean"}

# In this notebook, we use a simple resampling strategy. Other upsampling method can be explored as well.
from sklearn.utils import resample

majority_samples = data_source[data_source[TARGET_COLUMN] == 'negative']
minority_samples = data_source[data_source[TARGET_COLUMN] == 'positive']

if UPSAMPLE_MINORITY_CLASS:

  minority_samples_resampled = resample(
        minority_samples, 
        replace=True, 
        n_samples=len(majority_samples)-len(minority_samples)
      )
  data_source = pd.concat([data_source, minority_samples_resampled])

  # show balance
  print("target class ratio after resampling")
  print(data_source[TARGET_COLUMN].value_counts(normalize=True))

## 2. Train Gretel ACTGAN model

In [None]:
#@title Install the gretel-client
%%capture
!pip install -U gretel-client

In [None]:
#@title Import the libraries and configure the session
from gretel_client import configure_session
from gretel_client.projects.models import read_model_config
from gretel_client.projects import create_or_get_unique_project
from gretel_client.helpers import poll

configure_session(api_key="prompt", cache="yes", validate=True)

In [None]:
#@title Create a project and import the Gretel ACTGAN model configuration.

# Gretel project name
GRETEL_PROJECT_NAME = 'boost-minority-class-example' #@param {type:"string"}

project = create_or_get_unique_project(name=GRETEL_PROJECT_NAME)

In [None]:
#@title Load and modify Gretel Actan config
config = read_model_config("synthetics/tabular-actgan")

# We set the number of epochs to 200
training_epochs = 200 #@param {type:"string"}
config['models'][0]['actgan']['params']['epochs'] = training_epochs

#Turn off privacy filters off for maximum accuracy.
config["models"][0]['actgan']['privacy_filters']["outliers"] = None
config["models"][0]['actgan']['privacy_filters']["similarity"] = None

In [None]:
#@title Train the Gretel ACTGAN model.

# Train the model on our training data set
model = project.create_model_obj(model_config=config, data_source=data_source)
model.submit_cloud()
poll(model)

In [None]:
#@title Inspect the Synthetic Quality Score (SQS)

from pprint import pprint

report_summary = model.get_report_summary()['summary']
df = pd.DataFrame(report_summary, columns=['field','value'])

# Print SQS
print(f"Synthetic Data Quality Report summary")
display(df)

As can be seen from inspecting the Synthetic Data Quality report, we are getting a synthetic quality score of ~75. If you had run the notebook without upsampling the original mintory sample (by setting `UPSAMPLE_MINORITY_CLASS = False`), the SQS would have been much lower. 

Next, we will generate samples from the model and visualize the synthetic data against the original data samples.

## 3. Generate synthetic samples of the minority data

In this section, we conditionally synthesize the minority class samples using our various Gretel ACTGAN models.

In [None]:
#@title Specify the number of minority samples you want to generate

GENERATED_MINORITY_SAMPLES = 500 #@param {type:"integer"}

In [None]:
#@title Conditionally generate minority class with the ACTGAN models.

seeds = pd.DataFrame(data=['positive'] * GENERATED_MINORITY_SAMPLES, columns=[TARGET_COLUMN])

rh = model.create_record_handler_obj(
    data_source=seeds, params={"num_records": len(seeds)}
)
rh.submit_cloud()
poll(rh)

In [None]:
#@title Fetch the synthetic data samples.

print(rh.record_id + ' is complete. Fetching the synthetic data.')
# augment the training data with the synthetic positive samples and train classifier model
synth_data = pd.read_csv(rh.get_artifact_link("data"), compression="gzip")

## 4. Visualize the synthetic samples against the original data

In [None]:
#@title Principal Component Analysis

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition

def create_plot(data_source, majority_samples, minority_samples, synth_data, TARGET_COLUMN):
    scaler = StandardScaler()
    pca = decomposition.PCA(n_components = 2)

    features = data_source.drop(columns=[TARGET_COLUMN]).columns
    df = pd.concat([majority_samples[features], minority_samples[features]])

    # normalize and compute PCA on training data
    X = df.iloc[:, 1:-1]
    scaler.fit(X)
    x_std = scaler.transform(X)
    pca.fit(x_std)

    minority_upsampled = synth_data
    minority_upsampled[TARGET_COLUMN] = minority_upsampled[TARGET_COLUMN].replace('positive', 'synthetic positive')

    df = pd.concat([majority_samples[features], minority_samples[features], minority_upsampled[features],])
    df_lbl = pd.concat([majority_samples[TARGET_COLUMN], minority_samples[TARGET_COLUMN], minority_upsampled[TARGET_COLUMN],])
    X = df.iloc[:, 1:-1]
    x_std = scaler.transform(X)
    pca_data = pca.transform(x_std)
    pca_data = np.column_stack((pca_data, df_lbl))
    pca_df = pd.DataFrame(data=pca_data, columns=("X", "Y", "labels"))

    sns.FacetGrid(pca_df, hue="labels", height=6).map(plt.scatter, 'X', 'Y').add_legend().set(title=f"Original samples vs. samples generated with Gretel ACTGAN") 
    plt.grid()

    plt.show()

# build the PCA chart
create_plot(data_source, majority_samples, minority_samples, synth_data, TARGET_COLUMN)


As can be seen from the plots, we see that the Gretel ACTGAN model produces minority samples that are close to the positive minority samples of the original dataset, i.e. the synthetic minority class samples (green) shows to overlap better with the original minority class samples (orange). 

*Note: In case there are only a limited amount of minority samples available in the original training data, we suggest resampling the minority class as way to "help" the Gretel ACTGAN model to produce more meaningful examples.*