# Driverless AI: Handling Imbalanced Data

This notebook provides an example of using Driverless AI to train a model to predict an imbalanced target. The data used in this notebook is the [UCI Bank Marketing Dataset](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing).  The goal is to predict whether or not the outcome of the marketing call would be the client subscribing to the bank term deposit.  Only about 4% of clients end up subscribing.

When a dataset has an imbalanced target column, there can be improvement in performance by over-sampling the minority class or under-sampling the majority class.  Driverless AI will automatically under-sample the majority class if the data is considered large and imbalanced (see: http://docs.h2o.ai/driverless-ai/latest-stable/docs/userguide/data-sampling.html).  Driverless AI will not perform any imbalanced sampling unless the dataset needs to be sampled down because of its large size.  

If a user would like to increase the weight or importance of a particular class and would like a high level of control, they can add a weight column.  A weight column tells the Driverless AI models how important a particular row is when calculating the error of the model.  If one row is given a weight of 2 and all other rows are given a weight of 1, then the row with weight of 2 is considered twice as important.  The weight column can act as a way of over-sampling the minority class.  For example, by creating a weight column where the weight for all of our subscription clients is 10 and everyone else is 1, we are essentially copying our subscription clients 10 times or over-sampling 10 times.

The goal of this notebook is to programmatically figure out the best weight column by evaluating how an experiment with a specific weight column performs on our test data.  The test data will not be given a weight column so we are comparing the un-weighted performance metrics on the test data.


## Notes:

* This is an early release of the Driverless AI Python client.
* Python 3.6 is the only supported version.
* You must install the `h2oai_client` wheel to your local Python. This is available from the RESOURCES link in the top menu of the UI.

![py-client](images/py_client_link.png)

## Workflow Steps

1. Sign in
2. Import Data and Decide on Weights to Use
3. Launch Driverless AI Experiments for Different Weights (over-weighting the minority class) and Link to Project
4. Compare Experiments in the Project

## 1. Sign In

Import the required modules and log in.

Pass in your credentials through the Client class which creates an authentication token to send to the Driverless AI Server. In plain English: to sign into the Driverless AI web page (which then sends requests to the Driverless Server), instantiate the Client class with your Driverless AI address and login credentials.

In [1]:
from h2oai_client import Client

In [2]:
address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
h2oai = Client(address = address, username = username, password = password)
# make sure to use the same user name and password when signing in through the GUI

## 2. Import Data and Determine Weights

We will start by using Pandas to import the data into Python.  We will then evaluate how imbalanced the data is and use that information to guide us on what weights to use for our minority class.

In [3]:
import pandas as pd
df_pd = pd.read_csv("./bank-additional-full.csv", sep=";")

In [4]:
df_pd["y"].value_counts()

no     19086
yes      913
Name: y, dtype: int64

We can see that only about 4% of clients end up subscribing.  In order for the two classes to be evenly distributed, the `yes` class needs to occur about 20 times more often.  Therefore, we will choose the following weights to evaluate: 

* weight = 1: no weight done (this is our baseline)
* weight = 2: equivalent to doubling the `yes` class
* weight = 5: equivalent to duplicating the `yes` class 5 times
* weight = 10: equivalent to duplicating the `yes` class 10 times
* weight = 20: equivalent to duplicating the `yes` class 20 times (this would mean both `yes` and `no` occur about the same number of times in the dataset)

## 3. Launch Driverless AI Experiments for Different Weights

We will use a for loop to train an experiment for each weight setting.  We will then calculate how well this experiment does on our un-weighted test data.  The idea is that by weighting the minority class more highly, perhaps we will create a model that performs better overall on our un-weighted test data. 

We will first split the data into train and test and then create a project that will contain all our experiments for this use case.

In [5]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df_pd, test_size=0.2, random_state=1234)

In [6]:
# Make a Project
project_key = h2oai.create_project("campaign_weights", "effects of weighting for our campaign use case")

Now we will upload the test dataset.  The test dataset will not have a weight column because we want to see how the Driverless AI experiment performs on un-weighted hold out data.

In [7]:
# Upload Test Data
# We will not give any weight 
test.to_csv("./test.csv", index=False)
test_path = './test.csv'
test_dai = h2oai.upload_dataset_sync(test_path)

# Link Test Data to Project
h2oai.link_dataset_to_project(project_key = project_key, dataset_key = test_dai.key, dataset_type = "Testing")

Our for loop will: 

1. Add a weight column to the training dataset (over-weighting the minority class)
2. Link the training data to our project
3. Launch an experiment with the weighted training data
4. Save the unweighted test MCC to a dataframe for later comparison.

Note: For demo purposes the accuracy and time of these experiments is set to 1.  Consider increasing these values for your use case.

In [8]:
# Run experiments with different weight columns
weights = [1, 2, 5, 10, 20]
results = pd.DataFrame()
for weight in weights:
    # Add weight column to train
    weighted_train = train.copy()
    weighted_train["weight"] = 1
    weighted_train.loc[weighted_train["y"] == "yes", "weight"] = weight
    weighted_train.to_csv("./campaign_train-weighted.csv", index=False)
    train_weighted_dai = h2oai.upload_dataset_sync("./campaign_train-weighted.csv")
    
    # Link Train Data to Project
    h2oai.link_dataset_to_project(project_key = project_key, 
                                  dataset_key = train_weighted_dai.key, 
                                  dataset_type = "Training")
    
    # Launch Experiment
    experiment = h2oai.start_experiment_sync(dataset_key=train_weighted_dai.key,
                                         testset_key = test_dai.key,
                                         target_col="y",
                                         is_classification=True,
                                         accuracy=1,
                                         time=1,
                                         interpretability=1,
                                         scorer="MCC",
                                         seed = 1234,
                                         weight_col = "weight")
    
    # Update Experiment Description
    h2oai.update_model_description(experiment.key, "weight: " + str(weight))
    
    # Link Experiment to Project
    h2oai.link_experiment_to_project(project_key = project_key, 
                                     experiment_key = experiment.key)
    
    # Save the un-weighted Test MCC to our results frame
    exp_results = pd.DataFrame([{'weight': weight, 'test_score': experiment.test_score}])
    results = pd.concat([results, exp_results], axis = 0)

## 4. Compare Experiments in the Project

The table below shows the un-weighted MCC on the test data for the different weights we supplied.  We can see that the highest MCC occurs if we weight the minority class as twice as important.

In [9]:
results

Unnamed: 0,test_score,weight
0,0.550519,1
0,0.557476,2
0,0.537705,5
0,0.544846,10
0,0.532728,20


### Results in Project

We can use the Projects page in the UI to also compare the performance.  Below we are comparing the experiments by their Test MCC.

![](./images/weighted_project.png)


We can also compare 3 experiments at a time next to each other.  Here we are comparing the top 3 experiments based on their Test MCC and viewing the differences in their experiment summary and variable importance.


![](./images/compare_weighted_experiments.png)