# Creating, Training and Evaluating Uplift Models

## Introduction

In this notebook, we'll demonstrate how to create, train and evaluate uplift models and apply uplift modelling technique.

- What is uplift modelling?

    It is a family of causal inference technology that uses machine learning models to estimate the causal impact of some treatment on an individual's behaviour.

    - **Persuadables** will only respond positive to the treatment
    - **Sleeping-dogs** have a strong negative response to the treatment
    - **Lost Causes** will never reach the outcome even with the treatment
    - **Sure Things** will always reach the outcome with or without the treatment

    The goal of uplift modelling is to identify the "persuadables", not waste efforts on "sure things" and "lost causes", and avoid bothering "sleeping dogs"

- How does uplift modelling work?
    - **Meta Learner**: predicts the difference between an individual's behaviour when there is a treatment and when there is no treatment

    - **Uplift Tree**: a tree-based algorithm where the splitting criterion is based on differences in uplift

    - **NN-based Model**：a neural network model that usually works with observational data

- Where can uplift modelling work?
    - Marketing: help to identify persuadables to apply a treatment such as a coupon or an online advertisement
    - Medical Treatment: help to understand how a treatment can impact certain groups differently
    


## Step 1: Load the Data

### Notebook Configurations

By defining below parameters, we can apply this notebook on different datasets easily.

In [None]:
IS_CUSTOMER_DATA = False  # if True, dataset has to be uploaded manually by user
DATA_FOLDER = "Files/uplift-modelling"
DATA_FILE = "criteo-research-uplift-v2.1.csv"

# data schema
FEATURE_COLUMNS = [f"f{i}" for i in range(12)]
TREATMENT_COLUMN = "treatment"
LABEL_COLUMN = "visit"

### Import dependencies

In [None]:
import pyspark.sql.functions as F
from pyspark.sql.types import *

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns

%matplotlib inline


from synapse.ml.featurize import Featurize
from synapse.ml.lightgbm import *
from synapse.ml.train import ComputeModelStatistics

import os
import gzip

### Download dataset and upload to lakehouse

- Dataset description: This dataset was created by The Criteo AI Lab.The dataset consists of 13M rows, each one representing a user with 12 features, a treatment indicator and 2 binary labels (visits and conversions).
    - f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)
    - treatment: treatment group (1 = treated, 0 = control) which indicates if a customer was targeted by advertising randomly
    - conversion: whether a conversion occured for this user (binary, label)
    - visit: whether a visit occured for this user (binary, label)

- Dataset homepage: https://ailab.criteo.com/criteo-uplift-prediction-dataset/

- Citation:
    ```
    @inproceedings{Diemert2018,
    author = {{Diemert Eustache, Betlei Artem} and Renaudin, Christophe and Massih-Reza, Amini},
    title={A Large Scale Benchmark for Uplift Modeling},
    publisher = {ACM},
    booktitle = {Proceedings of the AdKDD and TargetAd Workshop, KDD, London,United Kingdom, August, 20, 2018},
    year = {2018}
    }
    ```

In [None]:
if not IS_CUSTOMER_DATA:
    # Download demo data files into lakehouse if not exist
    remote_url = "http://go.criteo.net/criteo-research-uplift-v2.1.csv.gz"
    download_file = "criteo-research-uplift-v2.1.csv.gz"

    # For this demo, we first check if the dataset files are already prepared in the default lakehouse. If not, we'll download the dataset.
    import os
    import requests

    if not os.path.exists("/lakehouse/default"):
        # ask user to add a lakehouse if no default lakehouse added to the notebook.
        # a new notebook will not link to any lakehouse by default.
        raise FileNotFoundError(
            "Default lakehouse not found, please add a lakehouse for the notebook."
        )
    else:
        # check if the needed files are already in the lakehouse, try to download if not.
        # raise an error if downloading failed.
        os.makedirs(f"/lakehouse/default/{DATA_FOLDER}/raw/", exist_ok=True)

        if not os.path.exists(f"/lakehouse/default/{DATA_FOLDER}/raw/{DATA_FILE}"):
            try:
                r = requests.get(f"{remote_url}", timeout=30)
                with open(
                    f"/lakehouse/default/{DATA_FOLDER}/raw/{download_file}", "wb"
                ) as f:
                    f.write(r.content)
                print(f"Downloaded {download_file} into {DATA_FOLDER}/raw/.")

                with gzip.open(
                    f"/lakehouse/default/{DATA_FOLDER}/raw/{download_file}", "rb"
                ) as fin:
                    with open(
                        f"/lakehouse/default/{DATA_FOLDER}/raw/{DATA_FILE}", "wb"
                    ) as fout:
                        fout.write(fin.read())
                print(f"Unzip {download_file} into {DATA_FOLDER}/raw/{DATA_FILE}.")
            except Exception as e:
                print(f"Failed on downloading {DATA_FILE}, error message: {e}")
        else:
            print(f"{DATA_FILE} already exists in {DATA_FOLDER}/raw/.")

### Read data from lakehouse

In [None]:
raw_df = spark.read.csv(f"{DATA_FOLDER}/raw/{DATA_FILE}", header=True, inferSchema=True)
display(raw_df.limit(20))

## Step 2: Data Preprocess

### Data exploration

- **The overall rate of users that visit/convert**

In [None]:
raw_df.select(
    F.mean("visit").alias("Percentage of users that visit"),
    F.mean("conversion").alias("Percentage of users that convert"),
    (F.sum("conversion") / F.sum("visit")).alias("Percentage of visitors that convert"),
).show()

- **The overall average treatment effect on visit**

In [None]:
raw_df.groupby("treatment").agg(
    F.mean("visit").alias("Mean of visit"),
    F.sum("visit").alias("Sum of visit"),
    F.count("visit").alias("Count"),
).show()

- **The overall average treatment effect on conversion**

In [None]:
raw_df.groupby("treatment").agg(
    F.mean("conversion").alias("Mean of conversion"),
    F.sum("conversion").alias("Sum of conversion"),
    F.count("conversion").alias("Count"),
).show()

### Split train-test dataset

In [None]:
transformer = (
    Featurize().setOutputCol("features").setInputCols(FEATURE_COLUMNS).fit(raw_df)
)

df = transformer.transform(raw_df)

In [None]:
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

print("Size of train dataset: %d" % train_df.count())
print("Size of test dataset: %d" % test_df.count())

train_df.groupby(TREATMENT_COLUMN).count().show()

### Split treatment-control dataset

In [None]:
treatment_train_df = train_df.where(f"{TREATMENT_COLUMN} > 0")
control_train_df = train_df.where(f"{TREATMENT_COLUMN} = 0")

## Step 3: Model Training and Evaluation

### Uplift Modelling: T-Learner with LightGBM

In [None]:
def train(train_df):
    classifier = (
        LightGBMClassifier()
        .setFeaturesCol("features")
        .setNumLeaves(10)
        .setNumIterations(100)
        .setObjective("binary")
        .setLabelCol(LABEL_COLUMN)
    )

    model = classifier.fit(train_df)

    return model


treatment_model = train(treatment_train_df)
control_model = train(control_train_df)

### Predict on test dataset

In [None]:
getPred = F.udf(lambda v: float(v[1]), FloatType())

test_pred_df = (
    treatment_model.transform(test_df)
    .withColumn("treatment_pred", getPred("probability"))
    .drop("rawPrediction", "probability", "prediction")
)

test_pred_df = (
    control_model.transform(test_pred_df)
    .withColumn("control_pred", getPred("probability"))
    .drop("rawPrediction", "probability", "prediction")
)

test_pred_df = test_pred_df.withColumn(
    "lift_pred", F.col("treatment_pred") - F.col("control_pred")
).select(TREATMENT_COLUMN, LABEL_COLUMN, "treatment_pred", "control_pred", "lift_pred")

display(test_pred_df.limit(20))

### Model evaluation

Since actial uplift cannot be observed for each individual, we measure the uplift over a group of customers.

- **Uplift Curve**: plots the real cumulative uplift across the population

In [None]:
test_pred_pandas_df = test_pred_df.toPandas()
test_pred_pandas_df.head()

First, we define some helper functions to plot uplift curve

In [None]:
def uplift_rank(uplift_df, treatment_col, label_col, uplift_col):
    # Rank the data by the uplift score
    ranked = pd.DataFrame(
        {"treatment": [], "label": [], "uplift_score": [], "ranked_uplift": []}
    )
    ranked["treatment"] = uplift_df[treatment_col]
    ranked["label"] = uplift_df[label_col]
    ranked["uplift_score"] = uplift_df[uplift_col]
    ranked["ranked_uplift"] = ranked.uplift_score.rank(pct=True, ascending=False)
    ranked = ranked.sort_values(by="ranked_uplift").reset_index(drop=True)
    return ranked


def uplift_eval(ranked):
    uplift_df = ranked.copy()
    # Using Treatment and Control Group to calculate the uplift (Incremental gain)
    C, T = sum(ranked.treatment == 0), sum(ranked.treatment != 0)
    ranked["cr"] = ranked.label
    ranked["tr"] = ranked.label
    ranked.loc[ranked.treatment != 0, "cr"] = 0
    ranked.loc[ranked.treatment == 0, "tr"] = 0
    ranked["cr/c"] = ranked.cr.cumsum() / C
    ranked["tr/t"] = ranked.tr.cumsum() / T
    # Calculate and put the uplift value into dataframe
    uplift_df["uplift"] = round(ranked["tr/t"] - ranked["cr/c"], 5)

    # Add q0
    q0 = pd.DataFrame({"ranked_uplift": 0, "uplift": 0, "treatment": None}, index=[0])
    uplift_df = pd.concat([q0, uplift_df]).reset_index(drop=True)

    return uplift_df


def uplift_plot(uplift_df):
    gain_x = uplift_df.ranked_uplift
    gain_y = uplift_df.uplift
    # plot the data
    plt.figure(figsize=(10, 6))
    mpl.rcParams["font.size"] = 8

    ax = plt.plot(gain_x, gain_y, color="#2077B4", label="Normalized Uplift Model")

    plt.plot(
        [0, gain_x.max()],
        [0, gain_y.max()],
        "--",
        color="tab:orange",
        label="Random Treatment",
    )
    plt.legend()
    plt.xlabel("Porportion Targeted")
    plt.ylabel("Uplift")
    plt.grid(b=True, which="major")

    return ax

Now we can plot the uplift curve on the prediction of the test dataset

In [None]:
ranked_df = uplift_rank(
    test_pred_pandas_df,
    treatment_col=TREATMENT_COLUMN,
    label_col=LABEL_COLUMN,
    uplift_col="lift_pred",
)
uplift_df = uplift_eval(ranked_df)
uplift_plot(uplift_df)

From the uplift curve above, we notice that the top 20% population ranked by our prediction have a large gain if they were given the treatment, which means the are the **persuadables**. Therefore, we can print the cutoff score at 20% percentage to identify the target customers.

In [None]:
cutoff_percentage = 0.2
cutoff_score = ranked_df.iloc[int(len(ranked_df) * cutoff_percentage)]["uplift_score"]

print("Uplift score higher than {:.4f} are Persuadables".format(cutoff_score))