# Criteo Uplift Modeling Dataset

This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising.

This dataset is released along with the paper: “A Large Scale Benchmark for Uplift Modeling” Eustache Diemert, Artem Betlei, Christophe Renaudin; (Criteo AI Lab), Massih-Reza Amini (LIG, Grenoble INP) This work was published in: AdKDD 2018 Workshop, in conjunction with KDD 2018.

#### Major columns:

- **f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11**: feature values (dense, float)

- **treatment**: treatment group. Flag if a company participates in the RTB auction for a particular user (binary: 1 = treated, 0 = control)

- **exposure**: treatment effect, whether the user has been effectively exposed. Flag if a company wins in the RTB auction for the user (binary)

- **conversion**: whether a conversion occured for this user (binary, label)

- **visit**: whether a visit occured for this user (binary, label)

In [None]:
import sys

# install uplift library scikit-uplift and other libraries 
!{sys.executable} -m pip install scikit-uplift dill catboost

## 📝 Load data

Dataset can be loaded from `sklift.datasets` module using `fetch_criteo` function.

In [3]:
from sklift.datasets import fetch_criteo

# returns sklearn Bunch object
# with data, target, treatment keys
# data features (pd.DataFrame), target (pd.Series), treatment (pd.Series) values 
dataset = fetch_criteo()

In [4]:
print(f"Dataset type: {type(dataset)}\n")
print(f"Dataset features shape: {dataset.data.shape}")
print(f"Dataset target shape: {dataset.target.shape}")
print(f"Dataset treatment shape: {dataset.treatment.shape}")

Dataset type: <class 'sklearn.utils.Bunch'>

Dataset features shape: (13979592, 12)
Dataset target shape: (13979592,)
Dataset treatment shape: (13979592,)


We can to load only 10 percent of the data with parameter `percent10=True`

In [5]:
dataset10 = fetch_criteo(percent10=True)

In [6]:
print(f"Dataset features shape: {dataset10.data.shape}")
print(f"Dataset target shape: {dataset10.target.shape}")
print(f"Dataset treatment shape: {dataset10.treatment.shape}")

Dataset features shape: (1397960, 12)
Dataset target shape: (1397960,)
Dataset treatment shape: (1397960,)


## 📝 EDA

In [7]:
dataset.data.head().append(dataset.data.tail())

Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11
0,12.616365,10.059654,8.976429,4.679882,10.280525,4.115453,0.294443,4.833815,3.955396,13.190056,5.300375,-0.168679
1,12.616365,10.059654,9.002689,4.679882,10.280525,4.115453,0.294443,4.833815,3.955396,13.190056,5.300375,-0.168679
2,12.616365,10.059654,8.964775,4.679882,10.280525,4.115453,0.294443,4.833815,3.955396,13.190056,5.300375,-0.168679
3,12.616365,10.059654,9.002801,4.679882,10.280525,4.115453,0.294443,4.833815,3.955396,13.190056,5.300375,-0.168679
4,12.616365,10.059654,9.037999,4.679882,10.280525,4.115453,0.294443,4.833815,3.955396,13.190056,5.300375,-0.168679
13979587,26.297764,10.059654,9.00625,4.679882,10.280525,4.115453,-3.282109,4.833815,3.839578,13.190056,5.300375,-0.168679
13979588,12.642207,10.679513,8.214383,-1.700105,10.280525,3.013064,-13.95515,6.269026,3.971858,13.190056,5.300375,-0.168679
13979589,12.976557,10.059654,8.381868,0.842442,11.029584,4.115453,-8.281971,4.833815,3.779212,23.570168,6.169187,-0.168679
13979590,24.805064,10.059654,8.214383,4.679882,10.280525,4.115453,-1.288207,4.833815,3.971858,13.190056,5.300375,-0.168679
13979591,12.616365,10.059654,8.214383,4.679882,10.280525,3.013064,0.294443,9.332563,3.971858,13.190056,5.300375,-0.168679


In [17]:
dataset.data.describe()

Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11
count,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0,13979590.0
mean,19.6203,10.06998,8.446582,4.178923,10.33884,4.028513,-4.155356,5.101765,3.933581,16.02764,5.333396,-0.1709672
std,5.377464,0.1047557,0.2993161,1.336645,0.3433081,0.4310974,4.577914,1.205248,0.05665958,7.018975,0.1682288,0.02283277
min,12.61636,10.05965,8.214383,-8.398387,10.28053,-9.011892,-31.42978,4.833815,3.635107,13.19006,5.300375,-1.383941
25%,12.61636,10.05965,8.214383,4.679882,10.28053,4.115453,-6.699321,4.833815,3.910792,13.19006,5.300375,-0.1686792
50%,21.92341,10.05965,8.214383,4.679882,10.28053,4.115453,-2.411115,4.833815,3.971858,13.19006,5.300375,-0.1686792
75%,24.43646,10.05965,8.723335,4.679882,10.28053,4.115453,0.2944427,4.833815,3.971858,13.19006,5.300375,-0.1686792
max,26.74526,16.34419,9.051962,4.679882,21.12351,4.115453,0.2944427,11.9984,3.971858,75.29502,6.473917,-0.1686792


In [20]:
dataset.data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13979592 entries, 0 to 13979591
Data columns (total 12 columns):
 #   Column  Dtype  
---  ------  -----  
 0   f0      float64
 1   f1      float64
 2   f2      float64
 3   f3      float64
 4   f4      float64
 5   f5      float64
 6   f6      float64
 7   f7      float64
 8   f8      float64
 9   f9      float64
 10  f10     float64
 11  f11     float64
dtypes: float64(12)
memory usage: 1.2 GB


There are no missing values in data.

In [22]:
print('Number NA:', dataset.data.isna().sum().sum())

Number NA: 0


### 🤔 target share for `treatment / control` 

In [10]:
dataset.treatment.value_counts()

1    11882655
0     2096937
Name: treatment, dtype: Int64

In [11]:
dataset.target.value_counts()

0    13322663
1      656929
Name: visit, dtype: Int64

In [8]:
import pandas as pd 

pd.crosstab(dataset.treatment, dataset.target, normalize='index')

visit,0,1
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.961799,0.038201
1,0.951457,0.048543
