Problem statement:

We have a set of samples which wildly vary in the metric we are trying to optimize for. We do A/B experimentation on these samples where we apply CUPED corrections to the metric to reduce it's variance. However, our samples actually consists of a set of subsamples which we cannot randomize on individually as the intervention only works on the sample level (i.e. the same intervention is always applied to all subsamples of a sample). However, we could try to do the split on A/B on the sample level, but do the evaluation on the subsample level (with corresponding CUPED corrections on this subsample level).

In this notebook I hope to prove that for such an approach we can actually reduce the untreated and CUPED-corrected effect sizes in an A/A setting (i.e. the effect size we see in an A/A should be a more narrow distribution around 0 for the subsampling approach versus the sampling one).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
%matplotlib inline

In [2]:
# Getting data
data = pd.read_csv('experiment_data.csv')

In [3]:
data.head()

Unnamed: 0,identifier,n_sub_identifiers,covariates,metrics
0,15467721,175,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 5, 0, ..."
1,15467841,8,"[4, 0, 0, 0, 0, 0, 0, 0]","[8, 0, 0, 0, 0, 0, 0, 0]"
2,15916026,96,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,15928806,24,"[0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 1, 1, 5, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ..."
4,16041186,213,"[0, 0, 0, 0, 0, 2, 6, 0, 0, 1, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 3, 0, 1, 0, 0, 1, 0, 0, 0, ..."


Explanation of columns:

- identifier: unique sample ID
- n_sub_identifiers: number of subsamples per sample
- metrics: the metric of interest per subsample, collected during the experiment period
- covariates: the covariate for CUPED corrections per subsample, this is the metric collected in the pre-experiment period

In [4]:
data.shape[0]

164870

In [5]:
data.dtypes

identifier            int64
n_sub_identifiers     int64
covariates           object
metrics              object
dtype: object

In [6]:
data.isnull().sum()

identifier           0
n_sub_identifiers    0
covariates           0
metrics              0
dtype: int64

In [41]:
# Small data-munging
tmp = data.drop('identifier',axis=1).reset_index()
tmp = tmp.rename(columns={'index' : 'sample', 'n_sub_identifiers' : 'n_sub_samples'})
# This is ridiculous but okay, need to cast the 'string of array' to an np.array
tmp['covariates'] = tmp['covariates'].apply( lambda x: np.array([int(y) for y in x.strip('[]').split(',')]) )
tmp['metrics'] = tmp['metrics'].apply(lambda x: np.array([int(y) for y in x.strip('[]').split(',')]) )
tmp['sum_covariates'] = tmp['covariates'].apply(lambda x: np.sum(x))
tmp['sum_metrics'] = tmp['metrics'].apply(lambda x: np.sum(x))
data_munged = tmp
data_munged.head()

Unnamed: 0,sample,n_sub_samples,covariates,metrics,sum_covariates,sum_metrics
0,0,175,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 5, 0, ...",127,137
1,1,8,"[4, 0, 0, 0, 0, 0, 0, 0]","[8, 0, 0, 0, 0, 0, 0, 0]",4,8
2,2,96,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",19,24
3,3,24,"[0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 1, 1, 5, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...",12,1
4,4,213,"[0, 0, 0, 0, 0, 2, 6, 0, 0, 1, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 3, 0, 1, 0, 0, 1, 0, 0, 0, ...",166,123
