# CUPED & Bucketing Techniques

## Tasks

### Task 1

Let's try to apply `CUPED` in practice on synthetic data (on real data, the effect would not be as large).  
Apply `CUPED` to this data and calculate how many times the variance in your sample was reduced.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# generating synthetic data
users_num = 10000

df = pd.DataFrame()
df['user'] = range(users_num)
df['group'] = np.random.rand(users_num) < 0.5
df['user_mean'] = np.random.lognormal(mean=np.log(1000), sigma=0.5, size=users_num)

df['cost_before'] = np.abs(
    df['user_mean'] + np.random.normal(0, 100, size=users_num)
)
df['cost'] = np.abs(
    df['user_mean'] + np.random.normal(50, 100, size=users_num)
)

In [3]:
df

Unnamed: 0,user,group,user_mean,cost_before,cost
0,0,False,591.534451,608.936295,607.480687
1,1,True,1084.754140,1324.583799,1218.101397
2,2,True,2089.395664,2257.651877,2094.460092
3,3,True,3401.260793,3350.711729,3359.489573
4,4,False,1534.640477,1670.066993,1477.014864
...,...,...,...,...,...
9995,9995,False,2379.615025,2367.656157,2605.246740
9996,9996,False,2361.264322,2445.388605,2505.677260
9997,9997,True,648.334432,514.279253,681.662309
9998,9998,False,545.096525,652.999266,577.494732


In [4]:
# calculating theta coefficient for CUPED
cov_yx = np.cov(df['cost'], df['cost_before'])[0, 1]
var_x = np.var(df['cost_before'])
theta = cov_yx / var_x

In [5]:
# applying CUPED
df['cost_cuped'] = df['cost'] - theta * df['cost_before']

In [6]:
df

Unnamed: 0,user,group,user_mean,cost_before,cost,cost_cuped
0,0,False,591.534451,608.936295,607.480687,17.830694
1,1,True,1084.754140,1324.583799,1218.101397,-64.530028
2,2,True,2089.395664,2257.651877,2094.460092,-91.687169
3,3,True,3401.260793,3350.711729,3359.489573,114.901979
4,4,False,1534.640477,1670.066993,1477.014864,-140.157571
...,...,...,...,...,...,...
9995,9995,False,2379.615025,2367.656157,2605.246740,312.579267
9996,9996,False,2361.264322,2445.388605,2505.677260,137.739290
9997,9997,True,648.334432,514.279253,681.662309,183.671369
9998,9998,False,545.096525,652.999266,577.494732,-54.822664


In [7]:
# calculate variances before and after CUPED
original_variance = np.var(df['cost'])
cuped_variance = np.var(df['cost_cuped'])

In [8]:
# calculating variance reduction ratio
variance_reduction_ratio = original_variance / cuped_variance

print(f'Original variance: {original_variance:.2f}')
print(f'CUPED variance: {cuped_variance:.2f}')
print(f'Variance reduction ratio: {variance_reduction_ratio:.2f}x')

Original variance: 363059.81
CUPED variance: 19530.72
Variance reduction ratio: 18.59x


### Task 2

Let's practice bucket testing, and at the same time check how the CTR variance in a bucket will differ if we calculate it in two different ways.  
Generate synthetic data.  
Create a transition to `buckets` (take 100 of them) using `MD5 hash` with salt applied to the user ID. 
Then calculate two types of CTR: 
- group CTR (based on total clicks and views in each bucket) 
- average user CTR in the bucket.  

Calculate the standard deviation of these CTRs and compare the values between these methods for one of the groups.

In [9]:
import hashlib

In [10]:
# generating data
np.random.seed(6)

users_num = 10000
mean_user_ctr = 0.2
beta = 20
alpha = mean_user_ctr * beta / (1 - mean_user_ctr)

df = pd.DataFrame()
df['user'] = range(users_num)
df['group'] = np.random.rand(users_num) < 0.5

df['base_user_ctr'] = np.random.beta(alpha, beta, size=users_num)
df['views'] = np.random.lognormal(mean=1, sigma=1, size=users_num).astype(int) + 1
df['clicks'] = np.random.binomial(df['views'], df['base_user_ctr'])

df['user_ctr'] = df['clicks'] / df['views']

In [11]:
df

Unnamed: 0,user,group,base_user_ctr,views,clicks,user_ctr
0,0,False,0.322688,2,2,1.000000
1,1,True,0.182517,6,1,0.166667
2,2,False,0.260975,6,3,0.500000
3,3,True,0.260439,2,0,0.000000
4,4,True,0.332355,1,1,1.000000
...,...,...,...,...,...,...
9995,9995,False,0.216499,25,5,0.200000
9996,9996,False,0.259650,2,1,0.500000
9997,9997,True,0.268912,2,0,0.000000
9998,9998,True,0.147879,4,0,0.000000


In [12]:
# creating buckets assignment function
def get_bucket(user_id, salt='my_salt', num_buckets=100):
    hash_input = f"{user_id}_{salt}"
    hash_value = hashlib.md5(hash_input.encode()).hexdigest()
    bucket = int(hash_value[:8], 16) % num_buckets
    return bucket

In [13]:
# assigning users to buckets
df['bucket'] = df['user'].apply(lambda x: get_bucket(x))

In [14]:
df

Unnamed: 0,user,group,base_user_ctr,views,clicks,user_ctr,bucket
0,0,False,0.322688,2,2,1.000000,83
1,1,True,0.182517,6,1,0.166667,52
2,2,False,0.260975,6,3,0.500000,64
3,3,True,0.260439,2,0,0.000000,26
4,4,True,0.332355,1,1,1.000000,36
...,...,...,...,...,...,...,...
9995,9995,False,0.216499,25,5,0.200000,58
9996,9996,False,0.259650,2,1,0.500000,64
9997,9997,True,0.268912,2,0,0.000000,57
9998,9998,True,0.147879,4,0,0.000000,33


In [15]:
df.bucket.nunique()

100

In [16]:
# method 1: calculating group CTR for each bucket (total clicks / total views)
bucket_group_stats = df.groupby(['group', 'bucket']).agg({
    'clicks': 'sum',
    'views': 'sum'
}).reset_index()

bucket_group_stats['group_ctr'] = bucket_group_stats['clicks'] / bucket_group_stats['views']

In [17]:
# method 2: calculating average user CTR for each bucket
user_ctrs = df.groupby(['group', 'bucket'])['user_ctr'].mean().reset_index()
user_ctrs.rename(columns={'user_ctr': 'avg_user_ctr'}, inplace=True)

In [18]:
# merging two methods for futher comparison
combined = pd.merge(bucket_group_stats, user_ctrs, on=['group', 'bucket'])

In [19]:
# calculating standard deviation for each group and method
std_devs = combined.groupby('group').agg({
    'group_ctr': 'std',
    'avg_user_ctr': 'std'
}).reset_index()

In [20]:
print(std_devs)
print('\nComparison ratios (avg_user_ctr_std / group_ctr_std):')
std_devs['ratio'] = std_devs['avg_user_ctr'] / std_devs['group_ctr']
print(std_devs[['group', 'ratio']])

   group  group_ctr  avg_user_ctr
0  False   0.028374      0.036418
1   True   0.031556      0.037832

Comparison ratios (avg_user_ctr_std / group_ctr_std):
   group     ratio
0  False  1.283489
1   True  1.198864
