# A/B Testing Course

## Lesson 10. Multiple Testing

### Homework

#### Import Libraries

In [1]:
import os
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
from scipy.stats import ttest_ind

#### Task 1. 

How many incompatible independent experiments can be conducted simultaneously if we collect 10,000 observations during the experiment?

If we decide to run 10 experiments, we can allocate 1,000 observations per experiment, with a group size of 500.

Experiment parameters:

- Testing the hypothesis of equal means.
- Significance level: 0.05.
- Allowable probability of Type II error: 0.1.
- Expected effect: increase in values by 3%.
- Method of introducing the effect in synthetic A/B experiments: multiplication by a constant.
- Assuming that the measured values follow a normal distribution with a mean of 100 and a standard deviation of 10.

As the answer, please enter the maximum number of experiments that can be conducted with the given parameters.

In [3]:
users = 10000
alpha = 0.05
beta = 0.1
effect = 1.03
mean = 100
std = 10

In [4]:
num_exp_lst = []
fpr_lst = []
power_lst = []

for num in range(1, 26):
    users_in_exp = users // num
    users_in_group = users_in_exp // 2
    alpha_ = alpha / num  # Bonferroni correction
    
    pval_a_lst = []
    pval_b_lst = []
    for _ in range(1000):
        a = np.random.normal(mean, std, users_in_group)
        aa = np.random.normal(mean, std, users_in_group)
        b = a * effect

        _, pval_a = ttest_ind(a, aa)
        _, pval_b = ttest_ind(a, b)
        pval_a_lst.append(pval_a)
        pval_b_lst.append(pval_b)

    fp = [x for x in pval_a_lst if x < alpha_]
    tp = [x for x in pval_b_lst if x < alpha_]

    fpr = len(fp) / len(pval_a_lst)
    power = len(tp) / len(pval_b_lst)
    
    num_exp_lst.append(num)
    fpr_lst.append(fpr)
    power_lst.append(power)
    
    
result_df = pd.DataFrame(
    {'experiments': num_exp_lst,
     'fpr': fpr_lst,
     'power': power_lst
    }
)  

In [5]:
result_df

Unnamed: 0,experiments,fpr,power
0,1,0.041,1.0
1,2,0.023,1.0
2,3,0.015,1.0
3,4,0.006,1.0
4,5,0.006,1.0
5,6,0.017,1.0
6,7,0.007,1.0
7,8,0.004,1.0
8,9,0.006,1.0
9,10,0.006,1.0


In [6]:
ans = result_df[(result_df['fpr'] < alpha) & (result_df['power'] > 1 - beta)].shape[0]

print(f'The maximum number of experiments that can be conducted with the given parameters: {ans}')

The maximum number of experiments that can be conducted with the given parameters: 20


#### Task 2. 

The task is similar to the previous one, but now the decision is not made independently for each experiment. For example, we have 5 texts for marketing campaigns and want to test which one works better, if any. The algorithm will be as follows:

Create non-overlapping control and experimental groups for each of the 5 variations.
Conduct 5 experiments in parallel.
Use the Holm's method to determine in which experiments there were statistically significant differences.
If no significant differences are found, conclude that there is no effect and reject all variations.
If significant differences are found, select the variation with the lowest p-value among the variations with a significant effect for further use.
We will consider a Type I error to be committed if significant differences are found when there were none in any of the variations.

We will consider a Type II error to be committed if:

No significant differences are found when there were actually some.
The selected variation for further use actually had no effect, while there were variations with an effect.
Experiment parameters:

Testing the hypothesis of equal means.
- Significance level: 0.05.
- Allowable probability of Type II error: 0.1.
- Expected effect: increase in values by 3%.
- Method of introducing the effect in synthetic A/B experiments: multiplication by a constant.
- Note: When evaluating the probability of Type II error, we need to consider the worst-case scenario, where the effect exists only in one of the experiments. The more experiments have an effect, the lower the probability of Type II error will be.

We will assume that the distribution of measured values follows a normal distribution with a mean of 100 and a standard deviation of 10.

As the answer, please enter the maximum number of experiments that can be conducted with the given parameters.

In [7]:
def method_holm(pvalues, alpha=0.05):
    """Applies the Holm's method to test the significance of changes.

    pvalues - List[float] - a list of p-values.
    alpha - float, significance level.
    return - np.array, an array of zeros and ones, 
        where 0 indicates no effect and 1 indicates the presence of an effect.
    """
    m = len(pvalues)
    array_alpha = np.arange(m, 0, -1)
    array_alpha = alpha / array_alpha
    sorted_pvalue_indexes = np.argsort(pvalues)
    res = np.zeros(m)
    for idx, pvalue_index in enumerate(sorted_pvalue_indexes):
        pvalue = pvalues[pvalue_index]
        alpha_ = array_alpha[idx]
        if pvalue < alpha_:
            res[pvalue_index] = 1
        else:
            break
    res = res.astype(int)
    return res

In [8]:
num_exp_lst = []
fpr_lst = []
power_lst = []


trials = 100
for max_exp in range(1, 16):
    users_in_exp = users // max_exp
    users_in_group = users_in_exp // 2
    exps = []
    pval_a_lst = []
    pval_b_lst = []
    fp = []
    tp = []
    
    for _ in range(trials):

        for i in range(1, max_exp+1):
            a = np.random.normal(mean, std, users_in_group)
            aa = np.random.normal(mean, std, users_in_group)
            b = a * effect
            exps.append((a, aa, b))

        for exp in exps:
            _, pval_a = ttest_ind(exp[0], exp[1])
            _, pval_b = ttest_ind(exp[0], exp[2])
            pval_a_lst.append(pval_a)
            pval_b_lst.append(pval_b)

        res_a = method_holm(pval_a_lst, alpha=0.05)
        res_b = method_holm(pval_b_lst, alpha=0.05)

        if sum(res_a) != 0:
            fp.append(1)

        if sum(res_b) > 0:
            tp.append(1)
    
    fpr = sum(fp) / trials
    power = sum(tp) / trials
    
    num_exp_lst.append(max_exp)
    fpr_lst.append(fpr)
    power_lst.append(power)
    
    
result_df = pd.DataFrame(
    {'experiments': num_exp_lst,
     'fpr': fpr_lst,
     'power': power_lst
    }
)  

In [9]:
result_df

Unnamed: 0,experiments,fpr,power
0,1,0.0,1.0
1,2,0.0,1.0
2,3,0.0,1.0
3,4,0.0,1.0
4,5,0.0,1.0
5,6,0.0,1.0
6,7,0.0,1.0
7,8,0.0,1.0
8,9,0.0,1.0
9,10,0.04,1.0


In [10]:
ans = result_df[(result_df['fpr'] < alpha) & (result_df['power'] > 1 - beta)].shape[0]

print(f'The maximum number of experiments that can be conducted with the given parameters: {ans}')

The maximum number of experiments that can be conducted with the given parameters: 11
