# AB Testing
Yang Xi <br>
20 Aug, 2021

# Introduction
*Reference: https://vwo.com/ab-testing/#a-b-testing-challenges*

## The Iterative Process of A/B Testing
- Research
- Logging Observations
- Formulating Hypothesis
- A/B Testing
- Deploying the Wining Version

<br><br>

## Frequentist vs. Bayesian
### Frequentist
- Use only data from current experiment
- Estimate mean and standard deviation

### Bayesian
- Incorporate prior knowledge from the previous experiments
- Possibility of A beating B, as well as range of the expected improvement

<br><br>

## Challenges
- Decide **what to test**
- Formulate **hypothesis**
- Determine **sample size** (or test duration)
- **Interpret** test results – why a variation stood out?
- Maintaining a **testing culture** – A/B testing is iterative

### Type I and Type II Errors
- Type I: False Positive – it seems significant but actually not – usually more severe
    - Cannot be totally avoided – use at least 95% confidence level (at most 0.05 significance level / p-value)
- Type II: False Negative – it seems not significant but actually is – discouragement
    - Improve statistical power (at least 0.8) – more samples, fewer variants, etc

<br><br>

## Common Mistakes & Key Notes
- **Invalid hypothesis** will make the test unlikely to succeed.<br>
This could happen if you take other’s success story, without analyzing your own data.
- Testing **too many elements** together makes it difficult to pinpoint which element influenced the test’s success/failure the most.
- You should focus on statistical significance, but not personal **opinions or gut feelings**.
- **Unbalanced traffic** will increase the chance of failure.
- Both **running too long or too short** can result in failure.
- Try not to **change your experiment** settings / goals / designs in the middle of a test
- A/B testing is an iterative process. Do not **stop testing** after the first success / failure.
- Tests should be run in **comparable periods**.

<br><br>

## Scale Up A/B Testing
- When doing many tests
    - Ensure none of the tests affect others or the overall (website’s) performance.
    - No more than two tests overlap each other at any given week.
- More sophisticated testing methods
    - Split URL testing
    - Multivariate testing (MVT)
    - Multipage testing


# Distributions and Hypothesis Testing

Should A/B testing use **one-tailed or two-tailed tests**? <br>
This article provides some debates: *https://blog.analytics-toolkit.com/2017/one-tailed-two-tailed-tests-significance-ab-testing/*


## Definitions
*Reference: https://machinelearningmastery.com/statistical-power-and-power-analysis-in-python/*

* p-value vs. Significance Level (α)
    * **p-value (p)**: Probability of obtaining a result equal to or more extreme than was observed in the data.
    * **Significance Level (α)** = Boundary for specifying a statistically significant finding when interpreting the p-value.
        * Commonly selected at α = 0.05
        * 5% likelihood to encounter **Type I Error**
    * Reject the Null Hypothesis H0 if p <= α

<br>

* **Statistical Power (sensitivity)**: The probability that the test correctly rejects the null hypothesis.
    * Power = 1-β
    * β = **Type II Error** (false negative)
    * Commonly selected at Power = 0.8

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as ss
import statsmodels.stats.api as sms

## Gaussian Distribution
Gaussian Distribution can be observed in data like **"Average reveneue per user"**.
- **Null hypothesis**: two means are equal (m1 = m2)
- **Welch's t-test** (standard): designed for comparing two normally distributed populations with **unequal variance**. 
- **Student's t-test** (alternative): designed for comparing two normally distributed populations with **equal variance**. 

In [2]:
np.random.seed(1)
n1, n2 = 1000, 1000
ar1 = np.random.normal(loc=0.15, scale=0.05, size=n1)
ar2 = np.random.normal(loc=0.18, scale=0.05, size=n2)

m1, m2 = ar1.mean(), ar2.mean()
print(f"Measured m1 = {m1}, m2 = {m2}\n")

print("one-sided:")
t_Welch, p_Welch = ss.ttest_ind(ar1, ar2, equal_var=False, alternative='less') # two-sided / less / greater
print(f"Welch's t-test: statistic = {t_Welch}, pvalue = {p_Welch}")

t_Student, p_Student = ss.ttest_ind(ar1, ar2, equal_var=True, alternative='less') # two-sided / less / greater
print(f"Student's t-test: statistic = {t_Student}, pvalue = {p_Student}")

print("")
print("two-sided:")
t_Welch, p_Welch = ss.ttest_ind(ar1, ar2, equal_var=False, alternative='two-sided') # two-sided / less / greater
print(f"Welch's t-test: statistic = {t_Welch}, pvalue = {p_Welch}")

t_Student, p_Student = ss.ttest_ind(ar1, ar2, equal_var=True, alternative='two-sided') # two-sided / less / greater
print(f"Student's t-test: statistic = {t_Student}, pvalue = {p_Student}")

Measured m1 = 0.1519406238079801, m2 = 0.18136627216972892

one-sided:
Welch's t-test: statistic = -13.07686619626439, pvalue = 7.478507328479177e-38
Student's t-test: statistic = -13.076866196264389, pvalue = 7.419818818667648e-38

two-sided:
Welch's t-test: statistic = -13.07686619626439, pvalue = 1.4957014656958354e-37
Student's t-test: statistic = -13.076866196264389, pvalue = 1.4839637637335296e-37


### T-test Formulation (manual calculation)

Reference:
- *https://en.wikipedia.org/wiki/Student%27s_t-test*
- *https://en.wikipedia.org/wiki/Welch%27s_t-test*

In [3]:
def t_test(ar1, ar2, method='Welch', alternative='two-sided'):
    n1, n2 = len(ar1), len(ar2)
    m1, m2 = ar1.mean(), ar2.mean()
    var1, var2  = ar1.var(ddof=1), ar2.var(ddof=1)

    if method=='Welch':
        dof = (var1/n1 + var2/n2)**2 / ((var1**2)/(n1**2)/(n1-1) + (var2**2)/(n2**2)/(n2-1))
        if alternative=='two-sided':
            t = -np.abs(m1 - m2) / np.sqrt(var1/n1 + var2/n2)
            p = ss.t.cdf(t, df=dof)*2
        else:
            t = (m1 - m2) / np.sqrt(var1/n1 + var2/n2)
            if alternative=='less': # significant if m1 < m2
                p = ss.t.cdf(t, df=dof)
            if alternative=='greater': # significant if m1 > m2
                p = ss.t.sf(t, df=dof)

    if method=='Student':
        dof = n1 + n2 -2
        if alternative=='two-sided':
            t = -np.abs(m1 - m2)/ (np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2)) * np.sqrt(1/n1 + 1/n2))
            p = ss.t.cdf(t, df=dof)*2
        else:
            t = (m1 - m2)/ (np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2)) * np.sqrt(1/n1 + 1/n2))
            if alternative=='less': # significant if m1 < m2
                p = ss.t.cdf(t, df=dof)
            if alternative=='greater': # significant if m1 > m2
                p = ss.t.sf(t, df=dof)
            
    return t, p, dof

In [4]:
print("one-sided (manual):")

t_Welch_manual, p_Welch_manual, dof_Welch = t_test(ar1, ar2, method='Welch', alternative='less') # two-sided / less / greater
print(f"Welch's t-test: statistic = {t_Welch_manual}, pvalue = {p_Welch_manual}, dof = {dof_Welch}")

t_Student_manual, p_Student_manual, dof_Student = t_test(ar1, ar2, method='Student', alternative='less') # two-sided / less / greater
print(f"Student's t-test: statistic = {t_Student_manual}, pvalue = {p_Student_manual}, dof = {dof_Student}")

print("")
print("two-sided (manual):")
t_Welch_manual, p_Welch_manual, dof_Welch = t_test(ar1, ar2, method='Welch', alternative='two-sided') # two-sided / less / greater
print(f"Welch's t-test: statistic = {t_Welch_manual}, pvalue = {p_Welch_manual}, dof = {dof_Welch}")

t_Student_manual, p_Student_manual, dof_Student = t_test(ar1, ar2, method='Student', alternative='two-sided') # two-sided / less / greater
print(f"Student's t-test: statistic = {t_Student_manual}, pvalue = {p_Student_manual}, dof = {dof_Student}")

one-sided (manual):
Welch's t-test: statistic = -13.07686619626439, pvalue = 7.478507328479177e-38, dof = 1993.2657733704975
Student's t-test: statistic = -13.076866196264389, pvalue = 7.419818818667648e-38, dof = 1998

two-sided (manual):
Welch's t-test: statistic = -13.07686619626439, pvalue = 1.4957014656958354e-37, dof = 1993.2657733704975
Student's t-test: statistic = -13.076866196264389, pvalue = 1.4839637637335296e-37, dof = 1998


### Statistical Power

Reference: *https://towardsdatascience.com/required-sample-size-for-a-b-testing-6f6608dd330a*

Benchmark: *https://www.stat.ubc.ca/~rollin/stats/ssize/n2.html*

In [5]:
var1, var2  = ar1.var(ddof=1), ar2.var(ddof=1)
dof = n1 + n2 -2
std = np.sqrt((var1+var2)/2)
effect_size = np.abs((m2 - m1) / std)
ratio = n2 / n1

# Use statsmodels
analysis = sms.TTestIndPower()
power = analysis.power(effect_size, n1, ratio, dof, alternative='two-sided')

# Manual calculation
x = np.abs(m1-m2) / np.sqrt(var1/n1 + var2/n2) - t_Student_manual
power_manual = ss.t.cdf(x, dof)

print(f"Statistical Power = {power}; manually calculated = {power_manual}")

Statistical Power = 1.0; manually calculated = 1.0


### Theoretical Minimum Number of Samples Required

Reference: *https://towardsdatascience.com/required-sample-size-for-a-b-testing-6f6608dd330a*

Benchmark: *https://www.stat.ubc.ca/~rollin/stats/ssize/n2.html*
- Yang Xi: a much larger sample size is required in practice

In [6]:
m1, std1 = 0.15, 0.05 # measured
m2, std2 = 0.18, 0.05 # expected
alpha, power = 0.05, 0.8 # required
ratio = 1 # required; ratio = n2 / n1

# Use statsmodels
std = np.sqrt((std1**2 + std2**2)/2)
effect_size = np.abs((m2 - m1) / std)
n1 = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=ratio, alternative='two-sided')

# Manual
n1_manual = (std1**2 + std2**2) * (ss.norm.ppf(1-alpha/2) + ss.norm.ppf(power))**2 / (m1-m2)**2

print(f"n1 = {n1}, n2 = {n1*ratio}; manually calculated n1 = {n1_manual}, n2 = {n1_manual*ratio}")

n1 = 44.58579025908025, n2 = 44.58579025908025; manually calculated n1 = 43.604887413050506, n2 = 43.604887413050506


## Binomial Distribution
Binomal Distribution can be observed in data like **"conversion rate"** (converted / not converted).
- **Null hypothesis**: two proportions are equal (p1=p2)

In [7]:
np.random.seed(0)
n1, n2 = 10000, 10000
p1, p2 = 0.15, 0.18
q1, q2 = 1-p1, 1-p2
ar1 = np.random.choice([0,1], size=n1, p=[q1, p1])
ar2 = np.random.choice([0,1], size=n2, p=[q2, p2])

p1 = (ar1==1).sum()/n1
p2 = (ar2==1).sum()/n2
print(f"Measured p1 = {p1}, p2 = {p2}\n")

print("one-sided:")
z, p = sms.proportions_ztest(count=[n1*p1, n2*p2], nobs=[n1,n2], alternative='smaller') # two-sided, smaller, larger
print(f"z statistics = {z}; p value = {p}")

print("")
print("two-sided:")
z, p = sms.proportions_ztest(count=[n1*p1, n2*p2], nobs=[n1,n2], alternative='two-sided')
print(f"z statistics = {z}; p value = {p}")


Measured p1 = 0.1487, p2 = 0.1818

one-sided:
z statistics = -6.301791760015639; p value = 1.4711214992766808e-10

two-sided:
z statistics = -6.301791760015639; p value = 2.9422429985533616e-10


### Proportion Z-test Formulation (manual calculation)
As the null hypothesis assumes p1=p2, it is estimating the same common proportion pc.

*Reference: https://online.stat.psu.edu/stat800/lesson/5/5.5*

In [8]:
def prop_ztest(n1, p1, n2, p2, alternative='two-sided'):
    pc = (n1*p1 + n2*p2) / (n1+n2)
    std_err = np.sqrt(pc*(1-pc)*(1/n1 + 1/n2))

    if alternative=='two-sided':
        z_manual = -np.abs(p1-p2)/std_err
        p_manual = 2*ss.norm.cdf(z_manual)
    if alternative=='smaller': # significant if p1 < p2
        z_manual = (p1-p2)/std_err
        p_manual = ss.norm.cdf(z_manual)
    if alternative=='larger': # significant if p1 > p2
        z_manual = (p1-p2)/std_err
        p_manual = ss.norm.sf(z_manual)
        
    return z_manual, p_manual

In [9]:
print("one-sided (manual):")
z_manual, p_manual = prop_ztest(n1, p1, n2, p2, alternative='smaller')
print(f"z statistics = {z_manual}; p value = {p_manual}")

print("")
print("two-sided (manual):")
z_manual, p_manual = prop_ztest(n1, p1, n2, p2, alternative='two-sided')
print(f"z statistics = {z_manual}; p value = {p_manual}")

one-sided (manual):
z statistics = -6.301791760015639; p value = 1.4711214992766808e-10

two-sided (manual):
z statistics = -6.301791760015639; p value = 2.9422429985533616e-10


### Statistical Power

Reference: *https://towardsdatascience.com/required-sample-size-for-a-b-testing-6f6608dd330a*

Benchmark: *https://www.stat.ubc.ca/~rollin/stats/ssize/b2.html*

In [10]:
n1, n2 = len(ar1), len(ar2)
alpha = p_manual

ratio = n2/n1
p_bar = (p1 + ratio*p2) / (1 + ratio)
q_bar = 1 - p_bar

numerator = np.sqrt(n1*(p1-p2)**2) - np.sqrt(p_bar*q_bar*(1+1/ratio)) * ss.norm.ppf(1-alpha/2)
denomenator = np.sqrt(p1*q1 + p2*q2/ratio)
z = numerator/denomenator
power = ss.norm.cdf(z)

print(f"Statistical Power = {power}")

Statistical Power = 0.49999997912111893


### Theoretical Minimum Number of Samples Required

Reference: *https://towardsdatascience.com/required-sample-size-for-a-b-testing-6f6608dd330a*

Benchmark: *https://www.stat.ubc.ca/~rollin/stats/ssize/b2.html*
- Yang Xi: a much larger sample size is required in practice

In [11]:
p1 = 0.15 # measured
p2 = 0.18 # expected
alpha, power = 0.05, 0.8 # required
ratio = 1 # required; ratio = n2 / n1

q1, q2 = 1-p1, 1-p2
p_bar = (p1 + ratio*p2) / (1 + ratio)
q_bar = 1 - p_bar

numerator = np.sqrt(p_bar*q_bar*(1+1/ratio)) * ss.norm.ppf(1-alpha/2)
numerator += np.sqrt(p1*q1 + p2*q2/ratio) * ss.norm.ppf(power)
numerator = numerator**2
n1 = numerator/(p1-p2)**2
print(f"n1 = {n1}; n2 = {n1*ratio}")

n1 = 2401.8860715204223; n2 = 2401.8860715204223
