# AB Testing
Yang Xi <br>
20 Aug, 2021

# Introduction
*Reference: https://vwo.com/ab-testing/#a-b-testing-challenges*

## The Iterative Process of A/B Testing
- Research
- Logging Observations
- Formulating Hypothesis
- A/B Testing
- Deploying the Wining Version

<br><br>

## Frequentist vs. Bayesian
### Frequentist
- Use only data from current experiment
- Estimate mean and standard deviation

### Bayesian
- Incorporate prior knowledge from the previous experiments
- Possibility of A beating B, as well as range of the expected improvement

<br><br>

## Challenges
- Decide **what to test**
- Formulate **hypothesis**
- Determine **sample size** (or test duration)
- **Interpret** test results – why a variation stood out?
- Maintaining a **testing culture** – A/B testing is iterative

### Type I and Type II Errors
- Type I: False Positive – it seems significant but actually not – usually more severe
    - Cannot be totally avoided – use at least 95% confidence level (at most 0.05 significance level / p-value)
- Type II: False Negative – it seems not significant but actually is – discouragement
    - Improve statistical power (at least 0.8) – more samples, fewer variants, etc

<br><br>

## Common Mistakes & Key Notes
- **Invalid hypothesis** will make the test unlikely to succeed.<br>
This could happen if you take other’s success story, without analyzing your own data.
- Testing **too many elements** together makes it difficult to pinpoint which element influenced the test’s success/failure the most.
- You should focus on statistical significance, but not personal **opinions or gut feelings**.
- **Unbalanced traffic** will increase the chance of failure.
- Both **running too long or too short** can result in failure.
- Try not to **change your experiment** settings / goals / designs in the middle of a test
- A/B testing is an iterative process. Do not **stop testing** after the first success / failure.
- Tests should be run in **comparable periods**.

<br><br>

## Scale Up A/B Testing
- When doing many tests
    - Ensure none of the tests affect others or the overall (website’s) performance.
    - No more than two tests overlap each other at any given week.
- More sophisticated testing methods
    - Split URL testing
    - Multivariate testing (MVT)
    - Multipage testing


# Distributions and Hypothesis Testing

Should A/B testing use **one-tailed or two-tailed tests**? <br>
This article provides some debates: *https://blog.analytics-toolkit.com/2017/one-tailed-two-tailed-tests-significance-ab-testing/*

<br>

*More Reference:*
- *https://en.wikipedia.org/wiki/A/B_testing*
- *https://towardsdatascience.com/required-sample-size-for-a-b-testing-6f6608dd330a*

## Definitions
*Reference: https://machinelearningmastery.com/statistical-power-and-power-analysis-in-python/*

* p-value vs. Significance Level (α)
    * **p-value (p)**: Probability of obtaining a result equal to or more extreme than was observed in the data.
    * **Significance Level (α)** = Boundary for specifying a statistically significant finding when interpreting the p-value.
        * Commonly selected at α = 0.05
        * 5% likelihood to encounter **Type I Error**
    * Reject the Null Hypothesis H0 if p <= α

<br>

* **Statistical Power (sensitivity)**: The probability that the test correctly rejects the null hypothesis.
    * Power = 1-β
    * β = **Type II Error** (false negative)
    * Commonly selected at Power = 0.8

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as ss
import statsmodels.stats.api as sms

## Gaussian Distribution
Gaussian Distribution can be observed in data like **"Average reveneue per user"**.
- **Null hypothesis**: two means are equal (m1 = m2)
- **Welch's t-test** (standard): designed for comparing two normally distributed populations with **unequal variance**. 
- **Student's t-test** (alternative): designed for comparing two normally distributed populations with **equal variance**. 

In [2]:
np.random.seed(0)
ar1 = np.random.normal(loc=0.15, scale=0.05, size=90)
ar2 = np.random.normal(loc=0.18, scale=0.05, size=90)

t_Welch, p_Welch = ss.ttest_ind(ar1, ar2, equal_var=False) # default to two-tailed test
print(f"Welch's t-test: statistic = {t_Welch}, pvalue = {p_Welch}")

t_Student, p_Student = ss.ttest_ind(ar1, ar2, equal_var=True) # default to two-tailed test
print(f"Student's t-test: statistic = {t_Student}, pvalue = {p_Student}")

Welch's t-test: statistic = -5.143878201186483, pvalue = 7.072288064264301e-07
Student's t-test: statistic = -5.143878201186483, pvalue = 7.055312345768804e-07


### T-test Formulation (manual calculation)

In [3]:
n1, n2 = len(ar1), len(ar2)
m1, m2 = ar1.mean(), ar2.mean()
var1, var2  = ar1.var(ddof=1), ar2.var(ddof=1)

dof = (var1/n1 + var2/n2)**2 / ((var1**2)/(n1**2)/(n1-1) + (var2**2)/(n2**2)/(n2-1))
print(f"degree of freedom = {dof}")

t_Welch_manual = (m1 - m2) / np.sqrt(var1/n1 + var2/n2)
p_Welch_manual = ss.t.sf(abs(t_Welch_manual), df=dof)*2
print(f"Welch's t-test: statistic = {t_Welch_manual}, pvalue = {p_Welch_manual}")

t_Student_manual = (m1 - m2)/ (np.sqrt(((n1-1)*var1 + (n2-1)*var2) / (n1+n2-2)) * np.sqrt(1/n1 + 1/n2))
p_Student_manual = ss.t.sf(abs(t_Student_manual), df=dof)*2
print(f"Student's t-test: statistic = {t_Student_manual}, pvalue = {p_Student_manual}")

degree of freedom = 177.51519061571628
Welch's t-test: statistic = -5.143878201186483, pvalue = 7.072288064264301e-07
Student's t-test: statistic = -5.143878201186483, pvalue = 7.072288064264301e-07


### Statistical Power

In [4]:
std = np.sqrt((var1+var2)/2)
effect_size = np.abs((m2 - m1) / std)
ratio = n2 / n1

# Use statsmodels
analysis = sms.TTestIndPower()
power = analysis.power(effect_size, n1, ratio, dof, alternative='two-sided')

# Manual calculation
x = np.abs(m1-m2) / np.sqrt(var1/n1 + var2/n2) - t_Student_manual
power_manual = ss.t.cdf(x, dof)

print(f"Statistical Power = {power}; manually calculated = {power_manual}")

Statistical Power = 1.0; manually calculated = 1.0


### Theoretical Minimum Number of Samples Required
Benchmark: *https://www.stat.ubc.ca/~rollin/stats/ssize/n2.html*

Yang Xi: I would prefer to use 2 times of this number.

In [5]:
m1, std1 = 0.15, 0.05 # measured
m2, std2 = 0.18, 0.05 # expected
alpha, power = 0.05, 0.8 # required
ratio = 1 # required; ratio = n2 / n1

# Use statsmodels
std = np.sqrt((std1**2 + std2**2)/2)
effect_size = np.abs((m2 - m1) / std)
n1 = analysis.solve_power(effect_size=effect_size, alpha=alpha, power=power, ratio=ratio, alternative='two-sided')

# Manual
n1_manual = (std1**2 + std2**2) * (ss.norm.ppf(1-alpha/2) + ss.norm.ppf(power))**2 / (m1-m2)**2

print(f"n1 = {n1}, n2 = {n1*ratio}; manually calculated n1 = {n1_manual}, n2 = {n1_manual*ratio}")

n1 = 44.5857902590805, n2 = 44.5857902590805; manually calculated n1 = 43.604887413050506, n2 = 43.604887413050506


## Binomial Distribution
Binomal Distribution can be observed in data like **"conversion rate"** (converted / not converted).
- **Null hypothesis**: two proportions are equal (p1 = p2)
- **Fisher's exact test** (standard): Can be applied for all sample sizes.
- **Chi-square test of independence** (alternative): An alternative when numbers in the contingency table are large.
- **Barnard's test** (alternative): Claimed to work better for 2x2 contingency tables, but for larger tables, the computation increases and the pwoer advantage decreases.


In [6]:
np.random.seed(0)
p1, p2 = 0.15, 0.18
q1, q2 = 1-p1, 1-p2
ar1 = np.random.choice([0,1], size=4800, p=[q1, p1])
ar2 = np.random.choice([0,1], size=4800, p=[q2, p2])

contingency_table = np.array([[(ar1==1).sum(),(ar2==1).sum()],[(ar1==0).sum(), (ar2==0).sum()]])
print(contingency_table)

perior_odds_ratio, p_fisher = ss.fisher_exact(contingency_table, alternative='two-sided')
print(f"Fisher's exact test: perior odds ratio = {perior_odds_ratio}, pvalue = {p_fisher}")

chi2, p_chi2, dof, expected = ss.chi2_contingency(contingency_table, correction=False) # Pearson's chi-squared by default
## degrees of freedom (dof) = (rows-1) * (cols-1)
print(f"Chi-square test: chi2 = {chi2}, pvalue = {p_chi2}")

[[ 723  816]
 [4077 3984]]
Fisher's exact test: perior odds ratio = 0.8658182919967103, pvalue = 0.01047388250523832
Chi-square test: chi2 = 6.6928268444339984, pvalue = 0.009680159093701474


### Statistical Power

In [7]:
n1, n2 = len(ar1), len(ar2)
alpha = p_fisher

ratio = n2/n1
p_bar = (p1 + ratio*p2) / (1 + ratio)
q_bar = 1 - p_bar

numerator = np.sqrt(n1*(p1-p2)**2) - np.sqrt(p_bar*q_bar*(1+1/ratio)) * ss.norm.ppf(1-alpha/2)
denomenator = np.sqrt(p1*q1 + p2*q2/ratio)
z = numerator/denomenator
power = ss.norm.cdf(z)

print(f"Statistical Power = {power}")

Statistical Power = 0.9193746662786612


### Theoretical Minimum Number of Samples Required
Benchmark: *https://www.stat.ubc.ca/~rollin/stats/ssize/b2.html*

Yang Xi: I would prefer to use 2 times of this number.

In [8]:
p1 = 0.15 # measured
p2 = 0.18 # expected
alpha, power = 0.05, 0.8 # required
ratio = 1 # required; ratio = n2 / n1

q1, q2 = 1-p1, 1-p2
p_bar = (p1 + ratio*p2) / (1 + ratio)
q_bar = 1 - p_bar

numerator = np.sqrt(p_bar*q_bar*(1+1/ratio)) * ss.norm.ppf(1-alpha/2)
numerator += np.sqrt(p1*q1 + p2*q2/ratio) * ss.norm.ppf(power)
numerator = numerator**2
n1 = numerator/(p1-p2)**2
print(f"n1 = {n1}; n2 = {n1*ratio}")

n1 = 2401.8860715204223; n2 = 2401.8860715204223


### More on Wikipedia
The AB Testing page of wikipedia also listed:
- **Poisson Distribution** (transactions per paying user): E-test or C-test
- **Multinominal Distribution** (number of each product purchased): chi-squared test
- **unknown distribution**: Mann-Whitney U test / Gibbs sampling