# A/B Test

**Content**

1. Key Principals
2. Estimate required sample size
3. Network Effect
4. Primacy and novelty effects

## 1. Key Principles

1. Controll Type I Error
2. Minimize Type II Error

## 2. Estimate Required Sample Size

(1) Two Means

<center>
$sample\ size\ (n) = \frac{{\sigma}^2}{{\Delta}^2} (Z_{\alpha/2} + Z_{\beta})^2 \approx \frac{16 {\sigma}^2}{{\Delta}^2}$
</center>

- $\sigma$ (sample variance)
- $\Delta$ (difference between control and treatment)
- $\alpha$ (Type I Error, false positive rate, 0.05 is the common value)
- $\beta$ (Type II Error, false negative rate, 0.2 is the common value)

**[Derivation](https://www.youtube.com/watch?v=JEAsoUrX6KQ)**: Central limit therorem: $\bar{X} \sim N (\Delta, \frac{\sigma^2 + \sigma^2}{n})$

(2) Two Proportions

<center>
$sample\ size\ (n) = \frac{(Z_{\alpha/2} \cdot \sqrt{2 \cdot \frac{(p_1 + p_2)}{2} \cdot (1 - \frac{(p_1 + p_2)}{2})} + Z_{\beta} \cdot \sqrt{p_1 \cdot (1 - p_1) + p_2 \cdot (1 - p_2)})^2}{|p_1 - p_2|^2}$
</center>

**Note**: Because A/B test usually involves two groups (control and treatment), hence the required sample size should be $2n$

Reference: https://towardsdatascience.com/required-sample-size-for-a-b-testing-6f6608dd330a

In [2]:
from statsmodels.stats.power import zt_ind_solve_power
from statsmodels.stats.proportion import proportion_effectsize as es

zt_ind_solve_power(effect_size=es(prop1=0.30, prop2=0.305), alpha=0.05, power=0.8, alternative="two-sided")

# prop1 (p1) = 0.3
# prop2 (p2) = 0.305
# alpha (Type I error)
# power (1-beta, 1-Type II error)

132482.80728417353

## 3. Network Effect

**SUTVA (Stable Unit Treatment Value Assumption)**: Every user's behavior is affected only by their treatment and NOT by the treatment of other users

SUTVA is the most basic assumption in any experimental design. However, SUTVA may not hold in product related to social connections (e.g. Facebook, Twitter), where users are connected with others and could be not only influenced by the treatment they receive, but also affected by other users and their reactions to different treatments.

**Spill-over Effect**: Subjects/Users could communicate between the treatment and the control groups, which will cause users in the control group to be affected by the treatment as well

**Possible Solutions** 

1. **Create control and treatment group in different time window**. Pros: Eliminate network effect, Cons: Inconsistent user experience across time. More suitable for experiment on backend where user would not directly experience the change
2. **Create control and treatment group in different space**: Combine with synthetic control to analyze the causal relationship
3. **Create control and treatment group in different spatiotemporal**: Alternating changes on space and/or time to creat diverse control/treatment pairs
2. **Apply treatment on cluster-based randomized experiment**. Reference: [Detecting Network Effects: Randomizing Over Randomized Experiments](https://www.youtube.com/watch?v=1v5_CzdRVAc&t=116s)

## 4. Primacy and novelty effects

**Primacy Effect**: users are resistant to change. For example, we are interested in learning how a new version update would affect the CTR. The new version is so different from the old version, making experienced users confused and click (open) multiple links.

**Novelty Effect**: users are temporarilty excited by new things. For example, existing users want to try out all new functions, which leads to an increase in the metrics (e.g. CTR).

*Note*: Any increase or decrease in the metric due to the primacy and novelty effects quickly dies out in days