# 6. Experimentation Basics: Introduction to Power Analysis

*note: if you see any mistakes, please feel free to let me know so that I can improve the notebook!*

In this notebook, we will first talk about what the Power is in an experiment. Then, we will discuss what power analysis is as well as some examples and the math behind sample size calculation.

## Power of a Test

The power of a hypothesis test is the probability that the test correctly rejects the null hypothesis ($H_0$) when the alternative ($H_1$) is true.

$$\text{power}=\Pr(\text{reject }H_0|H_1\text{ is true})$$

In a binary hypothesis test, there are 4 different scenarios:
1. The null is true but you incorrectly reject the null: false positive (i.e., *falsely* thinking the test is *positive*), also known as Type I error ($\alpha$)
2. The null is false but you incorrectly do not reject the null: false negative, also known as Type II error ($\beta$)
3. The null is true and you correctly do not reject the null: true negative ($1-\alpha$)
4. The null is false and you correctly reject the null: true positive ($1-\beta$)

This can also be represented concisely in a table below:

| | The null ($H_0$) is true   | The null ($H_0$) is false |
|-|--|-|
|Test rejects $H_0$|$\alpha$|$1-\beta$|
|Test doesn't reject $H_0$|$1-\alpha$|$\beta$|

Intuitively speaking, you would want your A/B test to have low type I error ($\alpha$) as well as a low type II error ($\beta$). However, generally speaking lowering one tends to have the effect of making the other higher, creating a trade-off problem. In an extreme example, if we rejected the null no matter what, then this means that you will never be in a situation where the null is false and you don't reject the null, since you've decided to always reject the null, meaning that $\beta=0$, your Type II error (false negative) rate is 0. However, this also means that when the null is true, you will still always reject the null therefore $\alpha=1$ (your type I error / false positive rate is 1).

In binary classification, the power of a test is also called the *sensitivity*.

## Power Analysis

Power analysis is a concept utilized extensively in A/B tests. It is used to calculate the minimum sample size required to be able to reasonably detect an effect of a given size. For example, if we would only ship a feature if it moved your primary metric by 1%, you would want to know how many samples it would take to be able to detect a 1% lift if there actually is a lift.

There are lots of online tools such as [this one](https://www.evanmiller.org/ab-testing/sample-size.html).

The concept of power can be also used to make comparisons between different test procedures (such as deciding between a parametric vs nonparametric test for the same hypothesis).

While you can utilize the link above, a common rule of thumb is that the sample size $n$ (each group) for a two-sided two-sample $t$-test with power 80% ($\beta=0.2$) and $\alpha=0.05$ (the 0.80/0.05 is the typical standard) should be:

$$n=16\frac{s^2}{d^2}$$

where $s^2$ is an estimate of the population variance and $d=\mu_1-\mu_2$ the to-be-detected difference in the mean values of both samples. For a one-sample test, replace the 16 with 8.

The power will depend on 3 major factors:
1. the statistical significance criterion ($\alpha$) used for the test: by being less conservative and increasing $\alpha$, your power will generally increase. However, this raises the risk of a type I error and without substantive justification, analysts typically stick with $\alpha=0.05$.
2. The magnitude of the effect you are interested in: to detect a smaller increase, you'll need higher samples. See how in the rule of thumb, a decrease of $d^2$ will increase $n$.
3. Sample size: more samples will boost the power of a tset as well as everything else. More data is usually better in a hypothesis test.

There are also other rules such as that it is better for control and treatment to have similar sample sizes for better power instead of uneven grouping.

Next, we'll go through the math that results in the rule-of-thumb above.

## Sample Size Calculation

When scouring the internet for the math on how sample size is calculated, it can be a little bit difficult as all the top sites seem to present a formula but rarely articulate on the derivation.

Generally speaking, the procedure is to take the formula for calculating Power and do some algebra to isolate $n$ so that you calculate the sample size given various variables.

We'll follow this handy [link](https://www.youtube.com/watch?v=JEAsoUrX6KQ).

The video starts with assuming that you are performing two-sample $t$-test, which is one of the more common tests you will utilize in an A/B test setting.

$$H_0:\mu_c=\mu_t$$
$$H_1:\mu_c\neq\mu_t$$

The null and alternative hypothesis looks pretty typical. $\mu_c$ represents the mean of the $c$ontrol and $\mu_t$ is the $t$reatment.

To justify the formulation of the test statistic, the video claims that under the Central Limit Theorem (e.g., sufficiently high sample sizes, which is not unreasonable for an A/B test), we have:

$$\bar x_c-\bar x_t\sim\mathcal{N}\left(\mu_c-\mu_t,\frac{2\sigma^2}{n}\right)$$

While the mean may seem familiar from previous notebooks, you may wonder how the variance is derived. My assumption is that it abuses the following 2 properties:
1. $s$, the estimate of the standard deviation converges to $\sigma$
2. as per assumption of the classical $t$-test, variance and sample size is equal between both groups.

Then, the [sampling distribution](https://en.wikipedia.org/wiki/Sampling_distribution) of the difference between two sample means of of two independent normal distributions has their variance / standard deviation combined to be $\frac{2\sigma^2}{n}$ instead of $\frac{2\sigma_1^2}{n_1}+\frac{2\sigma_2^2}{n_2}$.

Then we get the test statistic,

$$Z=\frac{(\bar x_c-\bar x_t)-(\mu_c-\mu_t)}{\sqrt{2\sigma^2/n}}\sim\mathcal{N}(0,1)$$

Before the next part, remember that $\beta$ represents the type II error, the probability of falsely not rejecting the null. In other words,

$$P(\text{Accept }H_0|H_1\text{ is true})$$

Also, we know that we would reject the null if $|Z| > z_{1-\alpha/2}$, where $Z$ is a test statistic with assuming the null (so that $\mu_c-\mu_t=0$). Given $\alpha=0.05$, we know from the previous notebook that this would equal to around $1.96$.

So in this case,

$$Z=\frac{(\bar x_c-\bar x_t)}{\sqrt{2\sigma^2/n}}\sim\mathcal{N}(0,1)$$

Thus, the type 2 error is:

$$P(|Z|< z_{1-\alpha/2}|\mu_c-\mu_t\neq 0)$$

To calculate this, we can utilize $Z$ from earlier but we have to add in the $\mu_c-\mu_t$ term because we are no longer assuming the null. We also want to remove the absolute value.

$$=P\left(-z_{1-\alpha/2}-\frac{\mu_c-\mu_t}{\sqrt{2\sigma^2/n}}\leq\frac{(\bar x_c-\bar x_t)-(\mu_c-\mu_t)}{\sqrt{2\sigma^2/n}}\leq z_{1-\alpha/2}-\frac{\mu_c-\mu_t}{\sqrt{2\sigma^2/n}}\right)$$

Even without assumping the null, the middle $Z$ still follows a standard normal distribution. Thus,

$$=\Phi\left(z_{1-\alpha/2}-\frac{\mu_c-\mu_t}{\sqrt{2\sigma^2/n}}\right)-\Phi\left(-z_{1-\alpha/2}-\frac{\mu_c-\mu_t}{\sqrt{2\sigma^2/n}}\right)$$

Since the result is similar regardless of whether or not $\mu_t$ is greater than $\mu_c$ or less than, we focus on just one of them, arbitrarily assume $\mu_c> \mu_t$.

The video then focuses on the $\Phi(\cdot)$ to the right. It claims that it can be assumed to be close to $0$. This is likely because we know that $\Phi(-z_{1-\alpha/2}\approx 1.96)$ is $0.025$ (because $\alpha/2$), and that $-\frac{\mu_c-\mu_t}{\sqrt{2\sigma^2/n}}$ will only make the value $0.025$ smaller, which is true since we are assuming $\mu_c > \mu_t$.


In addition, the video introduces the property that $1-\beta=\Phi(-z_{1-\beta})$. In addition, under the alternative hypothesis, $1-\beta$ is the power.

Thus, for the power we are left with

$$\Phi(-z_{1-\beta})=\Phi\left(z_{1-\alpha/2}-\frac{\mu_c-\mu_t}{\sqrt{2\sigma^2/n}}\right)$$

By Taking the $\Phi^{-1}(\cdot)$, we are left with

$$-z_{1-\beta}=z_{1-\alpha/2}-\frac{\mu_c-\mu_t}{\sqrt{2\sigma^2/n}}$$

Then,

$$\frac{\mu_c-\mu_t}{\sqrt{2\sigma^2/n}}=z_{1-\alpha/2}+z_{1-\beta}$$

$$\left(\frac{\mu_c-\mu_t}{z_{1-\alpha/2}+z_{1-\beta}}\right)^2=\frac{2\sigma^2}{n}$$

$$n=\frac{2\sigma^2(z_{1-\alpha/2}+z_{1-\beta})^2}{(\mu_c-\mu_t)^2}$$

Assuming we want $\alpha=0.05$ and power of $1-\beta = 0.80$, this means that

$(z_{1-0.05/2}+z_{1-0.02})^2\approx (1.96+0.842)^2\approx 8$

Then, it is easy to see that

$$n\approx 16\frac{s^2}{d^2}$$

where $s$ is the standard deviation estimate (converges to $\sigma$ as sample size grows) and $d$ is the difference we expect from our means, $\mu_c-\mu_t$.



