Notes on AB_Testing. Inspired by: https://www.callboxinc.com/growth-hacking/math-behind-ab-testing-visual-guide/

![1.jpg](attachment:1.jpg)


Let’s say you wanted to see if changing the color of your call-to-action (CTA) button on your whitepaper landing page from red to green would impact the number of downloads. You then randomly split your traffic 50-50, with one half assigned to the page having the red-colored CTA (the control group) and the other half assigned to the page which has the green-colored CTA (the variation group).

After recording 500 unique visits for each page, you observe that the conversion rate (number of downloads as a percentage of page traffic) for the control group was $7\%$, while the conversion rate for the variation group was $9\%$. You may be tempted to conclude that changing the CTA’s color has a real impact on conversions. But before you accept the results as valid, you first need to carefully answer a number of questions about your findings, such as:

- Do I have enough samples (page views) for each of the two groups?

- How likely is it that I got the test results simply by chance?

- Is the difference between the conversion rates big enough to justify making the change?

- If I ran the test again and again, how confident am I that it’s going to give me similar results?

These are only a few of the things you need to think about when planning and carrying out A/B tests. Below, we’ll go over the mathematical/statistical tools to help us objectively answer each of these questions.

## Null Hypothesis Testing

When running A/B tests, we’re actually applying a process called null hypothesis testing (NHT). We compare the conversion rates of the two landing pages and test the null hypothesis that there is no difference between the two conversion rates (meaning the 2-percentage-point difference between the control’s $7\%$ and the variation’s $9\%$ simply happened by chance).

In A/B tests, a null hypothesis typically states that the change (or changes you made on the page) have no effect on conversions.

We reject the null hypothesis if the *p-value* is less than the *significance level* we set (more on this below). Rejecting the null hypothesis means our test shows evidence that there’s a “statistically significant” difference between the $7\%$ and $9\%$ conversion rates we saw earlier.

Having a “statistically significant” result in our A/B test indicates that the change we made to the landing page probably had an impact on the conversion rate.

## Significance Level and p-value

**The significance level** is the probability that your A/B test **incorrectly rejects a null hypothesis** that’s actually true (i.e., the chance that you conclude there’s an effect when there’s really none). In other words, the significance level is the probability of getting a false positive result (or a Type 1 error).

It’s up to you how much significance level to use, but this is typically set to 5%. Having a 5% significance level means you’re willing to accept a 5% chance of a false positive result in your A/B test.

A related concept is the **p-value**. Statistics textbooks define The p-value as the *probability that the result would be at least as extreme as those observed, assuming the null hypothesis was true.*

If you get confused by the “assuming the null hypothesis was true” portion, think of it as simply assuming you ran a test that’s only made up of the control group (i.e., you made no variation).

Let’s say that in our landing page split-test example, we got a p-value of $3.2\%$ or 0.032. This means there’s a $3.2\%$ chance of getting at least a $9\%$ conversion rate for the green-buttoned landing page (the variation group), assuming that the variation’s conversion rate was the same as the control’s $7\%$ conversion rate.

Since we set the significance level at $5\%$, the p-value lies within the rejection threshold. This means it’s very unlikely we got the $9\%$ conversion rate assuming the null hypothesis is true. This is taken as evidence against the null hypothesis, and so we reject it.

In other words, **the p-value simply tells us how surprising a given result is**. If it’s very surprising (i.e., p-value is less than the significance level), then it’s most likely safe to reject the null hypothesis.

## Statistical Power

**Statistical power** refers to the probability that your A/B test will correctly reject a false null hypothesis. In plain English, **it’s the chance that your test detects a specific effect when an effect actually exists.**

A low-power A/B test will be less likely to pick out an effect than a high-power test. The higher the statistical power, the lower the chance that your test makes a Type 2 error (failing to reject a false null hypothesis or false negative).

According to ConversionXL, A/B tests follow an $80\%$ power standard. To improve your test’s statistical power, you need to increase the sample size, increase the effect size, or extend the test’s duration.

## Effect Size

In order for your A/B tests to be actionable and useful, you not only need to determine if a given variation has an effect, but you should also measure how much is the effect. The significance level, p-value, and statistical power make up only the starting point. You also need to analyze the effect size.

In our example earlier, the effect size is the absolute difference between the two group’s conversion rates (2 percentage points). We may also express the effect size as units of standard deviation.

It’s important to estimate and/or compute the effect size in an A/B test. Estimating the effect size at the start of a test helps you determine the sample size and statistical power while reporting the test’s post-experiment effect size allows you to make more informed decisions about the variations you’re analyzing.

##  Confidence Intervals

The $7\%$ and $9\%$ conversion rates from our earlier example are called point estimates (i.e., each of them corresponds to a single estimated number). But, since these values have only been estimated from samples, they may or may not coincide to the true conversion rates for each group.

That’s why you also need to build **confidence intervals** for your estimated conversion rates. *Confidence intervals measure the reliability of an estimate by specifying the range of likely values where the true conversion rate will probably be found.*

For example, here’s how we would most likely report a confidence interval for the variation’s conversion rates: “We are $95\%$ confident that the true conversion rate for the green-colored landing page is $9\%$ $+/-2\%$.”

In this example, we’re saying that given the test results we have, our best estimate for the tweaked landing page’s conversion rate is $9\%$ and that we’re $95\%$ confident that the true conversion rate lies within $7\%$ to $11\%$. The “+/-2%” value is called the margin of error.

Since we’ve also made a point estimate of the control group’s conversion rate, we need to construct a separate confidence interval for it. If we find, for example, that a $95\%$ confidence interval for the control group’s conversion rate overlaps with the other landing page’s confidence interval, we may need to keep testing to arrive at a statistically valid result.

Keep in mind that, in general, the larger the sample size, the narrower the confidence interval becomes (since more samples mean a more reliable estimate).The $7\%$ and $9\%$ conversion rates from our earlier example are called point estimates (i.e., each of them corresponds to a single estimated number). But, since these values have only been estimated from samples, they may or may not coincide to the true conversion rates for each group.

That’s why you also need to build confidence intervals for your estimated conversion rates. Confidence intervals measure the reliability of an estimate by specifying the range of likely values where the true conversion rate will probably be found.

For example, here’s how we would most likely report a confidence interval for the variation’s conversion rates: “We are $95\%$ confident that the true conversion rate for the green-colored landing page is $9\%$ +/- $2\%$.”

In this example, we’re saying that given the test results we have, our best estimate for the tweaked landing page’s conversion rate is $9\%$ and that we’re $95\%$ confident that the true conversion rate lies within $7\%$ to 11%. The “$+/-2\%$” value is called the margin of error.

Since we’ve also made a point estimate of the control group’s conversion rate, we need to construct a separate confidence interval for it. If we find, for example, that a $95\%$ confidence interval for the control group’s conversion rate overlaps with the other landing page’s confidence interval, we may need to keep testing to arrive at a statistically valid result.

Keep in mind that, in general, the larger the sample size, the narrower the confidence interval becomes (since more samples mean a more reliable estimate).

## Math of AB testing

Mathematically, the **conversion rate** is represented by a binomial random variable: conversion or non-conversion. Let’s call this variable as `p`. Our job is to estimate the value of `p` and for that we do `n` trials (or observe `n` visits to the website). After observing those `n` visits, we calculate how many visits resulted in a conversion. That percentage value (which we represent from 0 to 1 instead of $0\%$ to $100\%$) is the conversion rate of your website.

Now imagine that you repeat this experiment multiple times. It is very likely that, due to chance, every single time you will calculate a different value of p. Having all (different) values of p, you get a range for the conversion rate (which is what we want for next step of analysis). To avoid doing repeated experiments, statistics has a neat trick in its toolbox.  There is a concept called standard error, which tells how much deviation from average conversion rate `(p)` can be expected if this experiment is repeated multiple times. Smaller the deviation, more confident you can be about estimating true conversion rate. For a given conversion rate `(p)` and number of trials `(n)`, standard error is calculated as:

$$SE = \frac{std}{\sqrt{n}} =  \frac{\sigma}{\sqrt{n}}$$
 
If a random variable (`p`) has Binomial distribution, standart deviation can be computed analytically as a square root of variance $Var = p * (1-p)$. Hence, the one can rewrite the above equation as:

$$SE = \sqrt{\frac{p * (1-p)}{n}}$$ 


### Standart deviation

A few words about standart deviation in the formulas above. In the formula for the standart error we assume the standart deviation for **population**, which can be computed as:

$$ \sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_{i} - \mu)^{2}}$$
where $\sigma$ - **population** std, $\mu$ - **population** mean, $n$ - number of samples

However, sometimes our data is only a **sample of the whole population**. Luckily we can still estimate the Standart Deviation, but when we use the sample as an estimate of the whole population, the Standard Deviation formula changes to this:

$$ s = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(x_{i} - \hat x)^{2}}$$
where $s$ - **sample** std, $\hat x$ - **sample** mean, $n$ - number of samples

### Confidence Interval

An example of how SE is used, is to make **confidence intervals** of the unknown population mean. If the sampling distribution is normally distributed, the sample mean, the standard error, and the quantiles of the normal distribution can be used to calculate confidence intervals for the true population mean. The following expressions can be used to calculate the upper and lower $95\%$ confidence limits (mostly used), where $\bar{x}$  is equal to the sample mean, $SE$ is equal to the standard error for the sample mean, and 1.96 is the 0.975 quantile of the normal distribution:

$$ CI = \bar{x}\space \pm \space(1.96*SE) $$

In probability and statistics, **1.96** is the approximate value of the **97.5 percentile** point of the normal distribution. $95\%$ of the area under a normal curve lies within roughly 1.96 standard deviations of the mean, and due to the central limit theorem, this number is therefore used in the construction of approximate $95\%$ confidence intervals.


![2.png](attachment:2.png)
Picture taken from Wikipedia.

$95\%$ of the area under the normal distribution lies within 1.96 (almost 2) standard deviations of the mean.

![3.svg](attachment:3.svg)



The prediction interval for any standard score z corresponds numerically to $(1−(1−\Phi_{\mu,\sigma^2}(z))·2)$.

For example, $\Phi(2) \approx 0.9772$, or $Pr(X ≤ \mu + 2\sigma) \approx 0.9772$, corresponding to a prediction interval of $(1 − (1 − 0.97725)·2) = 0.9545 = 95.45\%$. Note that this is not a symmetrical interval – this is merely the probability that an observation is less than μ + 2σ. To compute the probability that an observation is within two standard deviations of the mean (small differences due to rounding):

This is related to confidence interval as used in statistics as we showed above.
