# A/B Testing

**Outline**

* Statistics Concepts
    * [Hypothesis Testing](#ht)
        * Set hypotheses
            * Type I & II error, Power        
        * Test statistics
        * Make a decision
            * P-value
* A/B Testing
    * [Steps](#steps)
        * [Step 1: Start with value propositions and define metric](#step1)
        * [Step 2: Separate Traffic](#step2)
        * [Step 3: Hypothesis Testing](#step3)
    * [Determining Sample Size](#size)
    * [Determining Duration](#duration)
    * [Caveats for A/B Testing](#caveats)
* [Reference](#reference)
    

In [1]:
%load_ext watermark

import pandas as pd
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
import statsmodels.stats.api as sms

%watermark -a 'Johnny' -d -t -v -p pandas,numpy,statsmodels

Johnny 2018-01-06 15:52:48 

CPython 3.6.2
IPython 6.2.1

pandas 0.20.3
numpy 1.13.1
statsmodels 0.8.0


  from pandas.core import datetools


---

## <a id="ht">Hypothesis Testing</a>

Hypothesis testing is a process to test claims about the population on the basis of sample.

Here is a simple example explaining the concept of hypothesis testing that I found on Quora:

> Suppose, I've applied for a typing job and I've stated in my resume that my typing speed is over 60 words per minute on an average. My recruiter may want to test my claim. If he finds my claim to be acceptable, he will hire me otherwise reject my candidature. So he asked me to type a sample letter and found that my speed is 54 words a minute. Now, he can decide on whether to hire me or not.

Hypothesis testing can help us know whether claims about the population is true or not, i.e, if we should accept it or not using the sample data.

When conducting a hypothesis testing, there are the steps we should take to do a hypothesis testing
1. **Set hypotheses**: what is the claim about the population that we want to test on
2. **Test statistics**: What is the statistics that we want to calculate using the sample we have
3. **Make a decision**: Decide whether to accept or reject the claim.

**1. Set hypotheses**

From the previous example, our claim will be whether "my typing speed is over 60 words per minute on an average" or not, i.e.,

$$\mu_{speed} \le 60 \quad \text{or} \quad \mu_{speed} > 60$$

The recruiter would then want to know which one should he believe using the sample data. In hypothesis testing, a claim with an equality sign is a **$H_0$ or Null Hypothesis**, and one without is called **$H_1$ or Alternative Hypothesis**.

![](_pic/htable.png)

When we make a decision, we can easily imagine that since we don't know what the actual typing speed is(suppose there is a real value for my typing speed. We could probably come up with a better example.), it is possible the null hypothesis is actually true but we still reject it. When this happens, we say we are commiting to **Type I eror**. **Type II eror** is when our null hypothesis is false but we don't reject it.

Here is an easy way to understand the concept:

![](_pic/pregant.jpg)

Another term that is also important is called **Power**. It is the ability that we can correctly reject $H_0$ when it is indeed false. In other words, it is the probability of correctly rejecting the null hypothesis when the null hypothesis is false. The larger the difference between $\mu_{h1}$ and $\mu_{h0}$, the better our ability is in correctly identifying $H_0$ as false, since when the sampling distribution of sample mean given that $H_1$ is further away from that of $H_0$, the area that represent the power will be bigger. We can see clearly in the gif below

![](_pic/zEffectSize.gif)

**2. Test statistics **

The recruiter then want to use the sample data he got, and calculate a value from it in order to make a decision. In this case, because we want to test whether if the mean value of my typing speed is over 60 or not. We assume that the distribution of the "mean typing speed" is normally distributed, i.e, I have measure my typing speed for a large number amount of time and according to Central Limit Theorem, the distribution of the sample mean is normally distributed. 

The recruiter would then calculate the sample mean, i.e., the typing speed from the test I just did, so that he can then decide whether he will accept the claim or not. 

**3. Make a decision**

It turns out the sample mean is 64 in this case. Let's also assume that I have typed for 36 mins and the variance of the average speed per min is 5. How can the recruiter decide whether to accept or reject the claim?

Intuitively, If the sample mean is too far away from the population mean, we may more likely to reject the the null hypothesis. However, the skeptic person might argue that a good typer may have a bad performance from time to time.
This could just be a chance event. So, the question would then be "how can the recruiter determine if I am really that good?"

That's why we need a significance level to help us set the threshold. Before doing the testing, we should firstly decide what is the **Type I error** we allow us to commit for the test? Usually, people use 5% as the threshold.

If the the $H_0$ is true, we can plot out the sampling distribution of the sample mean as the left bell shape curve in the plot below. The bell shape curve on the right hand side is the sampling distribution of the sample mean given $H_1$ is true.

When we get the actual sample mean from our data, we can then see where the sample mean is in the plot. Then, intuitively, we would want to know what the probability of performing 64 or even better when doing the test. If the probability is large, then there is not strong evidence that my typing speed is over 60 per minute on average. This probability is called **p-value**. It is the probability of obtaining the observed or more extreme outcome, given that the null hypothesis is true. We use p-value as a measure to check the level of evidence. 

![](_pic/error.png)

In our example, we can the sample mean(test statistics) is 64. Since we assume that the population is normally distributed, the sampling distribution of the sample mean will be normally distributed as well. Therefore, when our population variance is unknown, we know that

$$ t = \frac{\bar{X}-\mu_X}{S/\sqrt{n}}$$

Then we can calculate

$$Pr(\bar{X}\geq \bar{X}_* | H_0 \text{ is true}) = Pr(t \geq t_* | H_0 \text{ is true}) = 0.05$$

$$t_* = 1.6895 \text{ when df = 36-1}$$

$$t_* = \frac{\bar{X_*}-\mu_0}{S/\sqrt{n}} = \frac{\bar{X_*}-60}{5/6} = 1.6895$$

$$\bar{X_*} = 61.41$$

Therefore, the number we should put in the "Any Mean" in the above plot is 61.42. When our sample mean is over that threshold, then we say that we reject $H_0$ and admit that we will have 5% chance of commiting type I error; if the sample mean is below the threshold, then we don't reject $H_0$, since we don't have enough evidence to say that the sample mean is larger than 60.

We can also calculate our **p-value** using the sample mean we get. 

$$Pr(\bar{X}\geq \bar{X}_0 | H_0 \text{ is true}) = Pr(\bar{X}\geq 64 | H_0 \text{ is true}) = Pr(t \geq \frac{64-60}{5/\sqrt{36}}) = Pr(t \geq 4.8) = 0.000015$$

This value represent the area to the right of the "Any Mean" line in the left bell shape curve. It is the area that is smaller than the red region shown in the picture, since 64 is larger than the threshold we just calcualted.

---

# <a id="steps">A/B Testing</a>

Let's also start our introduction with an example:
> Google wants to change a button into a new one to their main search page. How can they determine whether or not people enjoy this new button feature?

To determine this, we will need to do some A/B testing, to know the difference and then use the difference to determine whether or not the new feature is good or not. Below are the steps I think we should take in order to make a decision

### <a id="step1">Step 1: Start with value propositions and define metric</a>

Before we want to decide which metric we should use to compare the the result. We should think of the value proposition of the company, since the value proposition should align with the value that the business provides and the metric should take that into account. For Google, the value proposition should be something like
**Google creates an extremely user friendly platform which directly connects people's queries to the information desired, enhancing the overall user experience.**

We want to make sure the new button feature help the google's user find their desired information in a better way. It can be easier, more efficient or any other improvement that you can think of.

Some possible metrics in this case are 
* Daily active user(DAU), Monthly active user(MAU)
* CTR of the certain button

A metric can also be something we create. For example, for LinkedIn, they use **Quality Signup**. It tracks the number of new members who have taken steps to establish their identity and grow their network within their first week as a member. Specifically, they track new members who have listed an occupation, made their first connection, and are reachable by other members. They think these are the basic requirements for any member to start receiving value on LinkedIn. As we can imagine, to get the number of **Quality Signup** for each version of our testing should be a lot of work, since it uses different actions that the users interact with LinkedIn. Also, we can imagine that to decide which feature we should use in order to for a user to be defined as a quality signup, machine learning techniques can be used in order to determine which feature has a higher importance on the target response. They'll need to label the users who are qualify to be a quality signup, and build a model on it to see which factors affect the outcome. People call these kinds of metrics **true-north metric**, which refers to what we should not what we can use to compare the differece. 

In LinkedIn's case, there are several steps in order to define and predict who will be a qualift signup. The steps are provided below
1. **Data collection, label, and features**: They gathered all new members from a six-month cohort as samples. They classified new members who were still active six months after registration as positive outcomes, and those who were not as negative outcomes. By using “active” as the label, they made the assumption that members who were active were the ones who were receiving sufficient value from LinkedIn.
2. **Step 3: Obligatory machine learning**: build simple classification model on the features to the outcome. 
3. **Step 4: Making your metric actionable (drive product strategy)**: Thinking more about the tradeoff between accuracy to simplicity. The later is more interpretable.
4. **Step 5: Validating your metric in its ecosystem**: run some A/B testing using the metric that we created and see if it make sense.
Note: it firstly label user in a 6 month cohort in order to build a machine learning model. The way to label as positive or negative in this step cannot be used when doing an A/B testing, since 6 month is too long for A/B testing to conduct. By no means we'll want to wait 6 month just to know whether if this new feature is good or bad. Therefore, by building a machine learing model using all the feature we think may directly or indirectly effect the label, when conducting an A/B testing, we can use the most important features to predict whether the new user will be classify as positive or negative. One thing we should keep in mind is that we picking features, there is a tradeoff between accuracy to simplicity. Dropping some features may decrease the accuracy of our model but it will make it easier to explain. In other words, when we conduct an A/B testing, an oberservation may be incorrectly classifed because we want a more interpretable model.

For our example, since we don't know what the feature of the new button is, let's assume that the higher CTR means it helps more people to find their desired information, hence the metric is relevant to Google's value proposition.

### <a id="step2">Step 2: Separate Traffic</a>

We then want to separate the traffic of the website into test and control so that we can compare the metric on the two versions of the page.

To do this, we'll need to collect event and gather samples of what we’re trying to measure. In our case, since our metric is CTR, we simply need to collect clicks event. 

### <a id="step3">Step 3: Hypothesis Testing</a>

In our case, since the metric we want to compare is CTR, which is the propotion of clicks to the impression. we want to do a Hypothesis testing comparing two population proportions. 

* **Set Hypothesis**

We denote the CTR for original version as $p_A$, and the one for the page with the new button feature as $p_B$
We claim our hypothesis as follows:

$$H_0: p_A-p_B=0$$
$$H_1: p_A-p_B > 0$$

A one-sided test was chosen here for charting-simplicity.

* **Test statistics**

For our test the underlying metric is a binary yes/no variable (event), which means the appropriate test statistic is a test for differences in proportions:

$$Z=\frac{(\hat{p_A}−\hat{p_B})−(p_A−p_B)}{SE(p_A−p_B)} \sim N(0, 1)$$

The test statistic makes sense as it measuring the difference in the observed proportions and the estimated proportion, standardized by an estimate of the standard error of this quantity. This is the sampling distribution of the difference between two propotions. 

To compute the test statistic, we first need to find the standard deviation/variance of $p_A−p_B$:

$$Var(p_A−p_B) = Var(p_A) + Var(p_B) -2 Cov(p_A,p_B)$$
$$ = Var(p_A) + Var(p_B) $$
$$ = \frac{p_A(1-p_A)}{n_A} + \frac{p_B(1-p_B)}{n_B} $$
$$ = p(1-p)\Big(\frac{1}{n_A}+ \frac{1}{n_B}\Big) $$

Where
* $n_i$ is the number of sample we have for each group.
* p is the pooled probability, which equals $\frac{n_Ap_A+n_Bp_B}{n_A+n_B}$

We know that when we separate the traffic, the two groups should be independent from each other. Therefore, the covariance between the two should be 0.

Given that we assume the null hypothesis is true, the test statistics, i.e., the quantile that our sample should lie in the sampling distribution becomes

$$Z = \frac{\hat{p_A}-\hat{p_B}-0}{\sqrt{\hat{p}(1-\hat{p})\Big(\frac{1}{n_A}+ \frac{1}{n_B} \Big)}}$$

Let's assume the number we get from two groups are as follows:

In [2]:
data = pd.DataFrame({
    'version': ['A', 'B'],
    'impression': [5000, 5000],
    'click': [486, 527]
})[['version', 'impression', 'click']]
data

Unnamed: 0,version,impression,click
0,A,5000,486
1,B,5000,527


In [3]:
counts = np.array([486, 527])
nobs = np.array([5000, 5000])

zscore, pvalue = proportions_ztest(counts, nobs, alternative = 'two-sided')
print('zscore = {:.3f}, pvalue = {:.3f}'.format(zscore, pvalue))

zscore = -1.359, pvalue = 0.174


* **Make a decision**

Then we can make a decision based on the p-value we get from the two proportion hypothesis testing. In the example above, we don't reject the null hypothesis that there is no difference between each version.

---

## <a id="size">Determining Sample Size</a>

When doing a two proportion hypothesis testing, how can we decide the sample size?

When conducting a testing, we need to consider both of type I and II errors when choosing the sample size. More specificly, we need to consider the following two probability to make the testing trustworthy.
* **Significance level**: The probability that we commit a false positive error(type I error, $\alpha$ error). When that happens, we end up recommending something that does not work. The probability that the observation was actually due to chance. A rule of thumb value for this probability is 5%.
* **Statistical Power**: The probability that when there is actually an effect, and we can detect it. A rule of thumb value for this probability is 80%, i.e., there is an 80% chance that if there was an effect, we would detect it.

To actually solve for the equation of finding the suitable sample size, we also need to specify the detectable difference, the level of impact we want to be able to detect with our test. When conducting a hypothesis, if our claim is the same as above, i.e., 

$$H_0: p_A-p_B=0$$
$$H_1: p_A-p_B > 0$$

we actually don't know how large or how small the difference will be. If we want our testing to be able to detect a small difference, then we will need to have a very big sample size; on the other hand, if we only want the testing to be able to test a small difference, we only need a smaller sample size in oerder to achieve the siginicance level and statistics power.

Let's consider two illustrative examples: if we want our testing be able to detect the difference of, say, 0.0001, then the sampling distribution of given $H_0$ and $H_1$ is true will be very close, so close that it is nearly indistinguishable. Then we will need a very large number of sample in order to get a power of 80%. On the other hand, if we want our testing to be able to to detect the difference of, say, 0.1, then two distribution will be further apart, which means we can conduct the testing with a much smaller sample size.

In the following gif, we can see that as the sample size goes up, the variance of each distribution become narrower. This is because the standard variance has this formula

$$\alpha_X=\frac{\alpha}{\sqrt{N}}$$

Therefore, if we want our testing to be able to detect a small difference of two group, we should need a bigger sample size. 

![](_pic/PowerSampleSize.gif)

Let use the function copied from Ethen's [blog post](http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/ab_tests/frequentist_ab_test.ipynb) about A/B testing to calculate the sample size we need for our testing to be able to test a difference of 0.02 in CTR.

In [6]:
def compute_sample_size(prop1, min_diff, significance = 0.05, power = 0.8):
    """
    Computes the sample sized required for a two-proportion A/B test;
    result matches R's pwr.2p.test from the pwr package
    
    Parameters
    ----------
    prop1 : float
        The baseline proportion, e.g. ctr
        
    min_diff : float
        Minimum detectable difference
        
    significance : float, default 0.05
        Often denoted as alpha. Governs the chance of a false positive.
        A significance level of 0.05 means that there is a 5% chance of
        a false positive. In other words, our confidence level is
        1 - 0.05 = 0.95
    
    power : float, default 0.8
        Often denoted as beta. Power of 0.80 means that there is an 80%
        chance that if there was an effect, we would detect it
        (or a 20% chance that we'd miss the effect)
        
    Returns
    -------
    sample_size : int
        Required sample size for each group of the experiment

    References
    ----------
    R pwr package's vignette
    - https://cran.r-project.org/web/packages/pwr/vignettes/pwr-vignette.html

    Stackoverflow: Is there a python (scipy) function to determine parameters
    needed to obtain a target power?
    - https://stackoverflow.com/questions/15204070/is-there-a-python-scipy-function-to-determine-parameters-needed-to-obtain-a-ta
    """
    prop2 = prop1 + min_diff
    effect_size = sms.proportion_effectsize(prop1, prop2)
    print(effect_size)
    sample_size = sms.NormalIndPower().solve_power(
        effect_size, power = power, alpha = significance, ratio = 1)
    
    return sample_size

In [7]:
sample_size = compute_sample_size(prop1 = 0.0972, min_diff = 0.02)
print('sample size required per group:', sample_size)

-0.0647140191204
sample size required per group: 3748.3476230693946


## <a id="duration">Determine the duration</a>

After we have the sample size we need from each group, another thing we need to know is how long we should conduct our A/B testing. The answer to this question is actually quite straight forward. A number we'll definitely need is the traffic we have in our website. 

For example, let's assume our website doesn't have a high volumn. Daily traffic to that specific page is 1k. To do the A/B testing, we separate our traffic into 50/50. Therefore, for each group, the daily traffic will be 5k After defining the significance level and statistical power for the hypothesis testing, if the sample size for each group is 100k, then we will need 20 days totally to do the testing. 

There is always a tradeoff between sample size and the minimum difference we want our testing to achieve. If we use a small sample size, then if there is indeed a difference between two groups, even though is small, then we cannot detect it. This is actually something we as a data scientist need to discuss with the product manager. 

## <a id="caveats">Caveats for A/B Testing</a>

**1. decide a Statistical Power first to estimate sample size**

When doing any A/B testing, we need to make sure our testing havce low false positive rate as well as having a decent probability that when there is actually an effect, and we can detect it. 

Let’s imagine we perform 100 tests on a website and, by running each test for 2 months, we have a large enough sample to achieve 80% power. Let's also assume that out of the 100 tests, there are 10 truly effective changes. In practice, of course, we won't know it after we actually implement it in the website. In this case, since our power is 80%, we expect to detect 80%, or 8, of these true effects. If we use a p-value cutoff of 5% we also expect to see 5 false positives. So, on average, we will see 8 + 5 = 13 winning results from 100 A/B tests. Therefore, 38% of the winning tests are imaginary.
 
What happens if the test is under-powered? Let’s say we are too impatient to wait for two months so we cut the test short after two weeks. The smaller sample size reduces the power of this test from a respectable 80% to less than 30%.
Now we will have 3 true positives and 5 false positives: 63% of our winning tests are completely imaginary. 

Therefore, if we don’t use the statistical power to calculate the sample size required up-front, we might not run our experiment for long enough. Even if there is an uplift we won’t have enough data to be able to detect it. The experiment will likely be a waste of time.

**2. Do not stop the test early if we use ‘classical methods’ of testing**

We should ot stop the test early as soon as the result looks significant. This temporary result does not mean our overall result is significant. If we stop our experiment early, the sample size will be lower than what we originally need, and this will make the statistical power of our testing become lower. Therefore, it will have the same problem as the condition described in Caveats 1.

**3. Perform a second ‘validation’ test repeating your original test to check that the effect is real**

Whenever we see tests which don’t seem to maintain uplift over time ask ourselves: was my original test conducted properly (e.g., was it adequately powered)? If we want to be sure of the result, then always perform a second validation study to check our results are robust.

**4. Be aware of Seasonality**

When doing a A/B testing, if there is seasonality of users for the website or product, i.e., for summer time there are more students compared to any other seasons, then we should be aware of it when we do a A/B testing during Summer time. The result of running a A/B testing in that time will probably be different when we run it in any other seasons.

Some similar concept of seasonality are *Day of week* and *Holiday*.

**5. Using incorrect metrics to evaluate the result of the A/B testing**

For example, for an e-commerce website, if we only use Click Through Rate as the metric to evaluate the result of the A/B testing for a marketing campaign, then we may end up choosing the most clickbaiting ads instead of the ad that attracts people who have genuine interests for buying products.

**6. Controlling all the differences.**

When we do a A/B testing, we should make sure we control all the other differences to be fixed. Otherwise, any other difference may confound the result, i.e., the final result is different not because of our original suspection, but because some other things that changes but we don't notice it.

**7. Be aware of statistical interference when doing A/B testing when two groups can not be perfectly separated**

Let's also use a simple example to illustrate this. For example, before facebook enhanced everyone's Like With Love, Haha, Wow, Sad, Angry Buttons, they want to know whether if the new button will make more user to react to others post. Therefore, they select some of the facebook users and separate them into two groups. Let's assume they want to see only the difference in the total reaction from the users of two groups, with group A still have the original button and group B have the new buttons. The higher the reaction, the better the result. 

If there are people in group A is friend with some people in group B, then this will cause the result be influenced by the statistical interference. In other words, the result is different not because of changing to new button, but because the groups is not fully separate. Let's say, for example, a user, Johnny, in group A has posted a very good post that every friend of Johnny in group B likes it, if johnny has lots of friends in group B, then it will probably make the final number, that is the total reaction of the group B, to be higher than group A. 

Therefore, we should be aware of this kind of problem when doing A/B testing if two groups can not be perfectly separated. Actually, Lyft has also illustrate the same kind of problem when they do some experiment in a Ridesharing Marketplace. You can find the post [here](https://eng.lyft.com/experimentation-in-a-ridesharing-marketplace-b39db027a66e).

**8. Novelty Effect**

When we do A/B testing for a new feature, it is easy to imagine that when users see a shiny new feature, they'll interact with it just to try it out. The metrics may look positive initially, which make us think that the new feature is quite successful if it effect our metrics in a good way. However, in a long run, it is not neccessary true.

**9. Biased rollouts**

We of course shouldn't trust our A/B tests when they're not actually random. 

The following are quoted from [here](https://www.quora.com/When-should-A-B-testing-not-be-trusted-to-make-decisions)
>This should be obvious, but when I was at Twitter and we launched the Twitter 2.0 redesign, we actually allowed users to opt in to the redesign, which meant we couldn't trust our A/B test results at all! I've also seen this happen at Google, where new features or algorithms were non-randomly rolled out to the users we thought would be most engaged with them.

---

### <a id="reference">Reference</a>
* [Quora: Hypothesis Testing to a laymen](https://www.quora.com/How-do-you-explain-hypothesis-testing-to-a-layman)
* [Illinois State University: Lab 14 Statistical Power](http://my.ilstu.edu/~wjschne/138/Psychology138Lab14.html)
* [The Science of Quality Growth](https://engineering.linkedin.com/blog/2017/06/the-science-of-quality-growth)
* [Ethen's frequentist A/B test](http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/ab_tests/frequentist_ab_test.ipynb)
* [Most Winning A/B Test Results are illusory](https://www1.qubit.com/sites/default/files/pdf/mostwinningabtestresultsareillusory_0.pdf)
* [Experimentation in a Ridesharing Marketplace](https://eng.lyft.com/experimentation-in-a-ridesharing-marketplace-b39db027a66e)
* [When should A/B testing not be trusted to make decisions?](https://www.quora.com/When-should-A-B-testing-not-be-trusted-to-make-decisions)