# Design of A/B Testing  

**Reference**: [A/B Testing - Udacity](https://www.udacity.com/course/ab-testing--ud257)

## What is  A/B Testing?  

> [A/B testing](https://en.wikipedia.org/wiki/A/B_testing) is a term for a randomized experiment with two variants, A and B, which are the control and variation in the controlled experiment. A/B testing is a form of statistical hypothesis testing with two variants leading to the technical term, two-sample hypothesis testing, used in the field of statistics. Other terms used for this method include bucket tests and split-run testing.  

* In the online world, the goal of A/B testing is to determine whether or not the users will like a particular new product / feature.  

### Experimental vs Observational  

* Experimental
    * Apply treatments to experimental units (people, animals, land, etc) and observe effect of treatment
    * Can be used to establish causality
    * Example: Randomly assign students to two groups, only give homework to one group, and measure the performance of the two groups  
    
* Observational
    * Observe subjects and measure variables of interest without assigning treatments to subjects
    * Can’t be used to establish causality
    * Example: Observe and record whether or not the studets do homework and their grades  

### Confounding Factor  

* An extraneous attribute that correlates with the dependent variable (performance) and the independent variable (homework or not)  
* Example:  
    * How hard-working the student is  
    * The students that are more hard-working are more likely to complete their homework and perform better

### Experimental Design   

* Randomization into groups of equal size  
    * Example: Randomly generate number from 0 - 1, if < 0.5, assign homework (**treatment/experiment group**); otherwise, no homework (**control group**) 
 
* Assume independent observations  
    * Assume the students don’t know if the other students have homework or not  
    * Otherwise that knowledge might affect performance  

**Q**: What can we test with A/B testing? Can you think of any example when we can't use A/B testing?  

### Other Techniques  

* Retrospective / observational study  
* User experience research  
* Focus groups
* Surveys
* Human evaluation    

**Note**: These techniques can also be used for generating or validating the metrics used in A/B testing.  

## Choosing Metrics  

**Example**: An online education company (like Udacity) is trying to test features that increase student engagement. A typical user flow through their website might look like:  
* Visit the homepage  
* Explore the site  
* Create an account  
* Complete a class / make a purchase

We'll consider an experimental change to the "Start Now" button on the company's homepage. If users click this button, they will see a list of the online courses. We want to test the hypothesis for that changing the "Start Now" button from orange to pink will increase how many students explore the online courses.  

**Q**: What metrics should we use for testing the hypothesis?  

* Total number of courses completed?  
* The number of users click on the "Start Now" button?  
* Click-through rate (CTR): the number of clicks divided by the number of pageviews?  
* Click-through probability: the number of unique visitors who click at least once divided by the number
of unique visitors who view the page?  

**Q**: How would you measure the metric(s) of your choice?  

**Q**: Say if we choose click-through probability as the metric. We observe that 1000 visitors visited the page, and there are 100 unique clicks, can we construct a 95% confidence interval for the click-through probability? How? What does this interval tell us?  

The 95% confidence interval for one sample proportion $p$ is given by  

$$ (\hat{p} - 1.96 \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}} \text{, } \hat{p} + 1.96 \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}) $$

In [1]:
# One sample proportion test in R
prop.test(x = 100, n = 1000)

NameError: name 'prop' is not defined

In [6]:
# To get the confidence interval
res = prop.test(x = 100, n = 1000)
res$conf.int

**Q**: Can you construct a 99% confidence interval for the click-through probability?  

### Comparing the Two Versions  

**Example**: Say we randomly show half of the visitors the homepage with orange "Start Now" button, and the other half the pink "Start Now" button. 1000 visitors see the orange button and there are 100 unique clicks on it, while 1000 visitors see the pink button and there are 130 unique clicks on it. 

**Q**: Which group is the control group? Which is the treatment/experiment group?  

**Q**: How should we compare the click-through probability between the two groups?  

Say if we are just interested in testing if there is a statistically significant difference between the two groups,  

$H_0: p_0 = p_1$ or $p_0 - p_1 = 0$  

$H_a: p_0 \neq p_1$ or $p_0 - p_1 \neq 0$

The overall (pooled) probability is:  

$$ \hat{p}_{pool} = \frac{X_0 + X_1}{n_0 + n_1} $$  

And the 95% confidence interval for $p_0 - p_1$ is give by  

$$ (\hat{p}_0 - \hat{p}_1) \pm 1.96 \sqrt{\hat{p}_{pool}(1 - \hat{p}_{pool}) \left(\frac{1}{n_0} + \frac{1}{n_1} \right)} $$

In [16]:
prop.test(x = c(100, 130), n = c(1000, 1000))


	2-sample test for equality of proportions with continuity correction

data:  c(100, 130) out of c(1000, 1000)
X-squared = 4.1317, df = 1, p-value = 0.04209
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.058932066 -0.001067934
sample estimates:
prop 1 prop 2 
  0.10   0.13 


### Practical Significance vs Statistical Significance  

When we conclude from the previous hypothsis test that there is a significance difference between the two groups, we are talking about statistical significance. But usually we don't just want to test if the difference is 0, and we would be more interested in something like the click-through probability increases 1% due to the new feature - this is talking about the practical significance.   

**Q**: Is the experiment result practically significant?  

## Determining the Sample Size   

Now that we have chosen a metric, how can we decide how large of a sample do we need for the experiment?  

### Statistical Power  

The statistical power of a test is defined as the probability that the test correctly rejects the null hypothesis ($H_0$) when the alternative hypothesis ($H_a$) is true. I.e.,  

$$power = P(\text{reject } H_0 \ | \ H_a \text{ is true}) = 1 - \beta$$  

**Q**: What is the power of the test if the true difference in click-through probability is 0.02? The sample size for each group is 1000, and the click-through probability for the control group is 10%.  

In [35]:
power.prop.test(n = 1000, p1 = 0.1, p2 = 0.1 + 0.02)


     Two-sample comparison of proportions power calculation 

              n = 1000
             p1 = 0.1
             p2 = 0.12
      sig.level = 0.05
          power = 0.2977321
    alternative = two.sided

NOTE: n is number in *each* group


**Example**: Now say if we know that our baseline click-through probability is 10% (the click-through probability before the new feature is introduced), and to be practically significant, we need an absolute difference of 2% in the click-through probability between the control and the experiment groups. In order to have a statistical power of 80%, what is the required sample size for each group?  

In [36]:
power.prop.test(p1 = 0.1, p2 = 0.1 + 0.02, power = 0.8)


     Two-sample comparison of proportions power calculation 

              n = 3840.847
             p1 = 0.1
             p2 = 0.12
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group


**Q**: How does the required sample size change as you:  

* Increase the practical significance level?  
* Increase the confidence level?  
* Increase the desired power of the test?  

In [37]:
power.prop.test(p1 = 0.1, p2 = 0.1 + 0.03, power = 0.8)


     Two-sample comparison of proportions power calculation 

              n = 1773.976
             p1 = 0.1
             p2 = 0.13
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group


In [39]:
power.prop.test(p1 = 0.1, p2 = 0.1 + 0.02, sig.level = 0.01, power = 0.8)


     Two-sample comparison of proportions power calculation 

              n = 5715.417
             p1 = 0.1
             p2 = 0.12
      sig.level = 0.01
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group


In [40]:
power.prop.test(p1 = 0.1, p2 = 0.1 + 0.02, power = 0.9)


     Two-sample comparison of proportions power calculation 

              n = 5141.306
             p1 = 0.1
             p2 = 0.12
      sig.level = 0.05
          power = 0.9
    alternative = two.sided

NOTE: n is number in *each* group


## The Two Uses of Metrics  

* **Invariant checking**: the metrics that shouldn't change across the experiment and control  
    * Do you have the same number of users across the two?  
    * Do you have comparable numbers of users across countries, or by language?  
    
* **Evaluation**: 
    * High level business metrics, e.g., how much revenue you make, how many users you have  
    * More detailed metrics, e.g., how long do the users stay on your page  

**Q**: For the following new features to the online education site, think about what metrics you would use to test them:  

* Adding course descriptions on the course list page  
* Increase the size of "Start Now" button  
* Explain the benefits of the paid services  

Some metrics can be difficult to obtain / measure:  

* You don't have access to that data
* It takes too long to collect  

**Q**: Why might the following metrics difficult to obtain?  

* The rate of students return to take a 2nd course after taking the 1st course  
* The percentage of the students who get jobs after taking the online courses  
* The average happiness level of Amazon shoppers  
* The probability that the users find the information they look for on Google  

## Define the Metrics  

**Q**: For the click-through probability we chose in the previous example, how exactly are we going to capture this probability from observed data? (How do we define "Unique"?)  

* For each time interval (1 minute, 1 hour, 1 day, etc.),  

$$\frac{\text{number of clicks from unique cookies}}{\text{number of unique cookies}} $$  

* For each time interval (1 minute, 1 hour, 1 day, etc.),  

$$\frac{\text{number of pageviews with clicks}}{\text{number of pageviews}} $$ 

* Click-through rate (CTR)    

$$ \frac{\text{number of clicks}}{\text{number of pageviews}} $$

**Q**: In certain cases, the metric(s) we defined might not be measuring what we think they are measuring, can you think of any cases?  

**Q**: how would you check to see if there's any problem with the data for the defined metrics?  

## Summary Metrics  

* Sum and count  
* Mean, median, percentiles
* Probability and rates
* Ratios

**Q**: Say if you want to measure the loading time of videos on your website, how would you choose a metric?  

### Sensitivity and Robustness

* Sensitivity: a metric that picks up changes you care about
* Robustness: is robust against changes that you don't care about  

## Variability of Metrics  

* Analytically
    * If the metric follows a known distribution, we can find the variance of the matrics analytically by finding the variance of its distribution
    * E.g., the mean would follow a Normal distribution by CLT

* Empirically  
    * A/A tests
    * Bootstrap

**Q**: Say if we measured the number of daily visits to our website, and we chose a metric of mean number of site visits per day. How would you estimate the variance of the metric?  

In [43]:
visits = c(452, 593, 932, 854, 362, 481, 459, 512, 835, 480, 783, 291, 452, 843, 673, 841, 733, 910, 486, 928)

## Design the Experiment  

### Unit of Diversion  

* The subject of the experiment
    * How do we define what an individual subject is in the experiment?  
* Commonly used:  
    * User ID
    * Anonymous ID (cookies)
    * Event
* Less commonly used:  
    * Device ID
    * IP address

**Q**: Which unit of diversion would you use for the following experiments on the online education site:  

* Reduce video loading time  
* Change the color and size of a button  
* Change the order of search results
* Add instructor notes before quizzes

### Unit of Analysis  

* Basically whatever the denominator of your metric is
* If the unit of analysis and unit of diversion are the same, the analytical variability of the metric is likely to be close to the empirical one
* In general, the unit of diversion needs to be at least as "big" as your unit of analysis

**Q**: If the metric of the test is the click-through rate (# clicks / # pageviews), and the unit of diversion is cookie, would you expect the analytical variance to match the empirical variance?  

### Target Population  

* You have to decide who you're targeting in your users  
    * E.g. by browswer, country, language, demographic information, etc.  
* You might want to restrict how many of your users can see the feature  
* You might not want to overlap with other experiments running at the same time  

### Sizing and Duration  

* Based on the unit of diversion we choose, we may have to estimate the varibility of the metric empirically, then choose the sample size based on the variance  

**Q**: How can you esimate the variance of the metric if the unit of diversion is different from the unit of analysis?  

* Then you need to decide 
    * What percentage of your users that you want to expose the new feature to
    * How long you have to run the experiment  
    
**Q**: Why wouldn't you want to expose all your users to the experiment?  

# Analysis of A/B Testing Results

* Now that we've decided on the metric(s) to evaluate, chosen the sample size and run the experiment, we want to see what we can conclude / recommend from the data collected.  

## Invariant Checking  

* Before we dive into comparing the click-through probability, we need to do some sanity checks to make sure the experiment is actually run properly
    * Something might have gone wrong in the experiment diversion, are your control and experiment groups still comparable?  
    * Did the data capture the events you were looking for?  

* We need to check if the experiment population and the control populations are actually comparable  
* The invariants shouldn't change when you run your experiment  
* 
**Q**: Say the metric we want to evaluate is the click-through rate, what metrics can we use for invariant checking?   

**Q**: If we observed a total of 8294 pageviews in the control group and 8095 pageviews in the experiment group, how do we use it for invariant checking?   

In [None]:
# We can perform a Binomial/proportion test on the invariant
prop.test(8294, 8294 + 8095)

## Analyzing the Results for Single Metric   

* We have looked at how to analyze the results for a single metric during the morning lecture  
* Another test we might be interested in performing is the 'Sign Test'  

### Sign Test

* The sign test can be used if we want to furthur confirm the results, or if the metric doesn't follow any known distribution  

**Example**: We run a test for one week, and obtained the daily CTR for the control and the experiment groups below.

In [None]:
ctr_cont = c(0.33, 0.45, 0.39, 0.40, 0.57, 0.63, 0.61)
ctr_exp = c(0.53, 0.55, 0.61, 0.58, 0.68, 0.72, 0.70)

# If we take the difference of the daily CTR's
ctr_exp - ctr_cont

* If the true CTR were the same between the two groups, what would we expect to see in terms of the signs of the differeces?   

* What is the probability of observing 7 positive differences?

In [None]:
# The one sample proportion test won't work too well
# Since the sample size is too small for the Normal approximation
prop.test(7, 7)

In [None]:
# We can calculate the p-value 
# by using the PMF/CDF of a Binomial distribution

dbinom(7, 7, 0.5) * 2

## Analysing the Results for Multiple Metrics   

**Q**: If we are testing 20 different metrics, and use the 95% confidence interval to make the decision on each test, what is the probability that you get at least one significant result by chance?  

### Bonferroni Correction  

* The more tests we run, the more likely that we will have significant results just by chance  
* One way to adjust for this, is to reduce the significance level of each individual test
* The Bonferroni correction controls the familywise error rate (FWER) by rejecting the null hypothesis for all $p_i \leq \frac{\alpha}{m}$, where $p_i$ is the p-value for the $i$th test, $m$ is the number of tests
    * The familywise error rate (FWER) is the probability of rejecting at least one true $H_0$, i.e. the probability of making at least one type I error  

### False Discovery Rate  (FDR)    

* Instead of controlling the FWER, we can also control the expected proportion of false discoveries - the FDR  

$$ FDR = E \left(\frac{\text{number of false positives}}{\text{number of rejections of the } H_0} \right)$$  

* Benjamini–Hochberg procedure:
    * Order the p-values of the $m$ tests: $p_{(1)}, p_{(2)}, \dots, p_{(m)}$
    * For a given $\alpha$ , find the largest $k$ such that $p_{(k)} \leq \frac{k}{m} \alpha$
    * Reject the null hypothesis for tests, $1, 2, \dots, k$

## Make Recommendations  

* Do we have statistically and practically significant results?  
* Do we understand the results? Do they make sense?
* Do we want to launch the change? Is it worth it?  
* Do we want to launch the change for a slice of the users?  
* Do we need to run further tests?