# Lecture 3.2: Analysis of A/B Testing Results

**Reference**: [A/B Testing - Udacity](https://www.udacity.com/course/ab-testing--ud257)

* Now that we've decided on the metric(s) to evaluate, chosen the sample size and run the experiment, we want to see what we can conclude / recommend from the data collected.  

## Invariant Checking  

* Before we dive into comparing the click-through probability, we need to do some sanity checks to make sure the experiment is actually run properly
    * Something might have gone wrong in the experiment diversion, are your control and experiment groups still comparable?  
    * Did the data capture the events you were looking for?  

* We need to check if the experiment population and the control populations are actually comparable  
* The invariants shouldn't change when you run your experiment  
* 
**Q**: Say the metric we want to evaluate is the click-through rate, what metrics can we use for invariant checking?   

**Q**: If we observed a total of 8294 pageviews in the control group and 8095 pageviews in the experiment group, how do we use it for invariant checking?   

In [4]:
# We can perform a Binomial/proportion test on the invariant
prop.test(8294, 8294 + 8095)


	1-sample proportions test with continuity correction

data:  8294 out of 8294 + 8095, null probability 0.5
X-squared = 2.3921, df = 1, p-value = 0.122
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.4983857 0.5137537
sample estimates:
        p 
0.5060711 


## Analyzing the Results for Single Metric   

* We have looked at how to analyze the results for a single metric during the morning lecture  
* Another test we might be interested in performing is the 'Sign Test'  

### Sign Test

* The sign test can be used if we want to furthur confirm the results, or if the metric doesn't follow any known distribution  

**Example**: We run a test for one week, and obtained the daily CTR for the control and the experiment groups below.

In [9]:
ctr_cont = c(0.33, 0.45, 0.39, 0.40, 0.57, 0.63, 0.61)
ctr_exp = c(0.53, 0.55, 0.61, 0.58, 0.68, 0.72, 0.70)

# If we take the difference of the daily CTR's
ctr_exp - ctr_cont

* If the true CTR were the same between the two groups, what would we expect to see in terms of the signs of the differeces?   

* What is the probability of observing 7 positive differences?

In [10]:
# The one sample proportion test won't work too well
# Since the sample size is too small for the Normal approximation
prop.test(7, 7)

“Chi-squared approximation may be incorrect”


	1-sample proportions test with continuity correction

data:  7 out of 7, null probability 0.5
X-squared = 5.1429, df = 1, p-value = 0.02334
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5609339 1.0000000
sample estimates:
p 
1 


In [12]:
# We can calculate the p-value 
# by using the PMF/CDF of a Binomial distribution

dbinom(7, 7, 0.5) * 2

## Analysing the Results for Multiple Metrics   

**Q**: If we are testing 20 different metrics, and use the 95% confidence interval to make the decision on each test, what is the probability that you get at least one significant result by chance?  

### Bonferroni Correction  

* The more tests we run, the more likely that we will have significant results just by chance  
* One way to adjust for this, is to reduce the significance level of each individual test
* The Bonferroni correction controls the familywise error rate (FWER) by rejecting the null hypothesis for all $p_i \leq \frac{\alpha}{m}$, where $p_i$ is the p-value for the $i$th test, $m$ is the number of tests
    * The familywise error rate (FWER) is the probability of rejecting at least one true $H_0$, i.e. the probability of making at least one type I error  

### False Discovery Rate  (FDR)    

* Instead of controlling the FWER, we can also control the expected proportion of false discoveries - the FDR  

$$ FDR = E \left(\frac{\text{number of false positives}}{\text{number of rejections of the } H_0} \right)$$  

* Benjamini–Hochberg procedure:
    * Order the p-values of the $m$ tests: $p_{(1)}, p_{(2)}, \dots, p_{(m)}$
    * For a given $\alpha$ , find the largest $k$ such that $p_{(k)} \leq \frac{k}{m} \alpha$
    * Reject the null hypothesis for tests, $1, 2, \dots, k$

## Make Recommendations  

* Do we have statistically and practically significant results?  
* Do we understand the results? Do they make sense?
* Do we want to launch the change? Is it worth it?  
* Do we want to launch the change for a slice of the users?  
* Do we need to run further tests?