# Split Testing
Practical Overview of A/B Testing

In [2]:
import scipy.stats as scs
import numpy as np

In [2]:
# preview data

In [3]:
# equations in LaTeX

In [4]:
# conclusion

## Types of Split Testing
* A/B Testing
* Multivariate Testing
* Multi-Armed Bandit

## General Procedure

### What can be tested? See an aggregated list below:
* Headlines
* Sub headlines
* Paragraph Text
* Testimonials
* Call to Action text
* Call to Action Button
* Links
* Images
* Content near the fold
* Social proof
* Media mentions
* Awards and badges
* Traffic
* App installs
* Lead generation
* Conversions
* Video views
* Catalog sales
* Reach
* Engagement

Please see the references 

## Definitions

* confidence interval:
    * a type of interval estimate, computed from the statistics of the observed data, that might contain the true value of an unknown population parameter[Wikipedia](https://en.wikipedia.org/wiki/Confidence_interval)
    * the interval has an associated confidence level that, loosely speaking, quantifies the level of confidence that the parameter lies in the interval
    * it is not a definitive range of plausible values for the sample parameter, though it may be understood as an estimate of plausible values for the population parameter
    * a particular confidence interval of 95% calculated from an experiment does not mean that there is a 95% probability of a sample parameter from a repeat of the experiment falling within this interval
* critical value:  [StatisticsHowTo](http://www.statisticshowto.com/probability-and-statistics/find-critical-values/)
* effect size: 
* false positive: in the case for split testing, false positive results suggest that a variant will improve a metric, when actually, the metric may be unchanged or may be affected by other factors; larger sample sizes will reduce the risk of false positives 
* margin of error
* sample size, minimum
* significance level ($\alpha$): 
    * the probability of making the wrong decision when the null hypothesis is true [StatisticsHowTo](http://www.statisticshowto.com/what-is-an-alpha-level/)
    * typically experiments are run with a significance level of 0.05 but ultimately the significance level will depend on the experiment
* standard error of the mean: the standard deviation of the sampling mean
* statistical power: 
    * probability of finding an effect if it is real [4]
    * probability of rejecting the null hypothesis when the alternative hypothesis is true [Wikipedia](https://en.wikipedia.org/wiki/Power_(statistics))
* statistical significance: 
    * statistical significance occurs when the resulting p-value from an experiment is less than the level of significance, $\alpha$
    * if there is statistical significance, the null hypothesis can be rejected
* t-distribution or Student's t-distribution: 
    * continuous probability distribution that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown [Wikipedia](https://en.wikipedia.org/wiki/Student%27s_t-distribution)
    * can be used to approximate the confidence interval of the true mean of a normal distribution
* t-test, Student: [Wikipedia](https://en.wikipedia.org/wiki/Student%27s_t-test)
    * standard Student's t-test for two independent samples with equal sample sizes and equal variance
    
* t-test, Welch's: [Wikipedia](https://en.wikipedia.org/wiki/Welch%27s_t-test)
    * Welch's t-test for two independent samples with equal sample sizes and equal variance 
* type I error:
    * false positive
* type II error: 
    * false negative
    * failing to reject the null hypothesis when the null hypothesis is false
    * probability of type II error decreases as statistical power increases
* variance ($\sigma^2$):
    * standard deviation squared
    * for a binomial distribution: $np(1-p)$
* z-score: a z-score is the distance measured in number of population standard deviations from a data point to the population mean [StatisticsHowTo](http://www.statisticshowto.com/probability-and-statistics/z-score/)


## Rule of Thumb for Estimating Minimum Sample Size [8]

For a power of 80% (typical):

$$ n = 16 \frac{\sigma^2}{\delta^2} $$

where:
* $n$ is the minimum sample size
* $\sigma^2$ is the variance
* $\delta$ is the minimum effect size
* the constant 16 corresponds to a statistical power of 80%; use 26 for a statistical power of 95%

For a binomial proportion:
$$ \sigma^2 = np(1-p) $$

## Power Formula

$$ Z_{power} = \frac{difference}{standarderror(difference)} - Z_{\alpha/2} $$

## Equation for Standard Error of the Mean

$$ \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} $$

## Equation for Minimum Sample Size



## Given significance level, find z-score and critical value
How: Use [z-table](http://www.z-table.com/) or use function in the cell below

#### For one-tailed test:
1. Find the central area under the curve after subtracting the significance level from 1. 
2. Find the x-value that returns the area equivalent to the central area computed in the first step.

#### For two-tailed test (typical):
1. Find the central area under the curve after subtracting half of the significance level from 1. 
2. Find the x-value that returns the area equivalent to the central area computed in the first step.

#### Common z-scores for two-tailed tests
Confidence Interval | Significance Level | z-score
--- | --- | ---
80% | 0.20 | 1.28
85% | 0.15 | 1.44
90% | 0.10 | 1.65
95% | 0.05 | 1.96
99% | 0.01 | 2.58

The z score of a raw score, x:
$$ z = \frac{x - \mu}{\sigma} $$

### To find the critical value
Find the sample mean and add/subtract the standard deviation multiplied by the z-score
$$ cv = \bar{x} \pm z \times s_x $$ 

**When comparing two independent samples, the statistical power is the area of the variant's distribution to the right (if the effect is greater) of the critical value.**

In [9]:
def get_zscore(significance=0.05, two_tailed=True):
    """Returns the appropriate z-score given the level of significance
    Arguments:
        significance (float): typically 0.05 for 5% significance level but ultimately depends on the experiment
        two_tailed (boolean): False if test is one-tailed
    Returns:
        z_score (float)
    """
    norm_dist = scs.norm()
    if two_tailed:
        central_area = 1 - significance/2
    else:
        central_area = 1 - significance
    return norm_dist.ppf(central_area)

In [16]:
get_zscore(significance=0.01, two_tailed=True)

2.5758293035489004

## Given level of statistical power, find z_power



## Equation to Calculate the Normal Confidence Interval for the Population Mean

For a known standard deviation:
$$ \left(\bar{x} + z^{*}\frac{\sigma}{n}, \: \bar{x} -z^{*}\frac{\sigma}{n}\right) $$

For an unknown standard deviation:
$$ \left(\bar{x} + t^{*}\frac{s}{n}, \: \bar{x} -t^{*}\frac{s}{n}\right) $$



## Equation to Calculate the Binomial Confidence Interval


## References

1. https://support.google.com/analytics/answer/2844870?hl=en
2. https://www.facebook.com/business/help/1738164643098669?helpref=related
3. https://www.exp-platform.com/Documents/controlledExperimentDMKD.pdf
4. https://web.stanford.edu/~kcobb/hrp261/lecture8.ppt
3. https://vwo.com/ab-testing/
4. https://conversionsciences.com/blog/ab-testing-guide/
5. https://conversionxl-com.cdn.ampproject.org/v/s/conversionxl.com/blog/ab-testing-guide/
6. http://blog.analytics-toolkit.com/2017/importance-statistical-power-online-ab-tests/
7. https://medium.com/airbnb-engineering/experiments-at-airbnb-e2db3abf39e7
8. http://www.evanmiller.org/how-not-to-run-an-ab-test.html
9. http://www.evanmiller.org/ab-testing/sample-size.html