# A/B Testing - Power Analysis

Defining a desired boundary of practical significance in advance is very useful during the planning of an A/B test. Once you know the required minimum effect you expect and/or you need to achieve you can determine __how many observations__ are neccessary in order to achieve a desired level of measurement reliability. In situations where external constraints put a limit on our sample sizes you're able to work your way backwards and estimate a __minimum detectable effect__ for reaching a significant experiment result. Finally, this knowledge can be translated into __how long__ your test ultimately needs to run and whether or not it's feasible.

A power analysis allows you to calculated several statistical parameters of your test given others. In most of the cases, it is used to estimate a required sample size given some expected effect size as well as alpha and power levels. Details on each of the parameters are outlined below:

__Alpha:__
- P(Type I Error) = Probability of finding an effect when there is none (false positive)
- Typically set to .05

__Power:__
- 1 - Beta = Probability of finding an effect when there is one (true positive)
- Where Beta:
    - P(Type II Error) = Probability of NOT finding an effect when there is one (false negative)
- Typically set to .80

__Effect size:__
- Usually referred to as expected effect
- Defined by
    - Previous tests and/or educated guess
    - Estimation (e.g. Cohen's d or Cohen's h)

__N:__
- Sample size (usually the parameter you are solving for)
- May be known/fixed due to external constraints

Since alpha and power are usually set to .05 and .80 your primary concern needs to be with the effect size. The most commonly asked question at this point is 'How can I know the expected effect when I haven’t run the test yet?'. Again, you should approach this question by thinking about your desired level of pracitical significance which is basically the threshold - once passed - that justifies the additional effort of rolling out the variation to all of your users. However, if you're unsure you can still refer back to more general effect size levels as shown for Cohen's d and Cohen's h below.


__Please note:__ Although there are no formal standards for power, most researchers assess the power of their tests using .80 as a standard for adequacy. This convention implies a four-to-one trade off between β-risk and α-risk. (β is the probability of a Type II error, and α is the probability of a Type I error; .2 and .05 are conventional values for β and α). However, there will be times when this 4-to-1 weighting is inappropriate.

__References:__
- [USU](http://rgs.usu.edu/irb/wp-content/uploads/sites/12/2015/08/A_Researchers_Guide_to_Power_Analysis_USU.pdf)
- [Wikipedia](https://en.wikipedia.org/wiki/Power_(statistics)

### Effect Size

#### Cohen's d
Cohen’s d is a __standardized measure of the difference between two means__ from two normally distributed variables.
It is defined as the difference between two means divided by the standard deviation of the data:

\begin{equation*}
\ d = \frac{\mu_1 - \mu_2}{s}
\end{equation*}

Cohen defined s, the pooled standard deviation, as (for two independent samples):

\begin{equation*}
\ s = \sqrt{\frac{(n_1 - 1)s_1^2  +  (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}
\end{equation*}

where the variance for one of the groups is defined as:

\begin{equation*}
\ s_1^2 = \frac{1}{n_1-1} \sum_{i=1}^{n_1}(x_{1i} - {\bar{x_1}})^2
\end{equation*}

and similar for the other group. Because it's a standardized measure there are specific thresholds which can be used to __interpret the effect size's magnitude__:
- d = 0.01: Very small
- d = 0.20: Small
- d = 0.50: Medium
- d = 0.80: Large
- d = 1.20: Very large
- d = 2.00: Huge

__References:__
- [Wikipedia](https://en.wikipedia.org/wiki/Effect_size#Cohen%27s_d)

In [4]:
# Effect size function for Cohen's d
cohens_d <- function(x, y=NULL, mde=NULL){
    # Estimates Cohen's d for two sample t-test
    #
    # Args:
    # - x: Vector with data for control group
    # - y: Vector with data for test group
    # - mde: Minimum detectable effect (threshold of practical significance)
    #
    # Returns:
    # - d: Cohen's d effect size
  
    if (!is.null(y)){
        # Sample size
        n1 <- min(length(x), 2)
        n2 <- min(length(y), 2)
        
        # Difference in means
        diff_means  <- abs(mean(x) - mean(y))
        
        # Pooled standard deviation
        sd <- sqrt(((n1-1) * var(x) + (n2-1) * var(y)) / (n1 + n2 - 2))
    
    } else if (!is.null(mde)){
        # Sample size
        n1 <- max(length(x), 2)
        n2 <- n1
    
        # Difference in means
        diff_means  <- abs(mean(x) * (1+mde) - mean(x))
    
        # Pooled standard deviation assuming equal variances
        sd <- sqrt(((n1-1) * var(x) + (n2-1) * var(x)) / (n1 + n2 - 2))
    
    } else {
        stop('Please specify either y or mde.')
    }

    # Return Cohen's d
    return(diff_means / sd)
}


# Effect size
effect_size <- cohens_d(data_control, data_test)
effect_size

#### Cohen's h
Similar to the above Cohen's h can be used as a standardized measure of the __difference between to means when dealing with two independent proportions__. Given two probability distributions or proportions p (between 0 and 1) it is defined as the difference between their arcsine transformations, i.e.:

$\ h = 2 \arcsin (\sqrt{p_1}) - 2 \arcsin (\sqrt{p_2}) $

Sometimes, Cohens'h is referred to as "directional h" because, in addition to showing the magnitude of the difference, it shows which of the two proportions is greater. Nonetheless, often researchers report a "non-directional h", which is just the absolute value of the directional $\ h = |h|$. Intepretations of magnitude are similar to Cohen's d as shown above:

- h = 0.20: Small
- h = 0.50: Medium
- h = 0.80: Large

__References:__
- [Wikipedia](https://en.wikipedia.org/wiki/Cohen%27s_h)

In [5]:
# Effect size function for Cohen's h
cohens_h <- function(x, y=NULL, mde=NULL){
    # Estimates non-directional Cohen's h for two sample test of binomial proportions
    #
    # Args:
    # - x: Either vector with data for control group or control group mean
    # - y: Either vector with data for test group or test group mean
    # - mde: Minimum detectable effect (threshold of practical significance)
    #
    # Returns:
    # - h: Cohen's h effect size
    
    if (!is.null(y)){
        # Proportions
        p1 <- mean(x)
        p2 <- mean(y)
    
    } else if (!is.null(mde)){
        # Proportions
        p1 <- mean(x)
        p2 <- p1 * (1+mde)
        
    } else {
        stop('Please specify either y or mde.')
    }
    
    # Return Cohen's h
    return(abs(2*asin(sqrt(p2)) - 2*asin(sqrt(p1))))
}


# (Non-directional) effect size
effect_size <- cohens_h(data_control, data_test)
effect_size

### Sample Size Estimation

#### Test for Two Independent Samples
_Normal Distributions (Approximately)_

When testing for two independent samples it can be difficult to estimate an expected effect size using Cohen's d. This is mainly due to the fact that we don't know if our variation is going to influence all users within our test group. Thus, you could choose a value from the table above as proxy.

In [34]:
# Sample size estimation for equal sample sizes
# Source: https://www.statmethods.net/stats/power.html
require(pwr)

# Plug-in proxy for effect size ...
effect_size <- 0.01

# ... or estimate Cohen's d based on MDE (e.g. 5% improvement)
mde <- 0.05
effect_size <- cohens_d(data_control, mde=mde)

# Estimate sample size n1 for first sample
pwr.t.test(
    d = effect_size,
    sig.level = 0.05,
    power = 0.8,
    alternative = 'two.sided'
)


     Two-sample t test power calculation 

              n = 482063
              d = 0.005706461
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group


#### Test for Two Independent Proportions
_Binomial Distributions_

As opposed to two (approximately) normally distributed independent samples you have much more control when working with proportions (e.g. conversion rates). Starting out from you current baseline you can simply apply the expected improvement based on your practical significance boundaries.

In [None]:
# Sample size estimation for equal sample sizes
# Source: https://www.statmethods.net/stats/power.html
require(pwr)

# Plug-in proxy for effect size ...
effect_size <- 0.01

# ... or estimate Cohen's h based on MDE (e.g. 5% improvement)
mde <- 0.05
effect_size <- cohens_h(data_control, mde=mde)

# Estimate sample size n1 for first sample
pwr.2p.test(
    h = effect_size,
    sig.level = 0.05,
    power = 0.8,
    alternative = 'two.sided'
)

### Minimum Detectable Effect Estimation

Oftentimes we find ourselves in a situation where we have __external constraints on the size of our experiment__ (e.g. number of users). Given a fixed sample size for our test and control groups we can estimate what minimum detectable effect (MDE) we need to observe in order to achieve a desired level of measurement reliability. By using the knowledge about the size of our control group and its statistical properties we can estimate the minimum effect size required to achieve a significant test result. Finally, we use the estimated effect size and translate it to a minimum detectable effect.

#### Test for Two Independent Samples
_Normal Distributions (Approximately)_

In [30]:
# Minimum detectable effect (MDE) estimation for equal sample sizes
require(pwr)

# Function to estimatr minimum detectable effect (MDE) for two sample t-test
mde.t.test <- function(x, d){
    # Estimates minimum detectable effect for two sample t-test given Cohen's d
    #
    # Args:
    # - x: Vector with data for control group
    # - d: Cohen's d effect size measure
    #
    # Returns:
    # - mde: Minimum detectable effect
  
    # Sample size
    n1 <- max(length(x), 2)
    n2 <- n1
  
    # Pooled standard deviation assuming equal sample size and variances
    sd <- sqrt(((n1-1) * var(x) + (n2-1) * var(x)) / (n1 + n2 - 2))
  
    # Difference in means
    diff_means <- d*sd
  
    # Return MDE
    return(diff_means / mean(x))
}


# Cohen's d effect size for fixed sample size
n1 <- 10000
pwr.results <- pwr.t.test(
    n = n1,
    sig.level = 0.05,
    power = 0.8,
    alternative = 'two.sided'
)

# MDE
d <- pwr.results$d
mde.t.test(data_control, d)

#### Test for Two Independent Proportions
_Binomial Distributions_

In [29]:
# Minimum detectable effect (MDE) estimation for equal sample sizes
require(pwr)

# Minimum detectable effect (MDE) for proportions with given effect size
mde.2p.test <- function(x, h){
    # Estimates minimum detectable effect for two sample binomial test
    # given Cohen's h
    #
    # Args:
    # - x: Either vector with data for control group or control group mean
    # - h: Cohen's h effect size measure
    #
    # Returns:
    # - mde: Minimum detectable effect
  
    # Proportions
    p1 <- mean(x)
    asin.p2 <- -(h + 2*asin(sqrt(p1)))
    p2 <- sin(asin.p2 / 2) ** 2
  
    # Return MDE
    return(p2/p1)
}


# Cohen's h effect size for fixed sample size
n1 <- 100000
pwr.results <- pwr.2p.test(
    n = n1,
    sig.level = 0.05,
    power = 0.8,
    alternative = 'two.sided'
)

# MDE
h = pwr.results$h
mde.2p.test(data_control, h)