# A/B Testing

Consider 2 situations:
1. You are a data scientist at an advertising firm, hunting for clicks.  You have built a model that you think will **significantly improve the click-thru rate** over the current advertising strategy.  ***How do you know?***
1. You are a data scientist for a non-profit, soliciting charitable donations via email.  You believe that you have a better version of the email message that will **generate more profit**.  ***How do you know?***

Enter, **A/B Testing**

### What is A/B Testing?
- Randomized, controlled statistical experiment
- 2 different "treatments" on a population: "A" and "B"
- Want to detect a difference between the 2
  - e.g. does new ad B improve engagement over old ad A?
  - e.g. is this drug effective in treating a disease?
- Use **Hypothesis Testing** techniques to look for significant differences

### Things to Remember - Experimental Design
- A and B group members should be randomly selected
- A and B groups should be representative of the general population
- Ideally, only the treatment (A vs B) changes between the groups

## Problem 1: Optimizing Ad Clicks  
You think your fancy new model is so good, let's just see!

### Planning the Experiment
- Random select treatment groups A and B
  - Question to think about: How many in each group?

#### Defining some variables
- $N_A$, $N_B$: Number in each treatment group
- $n_A$, $n_B$: Number that clicked in each group
- $p$: True population click-rate
- $\hat{p}$: Overall population sample proportion that clicked
- $p_A$, $p_B$: True click-rate for each grup
- $\hat{p}_A$, $\hat{p}_B$: Sample proportion that clicked in each group

#### Hypotheses
- $H_0\text{:}\quad p_A = p_B = p$
- $H_A\text{:}\quad p_A \ne p_B$

### Which did better, A or B? 
- If $H_0$ were true:
  - $\hat{p} = \frac{n_A + n_B}{N_A + N_B}$
  - $\hat{p}_A \sim \mathcal{N}(p, p(1-p)/N_A\quad$ and $\quad\hat{p}_B \sim \mathcal{N}(p, p(1-p)/N_B$
    - **Note**: Here we've used $\sigma^2 = p(1-p)$ for a **Bernoulli** Random Variable
    - Thus, CLT yields the above, and...
  - $\hat{p}_A - \hat{p}_B \sim \mathcal{N}\left(0, p(1-p)\times\left[{\frac{1}{N_A} + \frac{1}{N_B}}\right]\right)$
    - **Note**: Here we use $\text{var}(aX+bY) = a^2\text{var}(X) + b^2\text{var}(y)\rightarrow \text{var}(\hat{p}_A - \hat{p}_B) = p(1-p)\times\left[{\frac{1}{N_A} + \frac{1}{N_B}}\right]$
  - $\hat{p}$ is our **Maximum Likelihood Estimate** for $p$
- So let's do a **Z-test** on $\hat{p}_A-\hat{p}_B$
  - $Z = \frac{\hat{p}_A - \hat{p}_B - 0}{\sqrt{\hat{p}(1-\hat{p})\left[\frac{1}{N_A}+\frac{1}{N_B}\right]}}$

#### Let's Plug in some Numbers!
| Group | Sample Size | Ads Clicked |
|-------|-------------|-------------|
| A     |  2500       | 76          |
| B     |  2500       | 94          |

$Z = \frac{\frac{76}{2500} - \frac{94}{2500}}{\sqrt{\frac{170}{5000}\times\frac{4830}{5000}\times\left(\frac{1}{2500} + \frac{1}{2500}\right)}} = -1.405$

***So do we reject?***  Is there a difference between A and B??

## Problem 2: Gettin' that Money!  
So your ad model failed, what about your donation-scrounging ability?

### Planning the Experiment
- Randomly select treatment groups A and B
  - Question to think about: How many in each group?

#### Defining some variables
- $N_A$, $N_B$: Number in each treatment group
- $\mu$: True population average contribution
- $\hat{\bar{x}}$: Observed population average contribution
- $\sigma$: True population contribution variance
- $\hat{\sigma}$: Observed population contribution variance
- $\mu_A$, $\mu_B$: True population average for each group
- $\hat{\bar{x}}_A$, $\hat{\bar{x}}_B$: Observed Average contribution for each group
- $\sigma_A$, $\sigma_B$: True within-group contribution variance
- $\hat{\sigma}_A$, $\hat{\sigma}_B$: Observed Within-group contribution variance

#### Hypotheses
- $H_0\text{:}\quad \mu_A = \mu_B = \mu$
- $H_A\text{:}\quad \mu_B \gt \mu_A$

### Which did better, A or B? 
- This time, the result (contribution) is presumed $\mathcal{N}(\mu, \sigma^2)$
- And we don't know the population variance...
- So we'll have to estimate it from the sample $\rightarrow$ Use a **t-distribution/test**
  - In essence, we can use the same approach as **Z-test** with some minor changes:
    - $Z = \frac{\hat{p}_A - \hat{p}_B - 0}{\sqrt{\hat{p}(1-\hat{p})\left[\frac{1}{N_A}+\frac{1}{N_B}\right]}}\quad\rightarrow\quad t = \frac{\hat{\bar{x}}_A - \hat{\bar{x}}_B - 0}{\hat{s}^2\sqrt{\left(\frac{1}{N_A}+\frac{1}{N_B}\right)}}$
    - where $\hat{s}^2$ is a pooled estimate of the **sample variance** of contributions:
      - $\hat{s}^2 = \frac{(N_A-1)\sigma_A^2+(N_B-1)\sigma_B^2}{N_A+N_B-2}$
- The process is exactly the same now!  Namely:
  1. We have a test statistic ($t$ instead of $Z$)
  1. We look up the p-value against a **t-table** (with 1000+1000-2 degrees of freedom)
  1. We report whether this p-value is significant to our desired degree.

#### Let's Plug in some Numbers!
| Group | Sample Size | Average | Variance |
|-------|-------------|-------------|------|
| A     |  1000       | 100          | 10  |
| B     |  1000       | 105          | 14  |

$\rightarrow t \sim 9.19$

***So do we reject?***  Is there a difference between A and B??

## Steps for Hypothesis Testing
1. Choose **appropriate test statistic** (z, t, F, $\chi^2$, etc) for the situation (usually just, look it up!)
2. Formulate **null** and **alternative hypotheses**
3. Compute value of **test statistic under null hypothesis**
4. How extreme is the test statistic?  $\rightarrow$ **p-value**
  4. Likelihood of results occuring by chance if $H_0$ is true
5. **Reject/Don't Reject** by comparing p-value to desired significance

## Power and Sample Size Calculations

**Consider:** We have light bulbs and we believe that the mean lifetime before they die is roughly $\mu=100$ days with a known $\sigma=16$ days.  You want to test for deviations from this belief at **5% significance**.

**Question 1:** Given that a signal exists in our data (aka mean incorrect), what is the probability that we detect it?

**Question 2:** How many samples do we need to be sure that, if a signal is there, we are 90% sure we'll find it?

### Power
$$
\text{Power} = \text{P}(\text{Reject }H_0\; | \;H_0 \text{ is false})
$$
- Depends on the size of effect we're trying to detect (small effect needs more samples)

#### Question 1: What is the power?
- Let's say you want to be able to **detect differences in $\mu$ of 8 days or more**.
- What is the power of your test?
- $H_0$: $\mu = 100$, $\sigma = 16$
- $H_A$: $\mu \ne 100$
- The smallest $\mu$ $H_A$ we want to detect is for $\mu = 108$.
- This is a **one-tail test**, so $z* = 1.645$ instead of 1.96.
- If $z* \ge 1.645$ under $H_0$, then we reject $H_0$
- To calculate power, what percentage of a normal curve centered at $\mu=108$ does $z* \ge 1.645$ contain?
  - $z* = 1.645 \; \rightarrow \; \mu = 106.58 $
  - $\mu \ge 106.58 \; \rightarrow \; \text{P}(\mu \ge 106.58 \; | \; \mu = 108) = \frac{106.58 - 108}{16/\sqrt{n}} = \bf{.6387}$
  
#### Question 2: How many samples?
- How many samples are needed for **95% power** on **differences of 8** or more?
- Start of rejection region: $R = 100 + 1.645*16/\sqrt{n}$
- 5th percentile of $H_A$: $A = 108 - 1.645*16/\sqrt{n}$
- We need: $R \le A \; \rightarrow \; 32*1.645/\sqrt{n} \le 8 \; \rightarrow \; n \ge 44$