#### Introduction
An A/B Test is a scientific method of choosing between two options (Option A and Option B). Some examples of A/B tests include:

- What number of sale items on a website makes customers most likely to purchase something: 25 or 50?
- What color button are customers more likely to click on: blue or green?
- Do people spend more time on a website if the background is green or orange?

For A/B tests where the outcome of interest (eg., whether or not a customer makes a purchase) is categorical, an A/B test is conducted using a Chi-Square hypothesis test. In order to determine the sample size necessary for this kind of test, a sample size calculator requires three numbers:

- Baseline conversion rate
- Minimum detectable effect (also called the minimum desired lift)
- Statistical significance threshold


#### Baseline Conversion Rate
A/B tests usually compare an option that we’re currently using to a new option that we suspect might be better. In order to compare the two options, we need a metric. Often, our metric will be the percent of users who take a certain action after interacting with one of our options. For instance:

The percent of customers who buy a t-shirt after visiting one of two versions of a website
The percent of users who click on one of two versions of an ad
In the t-shirt example above, the baseline conversion rate is our estimate for the percent of people who will buy a shirt under the current website design.

We can generally calculate a baseline by looking at historical data for the option that we’re currently using. For example, suppose that 2000 people visited a website over the past three months and 320 of those visitors purchased a shirt. We could estimate the baseline rate as follows:

$ baseline = 320/2000*100
print(baseline) #output: 16.0 $

This number may be written as a proportion (eg., 0.16) or a percent (eg., 16%).

#### Minimum Detectable Effect
Suppose we’re running an A/B Test to find out if a new website layout drives more subscriptions than the current one. If the new layout is only a tiny percent better, would we really care?

In order to detect precise differences, we need a very large sample size. In order to choose a sample size, we need to know the smallest difference that we actually care to measure. This “smallest difference” is our desired minimum detectable effect. This is also sometimes referred to as desired lift.

Minimum detectable effect or lift is generally expressed as a percent of the baseline conversion rate. Suppose that 6% of customers currently subscribe to our website (that’s our baseline conversion rate). Changing a website layout is hard, so we only think that it’s worth doing if at least 8% of our customers would subscribe with the new layout. To calculate this as a percentage of our baseline:

`baseline = 6`

`new = 8`

`min_detectable_effect = (new - baseline) / baseline * 100`

`print(min_detectable_effect) #output: 33.0`

Our minimum detectable effect/desired lift is 33%.

#### Significance Threshold
When we run an A/B test, we usually want to use the results of the test to make a decision: use version A or B? In order to make that decision, many data scientists use a pre-determined significance threshold for their hypothesis test. For example, if we set a significance threshold of 0.05 (a commonly chosen value), we’ll “reject the null hypothesis” and conclude that the conversion rate for version B is significantly different from version A if we get a p-value less than 0.05.

It turns out that this significance threshold is the false positive rate for the test: the probability of finding a significant difference when there really is none. As a business owner, we don’t want to make this kind of mistake, because then we might invest money in a change that doesn’t actually make a difference!

Unfortunately, there’s a trade-off between false positives and false negatives. A false negative occurs when there is a difference between version A and B, but the test doesn’t detect it. This is a potential missed opportunity for a business owner!

Most A/B test sample size calculators estimate the sample size needed for a 20% false negative rate; while a data scientist needs to choose the false positive rate they are comfortable with. The lower the false positive rate, the larger the sample size will need to be!

#### Don't Interfere With Your Tests
Suppose that a Product Manager is running an A/B Test for a redesign of a landing page. Before starting the test, she used a sample size calculator to determine the sample size: 2,200 total website visitors. After reaching 2,200 visits, she ran a Chi-Square Test. The new website design performed slightly better, but the results were not statistically significant.

It might be tempting to run the test for another week to see if the difference becomes significant, but that would be a big mistake! By choosing to extend the A/B test past the original sample size, the project manager would introduce personal bias to the results of the test; she will be more likely to get the results she wants, regardless if these results reflect reality.

Here are two important rules for making sure that A/B tests remain unbiased:

- Don’t continue to run the test after the predetermined sample size, until “significant” results are found.
- Don’t stop a test before reaching the predetermined sample size, just because your results reach significance early (unless there are ethical reasons that require you to stop, like a prescription drug trial).

Test data is sensitive to changes in sample size, which is why it is important to calculate beforehand.

<img src = 'ab.png'>

Inspect the graph in the workspace. It shows an A/B Test where the baseline was 5%, and we want to see a lift of 50% (i.e., we want our second option to have at least a 7.5% conversion rate). A sample size calculator tells us that we need 210 observations. The chart shows the cumulative conversion rate after each new observation. When we reach our desired sample size of 210, our cumulative conversion rate is slightly higher than 5%, but the difference is not significantly different (indicated by red). By extending the experiment to 320 samples, the difference becomes significantly different (indicated by green). We might conclude that our results are significant if we stopped the experiment at this point. However, we can see this is a temporary fluctuation. After this brief moment of “significance” the conversion rate decreases and our results become insignificant again. By arbitrarily extending the study until it reaches significance, we fool ourselves!

Try this: Flip a coin five times. Which side came up more frequently? Perhaps you now suspect that the coin is biased. Keep flipping the coin until that side shows up even more frequently. By changing your sample size in the middle of an experiment, you can easily convince yourself that a fair coin is biased.

As a final exercise, let’s put everything together into a single calculation. Suppose that you are running a business and want to see if a new advertisement will drive more clicks on your website. Currently, about 10% of people who see your ad are clicking on it. You want to run the new ad if at least 14% of people will click the new ad. When you run your Chi-Square test after collecting your data, you plan to use a significance threshold of 0.05, so that your chances of a false positive are relatively low. Try the following:

Based on the description above, identify the baseline conversion rate and significance threshold
Based on the description above, calculate the minimum detectable effect (hint: it’s not 4%!)
Plug in your baseline, minimum detectable effect, and significance threshold to the provided calculator
Calculate the total sample size needed for this experiment (note: this calculator assumes that exactly half of the sample will see each version of the ad)

In [1]:
MDE = ((14 - 10)/10) * 100
print("Minimum Detectable Effect: ", MDE)

# calculate significance threshold:
sig_threshold = 5
print("significance threshold: ", sig_threshold)

# calculate total sample size: 
samp_size= 2060
print("sample size: ", samp_size)

Minimum Detectable Effect:  40.0
significance threshold:  5
sample size:  2060


### Sample Size Determination With Simulation

We will use simulation to understand some of the considerations for setting up an A/B test: sample size, power, and the false positive rate. But before we think about designing an A/B test, let’s first remind ourselves how to conduct the test itself, after planning and collecting data.

Suppose that a media company currently has a weekly newsletter email and wants to see if using the recipient’s first name in the email subject will cause more people to open the email (ie. “Bob! Checkout this week’s updates” vs “Checkout this week’s updates”). They randomly assign a group of 100 recipients to receive one of the two email subjects and record whether or not each recipient opened the email. The first few rows of their data might look something like this:


<img src = 'image4.png'>


In order to run a hypothesis test to decide whether there is a significant difference in the open rate for these emails, we would run a Chi-Square test. To accomplish this, we would first create a contingency table for the Email and Opened variables in the above table:

`X = pd.crosstab(data.Email, data.Opened)`

`print(X)`

Output: <img src = 'image5.png'>


We would then use this table to run a Chi-Square test and get a p-value:

`chi2, pval, dof, expected = chi2_contingency(X)`

`print(pval) #Output: 0.2186`

Based on the p-value, we would make a decision about which email to use; a small p-value would provide evidence that the open rates are significantly different for the two groups, while a large p-value would suggest no significant difference.

In [4]:
import pandas as pd
from scipy.stats import chi2_contingency

data = pd.read_csv('/Users/elorm/Documents/Repos/Datasets/ab_data.csv')
print(data.head())

# calculate contingency table here
ab_contingency = pd.crosstab(data.Web_Version, data.Purchased)

# run your chi square test here
chi2, pval, dof, expected = chi2_contingency(ab_contingency)
print(pval)


  Web_Version Purchased
0           A        no
1           A        no
2           A       yes
3           A       yes
4           A       yes
0.10096676200907678


#### Simulating Data for a Chi-Square test
In the last exercise, we used some data from an A/B test to run a Chi-Square test. In the next few exercises, we’ll build up a simulation to understand the considerations that go into choosing a sample size for that test.

Again consider the A/B test example from the previous exercise, comparing email subjects with and without the recipient’s first name. Suppose we know that visitors have a 50% chance of opening the control email and a 65% chance of opening the name email (30% lift!).

Here we use lift to refer to the inherent difference in the distributions of our two groups of data. In the A/B Testing: Sample Size Calculators lesson, we learned that minimum detectable effect is the smallest size of the difference between the two groups that we want our test to be able to detect. If we set up our experiment with a minimum detectable effect of at least 20%, our statistical test should detect a difference with a “lift” or “effect” of 20% or greater. In this lesson we are going to simulate data that has a lift of 30% to demonstrate how the inherent lift impacts the power of our statistical test.

We can use the aforementioned probabilities to simulate a dataset of 100 email recipients as follows:

<img src = 'img4.png' >

This gives us two simulated samples, of 50 recipients each, who hypothetically saw the name or control email subject. Each one looks something like ['yes' 'no' 'no' 'no' 'yes' 'yes' ...], where 'yes' corresponds to an opened email.

Next, we can assemble these arrays into a data frame that looks a lot like the one we saw in exercise 1:

<img src = 'img5.png'>

<img src = 'img6.png'>

Because of how we created this data frame, all of the “control” observations will be listed first, followed by all of the “name” observations.

In [3]:
import numpy as np
import pandas as pd

sample_size = 4
lift = .3
control_rate = .5
name_rate = (1 + lift) * control_rate

sample_control = np.random.choice(['yes', 'no'], size=int(sample_size/2), p=[control_rate,1-control_rate])
sample_name = np.random.choice(['yes', 'no'], size=int(sample_size/2), p=[name_rate, 1-name_rate])

group = ['control']*int(sample_size/2) + ['name']*int(sample_size/2)
outcome = list(sample_control) + list(sample_name)
sim_data = {"Button": group, "Opened": outcome}
sim_data = pd.DataFrame(sim_data)
print(sim_data)

    Button Opened
0  control    yes
1  control     no
2     name     no
3     name    yes


#### Determining Significance
Now that we’ve practiced simulating data for an A/B test, let’s actually run a Chi-Square test for each simulated dataset and consider the decision we would make based on the outcome.

If we were really running this test, we would want to use the data to make a decision about whether to use the control (old) or name (new) email subject. To make that decision, we can use a significance threshold. For example, if we’re using a significance threshold of 0.05, we’ll “reject the null hypothesis” for any p-value less than 0.05. In this context, rejecting the null would mean that we conclude that there is a significant difference between the open rates for the two email subjects and therefore we should switch to the email subject that uses the recipient’s first name.

We can use the following Python statement to record whether a particular p-value is significant or not, based on a threshold of 0.05:

<img src = 'img7.png'>

In [5]:
from scipy.stats import chi2_contingency

# pre-set values
significance_threshold = 0.05
sample_size = 100
lift = .3
control_rate = .5
name_rate = (1 + lift) * control_rate

# simulate a dataset
sample_control = np.random.choice(['yes', 'no'], size=int(sample_size/2), p=[control_rate,1-control_rate])
sample_name = np.random.choice(['yes', 'no'], size=int(sample_size/2), p=[name_rate, 1-name_rate])

group = ['control']*int(sample_size/2) + ['name']*int(sample_size/2)
outcome = list(sample_control) + list(sample_name)
sim_data = {"Email": group, "Opened": outcome}
sim_data = pd.DataFrame(sim_data)

# run a chi-square test
ab_contingency = pd.crosstab(np.array(sim_data.Email), np.array(sim_data.Opened))
chi2, pval, dof, expected = chi2_contingency(ab_contingency, correction=False)
print("P Value:")
print(pval)

# determine significance here:
result = ('significant' if pval < 0.05 else 'not significant')

print("Result:")
print(result)

P Value:
0.5385089712166174
Result:
not significant


#### Estimating Power
In the last exercise, we learned how to simulate a dataset for a Chi-Square test, run the test, and then output a result: ‘significant’ or ‘not significant’. In this exercise, we’ll repeat that process many times so that we can inspect the relative frequency of each outcome.

To do this, we’ll start by creating an empty list to store the results of our repeated experiments. Next, we’ll move all of our simulation code (to create a sample dataset, run a Chi-Square test, and determine a result) inside of a for-loop. In each iteration of the loop, we’ll append the outcome to our results list so that we can inspect it later.

The outline of the code looks something like this:

Set the sample size and subscription probabilities
Create an empty list named `results`

Repeat 100 times in a for-loop:
   Simulate a dataset
   Run a Chi-Square test
   Use the p-value to determine significance
   Append the result ('significant' or 'not significant') to `results`
   

Finally, we can inspect results by calculating the proportion of simulated tests where the result was 'significant':

`results =  np.array(results)`

`print(np.sum(results == 'significant')/100)`