# A/B Testing - An Introductory Project on Experimentation

In this jupyter notebook, we will learn how to conduct and analyze a real world A/B test. We will be using a dataset from Kaggle to carryout the experiment. 



We will work through the following A/B testing process:
1. `Hypothesis`
<br>
2. `Experiment Prerequisites`
<br>
3. `Experiment Design`
<br>
4. `Running an Experiment`
<br>
5. `Result to Decision`


**So, what is A/B testing to begin with?**

An **A/B test** is a randomized experiment in which **two variants** are created, each containing a difference in a single variable such as a procedure, visual, or product that might affect a user's behavior in question. Between the two groups, one receives the treatment and the other does not.  

In an A/B test, the goal is to **compare the performance between the two variations and determine whether the new treatment or control group performs better**, given a specific success metric.



#### Proposing A Potential Business Scenario
Let's propose a hypothetical business scenario! 

<blockquote>
You just started working as a Data Scientist at a software development company that specializes in helping athletes track their workouts. The marketing team has been working on a new ad campaign that they think will increase the number of users on the app. The team has let you know that the current conversion rate in the past year has stayed steady at 9.5% and they hope that this new updated ad will increase that to 11%, meaning that they will consider it a success with an increase 1.5%. However, a member of the marketing team is concerned that the new ad may distract or confuse customers and will ultimately lead them to not click the ad.
</blockquote> 
<blockquote>
Since, the new ad will be finalized in the coming weeks and has not been tested on any users, the marketing team wants you to design and run an A/B test on a subset of users to determine whether or not the new ad compaign should be rolled out to all users. 
</blockquote>


## 1. `Hypothesis`
As with any experiment we need to start out with a robust hypothesis. 

In order to truly be able to interpret the results of an experiment, our first step will be to formulate a hypothesis. This will help us clearly define our experiment design and the interpretation of the results of our experiment.

A **hypothesis test** is a stastical experiment that samples data to determine whether there is enough evidence to draw a conclusion about a population. 

There are two types of hypothesis tests, a **one-tailed test** and a **two-tailed test**. Given our hypothetical business case, we are not certain whether the new ad compaign will increase conversions, therefore we will be using a two-tailed test, as there is a possibility that the out come will actually decrease conversions. 

<blockquote>
A two-tailed test simply tests whether a sample is greater than or less than a range of values and has two critical regions. Simply put, the experiment will test whether there will be any difference in conversions between the original and new ad campaign.
</blockquote>
    

<center>$H_o: p_o = p_1$</center>
<br>
<center>$H_a: p_o \neq p_1$</center>
   

Our **null hypothesis, $H_0$** states that the conversion rate between the original and new ad compaign will be the same. Given the concern above from a member of our team, the **alternative hypothesis, $H_a$** is that the conversion rate between the original and new ad compagin is not the same. 

### 2. `Experiment Prerequisites`

**Objective and Key Metrics** <br>
- The marketing team is interested in whether **a user clicks on the ad compaigns or not**. Therefore, a good metric for this experiment would be the **click-through-rate (CTR)**. 
- The CTR is defined as the percentage of people who clicked on the ad divided by the number of people who saw the ad.
<br>
<center> $CTR = \frac{Clicks}{Impressions} * 100$ </center>
<br>

**Variants**
- A/B tests can contain more than two variants. However, in our simple example we will keep it to one control group and one treatment group. Below are our variants.
<br>
<center> <br>$H_o$, Control: Original Ad Campaign<br>$H_a$, Treatment: New Ad Campaign</center> 


**Randomization Units**
- For this example we will randomize the experiment by using the users. Users will be randomly assigned to the control or treatment group. This means that **users will be randomly shown either the existing ad campaign or the new ad campaign**. <blockquote> This is with the assumption that there are enough users in our test.</blockquote> 

## 3. `Experiment Design`

**What users should be targeted?**
- What users should this experiment target? As with any website, there is a funnel that users follow. Given that we are working with an ad campaign and the marketing team wants us to evaluate the performance of the new ad campaign, we are interested in all users.

<br>**What is the Practical Significance Boundry?**
- The practical significance refers to the real world importance of a statistical result. **How big of a change in the new ad campaign really matters to the team? How much matters from a bussiness perspective?** <blockquote> As mentioned above, the product team has agreed that the **click through rate needs to increase by 1.5% for the new layout to be deemed successful.**</blockquote> 


## 4. `Running The Experiment` 

What are the needed values to run a hypothesis test?

#### Power
- What is power? Power is the probability of rejecting the null hypothesis when it is false. Simply put, power is the probability of detecting a significant difference between the two groups if there actually is a true difference in a two-tailed test. 
<blockquote>The higher the power, the more confident you can be in the results of your A/B test, however, as you increase the power, a larger sample size will be needed for the experiment.</blockquote>
<br> <center>Power = Pr(reject $H_o | H_1$ is true) = 1 - Pr(fail to reject $H_o | H_o$ is false)</center>
<blockquote>An industry standard for power in A/B testing is 80%, and is what we will use for our experiment. Usually you would get clarity about what power value you should use from a stake holder. </blockquote> 

#### Signficance Level Alpha 
- What is alpha? Alpha is the probability of obtaining the results of a hypothesis test due to random chance. In other words, it is the probability of rejecting the null hypothesis when it is true. 
<center> $\alpha = 0.5$ </center>
<blockquote>Like power,the industry standard is 5%, but one should always get clarity with a stake holder.</blockquote>


#### Sample Size 
- Why do we care about sample size? In an experiment, you **cannot test the entire population** as it may be unfeasable and require too many resources. On the other hand, a sample size that is too small may not be representative of the entire population and will provide unrealiable results. This is why in an experiment, a sample of the correct size is used and the results are assessed to make a conlusion for the entire population. The accuracy of your interpretation of the statistical analysis of an experiment is directly dependent on the sample size that you used. **So, how do we go about calculating the size of the sample that you should use?**
<blockquote>In order to calculate the sample size, we need the alpha level, the power, and the effect size. </blockquote>




In [5]:
# Calculating sample size
import pandas as pd
import statsmodels.stats.api as sms
import statsmodels.api as sm
import numpy as np
import math

# using normal
effect_size = sm.stats.proportion_effectsize(0.095, 0.11)
sample_size =  sms.NormalIndPower().solve_power(
                effect_size=effect_size,
                nobs1=None,
                alpha=0.05,
                power=0.8,
                ratio=1.0,
                alternative='two-sided'
)
sample_size = math.ceil(sample_size)
print(sample_size)

6411


In [8]:
#using t test
effect_size = sm.stats.proportion_effectsize(0.095, 0.11)
sample_size =  sms.TTestIndPower().solve_power(
                    effect_size=effect_size,
                    nobs1=None,
                    alpha=0.05,
                    power=0.8,
                    ratio=1.0,
                    alternative='two-sided'
)
sample_size = math.ceil(sample_size)
print(sample_size)

6412


Base on our calcualtions, the sample size that is needed for our experiment is a total of **6,411 users** for each sample. As you can see we get very similar results between the two functions. 

### Experiment Run and Data Collection 

After calculating our sample size, we can now run our experiment. Usually, setting up and running an experiment needs to be done with the help of the engineering team.

Since this is a hypothetical business case, our data will come from a dataset found from Kaggle.

In [9]:
df = pd.read_csv('ab_test_results.csv')

In [10]:
df.groupby('group').nunique().user_id

group
control    75000
test       75000
Name: user_id, dtype: int64

Theoretically, we need a sample of 6,412 users in each group. Let's randomly pick users from the data set for each group.

In [29]:
control = df[df['group'] == 'control'].sample(sample_size, random_state=74)
treatment = df[df['group'] =='test'].sample(sample_size, random_state=74)

results = pd.concat([control, treatment], ignore_index = True)

In [30]:
results.group.value_counts()

control    6412
test       6412
Name: group, dtype: int64

Now that we created our hypothetical dataset consisting of a control and treatment group of the calculated sample size, we can now 'analyze' and calculate the results of our a/b test. 

## `5. Result to Decision`
Now that we have created our hypothetical data set. Let's analayze the 'results' from our test. 

- ** Sanity Checks**
Usually at this stage, once the a/b test has been run and 
What is our CTR?
As mentioned above, we can calculate our click through rate using the equation up above.  

In [31]:
grouped_results = results.groupby('group').sum().reset_index()
grouped_results['ctr'] = grouped_results['clicks']/grouped_results['views']
grouped_results[['group', 'views', 'clicks', 'ctr']]

Unnamed: 0,group,views,clicks,ctr
0,control,31886.0,3049.0,0.095622
1,test,32557.0,3277.0,0.100654


The results of our A/B test show that the treatment group has a higher click rate than the control group, 9.56% vs 10.0%. **However, is the increase in click rate seen the treatment group statistically significant?**

**Are The Results Stat Sig?**

Since, our sample is quite large and our metric in questionsis a proportion of a binary event, we can use a z-test  to calculate the $p-value$ and dtermine the statistical significance of the difference between the control and treatment groups that we see. 

In [39]:
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

In [40]:
count = np.array([grouped_results[grouped_results['group']=='control'].clicks[0], 
                      grouped_results[grouped_results['group']=='test'].clicks[1]])       
nobs = np.array([grouped_results[grouped_results['group']=='control'].views[0], 
                      grouped_results[grouped_results['group']=='test'].views[1]])


stat, pval = proportions_ztest(count, nobs)

In [41]:
print(f'z-statistics: {stat:.3f}')
print(f'p-value: {pval:.3f}')

z-statistics: -2.147
p-value: 0.032


In [37]:
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(count, nobs=nobs, alpha=0.05)

In [38]:
print(f'95% CI for Control group: ({lower_con:.3f}, {upper_con:.3f})')
print(f'95% CI for Treatment group: ({lower_treat:.3f}, {upper_treat:.3f})')

95% CI for Control group: (0.092, 0.099)
95% CI for Treatment group: (0.097, 0.104)


With a p-value of 0.032, it is less that the alpha that we set at 0.05. This means that there is sufficient evidence to conclude that the click rate in the treatment group is  different than that of the control group, so we can reject the null hypothesis and say that there is a significant difference. 

When we look at the confidence intervals for the treatment group, (0.097, 0.104), it does not include the 11% click rate that the team was aiming for. The results may not be practically significant according to what would be deeemed a success, and therefore further testing should be done. 