# A/B Testing

## Learning Objectives
- Learn what is A/B testing
- PICOT
- How to design a good A/B experiment
- Measuring the results of an A/B test

Undoubtedly you have all heard of an A/B test before. In most cases, it is when we want to test a variation of a product/service and see how that influences a metric we have in mind. Some examples are: what is the effect does adding a delay on my website have to the profit I am making (Amazon); which banner should I display to a user to increase their click through rate (Netflix).

Many of you have probably already realised that this is just applied hypothesis testing! We are testing whether a change in some product has a significant effect on a desired metric. Let's work through this under the following context: Whether changing the blue button to orange on an optional newsletter signup box increases the number of emails collected. When doing A/B testing, we should formulate our hypothesis under the PICOT acronym (not sure why they didn't choose PIVOT as the acronym...):
- **P**opulation: The specific group of people you will be studying in the experiment
- **I**ntervention: What is the variant you are introducing
- **C**omparison: What reference are you comparing this variant against
- **O**utcome: What metric/result are you measuring
- **T**ime: How long are you running the experiment for

Why is time important? Besides the constraint that some metrics are only reported at a regularly specified interval (e.g. every quarter), we also have to consider the **novelty factor**. That is, some change to our product may encourage users to experiment and play around with the product - increasing the outcome you're measuring. However, such an increase may only be coming from the novelty of the new feature. In the long run, it could turn out that the inclusion of the variant/feature is detrimental to the outcome you're trying to measure. A lengthier experiement helps mitigate this issue.

Under this acronym, what would an appropiate null and alternate hypothesis be for our experiment?
<details>
    <summary><b>> Click here to reveal the answer</b></summary>
    H0: Non-registered visitors of our website that saw the orange button will <i>not</i> result in a higher level of newsletter email signups over the period of one month compared to those which see the blue button. <br />
    Ha: Non-registered visitors of our website that saw the orange button will result in a higher level of newsletter email signups over the period of one month compared to those which see the blue button.

</details>


And can you break the hypothesis down into where each point follows a word from PICOT?
<details>
    <summary><b>> Click here to reveal the answer</b></summary>
    <ul>
        <li><b>P</b>opulation: Non-registered visitors of our website</li>
        <li><b>I</b>ntervention: Showing an orange button</li>
        <li><b>C</b>omparison: Showing a blue button</li>
        <li><b>O</b>utcome: Number of people signup to the newsletter
        <li><b>T</b>ime: One month</li>
    </ul>
</details>

With that knowledge, what's wrong with the following hypotheses:
- H0: Milk is not a good combination with cookies
- H1: Milk is a good combination with cookies

<details>
    <summary><b>> Click here to reveal the answer</b></summary>
    <ul>
        <li>No clear definition of what 'good combination' is. Sure they may taste great together, but 1) How do we measure this, and more importantly, they could cause negative health issues down the line. Would it be a good combination then?</li>
        <li>How do we measure this? It's not made clear</li>
        <li>Which population are we targetting? If we went to an asian country they might enjoy the cookies, but (cows) milk might make them ill. Then it wouldn't be a good combination for them.</li>
    </ul>
</details>

Now that a hypothesis has been formulated, let's look at how we can pick fair samples, before diving into the statistics. We need a **treamtent** and **control group**. Control groups are measured as a baseline (e.g. they're not shown the orange button), and treatment groups are measured with the change of interest - to try to reject the null hypothesis. We can usually assign groups randomly or by sampling our treatment group from users who have opt-ed into to a beta test, but both of these introduce biases:

**Randomisation bias** can occur when we collect samples of our data with poor randomisation. This will lead to over/under-representation in your samples. This would mean that some of the variables of one of the groups may differ from the distribution of the population. For example, different countries might have different affinities to the blue vs orange button. Capturing information from one country only, when you want to model the global population of your users would introduce randomisation bias and may skew your results.

**Selection bias** occurs when users are given the choice to A/B test a feature. This bias is more dangerous than randomisation bias as those who opt into these kinds of tests might have a higher risk appetite. Selection bias leads to harder to measure latent variables being encoded into the sample group results.

There is a statistically sound method known as [power analysis](https://www.youtube.com/watch?v=VX_M3tIyiYk) which allows you to determine a sample size that reflects the truth of the population data. We won't run through the methodology behind determining the sample size, but when it comes round to you designing and implementing your own A/B tests, researching how this method works is critical. For the sake of this notebook, we'll work under the assumption that the data we've obtained is representative of the population data. 

We'll be deviating away from the example above and be working with some data collected on a mobile game. The data is called [Cookie Cats and sourced from Kaggle](https://www.kaggle.com/yufengsui/mobile-games-ab-testing). We're given a hefty amount of datapoints (over 90,000), and we'll be testing the users 1-day retention on the game, depending on the variants that they were shown. The variant works as follows. In the game, a player will occasionally encounter something known as a gate, which forces them to wait a non-trivial amount of time or make an in-app purchase to continue. We've been given data on what level the user is allowed to reach before the first gate is shown to them. In the control group, the gate was displayed at level 40, in the treatment group, at level 30. Retention is defined as a player coming back and playing the game 1 day after installation.

We should also specify something known as the **minimum detectable effect** (MDE). The MDE denotes the minimum change which is practically significant to a business. The existence of it means that despite a result maybe being statisitically significant, it may not show a high enough change to be practical to the business. Here, will we choose an MDE of 1%.

So what are our null and alternate hypothesis with this being the case?
<details>
    <summary><b>> Click here to reveal the answer</b></summary>
    H0: New players of Cookie Cats that are shown the first gate at level 30 will <i>not</i> result in at least a 1% higher 1-day retention rate over the players who were shown the first gate at level 40. <br />
    Ha: New players of Cookie Cats that are shown the first gate at level 30 will result in at least a 1% higher 1-day retention rate over the players who were shown the first gate at level 40.
</details>

Mathematically,
- H0: $\mu_{40} \geq \mu_{30}$
- Ha: $\mu_{30} \gt \mu_{40}$

In [None]:
import pandas as pd
df = pd.read_csv("../DATA/cookie_cats.csv")
df

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90189 entries, 0 to 90188
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   userid          90189 non-null  int64 
 1   version         90189 non-null  object
 2   sum_gamerounds  90189 non-null  int64 
 3   retention_1     90189 non-null  bool  
 4   retention_7     90189 non-null  bool  
dtypes: bool(2), int64(2), object(1)
memory usage: 2.2+ MB


It looks like the variables represent the following:
- `userid`: Unique identifier for a user/player
- `version`: Whether the player was placed in control group (gate_40) or treatment group (gate_30)
- `sum_gamegrounds`: The number of rounds played by a player in the first 14 days after installing
- `retention_1`: Whether the player continued playing the game 1 day after installing
- `retention_7`: Whether the player continued playing the game 7 days after installing

Let's check how many players have been placed into each group. We're looking for these numbers to be close to each other, and high enough that we can assume that our findings will be statistically relevant.

In [39]:
## How many players are present in each group?

version
gate_30    44700
gate_40    45489
dtype: int64

Great - looks like both groups have similar numbers of data collected on them and that the size of the data is pretty large. Before continuing, I want gauge what the distribution of the number of gamerounds played actually is.  

In [51]:
## Produce a plot which shows the frequency of the number of gamerounds played
# Only plot up to 100 gamerounds played

A finding that I wanted to discern from this task is whether we would have users who installed the app but didn't play any game rounds. It looks like we do have a fair few of these cases. It's relevant to know this, so let's keep it in our mind. 

I initially went to plot the entirity of the gamerounds played and found that this extended to 50,000... Seems like that could indicate an outlier? Let's boxplot the different groups to find out what the data is saying.

In [59]:
px.box(df, x="version", y="sum_gamerounds", 
       labels={"sum_gamerounds": "Gamerounds played"}, title="Boxplot of games played by game version")

In [60]:
## Remove the outlier and replot the data


Earlier, we saw that we had players who had installed the game but hadn't played it yet. Some players had installed the game and played it a little bit - some players had played it a lot. If we were working for the business, our job would be to find out what we need to do to keep the players engaged as long as possible. A common metric for measuring engagement in games is 1-day retention: the percentage of players that come back and play the game one day after they've installed it. Let's look at the overall 1-day retention.

Subsequently, we should work out the 1-day retention between the different AB groups: 

In [137]:
## Work out the percentage of users that attained 1-day retention


## Work out the percentage of gate_40 users that attained 1-day retention

## Work out the percentage of gate_30 users that attained 1-day retention


Overall 1-day retention: 44.521444094558035
Gate 40 1-day retention: 44.22827496757458
Gate 30 1-day retention: 44.81979462627799


Looks like there's a 0.6% increase in 1-day retention when players are placed on Gate 30 instead of Gate 40. We should check how confident we are in this increase by validating whether it is statistically significant or not. Since we have a large number of samples, we're able to assume normality (via CLT) and use either a z-test or a t-test to test our hypothesis. If we hadn't had such a large number of datapoints to work with, an appropiate solution might be to bootstrap our data.

We can use either statsmodels or sklearn to run our statistical test. Choose your preference on library and test and implement a solution before which tests our hypothesis. Be aware of the tails of the test based on our null/alternative hypothesis, and write up your conclusion regarding statistical significance afterwards

Bonus: Search around to find out how to extract/implement confidence intervals for each of the groups too

p-value: 0.036961103150912514
95% confidence interval control: (0.4377185118600718, 0.4468469874914197)
95% confidence interval treatment: (0.4435875060684543, 0.4528083864571055)


Since our p-value (0.037) is below $\alpha$ = 0.05, we can reject the null hypothesis and conclude that decreasing the first gate to level 30 does significantly increase the 1-day retention of players. Looking at the confidence intervals for the `treatment` group [0.444, 0.453], we can see that this:
1. Starts above our baseline of 44.2%
2. Includes the MDE improvement over the baseline (44.2% + 1%)

What this means is that firstly we can be confident that the treatment will be effective (as our lowest interval is higher than the baseline), and secondly, that it is likely that true conversion rate of the treatment *could* meet our MDE. However, the MDE only just touches upper end of our confidence interval. This means that we'll probably need to leave our decision to our stakeholder as to whether it'd be worth implementing the change!