# Last two weeks of the course

The plan for last two weeks of the course:
1. April 30th (Wednesday) - live group project consultations. Each group gets a 15 minute slot to share progress and issues and get feedback from me. Very likely to impact your grade!
2. May 2nd (Friday) - course review. Will go through the whole course and work on some questions.
3. May 7th (Wednesday) - a presentation from an employee of State Data Agency (Valstybės duomenų agentūra). Their work is really really interested, highly suggested to come!
4. May 9th (Friday) - group project presentations.


# AB Testing

Agenda:
1. Randomized controlled cxperiments
2. AB testing
3. Some intuition on p-values and confidence intervals
4. Analysing an AB test


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

# if this import fails, run pip install statsmodels or comment it out
from statsmodels.stats.proportion import proportions_ztest

rng = np.random.default_rng(0)


## Randomized Controlled Experiments

Randomized controlled experiments is a scientific method for measuring the effectiveness of an intervention. Some examples:
- Collecting 107 patients with pulmonary tuberculosis, splitting them into 2 groups randomly. One group receives streptomycin (an antibiotic), while the other does not receive anything. ([link](https://www.jameslindlibrary.org/medical-research-council-1948b/))
- Randomizing which fertilizer you're using for which fields of barley and then measuring the yield of each field.

![Randomized Controlled Trial](https://www.simplypsychology.org/wp-content/uploads/randomized-controlled-trial-1024x596.jpeg)

What makes an experiment randomized and controlled:
1. One of the groups does not receive the treatment.
2. Participants of the experiment are assigned randomly into treatment and control groups.
3. Other than the treatment, the conditions that the two groups are exposed to are identical.

Some terminology:
- Treatment - a.k.a. intervention, the thing want to measure the impact of and that we have control over. The drug, the new page layout, etc.
- Control group - group that does not receive the treatment.
- Treatment group - group that receives the treatment.
- Outcome - the variable we are trying to impact with the experiment.
- Treatment effect - the difference in outcome between 


## AB Testing

AB tests are what technology companies call randomized controlled experiments. At the basic level, statistical techniques are the same as in RCTs. Usually group A is the control group, and group B is the treatment group. But, just like an RCT, AB tests can have more than two groups (i.e. more than one treatment).

Some examples of changes that are AB tested:
1. Improving the layout of the checkout page in an eshop.
2. Netflix improving their recommendation algorithm, to recommend shows that users want to watch.
3. Vinted improving machine learning models that detect counterfeit items in order to protect buyers.

In all of these examples, there is a clear treatment being applied - reorganized checkout page, new algorithm, etc. However, we are never sure whether the change we are making is really an improvement until we test it - split users into two groups randomly, expose one of the groups to the change and observe their behavior. We need AB testing because we do not want to make changes based on intuition or expertise alone, we need to back our decisions with data. Just like we don't know whether a new drug will be effective in curing the disease, we also don't know if a specific change to our product will improve it.

However, all of the examples above beg the question - what does it mean to "improve"? How we operationalize abstract concepts such as "user experience" is very important in AB tests, because we need to measure the outcomes for both groups precisely. Therefore, before every AB test we will define metrics that we will use to decide whether an AB test was successful.

Some popular metrics:
1. Conversion rates - the proportion of test participants that "convert" to some action - buying, watching a movie, creating a customer support ticket.
2. Retention rates - for instance, of all users that bought something this week, what proportion also bought during the next week?
3. Just counts or sums - e.g. average number of items bought or amount spent.


## Some statistics intuition

In [2]:
def z_test_pvalue(x_a, n_a, x_b, n_b):
    """
    x_a - number of successes in group A
    n_a - sample size of group A
    x_b - number of successes in group B
    n_b - sample size of group B
    """
    p_pool = (x_a + x_b) / (n_a + n_b)
    se = np.sqrt(p_pool * (1 - p_pool) * (1/n_a + 1/n_b))
    p_a = x_a / n_a
    p_b = x_b / n_b
    z  = (p_b - p_a) / se
    p = 2 * (1 - stats.norm.cdf(abs(z)))
    return p

def z_test_ci(x_a, n_a, x_b, n_b, alpha=0.05):
    p_pool = (x_a + x_b) / (n_a + n_b)
    se = np.sqrt(p_pool * (1 - p_pool) * (1/n_a + 1/n_b))
    te = x_b / n_b - x_a / n_a

    z = stats.norm.ppf(1 - alpha/2)
    moe = z * se # margin of error
    ci_lower = te - moe
    ci_upper = te + moe
    return ci_lower, ci_upper


## Analysing a real experiment

Let's now analyse a real experiment.


### Task 1

Read the dataset description [here](https://www.kaggle.com/code/mursideyarkin/mobile-games-ab-testing-with-cookie-cats) and look at the dataset. Identify the following components:
- Treatment
- Treatment unit (subject)
- The number of groups in the test
- Which group is the control group


### Task 2

Calculate 1-day retention rates and counts ($n$) for all groups.



### Task 3

Analyse the experiment - calculate the p-value and the confidence interval. Write down the interpretation of both.

Assume alpha level is 0.05.

Would you recommend to scale the experiment?


## Recap

1. A randomized controlled experiment is an experiment that has these components:
    1. There is a control group that does not receive the treatment.
    2. Units are assigned into control and treatment groups randomly.
    3. Aside from the treatment, everything is the same between groups.
2. AB tests are what technology companies call randomized controlled experiments.
3. We need AB tests because our intuition is wrong, and we need to back our decisions with data.
4. A key component of running successful AB tests is a valid target metric.
5. We can summarize the results of an experiment via a p-value or a confidence interval.

| number | question that it answers | interpretation |
| ------ | ---- | --- |
|   p-value | if the true effect is zero, how surprising is the observed result? | if p value is below $\alpha$, we reject the null hypothesis |
| confidence interval | which effect sizes are consistent with the data? | the narrower the interval, the more precise is our estimate. 0 being outside of the interval is equivalent to a significant effect  |

65.40157218073797