# **A/B Tests**

Imagine you are responsible for a service, say a website for pet adoption. You care about people booking a meeting to come visit and possibly adopt a stray. You have social-media outreach and key-word advertising to drive traffic, and you care specifically about the **conversion rate**: among the visitors who land on the home page, how many book a meeting.

To improve that conversion rate, you have an idea for an improvement: a cleaner page design or a more compelling call-to-action. You are not sure how effective it will be. The best way to do that is to **split** customers in two and show one half your existing set-up ('Control' or A) and the other half your modified, hopefully improved version ('Treatement', or B). If the improved version leads to a clearly better conversion, you probably want to adopt it.

It compares the two using a $t$-test to decide whether the observed difference is large enough to be meaningful, what we call **statistically significant**.

Spliting the traffic to an on-line service and comparing conversion rates with a $t$-test is a univerally used approach to understand the impact of any change. In science, it’s known as a **random control trial** (RCT) but online services prefer to call it an **A/B-test**.


In A/B testing, two versions of the same marketing material are created: version A (the control group) and version B (the treatment group). The two versions are then randomly shown to different groups of users, and their responses are measured and compared.

In A/B testing, the treatment group is the group that receives the modified version of the marketing material being tested (version B), while the control group is the group that receives the original or existing version of the marketing material (version A).

The purpose of the control group is to establish a baseline or benchmark against which the performance of the treatment group can be measured. By measuring the performance of both groups, analysts can determine whether the changes made in the treatment group had a statistically significant impact on the measured response metrics.



## **T-Test**

A t-test is a statistical test used to determine if there is a significant difference between the response metrics of the treatment group and the control group. It helps in evaluating whether the observed difference in performance between the two groups is statistically significant or simply due to random chance.

There are two common types of t-tests used in A/B testing: the independent samples t-test and the paired samples t-test.

Independent samples t-test: This type of t-test is used when the treatment and control groups are independent of each other, meaning that each participant is assigned to either the treatment or control group, but not both. The independent samples t-test compares the means of the response metrics between the two groups. It assumes that the response metrics are normally distributed and have equal variances.

Paired samples t-test: This type of t-test is used when the treatment and control groups are dependent on each other. It is employed when each participant is exposed to both the treatment and control conditions, such as in a before-and-after scenario. The paired samples t-test compares the mean differences between the response metrics of the two conditions. It also assumes that the differences are normally distributed.

In both types of t-tests, the null hypothesis assumes that there is no significant difference between the response metrics of the treatment and control groups. The alternative hypothesis suggests that there is a significant difference. By calculating the t-statistic and comparing it to the critical value from the t-distribution, a p-value is obtained. If the p-value is below a predetermined significance level (typically 0.05), the null hypothesis is rejected, indicating a statistically significant difference between the groups.

T-tests provide a statistical framework to assess the significance of the observed differences in response metrics between the treatment and control groups, helping to determine the effectiveness of the changes made in the A/B test.

# 1. Import 
## Key libraries

In [None]:
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
plt.style.use('dark_background')

## Event log data from a pickle file


In [None]:
events_df = pd.read_pickle('../data/events.pkl')

## Exploration

In [None]:
events_df.head()

In [None]:
# change the datatype
events_df['event_timestamp'] =\
    events_df['event_timestamp'].astype('datetime64[ns]')


In [None]:
daily_conversion = events_df\
    .groupby(['page_url_path', events_df.event_timestamp.dt.to_period('D')])\
    .agg({'user_domain_id': 'count'})\
    .unstack(level='page_url_path')
daily_conversion.head()

In [None]:
daily_conversion.columns = daily_conversion.columns.droplevel()
order=['/home','/product_a','/product_b','/cart','/payment','/confirmation']
daily_conversion = daily_conversion[order]

In [None]:
daily_conversion

In [None]:
daily_conversion.plot()
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))

# 2. Data processsing 
## Aggregate into conversion rates

In [None]:
total_users = events_df\
    .groupby(['variant'])\
    ['user_domain_id'].nunique()
conversion_count = events_df\
    .groupby(['variant', 'page_url_path'])\
    ['user_domain_id'].nunique()

In [None]:
conversion_count

In [None]:
conversion_rate = pd.DataFrame(conversion_count)\
    .join(total_users, on='variant', rsuffix='_total')
conversion_rate['conversion_rate'] =       \
    conversion_rate['user_domain_id'] /    \
    conversion_rate['user_domain_id_total']
conversion_rate

In [None]:
conversion_rate.drop(
    columns=['user_domain_id'], inplace=True)
conversion_rate.columns = \
    ['visitors', 'conversion_rate']
conversion_rate

## Pivot into parallel columns

In [None]:
conversion_rate = conversion_rate\
    .pivot_table(
        index='page_url_path',
        columns='variant',
        values=['visitors', 'conversion_rate']) \
    .sort_values(
        ('conversion_rate', 'Treatment'),
        ascending=False
    )
conversion_rate

# 3. Compute statistical tests

## $t$-test

The most commonly used statistical test is the $t$-test.

$t$ is the **normalised** difference between two measures using the expected standard deviation:

$$t = \frac{\Delta\overline{X}}{s_{\Delta\bar{X}}}$$

where 
* $X$ is our objective (conversion),
*  $\overline{X}$ the average over a sample (conversion rate)
*  $\Delta\overline{X}$ the difference, and 
*  $s_{\Delta\overline{X}}$ is the **standard deviation** of that difference, i.e. the commonly observed difference between two indepedent measurements from the same origin.

This score can be compared to a normalised $t$-distribution:

![t-distribution](https://upload.wikimedia.org/wikipedia/commons/thumb/c/c0/Student_T-Distribution_Table_Diagram.svg/2560px-Student_T-Distribution_Table_Diagram.svg.png)

If we assumed there were no difference, when $t$ is very large (typically around 2), that observersation is an outlier. Therefore, we would have to reject that assumption. We consider that difference significant.

## Welch’s $t$-test

Welch suggested a version when both Control and Treatment offer an estimate the standard deviation: 
$$ t = \frac{\overline{X}_T - \overline{X}_C}
            {\sqrt{
               {s_{\overline{X}_T}^2} +
               {s_{\overline{X}_C}^2}
            }} $$

With conversion rate, the variance of one observation can be estimated as:

$$ s_{X_Ti}^2 = \overline{X}_T (1-\overline{X}_T) \qquad
   s_{X_Ci}^2 = \overline{X}_C (1-\overline{X}_C)$$

The variance of the sum of $n$ independent observations of variance $s^2$ is $n$ times larger. The variance of an average, $n^2$ smaller:
$$ s_{\overline{X}_T}^2 = \frac{n.s^2}{n^2} = \frac{s_{X_Ti}^2}{n} \qquad
   s_{\overline{X}_C}^2 = \frac{s_{X_Ci}^2}{n}
$$

Therefore, 
$$ t = \frac{\overline{X}_T - \overline{X}_C}
            {s_{\Delta\overline{X}}} $$
where 
$$ s_{\Delta\overline{X}} = \sqrt{
   \frac{\overline{X}_T(1-\overline{X}_T)}{n_T} +
   \frac{\overline{X}_C(1-\overline{X}_C)}{n_C}
} $$


In [None]:
from scipy.stats import t
alpha = 0.05

def compute_t_test(c):    
    c['difference'] =                               \
        c['conversion_rate']['Treatment'] -         \
        c['conversion_rate']['Control']
    c['stdev'] = (
            c['conversion_rate']['Treatment'] *     \
            (1 - c['conversion_rate']['Treatment'])/
                c['visitors']['Treatment'] +  
            c['conversion_rate']['Control'] *       \
            (1 - c['conversion_rate']['Control'])/
                c['visitors']['Control']
        ) ** 0.5
    c['t-score'] = c['difference'] / c['stdev']
    c['degrees_freedom'] =                          \
        c['visitors']['Control'] +                  \
        c['visitors']['Treatment'] - 1
    c['p-value'] = t.sf(
        np.abs(c['t-score']),
        c['degrees_freedom']
    ) * 2
    c['minimum_detectable_effect'] =                \
        t.ppf(1 - alpha/2, c['degrees_freedom']) *  \
        c['stdev']
    c['significant'] = c['p-value'] < alpha
 
    return c

In [None]:
for d in [alpha/2, 1 - alpha/2]:
    print(f'{d=}: t={t.ppf(d, 1000)}')

# Results of the t-test

In [None]:
compute_t_test(conversion_rate)

## Interview Questions

* What is A/B testing, and why is it commonly used in data-driven decision-making?

* What is the difference between the treatment group and the control group in an A/B test?

* How do you determine the sample size for an A/B test? What factors influence this decision?


* What is hypothesis testing, and how does it apply to A/B testing?

* What is the p-value in A/B testing, and how do you interpret its significance?

* What are Type I and Type II errors in A/B testing, and how do they relate to statistical significance and statistical power?

