A/B testing, also known as split testing, is a statistical method used in data science to compare two versions of a product, webpage, or marketing campaign to determine which one performs better based on a specific metric. This approach allows data scientists and marketers to make data-driven decisions rather than relying on intuition or guesswork.

## A/B Testing Procedure

1. __Problem Statement__ - What is the goal of the experiment ?
2. __Hypothesis Testing__ - What result do you hypothesize from the experiment?
3. __Design the Experiment__ - What are the experiment parameters ?
4. __Run the Experiment__ - What are the requirements for running it ?
5. __Validity Checks__ - Did the experiment run soundly without errors or bias?
6. __Interpret the Results__ - Is the metric significant statistically and practically?
7. __Launch Decision__ - Take action based on the experiment results

## Tips

- Talk about the business goal first (user journey)
- Use the user funnel to create the success metric
- A success metric must be: measurable, attributable, sensitive and timely

## Example

A web store wants to change the product ranking recommendation system.

- __Success Metric__: revenue per day per user
- __Null Hypothesis ($Ho$)__: the average revenue per day per user between the baseline and variant ranking algorithms are the same
- __Alternative Hypothesis ($Ha$)__: the average revenue per day per user between the baseline and variant ranking algorithms are different
- __Significance Level ($\alpha = 0.05$)__: If the P-value is $< \alpha$, then reject $Ho$ and conclude $Ha$. $\alpha = P(Type 1 | Error Rate)$
- __Statistical Power ($= 0.80$)__: The probability of detecting an effect if the alternative hypothesis is true.
- __Minimum Detectable Effect ($MDE = 1\%$ lift)__: If the change is at least 1% higher in revenue per day per user then it is practically significant.

- Set the randomization unit: User
- Target population in the experiment: Visitors who searches a product
- Determine the sample size
$$n \approx \frac{16\sigma^{2}}{\delta^{2}}$$
Where $\sigma$ is the sample standard deviation and $\delta$ is the difference between the control and treatment. (Based on $\alpha=0.05$ and $power=0.80$)
- Duration of the experiment
- Run experiment
  - Set up instruments and data pipelines to collect data
  - Avoid peeking p-values
- Validity checks (check for bias)
  - Check for instrumentation effects
  - External factors
  - Selection bias
  - Sample ratio mismatch (Chi-Square Goodness of Fit Test)
  - Novelty effect (Segment by new and old visitors)

- Interpret the results
- Launch decision
  - Metric trade-offs
  - Cost of launching


| Tests     |  Metrics  | Absolute Difference | Relative Difference | P-Value | Confidence Interval |
|-----------|-----------|---------------------|---------------------|---------|---------------------|
| Control   | 25.00  |  1.10              | 4.40 %              | 0.001   | (3.40 %, 5.40 %)    |
| Treatment | 26.10 |  1.10              | 4.40 %              | 0.001   | (3.40 %, 5.40 %)    |

## Statistical Tests

### Discrete Metrics

- Fisher's exact test
- Pearson's chi-squared test

### Continous Metrics
- Z-test
- Student's t-test
- Welch's t-test
- Mann-Whitney U test

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import norm

np.random.seed(42)

### Simulating click data for A/B testing

In [2]:
N_experiment = 10000
N_control = 10000

click_experiment = pd.Series(np.random.binomial(1, 0.5, size=N_experiment))
click_control = pd.Series(np.random.binomial(1, 0.2, size=N_control))

In [3]:
df = pd.concat(
    [
        pd.DataFrame(
            {
                "Click": click_experiment,
                "Group Label": "Experiment",
            }
        ),
        pd.DataFrame(
            {
                "Click": click_control,
                "Group Label": "Control",
            }
        ),
    ]
).reset_index(drop=True)
df

Unnamed: 0,Click,Group Label
0,0,Experiment
1,1,Experiment
2,1,Experiment
3,1,Experiment
4,0,Experiment
...,...,...
19995,1,Control
19996,0,Control
19997,0,Control
19998,0,Control


In [4]:
X_experiment = df.groupby("Group Label")["Click"].sum().loc["Experiment"]
X_control = df.groupby("Group Label")["Click"].sum().loc["Control"]
print(
    f"# Clicks in 'Control' group: {X_control}\n# Clicks in 'Experiment' group: {X_experiment}"
)

# Clicks in 'Control' group: 2033
# Clicks in 'Experiment' group: 4924


In [5]:
# calculating probabilities
p_experiment_hat = X_experiment / N_experiment
p_control_hat = X_control / N_control
print(
    f"Click probability in 'Control' group: {p_control_hat}\nClick probability in 'Experiment' group: {p_experiment_hat}"
)

Click probability in 'Control' group: 0.2033
Click probability in 'Experiment' group: 0.4924


In [6]:
p_pooled_hat = (X_control + X_experiment) / (N_control + N_experiment)
pooled_variance = (
    p_pooled_hat * (1 - p_pooled_hat) * (1 / N_control + 1 / N_experiment)
)

In [7]:
SE = np.sqrt(pooled_variance)
print(f"Standard Error: {SE}")

Standard Error: 0.00673573125206165


In [8]:
test_stat = (p_control_hat - p_experiment_hat) / SE
# z test
test_stat

-42.92035848542996

In [9]:
alpha = 0.05

In [10]:
z_crit = norm.ppf(1 - alpha / 2)
z_crit

1.959963984540054

In [11]:
p_val = 2 * norm.sf(abs(test_stat))
p_val

0.0

In [12]:
if p_val < alpha:
    print("Reject Ho !")
else:
    print("Fail !")

Reject Ho !


In [13]:
CI = [
    round((p_experiment_hat - p_control_hat) - SE * z_crit, 3),
    round((p_experiment_hat - p_control_hat) + SE * z_crit, 3),
]
CI

[0.276, 0.302]