In [7]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.stats.api as sms
from math import ceil

**Reference**: https://towardsdatascience.com/ab-testing-with-python-e5964dd66143

**Data**: Data contains results of an A/B test on what seems to be 2 different designs of a website page (old_page vs new_page).
https://www.kaggle.com/zhangluyuan/ab-testing?select=ab_data.csv

**General Steps to tackle A/B Testing** (see OneNote Studying Cheat Sheet):
- Define a success metric
- State the hypothesis (i.e., `Ho` = no difference b/w control and treatment vs `Ha` = difference)
- Define the statistic parameters (e.g., alpha, power, MDE)
- Randomization unit (i.e., splitting groups)
- Determine the sample size
- Estimate the experimentation time
- Perform test for statistical significance (i.e., pairwise t-test to compare two sample means)
- Make final business decision to launch the new feature or not

**Scenario Background**: You work on the product team at a medium-sized online ECOMM business. The UX designer worked really hard on a new version of the product page, with the hope that it will lead to a higher conversion rate. The PM told you that the current conversion rate is about 13% on average throughout the year, and that the team would be happy with an increase of 2%. In other words, the new design will be considered a success if it raises the conversion rate to 15%.

### 1. Define a Success Metric

Based on the scenario, our success metric has been established for us in talks with the UX designer and Project Manager. The success rate will be **conversion rate** - specifically if it increases to 15%.

It will be calculated based on every user session with a binary variable, where `0` = user did not buy the product during this user session and `1` = user bought the product during this user session.

### 2. Formulate a Hypothesis

The null hypothesis is that there is no difference in performance between the old website (control = c) and new website (treatment = t). The alternative hypothesis is that there is a difference in performance between the two.

`Ho: mu_c = mu_t`

`Ha: mu_c != mu_t`

### 3. Define Statistical Parameters

Let's set a confidence level of 95%. This means that our critical value `α = 0.05` (industry standard). We will also set `power = 0.80` (also industry standard). Recall that power is `1-β` (Type II Error).

### 4. Sample Randomization

We know we need two samples. Having a control group allows us to directly compare their results to the treatment group, because the only systematic difference between the two groups is the actual design of the product page (i.e., what we're evaluating the impact of). Therefore, we can attribute any differences in results to the designs.

1. A control group: They'll be shown the old design
2. A treatment group: They'll be shown the new design

*How do we form the groups?*

### 5. Determine Sample Size

It's important to remember that since we can't test our *whole* user base, the conversion rates that we'll get will just be *estimates* of the true rates.

The larger the sample size (i.e., number of user sessions we capture in each group), the more precise our estimates and the higher the chance to detect a difference in the two groups, if present. However, the larger the sample, the more expensive the study becomes.

We can perform **Power analysis** to determine our necessary sample size. We require the following metrics (see statistical parameters above):

- Power `(1 — β)`: This represents the probability of finding a statistical difference between the groups in our test when a difference is actually present. This is usually set at 0.8 by convention.
- Alpha value `(α)`: The critical value we set earlier - 0.05 is typically industry standard.
- Effect Size: How big of a difference we expect there to be between the success metric (i.e., conversion rate in our case).

In [2]:
#first calculate effect size (statsmodels library)

effect_size = sms.proportion_effectsize(0.13, 0.15)

    # we can use 13% as prop1 since this is the original conversion rate
    # 15% as prop2 since this is the rate we're hoping for

In [16]:
#now we can use formula for our required sample size
#NormalIndPower() is used for power calculations for 2 independent samples

required_n = sms.NormalIndPower().solve_power(effect_size = effect_size,
                                             power = 0.8,
                                             alpha = 0.05,
                                             ratio = 1 #for even samples
                                             )

#round up
required_n = ceil(required_n)

print(required_n)

4720


The results of our power test indicate that we'll want at least 4720 observations for each group.

### 6. Experimentation Time + Collecting & Preparing the Data

Typically after establishing the required sample size, you would work with your team to set up the experiment and start collecting the data over a pre-determined amount of time (e.g., run the experiment over the course of 2 weeks).

We'll simulate this and assume that the Kaggle data represents the data resulting from our experiment.

In [None]:
#import data
