# Set up the experiment.
Run the test and record the success rate for each group and then
plot the distribution of the difference between the two samples.
Calculate statistics to add insights.
Evaluate impact of sample size as relates to A/B tests.

The goal of running an A/B test is to evaluate if a change in a marketing page will lead to improved performance in a specific metric or it could aid in... lets say we want to look at the steps required in signing up a new user or processing the sale on an online marketplace

For this example let's make up a current conversion rate that is the current rate at which we sign up new users under the existing marketing page

# Goal
2% increase in our sign up rate, after we change out marketing page. We currently sign up 5 out of 50 users who are offered a enterprise account.

In [38]:
crnt = 0.12  # current conversion rate
gl = 0.02  # difference we seek between the groups


# Test Group
 Users participating in the A/B are a small percentage of the total users. Users are randomly selected and assigned to either a control group or a test group. Sample size/type[of event] will dicate how long until data is collected to analyze and will impact how you prioritize improving other metrics. For this example, we collected 5,000 users

In [39]:
c = 5000 # A is the control Group
t = 5000 # B is the test Group

# Run the A/B
TestConsidering this is a demonstration of an approach of mine to A/B testing , I found a function to generate hpyothetical data for us courtesy of https://github.com/mnguyenngo/ab-framework/blob/master/src/data.py

In [40]:
import scipy.stats as scs
import pandas as pd
import numpy as np


def generate_data(c, t, p_A, p_B, days=None, control_label='A',
                  test_label='B'):
    """Returns a pandas dataframe with fake CTR data
    Example:
    Parameters:
        A (int): sample size for control group
        B (int): sample size for test group
            Note: final sample size may not match N_A provided because the
            group at each row is chosen at random (50/50).
        p_A (float): conversion rate; conversion rate of control group
        p_B (float): conversion rate; conversion rate of test group
        days (int): optional; if provided, a column for 'ts' will be included
            to divide the data in chunks of time
            Note: overflow data will be included in an extra day
        control_label (str)
        test_label (str)
    Returns:
        df (df)
    """

    # initiate empty container
    data = []

    # total amount of rows in the data
    N = A + B

    group_bern = scs.bernoulli(0.5)

    # initiate bernoulli distributions to randomly sample from
    A_bern = scs.bernoulli(p_A)
    B_bern = scs.bernoulli(p_B)

    for idx in range(N):
        # initite empty row
        row = {}
        # for 'ts' column
        if days is not None:
            if type(days) == int:
                row['ts'] = idx // (N // days)
            else:
                raise ValueError("Provide an integer for the days parameter.")
        # assign group based on 50/50 probability
        row['group'] = group_bern.rvs()

        if row['group'] == 0:
            # assign conversion based on provided parameters
            row['converted'] = A_bern.rvs()
        else:
            row['converted'] = B_bern.rvs()
        # collect row into data container
        data.append(row)

    # convert data into pandas dataframe
    df = pd.DataFrame(data)

    # transform group labels of 0s and 1s to user-defined group labels
    df['group'] = df['group'].apply(
        lambda x: control_label if x == 0 else test_label)

    return df


In [41]:
data = generate_data(c, t, crnt, gl)


View Sample Data Table that we genereated below, 'converted column' indicates whether a user opted for the enterprise service or not with a 1 or 0, respectively. The A group will be used for our control group & B group will be our test group.

In [43]:
data.head(5)

Unnamed: 0,converted,group
0,0,A
1,0,B
2,0,B
3,0,A
4,0,B


Pandas pivot to view that data better

In [44]:
pivot = data.pivot_table(values='converted', index='group', aggfunc=np.sum)
#create a total column to the pivot and a rate 
pivot['total'] = data.pivot_table(values='converted', index='group', aggfunc=lambda x: len(x))
pivot['rate'] = data.pivot_table(values='converted', index='group')


In [45]:
pivot

Unnamed: 0_level_0,converted,total,rate
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,129,984,0.131098
B,21,1016,0.020669


# Compare
Looking at the conversion rate of group A less group B, we see a difference of .111. Let's use vizualize this data to get a closer look. 

https://towardsdatascience.com/the-math-behind-a-b-testing-with-example-code-part-1-of-2-7be752e1d06f - look into this for viz ideas