Data cleaning and analysis for A/B testing

Hypotheses
H0: null hypothesis (change in revenue observed in treatment group is *not* statistically significant)
H1: reject null hypothesis (change is statistically significant and new membership model is going to make us lots of money)

control group: old design
treatment group: new design

dependent variable: change in revenue


# choose sample size?


In [1]:
# Packages imports
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns


import statsmodels.stats.api as sm
from scipy.stats import shapiro, mannwhitneyu, anderson, kstest


In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("sergylog/ab-test-data")


print("Path to dataset files:", path)

Path to dataset files: /Users/issrar/.cache/kagglehub/datasets/sergylog/ab-test-data/versions/3


In [3]:
df = pd.read_csv(path + "/AB_Test_Results.csv").copy()

df.head()


Unnamed: 0,USER_ID,VARIANT_NAME,REVENUE
0,737,variant,0.0
1,2423,control,0.0
2,9411,control,0.0
3,7311,control,0.0
4,6174,variant,0.0


In [4]:
def summary(df):
    print(f'data shape: {df.shape}')

    # Get data types
    dtypes_df = pd.DataFrame(df.dtypes, columns=['data type'])

    # Count missing values
    missing_df = pd.DataFrame(df.isnull().sum(), columns=['number of missing values'])
    missing_df['percent missing'] = (df.isnull().mean() * 100).round(2)
    
    # Count unique values
    unique_df = pd.DataFrame(df.nunique(), columns=['number of unique values'])

    # Describe statistics
    desc_df = df.describe().T
    stats_df = desc_df[['mean', 'std', 'min', 'max']].copy()
    stats_df['mean'] = stats_df['mean'].round(2)
    stats_df['std'] = stats_df['std'].round(2)
    stats_df['min'] = stats_df['min'].round(2)
    stats_df['max'] = stats_df['max'].round(2)

    # Count duplicate rows
    duplicate_rows = df.duplicated().sum()
 
    # Concatenate all dataframes
    summary_df = pd.concat([dtypes_df, missing_df, unique_df, stats_df], axis=1)
    summary_df.loc['number of duplicate rows'] = duplicate_rows
    summary_df = summary_df.fillna('N/A')
    summary_df = summary_df.reset_index().rename(columns={'index': 'column name'})
    summary_df = summary_df.set_index('column name')
    summary_df.loc['number of duplicate rows'] = duplicate_rows

    styled_summary = summary_df.style.background_gradient(cmap='GnBu')
    return styled_summary
   

summary(df)


data shape: (10000, 3)


Unnamed: 0_level_0,data type,number of missing values,percent missing,number of unique values,mean,std,min,max
column name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
USER_ID,int64,0,0.0,6324,4981.08,2890.59,2.0,10000.0
VARIANT_NAME,object,0,0.0,2,,,,
REVENUE,float64,0,0.0,101,0.1,2.32,0.0,196.01
number of duplicate rows,2067,2067,2067.0,2067,2067.0,2067.0,2067.0,2067.0


# Analysis 

In [5]:
# Variable definition 

control = df.loc[df['VARIANT_NAME'] == 'control']
variant = df.loc[df['VARIANT_NAME'] == 'variant']

print(control.shape)
print(variant.shape)

(4984, 3)
(5016, 3)


# Hypotheses 

H0: There is no significant different in revenue brought by users between the control group (existing membership model) and the variant group (new membership model) in the A/B test.

H1: There is a significant difference in revenue brought by users between the control group (existing memebrship model) and the variant group (new membership model) in the A/B test, indicating that the new membership model leads to either an increase or decrease in revenue compared to the existing model. 

In [6]:
print('Control group mean revenue:', control['REVENUE'].mean())
print('Variant group mean revenue:', variant['REVENUE'].mean())

Control group mean revenue: 0.12901284109149277
Variant group mean revenue: 0.07006977671451356


# Normality variance (Shapiro-Wilk test)
We can use the Shapiro-Wilk test to check for normality. This test assesses whether a sample comes from a normally distributed population. 

H0: null hypothesis; the data follows a normal distribution 
H1: alternative hypothesis; assymes that the data does not follow a normal distribution 

In [7]:
# Perform the Shapiro-Wilk test for normality

shapiro_control = shapiro(control['REVENUE'])
shapiro_variant = shapiro(variant['REVENUE'])
print('Shapiro-Wilk test for control group - p-value:', shapiro_control.pvalue) 
print('Shapiro-Wilk test for variant group - p-value:', shapiro_variant.pvalue)

Shapiro-Wilk test for control group - p-value: 2.4983285100057203e-95
Shapiro-Wilk test for variant group - p-value: 2.7310104028899288e-95


  res = hypotest_fun_out(*samples, **kwds)


Since the p-valeu from the Shapiro-Wilk test may not be accurate when dealing with sample sizes larger than 5000 and our sample sizes are around that size, these p-value results may not be realiable and we must consider alternatives. 

# The Anderson-Darling test
For alrge sample sizes, the Anderson-Darling test is an alternative that can access normality. it provides both critical values and a statistic, which we compare. A higher statistic value suggests stronger evidence against the null hypothesis. 

In [8]:
# Perform the Anderson-Darling test for normality
anderson_control = anderson(control['REVENUE'])
anderson_variant = anderson(variant['REVENUE'])

# Extract critical values and statistic from the test results 
anderson_control_stat = anderson_control.statistic
anderson_variant_stat = anderson_variant.statistic
anderson_control_critical = anderson_control.critical_values
anderson_variant_critical = anderson_variant.critical_values

# Print the critical values and statistic
print('Anderson-Darling test for control group - statistic:', anderson_control_stat)
print('Anderson-Darling test for control group - critical values:', anderson_control_critical)
print('Anderson-Darling test for variant group - statistic:', anderson_variant_stat)
print('Anderson-Darling test for variant group - critical values:', anderson_variant_critical)


Anderson-Darling test for control group - statistic: 1855.221244001521
Anderson-Darling test for control group - critical values: [0.576 0.655 0.786 0.917 1.091]
Anderson-Darling test for variant group - statistic: 1865.809035320146
Anderson-Darling test for variant group - critical values: [0.576 0.655 0.786 0.917 1.091]


Our data deviates significantly from the expected bell curve pattern. 

Need approach A/B test with caution. Since normality assympotions not met, t-tests and ANOVA may not be appropriate for comparing means between groups. 

We should consider non-parametric tests/transformations to handle the non-normality. 

# Kolmogorov-Smirnov test

Another test for asssessing the goodness-of-fit between the empirical distribution function of the data and normal distribution. 

KS statistic (D-value): a larger KS statistic indicates a greater deviation from normality. 
P-value: low p-value suggests strong evidence to reject null

In [9]:
# Perform Kolmogorov-Smirnov test for normality
ks_control = kstest(control['REVENUE'], 'norm')
ks_variant = kstest(variant['REVENUE'], 'norm')
print('Kolmogorov-Smirnov test for control group - statistic:', ks_control.statistic)
print('Kolmogorov-Smirnov test for control group - p-value:', ks_control.pvalue)
print('Kolmogorov-Smirnov test for variant group - statistic:', ks_variant.statistic)
print('Kolmogorov-Smirnov test for variant group - p-value:', ks_variant.pvalue)

Kolmogorov-Smirnov test for control group - statistic: 0.5
Kolmogorov-Smirnov test for control group - p-value: 0.0
Kolmogorov-Smirnov test for variant group - statistic: 0.5
Kolmogorov-Smirnov test for variant group - p-value: 0.0


Reinforces previous normality test results: neither control nor variant groups follow a normal distribution. 
Non-parametric tests should be consdiered to endure accurate conclusions. 

# Mann-Whitney U test
wikipedia: "since the Mann–Whitney U test is an ordinal test, medians are usually recommended"

In [10]:
# Perform Mann-Whitney U test
mannwhitney = mannwhitneyu(control['REVENUE'], variant['REVENUE'])
print('Mann-Whitney U test - statistic:', mannwhitney.statistic)
print('Mann-Whitney U test - p-value:', mannwhitney.pvalue)

Mann-Whitney U test - statistic: 12521564.0
Mann-Whitney U test - p-value: 0.47825247965294926


p-value < 0.5  so we don't find a statistically significant difference between the *medians* of the control and variant groups in terms of income brought by users.


=> the modifications made in the variant group did not lead to a significant change in user income compared to the control group during the A/B test. 