# A/B Testing and Causal Inference

Dataset: https://www.kaggle.com/datasets/ahmedmohameddawoud/ecommerce-ab-testing

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

## Load and Inspect Data

In [2]:
abtest_df = pd.read_csv("Downloads/E-commerce AB testing - Kaggle/ab_test.csv",
                          header=0)

countries_df = pd.read_csv("Downloads/E-commerce AB testing - Kaggle/countries_ab.csv",
                          header=0)

In [3]:
abtest_df.head()

Unnamed: 0,id,time,con_treat,page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


In [4]:
abtest_df.shape

(294478, 5)

In [7]:
abtest_df.dtypes

id            int64
time         object
con_treat    object
page         object
converted     int64
dtype: object

In [5]:
abtest_df.describe()

Unnamed: 0,id,converted
count,294478.0,294478.0
mean,787974.124733,0.119659
std,91210.823776,0.324563
min,630000.0,0.0
25%,709032.25,0.0
50%,787933.5,0.0
75%,866911.75,0.0
max,945999.0,1.0


In [6]:
abtest_df.describe(include="object")

Unnamed: 0,time,con_treat,page
count,294478,294478,294478
unique,35993,2,2
top,36:26.7,treatment,old_page
freq,22,147276,147239


Time variable needs to be coverted to a float representing time spent on the page.

In [8]:
abtest_df["id"].nunique()

290584

Duplicate ID's need to be investigated.

In [11]:
abtest_df.duplicated().sum()

0

In [12]:
abtest_df.isna().sum()

id           0
time         0
con_treat    0
page         0
converted    0
dtype: int64

In [13]:
countries_df.head()

Unnamed: 0,id,country
0,834778,UK
1,928468,US
2,822059,UK
3,711597,UK
4,710616,UK


In [14]:
countries_df.shape

(290584, 2)

The two dataframes can be combined.

In [18]:
abtest_df = pd.merge(abtest_df, countries_df, how = "inner", on = "id")

In [19]:
abtest_df.head()

Unnamed: 0,id,time,con_treat,page,converted,country
0,851104,11:48.6,control,old_page,0,US
1,804228,01:45.2,control,old_page,0,US
2,661590,55:06.2,treatment,new_page,0,US
3,853541,28:03.1,treatment,new_page,0,US
4,864975,52:26.2,control,old_page,1,US


In [22]:
abtest_df["country"].value_counts()

country
US    206364
UK     73419
CA     14695
Name: count, dtype: int64

In [23]:
abtest_df["converted"].value_counts()

converted
0    259241
1     35237
Name: count, dtype: int64

## Data Cleaning and Preprocessing

In [28]:
(abtest_df["id"].value_counts() > 1).sum()

3894

In [41]:
abtest_df[abtest_df["id"] == 783176]

Unnamed: 0,id,time,con_treat,page,converted,country
22271,783176,21:20.2,treatment,new_page,0,US
35716,783176,41:30.9,control,new_page,0,US


In [36]:
abtest_df["id"].value_counts() > 1

id
805339     True
754884     True
722274     True
783176     True
898232     True
          ...  
642985    False
771499    False
923606    False
712675    False
715931    False
Name: count, Length: 290584, dtype: bool

### Misaligned rows

In [42]:
pd.crosstab(abtest_df['con_treat'], abtest_df['page'])

page,new_page,old_page
con_treat,Unnamed: 1_level_1,Unnamed: 2_level_1
control,1928,145274
treatment,145311,1965


In [45]:
misaligned = abtest_df[
    ((abtest_df['con_treat'] == 'control') & (abtest_df['page'] == 'new_page')) |
    ((abtest_df['con_treat'] == 'treatment') & (abtest_df['page'] == 'old_page'))
]

misaligned.shape

(3893, 6)

In [48]:
abtest_df = abtest_df.drop(misaligned.index, axis=0)

In [49]:
abtest_df["id"].value_counts()

id
773192    2
851104    1
688307    1
718297    1
838144    1
         ..
755610    1
804629    1
837875    1
889019    1
715931    1
Name: count, Length: 290584, dtype: int64

In [59]:
abtest_df = abtest_df.drop(abtest_df[(abtest_df["id"] == 773192)].index, axis=0).reset_index(drop=True)

All ID's are now unique and there is no mismatch between the variables `con_treat` and `page`. The latter variable can now be dropped.

In [61]:
abtest_df = abtest_df.drop("page", axis=1)

### Fixing Data Types

I first will change the time variable to a numeric column representing time spent on the site in seconds.

In [78]:
abtest_df[["minutes", "seconds"]] = abtest_df["time"].str.split(":", expand=True)

In [91]:
abtest_df["time_spent_seconds"] = (abtest_df["minutes"].astype(float).round().astype(int)*60) + (
    abtest_df["seconds"].round().astype(float).round().astype(int))

In [93]:
abtest_df.head()

Unnamed: 0,id,time,con_treat,converted,country,minutes,seconds,time_spent_seconds
0,851104,11:48.6,control,0,US,11,48.6,709
1,804228,01:45.2,control,0,US,1,45.2,105
2,661590,55:06.2,treatment,0,US,55,6.2,3306
3,853541,28:03.1,treatment,0,US,28,3.1,1683
4,864975,52:26.2,control,1,US,52,26.2,3146


In [96]:
abtest_df = abtest_df.drop(["minutes", "seconds", "time"], axis=1)

In [97]:
abtest_df.head()

Unnamed: 0,id,con_treat,converted,country,time_spent_seconds
0,851104,control,0,US,709
1,804228,control,0,US,105
2,661590,treatment,0,US,3306
3,853541,treatment,0,US,1683
4,864975,control,1,US,3146


I will change the con_treat variable to be numeric.

In [100]:
abtest_df["treatment"] = pd.get_dummies(abtest_df["con_treat"], drop_first=True, dtype=int)

In [102]:
abtest_df = abtest_df.drop("con_treat", axis=1)

I will one-hot encode the country variable.

In [110]:
abtest_df = pd.get_dummies(abtest_df, columns = ["country"], drop_first=True, dtype=int)

In [166]:
abtest_df.head()

Unnamed: 0,id,converted,time_spent_seconds,treatment,country_UK,country_US
0,851104,0,709,0,0,1
1,804228,0,105,0,0,1
2,661590,0,3306,1,0,1
3,853541,0,1683,1,0,1
4,864975,1,3146,0,0,1


In [114]:
abtest_df.shape

(290583, 6)

## Sample Ratio Mismatch (SRM)

In [128]:
treatment_counts = abtest_df.treatment.value_counts().sort_index()
expected = np.repeat(len(abtest_df) / 2, 2)

stats.chisquare(f_obs=treatment_counts, f_exp=expected)


Power_divergenceResult(statistic=0.0042156629947381645, pvalue=0.9482312176431853)

Since the p-value is very high (~0.95), we fail to reject the null hypothesis, indicating the observed sample split is consistent with random variation and there is no evidence of a sample ratio mismatch.

## A/B Test

In [138]:
treatment_counts

treatment
0    145274
1    145309
Name: count, dtype: int64

In [134]:
converted_counts = abtest_df.groupby("treatment")["converted"].sum()
converted_counts

treatment
0    17489
1    17264
Name: converted, dtype: int64

In [136]:
observed_props = converted_counts/treatment_counts
observed_props

treatment
0    0.120386
1    0.118809
dtype: float64

In [123]:
successes = np.array([converted_counts.tolist()[0], converted_counts.tolist()[1]])
nobs = np.array([treatment_counts.tolist()[0], treatment_counts.tolist()[1]])

In [124]:
z_stat, p_value = sm.stats.proportions_ztest(successes, nobs)
print(f"Z-statistic: {z_stat:.4f}, p-value: {p_value:.4f}")

Z-statistic: 1.2624, p-value: 0.2068


The null hypothesis is that there is no difference in the proportion of users who convert between the control and treatment groups. The alternative hypothesis is that there is a difference in proportions. Using an alpha significance level of 0.05, the null hypothesis will be rejected if the p-value is less than 0.05 or the z statistic is greater than 1.96. Since the p-value is 0.21 and the z statistic is 1.26, we fail to reject the null hypothesis, indicating that there is no statistically significant difference between the two proportions.

In [132]:
sm.stats.proportion_confint(successes, nobs, alpha=0.05, method="wilson")

(array([0.11869435, 0.11718356]), array([0.12204033, 0.12051162]))

The 95% confidence interval for the proportion of users who converted without the treatment is (0.1187, 0.1220). The 95% confidence interval for the proportion for those with treatment is (0.1172, 0.1205).

### Statistical Power

In [170]:
import statsmodels.stats.api as sms

effect_size = sms.proportion_effectsize(
    observed_props.tolist()[0], observed_props.tolist()[1])
print(f"The effect size is {effect_size:.4f}")


#analysis = sms.NormalIndPower()
#power = analysis.power(
#    effect_size=effect_size,
#    nobs1=converted_counts.tolist()[0],
#    alpha=0.05,
#    ratio=treatment_counts.tolist()[1]/treatment_counts.tolist()[0]
#)
#print(f"The statistical power is {power:.4f}")


The effect size is 0.0049


In [158]:
print(f"There is a {observed_props[0] - observed_props[1]:.4f} difference between the treatment groups.")

There is a 0.0016 difference between the treatment groups.


An effect size of 0.0049 indicates that there is a very small difference between the two proportions.

### Minimum Detectable Effect (MDE)

For an effect size this small, let's see what the minimum detectable effect (MDE) is.

In [165]:
mde = analysis.solve_power(
    effect_size=None,
    nobs1=converted_counts.tolist()[0],
    alpha=0.05,
    power=0.8,
    ratio=treatment_counts.tolist()[1]/treatment_counts.tolist()[0]
)
print(f"MDE: {mde}")

absolute_mde = mde * np.sqrt(observed_props[0] * (1 - observed_props[0]))
absolute_mde
print(f"Absolute MDE: {absolute_mde}")

MDE: 0.02995782731584268
Absolute MDE: 0.009748658249419884


Given the sample size, the effect size would need to be at least 0.030 to detect a significant difference between the two proportions with a statistical power of 0.8. The experiment can reliably detect a change of a 0.0097 increase from the proportion of people who converted without treatment.

## Causal Inference

In [160]:
logistic_model = smf.logit(
    formula="converted ~ treatment + time_spent_seconds + country_UK + country_US",
    data=abtest_df
).fit()

logistic_model.summary()

Optimization terminated successfully.
         Current function value: 0.366113
         Iterations 6


0,1,2,3
Dep. Variable:,converted,No. Observations:,290583.0
Model:,Logit,Df Residuals:,290578.0
Method:,MLE,Df Model:,4.0
Date:,"Thu, 25 Dec 2025",Pseudo R-squ.:,2.429e-05
Time:,11:53:22,Log-Likelihood:,-106390.0
converged:,True,LL-Null:,-106390.0
Covariance Type:,nonrobust,LLR p-value:,0.2705

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-2.0253,0.028,-71.238,0.000,-2.081,-1.970
treatment,-0.0149,0.011,-1.307,0.191,-0.037,0.007
time_spent_seconds,-2.62e-06,5.51e-06,-0.476,0.634,-1.34e-05,8.17e-06
country_UK,0.0506,0.028,1.783,0.075,-0.005,0.106
country_US,0.0407,0.027,1.515,0.130,-0.012,0.093


An LLR p-value of 0.27 means the model as a whole does not predict conversion significantly better than a model with just the intercept, so the predictors collectively donâ€™t provide strong evidence of a causal effect. This is supported by the fact that the p-value for each predictor is greater than 0.05, indicating that they are not useful in predicting conversion.