# **Classic and Sequential A/B Testing Analysis for Evaluating Ad Campaign Performance**

**Objective**  
Evaluate whether an online advertising campaign by SmartAd is effective in increasing brand awareness. The analysis focuses on determining if exposure to a creative, interactive ad (exposed group) produces a statistically significant increase in "Yes" responses compared to a dummy ad (control group).

**Methods**  
- **Classic A/B Testing**

    Data are collected until a predetermined sample size is reached. A two-sample z-test for proportions compares the performance between control and exposed groups.

- **Sequential A/B Testing**

    Data are monitored continuously, allowing the experiment to potentially stop early if significant differences are observed. Advanced sequential testing methods with rigorous alpha spending adjustments control the overall type I error rate.


In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

from scipy.stats import norm
from statsmodels.stats.proportion import proportions_ztest

## **Data Import and Initial Inspection**

The dataset is loaded from a GitHub repository. The first few rows are inspected to understand the structure.


In [2]:
# Data import
url = 'https://raw.githubusercontent.com/mwi-kali/AB-Hypothesis-Testing-Ad-campaign-performance/refs/heads/master/data/AdSmartABdata.csv'
data = pd.read_csv(url, low_memory=False)

# Display the first few rows for a quick inspection
data.head()

Unnamed: 0,auction_id,experiment,date,hour,device_make,platform_os,browser,yes,no
0,0008ef63-77a7-448b-bd1e-075f42c55e39,exposed,2020-07-10,8,Generic Smartphone,6,Chrome Mobile,0,0
1,000eabc5-17ce-4137-8efe-44734d914446,exposed,2020-07-07,10,Generic Smartphone,6,Chrome Mobile,0,0
2,0016d14a-ae18-4a02-a204-6ba53b52f2ed,exposed,2020-07-05,2,E5823,6,Chrome Mobile WebView,0,1
3,00187412-2932-4542-a8ef-3633901c98d9,control,2020-07-03,15,Samsung SM-A705FN,6,Facebook,0,0
4,001a7785-d3fe-4e11-a344-c8735acacc2c,control,2020-07-03,15,Generic Smartphone,6,Chrome Mobile,0,0


## **Data Overview and Preliminary Exploration**

The dataset structure is examined with summary statistics and a missing value check. An initial visualization presents the distribution of users across experimental groups.


In [3]:
print("Data Information:")
print(data.info())

Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8077 entries, 0 to 8076
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   auction_id   8077 non-null   object
 1   experiment   8077 non-null   object
 2   date         8077 non-null   object
 3   hour         8077 non-null   int64 
 4   device_make  8077 non-null   object
 5   platform_os  8077 non-null   int64 
 6   browser      8077 non-null   object
 7   yes          8077 non-null   int64 
 8   no           8077 non-null   int64 
dtypes: int64(4), object(5)
memory usage: 568.0+ KB
None


There are 8077 rows and 9 columns. No missing values are detected.

In [4]:
print("\nSummary Statistics:")
print(data.describe())


Summary Statistics:
              hour  platform_os          yes           no
count  8077.000000  8077.000000  8077.000000  8077.000000
mean     11.615080     5.947134     0.070818     0.083075
std       5.734879     0.224333     0.256537     0.276013
min       0.000000     5.000000     0.000000     0.000000
25%       7.000000     6.000000     0.000000     0.000000
50%      13.000000     6.000000     0.000000     0.000000
75%      15.000000     6.000000     0.000000     0.000000
max      23.000000     7.000000     1.000000     1.000000


The average hour is approximately 11.62, suggesting a mid-day peak in impressions. Both `yes` and `no` responses have means below 0.1, indicating that only a small fraction of impressions yield a response.

In [5]:
print("\nMissing values per column:")
print(data.isnull().sum())


Missing values per column:
auction_id     0
experiment     0
date           0
hour           0
device_make    0
platform_os    0
browser        0
yes            0
no             0
dtype: int64


There are no missing values, ensuring that subsequent analysis is not affected by data imputation.

### **Distribution of Experimental Groups**

The bar chart below shows the count of users assigned to each experimental group (control vs. exposed), providing an initial overview of the experimental balance.



In [6]:
group_counts = data['experiment'].value_counts().reset_index()
group_counts.columns = ['experiment', 'count']

fig = px.bar(
    group_counts,
    x='experiment',
    y='count',
    title='Distribution of Users by Experimental Group',
    labels={'experiment': 'Experiment Group', 'count': 'Number of Users'},
    text='count'
)
fig.show()


The first bar chart shows the counts for control vs. exposed groups. The sample is relatively balanced, although the exact ratio can be checked in the figure.

### **Response Analysis**

Not every impression results in a response (i.e., both `yes` and `no` are 0). A new column, `responded`, indicates whether a response was provided. Totals and proportions of "Yes" responses are computed for each experimental group.


In [7]:
data['responded'] = (data['yes'] + data['no']) > 0

response_counts = data[data['responded']].groupby('experiment')['responded'].count()
yes_counts = data.groupby('experiment')['yes'].sum()
prop_yes = yes_counts / response_counts

print("Total responses per group:")
print(response_counts)
print("\nTotal 'Yes' responses per group:")
print(yes_counts)
print("\nProportion of 'Yes' responses per group:")
print(prop_yes)


Total responses per group:
experiment
control    586
exposed    657
Name: responded, dtype: int64

Total 'Yes' responses per group:
experiment
control    264
exposed    308
Name: yes, dtype: int64

Proportion of 'Yes' responses per group:
experiment
control    0.450512
exposed    0.468798
dtype: float64


Although the exposed group has a marginally higher proportion of "Yes" responses, the difference is minimal.

### **Segmentation Analysis**

Segmentation by multiple dimensions reveals deeper insights into response behavior. The following sections segment the data by device brand, operating system, time of day, and browser type.


#### **Analysis by Device Type**

The visualization presents the distribution of the count of responses segmented by the device manufacturer (`device_make`). This segmentation helps to understand whether certain device types are associated with higher engagement.


In [8]:
brand_mapping = {
    'samsung': 'Samsung',
    'sm-': 'Samsung',
    'gt-': 'Samsung',
    'iphone': 'Apple',
    'htc': 'HTC',
    'lg-': 'LG',
    'moto': 'Motorola',
    'xt': 'Motorola',
    'pixel': 'Google',
    'nexus': 'Google',
    'oneplus': 'OnePlus',
    'xiaomi': 'XiaoMi',
    'redmi': 'XiaoMi',
    'huawei': 'Huawei',
    'lx1': 'Huawei',
    'col-l29': 'Huawei',
    'fig-lx1': 'Huawei',
    'ane-lx1': 'Huawei',
    'vog-l09': 'Huawei',
    'vog-l29': 'Huawei',
    'clt-l09': 'Huawei',
    'mar-lx1a': 'Huawei',
    'bkl-l09': 'Huawei',
    'lya-l09': 'Huawei',
    'pra-lx1': 'Huawei',
    'plk-l01': 'Huawei',
    'vce-l22': 'Huawei',
    'ele-l09': 'Huawei',
    'ele-l29': 'Huawei',
    'lenovo': 'Lenovo',
    'nokia': 'Nokia',
    'f8331': 'Sony',
    'e5823': 'Sony',
    'i3312': 'Sony',
    'h3311': 'Sony',
    'generic': 'Generic'
}

def map_device_to_brand(device_name):
    device_lower = device_name.lower()
    for keyword, brand in brand_mapping.items():
        if keyword in device_lower:
            return brand
    return 'Other'

data['device_brand'] = data['device_make'].apply(map_device_to_brand)

brand_counts = (
    data[data['responded']]
    .groupby('device_brand')['responded']
    .count()
    .reset_index()
    .rename(columns={'responded': 'response_count'})
)

fig_brand = px.bar(
    brand_counts,
    x='device_brand',
    y='response_count',
    title='Response Count by Device Brand',
    labels={'device_brand': 'Brand', 'response_count': 'Number of Responses'},
    text='response_count'
)
fig_brand.show()


#### **Analysis by Operating System**

The bar chart below shows response counts across different operating systems (`platform_os`), which may indicate distinct patterns in ad performance.


In [9]:
os_counts = (
    data[data['responded']]
    .groupby('platform_os')['responded']
    .count()
    .reset_index()
    .rename(columns={'responded': 'response_count'})
)

fig_os = px.bar(
    os_counts,
    x='platform_os',
    y='response_count',
    title='Response Count by Operating System',
    labels={'platform_os': 'Operating System', 'response_count': 'Number of Responses'},
    text='response_count'
)
fig_os.show()


#### **Analysis by Time of Day**

The line chart below illustrates hourly response counts. This segmentation by `hour` provides insight into peak engagement times.



In [10]:
hourly_summary = (
    data[data['responded']]
    .groupby('hour')['responded']
    .count()
    .reset_index()
    .rename(columns={'responded': 'response_count'})
)

fig_hour = px.line(
    hourly_summary,
    x='hour',
    y='response_count',
    title='Hourly Response Count',
    labels={'hour': 'Hour of Day', 'response_count': 'Number of Responses'},
    markers=True
)
fig_hour.show()


#### **Analysis by Browser Type**

The bar chart below shows response counts segmented by browser type. This segmentation can provide insights into whether particular browsers are associated with higher or lower engagement.


In [11]:
browser_counts = (
    data[data['responded']]
    .groupby('browser')['responded']
    .count()
    .reset_index()
    .rename(columns={'responded': 'response_count'})
)

fig_browser = px.bar(
    browser_counts,
    x='response_count',
    y='browser',
    title='Response Count by Browser Type',
    labels={'browser': 'Browser', 'response_count': 'Number of Responses'},
    text='response_count'
)
fig_browser.show()


## **Classic A/B Testing Analysis**

**Hypotheses**  
- **Null Hypothesis (H₀)**

    The proportion of "Yes" responses in the exposed group equals that in the control group.  

- **Alternative Hypothesis (H₁)**

    The exposed group has a higher proportion of "Yes" responses.

A one-sided two-sample z-test for proportions is applied. A p-value below 0.05 indicates a statistically significant lift in brand awareness for the exposed group.


In [12]:
yes_control = data[(data['experiment'] == 'control') & (data['responded'])]['yes'].sum()
n_control = data[(data['experiment'] == 'control') & (data['responded'])].shape[0]

yes_exposed = data[(data['experiment'] == 'exposed') & (data['responded'])]['yes'].sum()
n_exposed = data[(data['experiment'] == 'exposed') & (data['responded'])].shape[0]

counts = np.array([yes_control, yes_exposed])
nobs = np.array([n_control, n_exposed])

z_stat, p_val = proportions_ztest(counts, nobs, alternative='smaller')

print("Classic A/B Testing Results:")
print(f"Z-test statistic: {z_stat:.4f}")
print(f"P-value: {p_val:.4f}")


Classic A/B Testing Results:
Z-test statistic: -0.6457
P-value: 0.2592


The z-test statistic is approximately **-0.6457** with a p-value of **0.2592**. Since the p-value is well above 0.05, there is no statistically significant evidence to conclude that the exposed group has a higher proportion of "Yes" responses compared to the control group.


## **Sequential A/B Testing Analysis**

Sequential A/B testing enables continuous monitoring of the experiment. Data accumulate over time, and the z-test statistic is calculated at each interim analysis, which may allow early stopping if significant evidence emerges.

### **Aggregating Data by Date and Experiment**

Convert the `date` column to datetime format, filter for rows with responses, and aggregate daily totals of responses and "Yes" responses for each experimental group.


In [13]:
data['date'] = pd.to_datetime(data['date'])
responded_data = data[data['responded']].copy()

daily_summary = responded_data.groupby(['date', 'experiment']).agg(
    total_responses=('responded', 'count'),
    yes_count=('yes', 'sum')
).reset_index()

daily_summary.sort_values('date', inplace=True)
daily_summary.head()


Unnamed: 0,date,experiment,total_responses,yes_count
0,2020-07-03,control,233,104
1,2020-07-03,exposed,92,43
2,2020-07-04,control,68,30
3,2020-07-04,exposed,91,46
4,2020-07-05,control,43,17


### **Calculating Cumulative Totals**

Cumulative totals of responses and "Yes" responses are computed over time to illustrate how statistical evidence builds as data accumulate.


In [14]:
pivot_total = daily_summary.pivot(index='date', columns='experiment', values='total_responses').fillna(0)
pivot_yes = daily_summary.pivot(index='date', columns='experiment', values='yes_count').fillna(0)

cum_total = pivot_total.cumsum()
cum_yes = pivot_yes.cumsum()

cum_summary = pd.concat([cum_total, cum_yes], axis=1, keys=['total', 'yes'])
cum_summary.columns = ['total_control', 'total_exposed', 'yes_control', 'yes_exposed']
cum_summary.reset_index(inplace=True)
cum_summary.head()


Unnamed: 0,date,total_control,total_exposed,yes_control,yes_exposed
0,2020-07-03,233,92,104,43
1,2020-07-04,301,183,134,89
2,2020-07-05,344,257,151,124
3,2020-07-06,370,305,163,147
4,2020-07-07,407,351,179,169


### **Computing the Sequential Z-Statistic Over Time**

The sequential z-test statistic is computed at each time point using the cumulative data from both experimental groups. This provides a dynamic view of how evidence accumulates.



In [15]:
z_stats = []
dates = []
sample_sizes = []

for _, row in cum_summary.iterrows():
    count_control = row['yes_control']
    n_control = row['total_control']
    count_exposed = row['yes_exposed']
    n_exposed = row['total_exposed']

    if n_control > 0 and n_exposed > 0:
        counts_seq = np.array([count_control, count_exposed])
        nobs_seq = np.array([n_control, n_exposed])
        z, _ = proportions_ztest(counts_seq, nobs_seq, alternative='smaller')
        z_stats.append(z)
        dates.append(row['date'])
        sample_sizes.append(n_control + n_exposed)

seq_df = pd.DataFrame({
    'date': dates,
    'cumulative_sample_size': sample_sizes,
    'z_stat': z_stats
})
seq_df.head()


Unnamed: 0,date,cumulative_sample_size,z_stat
0,2020-07-03,325,-0.343303
1,2020-07-04,484,-0.880831
2,2020-07-05,601,-1.059898
3,2020-07-06,675,-1.074869
4,2020-07-07,758,-1.148183


### **Evolution of the Cumulative Z-Statistic**

The line chart below displays the evolution of the cumulative z-statistic over time. A horizontal dashed line represents the conventional significance threshold of **1.645** for a one-sided test at the 5% level.


In [16]:
fig_seq = go.Figure()

fig_seq.add_trace(go.Scatter(
    x=seq_df['date'],
    y=seq_df['z_stat'],
    mode='lines+markers',
    name='Cumulative Z-Statistic',
    marker=dict(size=8)
))

fig_seq.add_trace(go.Scatter(
    x=seq_df['date'],
    y=[1.645] * len(seq_df),
    mode='lines',
    name='Significance Threshold (1.645)',
    line=dict(dash='dash', color='red')
))

fig_seq.update_layout(
    title='Sequential A/B Testing: Cumulative Z-Statistic Over Time',
    xaxis_title='Date',
    yaxis_title='Z-Statistic',
    hovermode='x unified'
)
fig_seq.show()


### **Sequential Testing Boundaries with Alpha Spending**

In sequential analysis, adjusting the significance threshold at each interim look is crucial for controlling the overall type I error rate. Approximate O'Brien-Fleming boundaries are used here:

- **At 25% information**, the critical z-value is approximately **2.34**.
- **At 50% information**, it is approximately **1.98**.
- **At 75% information**, it is approximately **1.75**.
- **At full data (100%)**, the threshold is **1.645**.


In [17]:
total_sample = seq_df['cumulative_sample_size'].max()
information_fraction = seq_df['cumulative_sample_size'] / total_sample

def obf_boundary(frac):
    if frac <= 0.25:
        return 2.34
    elif frac <= 0.50:
        return 1.98
    elif frac <= 0.75:
        return 1.75
    else:
        return 1.645

obf_boundaries = information_fraction.apply(obf_boundary)
seq_df['info_frac'] = information_fraction
seq_df['obf_boundary'] = obf_boundaries


The plot below shows these approximate boundaries compared with the cumulative z-statistic.

In [18]:
fig_advanced = go.Figure()

fig_advanced.add_trace(go.Scatter(
    x=seq_df['date'],
    y=seq_df['z_stat'],
    mode='lines+markers',
    name='Cumulative Z-Statistic',
    marker=dict(size=8)
))

fig_advanced.add_trace(go.Scatter(
    x=seq_df['date'],
    y=seq_df['obf_boundary'],
    mode='lines',
    name='O’Brien-Fleming Boundary',
    line=dict(dash='dash', color='green')
))

fig_advanced.add_trace(go.Scatter(
    x=seq_df['date'],
    y=[1.645] * len(seq_df),
    mode='lines',
    name='Conventional Boundary (1.645)',
    line=dict(dash='dot', color='red')
))

fig_advanced.update_layout(
    title='Advanced Sequential Testing with O’Brien-Fleming Alpha Spending',
    xaxis_title='Date',
    yaxis_title='Z-Statistic',
    hovermode='x unified'
)
fig_advanced.show()

### **Segmentation Insights**  
- **Device Brand**

    A large proportion of responses come from generic devices, with Samsung being the most common identifiable brand.  

- **Operating System**
    
    The data are highly concentrated on a single OS version (6), suggesting limited diversity in platform usage.  

- **Time of Day**
    
    Response rates peak around mid-afternoon, particularly near 3 PM, indicating higher engagement during these hours.  

- **Browser Type**

    Mobile browsers dominate the dataset, with Chrome Mobile receiving the majority of responses.

### **Classic A/B Testing Findings**  
Analysis of the final aggregated data using a one-sided z-test yields a z-statistic of approximately **-0.6457** and a p-value of **0.2592**. The exposed group shows a slightly higher proportion of "Yes" responses (46.88%) compared to the control group (45.05%), but the difference is not statistically significant at the 5% level.

### **Sequential A/B Testing Findings**  
Monitoring the cumulative z-statistic over time reveals that it remains below the conventional threshold (and far from the adjusted boundaries) throughout the experiment period. This indicates that no interim analysis justifies early stopping in favor of the exposed group.

### **Conclusion**  
Neither classic nor sequential A/B testing provides statistically significant evidence that the exposed ad improves brand awareness. However, detailed segmentation uncovers valuable insights into user behavior that can inform targeted optimizations.





