In [43]:
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.proportion import proportions_ztest
import warnings
warnings.filterwarnings("ignore")

# Task 1

The goal of this Jupyter notebook is to perform an A/B test to evaluate the impact of the Change Vote button in the "Who will win" feature on the Android platform in February 2024. We are comparing user behavior between the control group (without the Change Vote option) and the treatment group (with the Change Vote button).

## Data loading

We are loading data from the CSV file obtained using the task1.sql query.

In [20]:
column_names = [
    "event_date",
    "event_timestamp",
    "event_name",
    "user_pseudo_id",
    "geo_country",
    "app_info_version",
    "platform",
    "status",
    "id",
    "event_id",
    "name",
    "item_name",
    "previous_first_open_count",
    "firebase_experiments"
]

In [21]:
data = pd.read_csv('./events_data.csv', names=column_names, header=0)  
data.head(5)

Unnamed: 0,event_date,event_timestamp,event_name,user_pseudo_id,geo_country,app_info_version,platform,status,id,event_id,name,item_name,previous_first_open_count,firebase_experiments
0,20240219,1708300359346037,add_favorite_event,53357ec1f1dca1f79fabed2ee14adfb6,Croatia,6.16.9,ANDROID,,11911139.0,,,,,['firebase_exp_84_1']
1,20240219,1708300395373042,add_favorite_event,53357ec1f1dca1f79fabed2ee14adfb6,Croatia,6.16.9,ANDROID,,11938093.0,,,,,['firebase_exp_84_1']
2,20240219,1708343711592023,add_favorite_event,53357ec1f1dca1f79fabed2ee14adfb6,Croatia,6.16.9,ANDROID,,11515794.0,,,,,['firebase_exp_84_1']
3,20240219,1708343723166037,add_favorite_event,53357ec1f1dca1f79fabed2ee14adfb6,Croatia,6.16.9,ANDROID,,11515796.0,,,,,['firebase_exp_84_1']
4,20240219,1708343760654072,add_favorite_event,53357ec1f1dca1f79fabed2ee14adfb6,Croatia,6.16.9,ANDROID,,12076702.0,,,,,['firebase_exp_84_1']


## Data preparation

We are removing the column `platform` because is is constant.

In [22]:
colummns_to_drop = ['platform']
data.drop(columns=colummns_to_drop, inplace=True)
data.head(5)

Unnamed: 0,event_date,event_timestamp,event_name,user_pseudo_id,geo_country,app_info_version,status,id,event_id,name,item_name,previous_first_open_count,firebase_experiments
0,20240219,1708300359346037,add_favorite_event,53357ec1f1dca1f79fabed2ee14adfb6,Croatia,6.16.9,,11911139.0,,,,,['firebase_exp_84_1']
1,20240219,1708300395373042,add_favorite_event,53357ec1f1dca1f79fabed2ee14adfb6,Croatia,6.16.9,,11938093.0,,,,,['firebase_exp_84_1']
2,20240219,1708343711592023,add_favorite_event,53357ec1f1dca1f79fabed2ee14adfb6,Croatia,6.16.9,,11515794.0,,,,,['firebase_exp_84_1']
3,20240219,1708343723166037,add_favorite_event,53357ec1f1dca1f79fabed2ee14adfb6,Croatia,6.16.9,,11515796.0,,,,,['firebase_exp_84_1']
4,20240219,1708343760654072,add_favorite_event,53357ec1f1dca1f79fabed2ee14adfb6,Croatia,6.16.9,,12076702.0,,,,,['firebase_exp_84_1']


We split the data into control and treatment groups based on the firebase_experiments variable and observe that the size difference between the two groups is not significant.

In [23]:
control_group = data[data['firebase_experiments'].apply(lambda x: 'firebase_exp_84_0' in x)]
treatment_group = data[data['firebase_experiments'].apply(lambda x: 'firebase_exp_84_1' in x)]

This shows that about 49% of users are in the control group and 51% in the treatment group.

In [24]:
print("Percentage of control group: ", round(len(control_group)/len(data), 2))
print("Percentage of treatment group: ", round(len(treatment_group)/len(data), 2))

Percentage of control group:  0.48
Percentage of treatment group:  0.52


## Hypothesis

**Null Hypothesis (H0):**

There is no difference in the average number of clicks per user on the vote button between the treatment group (with access to the Change Vote button) and the control group (without it).

**Alternative Hypothesis (H1):**

Users in the treatment group (with access to the Change Vote button) have a higher average number of clicks per user on the vote button compared to the control group.

## Metrics

**Goal metric (OEC):**

- Average clicks on the vote button per user

Our primary assumption is that the average number of vote button clicks per user will be higher in the treatment group, since users in this group have access to the "change vote" option. This additional feature is expected to encourage more interaction with the voting system, potentially leading to more clicks per user compared to the control group.

In [None]:
control_votes = control_group[control_group['event_name'] == 'event_vote']
treatment_votes = treatment_group[treatment_group['event_name'] == 'event_vote']

total_control_clicks = len(control_votes)
total_treatment_clicks = len(treatment_votes)

n_control_users = control_group['user_pseudo_id'].nunique()
n_treatment_users = treatment_group['user_pseudo_id'].nunique()

avg_clicks_control = round(total_control_clicks / n_control_users, 2)
avg_clicks_treatment = round(total_treatment_clicks / n_treatment_users, 2)

print("Average vote clicks per user (control):", avg_clicks_control)
print("Average vote clicks per user (treatment):", avg_clicks_treatment)

Average vote clicks per user (control): 9.96
Average vote clicks per user (treatment): 9.85


It can be observed that the average number of vote button clicks per user is 9.96 in the control group and 9.85 in the treatment group.
To further investigate this difference statistically, we first test whether the variances of the two groups are equal using Bartlett’s test.


In [58]:
clicks_per_user_control = control_votes.groupby('user_pseudo_id').size()
clicks_per_user_treatment = treatment_votes.groupby('user_pseudo_id').size()

statistic, p_value = stats.bartlett(clicks_per_user_control, clicks_per_user_treatment)

print("Bartlett test statistic:", round(statistic, 2))
print("p-value:", p_value)

Bartlett test statistic: 539.63
p-value: 2.272523389900825e-119


Since the resulting p-value is extremely small (2.27e-119), we reject the null hypothesis that the variances are equal and conclude that the control and treatment groups have significantly different variances.
Because of this, we cannot use Student’s t-test, which assumes equal variances. Instead, we apply Welch’s t-test, which does not rely on this assumption. The Welch’s t-test compares the average number of vote clicks per user between the control and treatment groups. 

In [60]:
clicks_per_user_control = control_votes.groupby('user_pseudo_id').size()
clicks_per_user_treatment = treatment_votes.groupby('user_pseudo_id').size()

t_stat, p_value = stats.ttest_ind(
    clicks_per_user_control,
    clicks_per_user_treatment,
    equal_var=False
)

print("Welch's t-test statistic:", round(t_stat, 2))
print("p-value:", round(p_value, 2))

Welch's t-test statistic: 0.39
p-value: 0.69


The test statistic is 0.39, with a p-value of 0.69, which is well above the common significance threshold of 0.05. This means we fail to reject the null hypothesis that the two groups have the same average number of vote clicks per user.

In other words, there is no statistically significant difference in the number of vote clicks per user between the control and treatment groups.

**Secondary metrics:**

- Average ads_impression_custom per user

Our goal is to observe positive growth in this secondary metric alongside the primary metric, since engagement with the Change Vote option should lead to more ads being shown.

In [61]:
n_control_users = control_group['user_pseudo_id'].nunique()
n_treatment_users = treatment_group['user_pseudo_id'].nunique()

ads_impressions_control = control_group[control_group['event_name'] == 'ads_impression_custom']
ads_impressions_treatment = treatment_group[treatment_group['event_name'] == 'ads_impression_custom']

total_control_impressions = len(ads_impressions_control)
total_treatment_impressions = len(ads_impressions_treatment)

avg_impressions_per_user_control = total_control_impressions / n_control_users
avg_impressions_per_user_treatment = total_treatment_impressions / n_treatment_users

print("Average ad impressions per user (control):", round(avg_impressions_per_user_control, 2))
print("Average ad impressions per user (treatment):", round(avg_impressions_per_user_treatment, 2))

Average ad impressions per user (control): 398.46
Average ad impressions per user (treatment): 401.64


The control group haa an average of 398.46 ad impressions per user, while the treatment group has 401.64.

In [None]:
impressions_per_user_control = ads_impressions_control.groupby('user_pseudo_id').size()
impressions_per_user_treatment = ads_impressions_treatment.groupby('user_pseudo_id').size()

t_stat, p_value = stats.ttest_ind(
    impressions_per_user_control,
    impressions_per_user_treatment,
    equal_var=False  
)

print("Welch’s t-test statistic:", round(t_stat, 2))
print("p-value:", round(p_value, 2))

Welch’s t-test statistic: -0.39
p-value: 0.7


Although the treatment group shows a slightly higher mean, we used Welch’s t-test to assess whether this difference is statistically significant, since the assumption of equal variances may not hold, as we showed earlier that the groups have unequal variances and therefore Student’s t-test is not applicable. \
The test resulted in a t-statistic of -0.39 and a p-value of 0.70, which is much higher than the common significance threshold of 0.05.
This means that we fail to reject the null hypothesis, and we do not observe a statistically significant difference in average ad impressions per user between the control and treatment groups. The observed difference is likely due to random variation.

**Guardrail metrics:**

- Retention (1 day, 1 week)

We calculated retention rates for 1, 5, 10, 15, and 17 days after the first event separately for users in the control and treatment groups.

Next, we performed a two-proportion Z-test to statistically compare the retention rates between the two groups for each of these time points. The two-proportion Z-test evaluates whether the difference between two population proportions is statistically significant by comparing the observed difference relative to the variability expected under the null hypothesis of no difference.

In [31]:
control_group['event_date'] = pd.to_datetime(control_group['event_date'], format='%Y%m%d')
treatment_group['event_date'] = pd.to_datetime(treatment_group['event_date'], format='%Y%m%d')

In [32]:
first_event_control = control_group.groupby('user_pseudo_id')['event_date'].min()
first_event_treatment = treatment_group.groupby('user_pseudo_id')['event_date'].min()

In [33]:
control_group = control_group.merge(first_event_control.rename('first_date'), on='user_pseudo_id')
treatment_group = treatment_group.merge(first_event_treatment.rename('first_date'), on='user_pseudo_id')

In [36]:
control_group['days_since_first_event'] = (control_group['event_date'] - control_group['first_date']).dt.days
treatment_group['days_since_first_event'] = (treatment_group['event_date'] - treatment_group['first_date']).dt.days

In [None]:
days = [1, 5, 10, 15, 17]
retention_data = []

for day in days:
    active_users_c = control_group[control_group['days_since_first_event'] == day]['user_pseudo_id'].nunique()
    total_users_c = control_group['user_pseudo_id'].nunique()
    retention_rate_c = active_users_c / total_users_c if total_users_c > 0 else 0

    active_users_t = treatment_group[treatment_group['days_since_first_event'] == day]['user_pseudo_id'].nunique()
    total_users_t = treatment_group['user_pseudo_id'].nunique()
    retention_rate_t = active_users_t / total_users_t if total_users_t > 0 else 0

    counts = [active_users_c, active_users_t]
    nobs = [total_users_c, total_users_t]
    z_stat, p_val = proportions_ztest(counts, nobs)

    retention_data.append({
        'day_since_first_event': day,
        'retention_rate_control': round(retention_rate_c, 2),
        'active_users_control': active_users_c,
        'retention_rate_treatment': round(retention_rate_t, 2),
        'active_users_treatment': active_users_t,
        'z_statistic': round(z_stat, 2),
        'p_value': round(p_val, 2)
    })

retention_df = pd.DataFrame(retention_data)

In [64]:
retention_df.head(5)

Unnamed: 0,day_since_first_event,retention_rate_control,active_users_control,retention_rate_treatment,active_users_treatment,z_statistic,p_value
0,1,0.69,13493,0.69,14123,0.57,0.57
1,5,0.44,8665,0.44,9132,-0.27,0.79
2,10,0.22,4389,0.23,4630,-0.21,0.83
3,15,0.05,995,0.05,978,1.52,0.13
4,17,0.02,381,0.02,418,-0.62,0.54


The results show that all p-values are greater than 0.05, indicating that there is no statistically significant difference in retention rates between the control and treatment groups at any of the evaluated time points.