# **Basics of AB testing for Mean and Proportions**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

In [2]:
trx_amount_df = pd.read_csv('datasets/trx_amount_df.csv')
payment_duration_df = pd.read_csv('datasets/payment_duration_df.csv')

## **Normal Distribution and Z-Score**

In [4]:
# IQ follow Normal Dist. with mean=100, std=15
# Probability IQ between 120 and 140?
dist = stats.norm(loc=100, scale=15)
prob = dist.cdf(140) - dist.cdf(120)
print(f"Probability IQ between 120 and 140 is {prob:.4f}")

Probability IQ between 120 and 140 is 0.0874


In [4]:
# get z-score from alpha
# alpha to critical
alpha = 0.05
n_sided = 2 # in practice, it's (almost) always 2 sided test
z_crit = stats.norm.ppf(1-alpha/n_sided)
print(f"Z-critical value for alpha = 0.05(two-tailed) is -{z_crit:.2f} and +{z_crit:.2f}")

Z-critical value for alpha = 0.05(two-tailed) is -1.96 and +1.96


### Exercise

In [6]:
# Calculate the probability of pr(100_000 < x < 150_000) from 
# Normally distributed data with mean=120_000, std=50_000)
dist_e1 = stats.norm(loc=120000, scale=50000)
prob_e1 = dist_e1.cdf(150000) - dist_e1.cdf(100000)
print(f"Probability of data measured between 100.000 and 150.000 is {prob_e1:.4f}")

Probability of data measured between 100.000 and 150.000 is 0.3812


In [7]:
# Compute the probability of X greater than Z = 1.96 in a normally distributed data
dist2 = stats.norm(loc=0, scale=1)
prob2 = 1 - dist2.cdf(1.96)
print(f"Probability X > 1.96 is {prob2:.4f}")

Probability X > 1.96 is 0.0250


## **Student's T-test**

### Transaction Amount Data

In [5]:
trx_amount_df.head(3)

Unnamed: 0,control,variant
0,149900,122899
1,124500,233400
2,155900,149400


In [12]:
# summary statistics
trx_amount_df.describe()

Unnamed: 0,control,variant
count,30.0,30.0
mean,122476.6,144553.266667
std,35996.270772,41901.599542
min,53500.0,61800.0
25%,106375.0,118075.0
50%,120600.0,147100.0
75%,144400.0,174500.0
max,193200.0,233400.0


From the mean values: variant > control. Is this difference significant? We can test it using the two-sample Student's t-test. We use t-test instead of z-test because we don't really know the population variance.

In [14]:
# t-test
t_stat, p_val = stats.ttest_ind(trx_amount_df['variant'], 
                                trx_amount_df['control'],
                                alternative='two-sided')
print(f"t_stat is {t_stat:.4f}")
print(f"p_val is {p_val:.4f}")

t_stat is 2.1890
p_val is 0.0326


Since the p-value < alpha = 0.05(two-tailed), we can reject H0 that assumes there is no significant difference between both samples mean.

However, just the p-value alone is not the best way to evaluate AB testing of 2 samples. We can also evaluate the confidence intervals (CI) for the difference between means of two samples. The way we calculate it is different for whether we use Z-test or T-test when comparing the means.

For CI with Z-score:
$$
\begin{align*}
  CI_{\alpha=0.05} &= (\bar{x_1}-\bar{x_2})\pm z^*\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}} \\
  z^* &= z_{(1-\alpha/2)}
\end{align*}
$$

For CI with t-score, first determines whether we can use the pooled variance of the two samples. One way to do this is with F-test:


$$
\begin{align*}
  F = \frac{s_1^2}{s_2^2} = \frac{larger s}{smaller s}\\
  if F \leq t^* , then\ pool\\
  if F > t^*, then\ don't\ pool 
\end{align*}
$$

CI with t-score and unpooled variance ($s^2$)
$$
\begin{align*}
  CI_{\alpha=0.05} &= (\bar{x_1}-\bar{x_2})\pm t^*\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}} \\
  t^* &= t_{\alpha/2, df}, df=n_1+n_2-2
\end{align*}
$$

CI with t-score and pooled variance ($s^2$)
$$
\begin{align*}
  CI_{\alpha=0.05} &= (\bar{x_1}-\bar{x_2})\pm t^*(s_p)\sqrt{\frac{1}{n_1}+\frac{1}{n_2}} \\
  t^* &= t_{\alpha/2, df}, df=n_1+n_2-2 \\
  s_p &= \sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}}
\end{align*}
$$


In [23]:
alpha = 0.05; n_sided = 2; n_control = 30; n_variant = 30
s_control = stats.tstd(trx_amount_df['control'].to_numpy())
s_variant = stats.tstd(trx_amount_df['variant'].to_numpy())

F_score = (s_control**2)/(s_variant**2) if s_control > s_variant else (s_variant**2)/(s_control**2)
t_crit = stats.t.ppf(1-alpha/n_sided, n_control+n_variant-2)
print(f"F-score is {F_score:.4f}")
print(f"Critical t-score is {t_crit:.4f}") #Use pooled variance

delta_mean = trx_amount_df['variant'].mean() - trx_amount_df['control'].mean()
pooled_std = np.sqrt(((n_control-1)*s_control**2 + (n_variant-1)*s_variant**2)/ \
                     (n_control+n_variant-2))
CI_upper_t_pooled = delta_mean + t_crit*pooled_std*np.sqrt(1/n_control+1/n_variant)
CI_lower_t_pooled = delta_mean - t_crit*pooled_std*np.sqrt(1/n_control+1/n_variant)
print("95% Confidence interval of transaction amount difference between variant and control is "
     f"({CI_lower_t_pooled:.2f}, {CI_upper_t_pooled:.2f})")

F-score is 1.3550
Critical t-score is 2.0017
95% Confidence interval of transaction amount difference between variant and control is (1888.49, 42264.84)


In [24]:
# Comparing it with CI using the z-score approach
z_crit = stats.norm.ppf(1-alpha/n_sided)
print(f"Critical z-score is {z_crit:.2f}")
std_error = np.sqrt((s_control**2/n_control)+(s_variant**2/n_variant))
CI_upper_z = delta_mean+z_crit*std_error
CI_lower_z = delta_mean-z_crit*std_error
print("95% Confidence interval of transaction amount difference between variant and control is "
     f"({CI_lower_z:.2f}, {CI_upper_z:.2f})")

Critical z-score is 1.96
95% Confidence interval of transaction amount difference between variant and control is (2309.59, 41843.74)


**CONCLUSION:**

Here we can observe that the CI of difference between variant and control mean is always in the positive numbers. Therefore there is a supporting evidence that if we replicate the experiment numerous times, we can be sure that at least 95% of them will resulted in positive difference between means.

### Exercise

Do the same analysis for the effects of treatment onto customer's payment duration, which is the time they took to finish a payment.

In [41]:
payment_duration_df.head(3)

Unnamed: 0,control,variant
0,3.46,4.38
1,4.1,4.14
2,3.56,3.02


In [7]:
# summary statistics
payment_duration_df.describe()

Unnamed: 0,control,variant
count,1000.0,1000.0
mean,3.39082,3.49901
std,0.5168,0.625253
min,1.65,1.59
25%,3.05,3.11
50%,3.4,3.47
75%,3.7325,3.91
max,4.96,5.44


In [15]:
# t-test
t_stat, p_val = stats.ttest_ind(payment_duration_df['variant'], 
                                payment_duration_df['control'])
print(f"t_stat is {t_stat:.4f}")
print(f"p_val is {p_val:.4f}")

t_stat is 4.2176
p_val is 0.0000


Since the p-value < alpha = 0.05(two-tailed), we can reject H0 that assumes there is no significant difference between both samples mean.

In [25]:
alpha = 0.05; n_sided = 2; n_control = 1000; n_variant = 1000
s_control = stats.tstd(payment_duration_df['control'].to_numpy())
s_variant = stats.tstd(payment_duration_df['variant'].to_numpy())

F_score = (s_control**2)/(s_variant**2) if s_control > s_variant else (s_variant**2)/(s_control**2)
t_crit = stats.t.ppf(1-alpha/n_sided, n_control+n_variant-2)
print(f"F-score is {F_score:.4f}")
print(f"Critical t-score is {t_crit:.4f}") #Use pooled variance

delta_mean = payment_duration_df['variant'].mean() - payment_duration_df['control'].mean()
pooled_std = np.sqrt(((n_control-1)*s_control**2 + (n_variant-1)*s_variant**2)/ \
                     (n_control+n_variant-2))
CI_upper_t_pooled = delta_mean + t_crit*pooled_std*np.sqrt(1/n_control+1/n_variant)
CI_lower_t_pooled = delta_mean - t_crit*pooled_std*np.sqrt(1/n_control+1/n_variant)
print("95% Confidence interval of transaction amount difference between variant and control is "
     f"({CI_lower_t_pooled:.2f}, {CI_upper_t_pooled:.2f})")

F-score is 1.4637
Critical t-score is 1.9612
95% Confidence interval of transaction amount difference between variant and control is (0.06, 0.16)


In [26]:
# Comparing it with CI using the z-score approach
z_crit = stats.norm.ppf(1-alpha/n_sided)
print(f"Critical z-score is {z_crit:.2f}")
std_error = np.sqrt((s_control**2/n_control)+(s_variant**2/n_variant))
CI_upper_z = delta_mean+z_crit*std_error
CI_lower_z = delta_mean-z_crit*std_error
print("95% Confidence interval of transaction amount difference between variant and control is "
     f"({CI_lower_z:.2f}, {CI_upper_z:.2f})")

Critical z-score is 1.96
95% Confidence interval of transaction amount difference between variant and control is (0.06, 0.16)


**CONCLUSION:**

Here we can observe that the CI of difference between variant and control mean is always in the positive numbers. Therefore there is a supporting evidence that if we replicate the experiment numerous times, we can be sure that at least 95% of them will resulted in positive difference between means.

## **Z-proportion test**

If the parameter that we want to compare is in the form of proportions, we can use the Z-proportion test for it. 

We don't need to use the t-test equivalence for it since the data are binomial (i.e., the number of 'successes' out of a known total of 'trials'), instead of normal. In the binomial distribution, the **standard deviation is a function of the mean**, so **once you have estimated the mean there is no additional uncertainty to have to worry about**. Thus, the normal distribution can be used as a model of the sampling distribution of the test statistic.

In [31]:
# Suppose we have the following experiment data
count = np.array([7, 13])
nobs = np.array([83, 99])

prop_df = pd.DataFrame({
    'event': count,
    'sample': nobs,
    'proportion': (count/nobs).round(4)
}, index=['control','variant'])
prop_df

Unnamed: 0,event,sample,proportion
control,7,83,0.0843
variant,13,99,0.1313


From the data, proportion in variant is higher than control. Is this difference significant?

In [32]:
# z-proportion test
from statsmodels.stats.proportion import proportions_ztest
z_stat, p_val = proportions_ztest(prop_df['event'], 
                                  prop_df['sample'])
print(f"z_stat is {z_stat:.4f}")
print(f"p_val is {p_val:.4f}")

z_stat is -1.0092
p_val is 0.3129


P-value > 0.05 -> There is no significant efident to reject H0

Similar to comparing two means of continuous parameters, we can also evaluate the confidence intervals (CI) for the difference between proportions of two samples.

CI of proportion with Z-score:
$$
\begin{align*}
  CI &= (\hat{p_1}-\hat{p_2})\pm z_{crit}\cdot SE\\
  SE &= \sqrt{\frac{\hat{p_1}(1-\hat{p_1})}{n_1}+\frac{\hat{p_2}(1-\hat{p_2})}{n_2}}
\end{align*}
$$

However, sice the distribution of the test statistic is always examined as though the null hypothesis is true, i.e. in this case $\hat{p_1} = \hat{p_2}$. We can use the **pooled proportion estimate** to verify the success-failure condition and also **to estimate the standard error.**
$$
\begin{align*}
  SE &= \sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1}+\frac{1}{n_2})}\\
  \hat{p} &= \frac{e_1+e_2}{n_1+n_2}
\end{align*}
$$


In [33]:
# Computing the confidence interval of proportions difference
p_control = prop_df.loc['control','proportion']
p_variant = prop_df.loc['variant','proportion']
n_control = prop_df.loc['control','sample']
n_variant = prop_df.loc['variant','sample']

p_pool = prop_df['event'].sum()/prop_df['sample'].sum()
std_error = np.sqrt(p_pool*(1-p_pool)*(1/n_control+1/n_variant))
prop_delta = p_variant - p_control
CI_upper_z_pooled = prop_delta+z_crit*std_error
CI_lower_z_pooled = prop_delta-z_crit*std_error

print("95% Confidence interval of proportion difference between variant and control is "
      f"({CI_lower_z_pooled:.4f}, {CI_upper_z_pooled:.4f})")

95% Confidence interval of proportion difference between variant and control is (-0.0442, 0.1382)


Since p1 – p2 = 0 is in the interval, we are 95% confident that there is no difference in the proportion of control and variant samples.

### Exercise
Use this proportion data of transaction made from two different voucher scheme to measure whether the new(variant) method can resulted in meaningful proportion difference.

In [34]:
# suppose we have the following experiment data
count = np.array([1513, 1853])
nobs = np.array([15646, 15130])

trx_turnover_df = pd.DataFrame({
    'event': count,
    'sample': nobs,
    'proportion': (count/nobs).round(4)
}, index=['control','variant'])

trx_turnover_df

Unnamed: 0,event,sample,proportion
control,1513,15646,0.0967
variant,1853,15130,0.1225


From the data, proportion in variant is higher than control. Is this difference significant?

In [35]:
# z-proportion test
from statsmodels.stats.proportion import proportions_ztest

z_stat, p_val = proportions_ztest(trx_turnover_df['event'], 
                                  trx_turnover_df['sample'])
print(f"z_stat is {z_stat:.4f}")
print(f"p_val is {p_val:.4f}")

z_stat is -7.2415
p_val is 0.0000


Since the p-value < 0.05 -> There is strong enough evidence to reject H0, meaning there is a meaningful difference between the proportions.

In [36]:
# Computing the confidence interval of proportions difference
p_control = trx_turnover_df.loc['control','proportion']
p_variant = trx_turnover_df.loc['variant','proportion']
n_control = trx_turnover_df.loc['control','sample']
n_variant = trx_turnover_df.loc['variant','sample']

p_pool = trx_turnover_df['event'].sum()/trx_turnover_df['sample'].sum()
std_error = np.sqrt(p_pool*(1-p_pool)*(1/n_control+1/n_variant))
prop_delta = p_variant - p_control
CI_upper_z_pooled = prop_delta+z_crit*std_error
CI_lower_z_pooled = prop_delta-z_crit*std_error

print("95% Confidence interval of proportion difference between variant and control is "
      f"({CI_lower_z_pooled:.4f}, {CI_upper_z_pooled:.4f})")

95% Confidence interval of proportion difference between variant and control is (0.0188, 0.0328)


Since p1 – p2 = 0 is not in the interval, we are confidence that 95% of the replications will resulted in positive difference of proportions between the two treatment.