<a href="https://colab.research.google.com/github/lightbluening/Credit-Risk-Analysis-Using-Machine-Learning/blob/main/AB_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hypothesis Testing in Action!

Hypothesis testing often involves experiements. To design an experiment, one has to determine a number of things, such as sample size, target goal, confidence level, etc. This notebook highlights all these topics you need to know.

## What is A/B Testing?

- Popular experimental analysis ... hypothesis testing involving actual experiments
- Comparing **two** products / designs / services / website designs / groups
- E.g., one group of users with the original product (<ins>control group</ins>) & another group of users with the newly designed product (<ins>experiment group</ins>)
- Popular in
    - marketing (e.g., two products, two Ads, two prices)
    - MIS (e.g., two new website designs, two software user interfaces)
    - supply chain (e.g., two store layouts)
    - engineering (e.g., two product designs)
    - agribusiness (e.g., two soil treatments)
    - medicine (e.g., two therapies)
    - ...

## Data & Hypothesis

In [14]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [15]:
#import packages
import pandas as pd
import numpy as np
import scipy.stats as stats
import math
import statsmodels.stats.api as sms
import scipy.stats as st

In [16]:
#import data

df = pd.read_csv('/content/drive/My Drive/665/6_z_ABTesting/ab_data_cleaned.csv')

print("Number of rows: ", df.shape[0], " Number of columns: ", df.shape[1])
df.head()

Number of rows:  290584  Number of columns:  5


Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


Two user groups

- Control group experiencing the old webpage design Vs. Treatment group with the new website design

Converted

- 0: no product purchase
- 1: yes product purchase

## Testing Null Hypothesis

- Test if the probability of conversion in the treatment group **=** the probability of conversion in the control group
- If p-value is close to zero, reject the null hypothesis and conclude that the treatment (or new design) works.

<img src='https://assets.justinmind.com/wp-content/webp-express/webp-images/uploads/2019/12/obama-email-campaign-ab-testing-example.png.webp'>

## Data understanding & transformation

### Probability of conversion per group

In [17]:
# how many in each group?

control = df.loc[df["group"] == "control"]   # control group
treatment = df.loc[df["group"] == "treatment"]   # treatment group
print(len(control), len(treatment))

145274 145310


In [18]:
# percentage of each group

print(len(control)/len(df))
print(len(treatment)/len(df))

0.4999380557773312
0.5000619442226688


Control group conversion rate

In [19]:
conversions_control = df.loc[(df["group"] == "control") & (df['converted'] == 1)]

print("Number of control users who converted on old page: ", len(conversions_control))
print("Percentage of control users who converted: ", round((len(conversions_control) / len(control)) * 100, 2), "%")

Number of control users who converted on old page:  17489
Percentage of control users who converted:  12.04 %


Treatment group conversion rate

In [20]:
conversions_treatment = df.loc[(df["group"] == "treatment") & (df['converted'] == 1)]

print("Number of treatment users who converted on new page: ", len(conversions_treatment))
print("Percentage of treatment users who converted: ", round((len(conversions_treatment) / len(treatment)) * 100, 2), "%")

Number of treatment users who converted on new page:  17264
Percentage of treatment users who converted:  11.88 %


### Summary of conversion rates of the two groups

- [Mean](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html): Return the mean of the values for the requested axis.
- [Standard deviation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.std.html): Return sample standard deviation over requested axis
- [Standard error](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sem.html): Return unbiased standard error of the mean

In [21]:
print(df.groupby('group')['converted'].mean())   # mean
print(df.groupby('group')['converted'].std())    # standard deviation
print(df.groupby('group')['converted'].sem())    # standard error

group
control      0.120386
treatment    0.118808
Name: converted, dtype: float64
group
control      0.325414
treatment    0.323564
Name: converted, dtype: float64
group
control      0.000854
treatment    0.000849
Name: converted, dtype: float64


In [22]:
# table format
df.groupby('group')['converted'].agg(['mean', 'std', 'sem'])

# show three decimal points only
#df.groupby('group')['converted'].agg(['mean', 'std', 'sem']).style.format('{:.3f}')

Unnamed: 0_level_0,mean,std,sem
group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
control,0.120386,0.325414,0.000854
treatment,0.118808,0.323564,0.000849


**Two designs performed very similarly. In fact, the new design didn't do better :-(**

## A/B Test (Hypothesis Testing)

### Set Test Parameters

- Determine if the sample size is suitable: How many people should we have in each group?
- Determine the confidence level you want to have (e.g., 95 percent)

### Additional Terminologies

- Effect size: The observed difference (between zero and 100%) in conversion between control group and treatment group. 100% (the maximum value) means the new design works tremendously; Zero means the new design is no better than the old design.
- Baseline rate: The conversion rate of the control group (12%)
- Practical significance: The improvement you're hoping for from the new design. For example, 2% means your target conversion rate is 14% (= baseline rate + practical significance)

In [23]:
#Check what sample size is required

baseline_rate = len(conversions_control) / len(control)
print("Baseline rate: ", baseline_rate)

practical_significance = 0.02   #user defined, meaning we'd be happy with a difference of 2% (and our target rate would be 14%)
confidence_level = 0.05         #user defined, for a 95% confidence interval

effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + practical_significance)   #Effect size for a test comparing two proportions

sample_size = sms.NormalIndPower().solve_power(effect_size = effect_size, power = 0.8, alpha = confidence_level, ratio=1)

print("Required sample size: ", round(sample_size), " per group")

Baseline rate:  0.1203863045004612
Required sample size:  4444  per group


We'd need **at least 4444 observations for each group.**

### Run A/B test

In [24]:
from statsmodels.stats.proportion import proportions_ztest, proportion_confint

In [25]:
# how many people in each group?

nobs = [len(control), len(treatment)]
nobs

[145274, 145310]

In [26]:
# how many per each group converted?

successes = [len(conversions_control), len(conversions_treatment)]
successes

[17489, 17264]

In [27]:
z_stat, pval = proportions_ztest(successes, nobs=nobs)
(lower_con, lower_treat), (upper_con, upper_treat) = proportion_confint(successes, nobs=nobs, alpha=0.05)

print(f'z statistic: {z_stat:.2f}')
print(f'p-value: {pval:.3f}')
print(f'ci 95% for control group: [{lower_con:.3f}, {upper_con:.3f}]')
print(f'ci 95% for treatment group: [{lower_treat:.3f}, {upper_treat:.3f}]')

z statistic: 1.31
p-value: 0.190
ci 95% for control group: [0.119, 0.122]
ci 95% for treatment group: [0.117, 0.120]


## Conclusions

Since our **p-value=0.190 is way above our α=0.05 (95 percent confidence interval)** threshold, we cannot reject the Null hypothesis Hₒ, meaning that **our new design did not perform significantly different (let alone better) than our old one :(**

Additionally, if we look at the confidence interval for the treatment group ([0.117, 0.120], or 11.7-12.0%) we notice that:

    1. It includes our baseline value of 11.9% conversion rate
    2. It does not include our target value of 14% (the 2% uplift we were aiming for)

What this means is that it is more likely that the true conversion rate of the new design is similar to our baseline, rather than the 14% target we had hoped for. This is further proof that our new design is not likely to be an improvement on our old design.

## Reproduced from
- https://github.com/RobbieGeoghegan/AB_Testing
- https://medium.com/@robbiegeoghegan/implementing-a-b-tests-in-python-514e9eb5b3a1
- https://medium.com/@RenatoFillinich/ab-testing-with-python-e5964dd66143