Problem Statement:
+ The company has developed a new web page in order to try and increase the number of users who "convert," meaning the number of users who decide to pay for the company's product.
+ Your goal is to work through this notebook to help the company understand if they should implement this new page, keep the old page, or perhaps run the experiment longer to make their decision.
+ The new landing page is marked as "new_page", while the existing page is marked as "old_page"

About the Dataset:
+ user_id: unique users number
+ timestamp: time
+ group: treatment and control group
+ landing_page: old_page and new_page
+ converted: Sign up status after viewing the page (0-1)

Hypothesis:
+ H0: There is no statistically significant difference between the old page and the new page in conversion rate.
+ H1: There is a statistically significant difference between the old page and the new page in conversion rate.

The hypothesis will be concluded based on the p_value obtained from the test:
+ If the p_value is less than 0.05, then reject H0 and accept H1.

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

In [2]:
# Dataset overview
df = pd.read_csv('ab_data.csv')
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294480 entries, 0 to 294479
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294480 non-null  int64 
 1   timestamp     294480 non-null  object
 2   group         294480 non-null  object
 3   landing_page  294480 non-null  object
 4   converted     294480 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


In [9]:
df.isnull().sum()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

+ No Null value in the dataset

In [6]:
df.describe()

Unnamed: 0,user_id,converted
count,294480.0,294480.0
mean,787973.538896,0.119658
std,91210.917091,0.324562
min,630000.0,0.0
25%,709031.75,0.0
50%,787932.5,0.0
75%,866911.25,0.0
max,945999.0,1.0


In [11]:
df['landing_page'].value_counts()

landing_page
new_page    147241
old_page    147239
Name: count, dtype: int64

+ This is a roughly 50.01/49.99 split. This porpotion is acceptable 

In [15]:
df.groupby('landing_page')['converted'].mean()

landing_page
new_page    0.118839
old_page    0.120478
Name: converted, dtype: float64

+ From the raw number, look like there is no significant difference in conversion rate between the new_page and old_page. Lets perform AB Testing to verify this.

+ Conversion has only 2 variable, therefore I will perform a Chi Square Test for this case: 

In [20]:
crosstable = pd.crosstab(df['landing_page'],df['converted'])
crosstable

converted,0,1
landing_page,Unnamed: 1_level_1,Unnamed: 2_level_1
new_page,129743,17498
old_page,129500,17739


In [21]:
chi2, p_chi, dof, expected = stats.chi2_contingency(crosstable)
print(f"Chi-square p-value: {p_chi:.10f}")
print(f"Chi statistic: {chi2:.2f}")

Chi-square p-value: 0.1725629856
Chi statistic: 1.86


P-Value 0.17 is higher than 0.05.
+ Then we cannot reject the Null Hypothesis, there is no statistically significant difference between the new page and the old page in conversion rate
+ I could suggest the company to provide more data such as sales value to further determine if the new page is more profitable.