# Notes

* Making inferences about a population based on some sample data is HYPOTHESIS TESTING
* Ex. Drug effectivness human trials testing
* H0 is null hypothesis, which is default, or current setting
* H1 is alternative hypothesis, which is new proposal, or opposition stance of H0
* Using sample data the idea is try to reject the H0 in favour of H1

## Steps we can follow
1. Gather the necessary data required for the hypothesis test
2. Define Null - H0 and Alternate - H1 hypothesis
3. Choose the significance level alpha , which is the porbability of rejecting the null hypothesis when it is true
4. Select appropriate statistical tests.
   1. example include t-tests for comparing means, chi-square tests for categorical data,
   2. and ANOVA for comparing means across more than two groups
5. Perform the choosen statistical test on your data
6. Fidn the p-value and interpret the results of your statistical tests

# Libraries

In [19]:
import os
import pandas as pd
from scipy.stats import ttest_ind

import warnings
warnings.filterwarnings('ignore')

In [20]:
df = pd.read_csv('website_ab_test.csv')
df.head()

Unnamed: 0,Theme,Click Through Rate,Conversion Rate,Bounce Rate,Scroll_Depth,Age,Location,Session_Duration,Purchases,Added_to_Cart
0,Light Theme,0.05492,0.282367,0.405085,72.489458,25,Chennai,1535,No,Yes
1,Light Theme,0.113932,0.032973,0.732759,61.858568,19,Pune,303,No,Yes
2,Dark Theme,0.323352,0.178763,0.296543,45.737376,47,Chennai,563,Yes,Yes
3,Light Theme,0.485836,0.325225,0.245001,76.305298,58,Pune,385,Yes,No
4,Light Theme,0.034783,0.196766,0.7651,48.927407,25,New Delhi,1437,No,No


* Data is based on the performance of two themes on a website
* Our task is to find which theme performs better using Hypothesis testing

In [21]:
print(f'No. of Records available : {df.shape[0]}\
    \nNo. of Columns available : {df.shape[1]} \
        \n Missing values in the dataset : {df.isnull().sum()}')

No. of Records available : 1000    
No. of Columns available : 10         
 Missing values in the dataset : Theme                 0
Click Through Rate    0
Conversion Rate       0
Bounce Rate           0
Scroll_Depth          0
Age                   0
Location              0
Session_Duration      0
Purchases             0
Added_to_Cart         0
dtype: int64


In [22]:
print(f'        Numerical columns summary :')
df.describe()

        Numerical columns summary :


Unnamed: 0,Click Through Rate,Conversion Rate,Bounce Rate,Scroll_Depth,Age,Session_Duration
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,0.256048,0.253312,0.505758,50.319494,41.528,924.999
std,0.139265,0.139092,0.172195,16.895269,14.114334,508.231723
min,0.010767,0.010881,0.20072,20.011738,18.0,38.0
25%,0.140794,0.131564,0.353609,35.655167,29.0,466.5
50%,0.253715,0.252823,0.514049,51.130712,42.0,931.0
75%,0.370674,0.37304,0.648557,64.666258,54.0,1375.25
max,0.499989,0.498916,0.799658,79.997108,65.0,1797.0


In [23]:
# grouping data by theme and calculating the mean of each group

num_cols = df.select_dtypes(include='number').columns
theme_group = df.groupby('Theme')[num_cols].mean()
theme_group.sort_values(by='Conversion Rate', ascending=False, inplace=True)
theme_group

Unnamed: 0_level_0,Click Through Rate,Conversion Rate,Bounce Rate,Scroll_Depth,Age,Session_Duration
Theme,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Light Theme,0.247109,0.255459,0.499035,50.735232,41.734568,930.833333
Dark Theme,0.264501,0.251282,0.512115,49.926404,41.332685,919.48249


# Hypothesis testing

In [28]:
# calculating the t-test for the two themes
conversion_rates_light = df[df['Theme'] == 'Light Theme']['Conversion Rate']
conversion_rates_dark = df[df['Theme'] == 'Dark Theme']['Conversion Rate']

cr_t_test, cr_p_value = ttest_ind(conversion_rates_light, conversion_rates_dark, equal_var=False) # equal_var=False because the variances are not known or might be different
print(f'The t-test statistic is {cr_t_test}')
print(f'The p-value is {cr_p_value}')

The t-test statistic is 0.4748494462782632
The p-value is 0.6349982678451778


In [30]:
# calculating the t-test on click through rate of the two themes
click_through_rates_light = df[df['Theme'] == 'Light Theme']['Click Through Rate']
click_through_rates_dark = df[df['Theme'] == 'Dark Theme']['Click Through Rate']

ctr_t_test, ctr_p_value = ttest_ind(click_through_rates_light, click_through_rates_dark, equal_var=False)
print(f'The t-test statistic is for click through rates {ctr_t_test}')
print(f'The p-value is {ctr_p_value}')


The t-test statistic is for click through rates -1.9781708664172253
The p-value is 0.04818435371010704


In [36]:
# calculating the t-test on bounce rates, scroll_depth, session_duration of the two themes
bounce_rates_light = df[df['Theme'] == 'Light Theme']['Bounce Rate']
bounce_rates_dark = df[df['Theme'] == 'Dark Theme']['Bounce Rate']

br_t_test, br_p_value = ttest_ind(bounce_rates_light, bounce_rates_dark, equal_var=False)
print(f'The t-test statistic is for bounce rates {br_t_test}')
print(f'The p-value is {br_p_value}')

scroll_depth_light = df[df['Theme'] == 'Light Theme']['Scroll_Depth']
scroll_depth_dark = df[df['Theme'] == 'Dark Theme']['Scroll_Depth']
sd_t_test, sd_p_value = ttest_ind(scroll_depth_light, scroll_depth_dark, equal_var=False)
print(f'\nThe t-test statistic is for scroll depth {sd_t_test}')
print(f'The p-value is {sd_p_value}')

session_duration_light = df[df['Theme'] == 'Light Theme']['Session_Duration']
session_duration_dark = df[df['Theme'] == 'Dark Theme']['Session_Duration']

sess_t_test, sess_p_value = ttest_ind(session_duration_light, session_duration_dark, equal_var=False)
print(f'\nThe t-test statistic is for session duration {sess_t_test}')
print(f'The p-value is {sess_p_value}')

The t-test statistic is for bounce rates -1.2018883310494073
The p-value is 0.229692077505148

The t-test statistic is for scroll depth 0.7562277864140986
The p-value is 0.4496919249484911

The t-test statistic is for session duration 0.3529118117099043
The p-value is 0.7242292543366338


* from above results, it is evident that 
* Click through rates only siginifcant effect in dark theme, meaning customers using dark theme is highly likely having higher CTR
* For the rest of performance indicators like bounce rates, scroll_depth, session duration
  * the choice betwen a light theme and a dark theme does not significantly affect user
  * behaviour according to the data provided