## Table of Contents

3. Problem statement
4. Deep dive into data
     - 4.1. Loading Data
     - 4.2. Splitting data in two groups
          - 4.2.1. Control Group
          - 4.2.2. Treatment Group
          - 4.2.3. Observation
     - 4.3. Loading Packages
          - 4.3.1. Installing package
          - 4.3.2. Importing the package
     - 4.4. Performing t-Test
          - 4.4.1. Calculating t-Test for independent samples
          - 4.4.2. Interpret via critical value
          - 4.4.3. Interpret via p-value
5. Conclusion


### 3 Problem statement :-

- While working on the dataset, need to keep in mind about the objective :   
  " **Help Shades N Style to understand - whether they should implement new webpage or not or need to wait for some more time       before implementation**"

- To evaluate the objective lets attempt "t-Test" keeping control and test grouping from the data


## 4. Deep dive into data


### 4.1 Loading Data and packages

**To get started with import our libraries**

In [1]:
import numpy as np                                          
import pandas as pd                                                
from scipy.stats import stats

**Importing the Dataset : From local drive and take a look at the top few rows here**

In [2]:
data = pd.read_csv("ab_data.csv")

In [3]:
data.shape

(294478, 5)

In [4]:
data.shape[0]

294478

In [5]:
data.columns

Index(['user_id', 'timestamp', 'group', 'landing_page', 'converted'], dtype='object')

In [6]:
data.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [7]:
data.dtypes

user_id          int64
timestamp       object
group           object
landing_page    object
converted        int64
dtype: object

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null object
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


In [9]:
data.describe()

Unnamed: 0,user_id,converted
count,294478.0,294478.0
mean,787974.124733,0.119659
std,91210.823776,0.324563
min,630000.0,0.0
25%,709032.25,0.0
50%,787933.5,0.0
75%,866911.75,0.0
max,945999.0,1.0


In [10]:
data.isnull().sum() 

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

**Identify number of unique users in the dataset:**

In [11]:
data['user_id'].nunique()

290584

**The proportion of users converted:**

In [12]:
data['converted'].sum()/float(data.shape[0])

0.11965919355605512

**The number of times the new_page and treatment don't line up:**

In [13]:
data.query("group == 'treatment' and landing_page == 'old_page'").shape[0] + data.query("group == 'control' and landing_page == 'new_page'").shape[0]

3893

**Any of the rows have missing values?**

In [14]:
pd.isnull(data).any(axis = 1).sum()

0

**For the rows where "treatment" is not aligned with "new_page" or "control" is not aligned with "old_page", 
  we are not so sure if this row truly received the new or old page.**

In [15]:
x = data.query("group == 'treatment' and landing_page == 'new_page'")
y = data.query("group == 'control' and landing_page == 'old_page'")
data2 = x.append(y, ignore_index = True)

In [16]:
# Double Check all of the correct rows were removed - this should be 0
data2[((data2['group'] == 'treatment') == (data2['landing_page'] == 'new_page')) == False].shape[0]

0

**How many unique user_ids are in data2**

In [17]:
data2['user_id'].nunique()

290584

**There is one user_id repeated in data2, which is that?**

In [18]:
data2.loc[data2['user_id'].duplicated(),:]['user_id']

1404    773192
Name: user_id, dtype: int64

**Details about repeat user_id?** 

In [19]:
data2.loc[data2['user_id'].duplicated(),:]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1404,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


**Removing duplicate rows with a duplicate user_id, but keeping dataframe as data2**

In [20]:
data2.drop_duplicates(subset =['user_id'] , inplace = True)

**What is the probability of an individual converting regardless of the page they receive?**

In [21]:
data_conv = data2.query(" converted == 1 ")
p_conv = data_conv.shape[0]/float(data2.shape[0])
p_conv # P(converted)

0.11959708724499628


### 4.2 Splitting data in two groups


#### 4.2.1 Control Group

In [22]:
data_control= data[data['group']=='control']

In [23]:
data_control.shape[0]

147202


#### 4.2.2 Treatment Group

In [24]:
data_treatment= data[data['group']=='treatment']

In [25]:
data_treatment.shape[0]

147276

**Given that an individual was in the control group, what is the probability they converted?**

In [26]:
data_control = data2.query(" group == 'control' ")
p_conv_control = data_control['converted'].sum()/data_control.shape[0]
p_conv_control # P(converted/control)

0.1203863045004612

**Given that an individual was in the treatment group, what is the probability they converted?**

In [27]:
data_treatment = data2.query(" group == 'treatment' ")
p_conv_treatment = data_treatment['converted'].sum()/data_treatment.shape[0]
p_conv_treatment # P(converted/treatment)

0.11880806551510564

**What is the probability that an individual received the new page?**

In [28]:
data2.query(" landing_page == 'new_page' ").shape[0]/float(data2.shape[0])

0.5000619442226688


#### 4.2.3 Observation

This evidence suggests that neither page leads to more conversions. The conversion rate is approximately equivalent for the new page and the old page.

While comparing the conversion rate of new page with the conversion rate of old page, we can see that the conversion rate falls down from 0.1196 to 0.1188.

## Let's apply "t-Test" on the database


### 4.3 Loading Packages


#### 4.3.1 Installing package

**Loading of packages to perform t-test:-**

In [29]:
# Paired Student's t-test
from numpy.random import seed
from numpy.random import randn
from scipy.stats import ttest_rel
# seed the random number generator
seed(1)

In [30]:
# generate two independent samples
data_control = 5 * randn(100) + 50
data_treatment = 5 * randn(100) + 51

In [31]:
# compare samples
stat, p = ttest_rel(data_control, data_treatment)
print('Statistics=%.3f, p=%.3f' % (stat, p))

Statistics=-2.372, p=0.020


#### 4.3.2 Importing the package

In [32]:
# t-test for dependent samples
from math import sqrt
from numpy.random import seed
from numpy.random import randn
from numpy import mean
from scipy.stats import t


### 4.4 Performing t-Test

#### 4.4.1 Calculating t-Test for independent samples

In [33]:
# function for calculating the t-test for two dependent samples
def dependent_ttest(data_control, data_treatment, alpha):
# calculate means
	mean_control, mean_treatment = mean(data_control), mean(data_treatment)
# number of paired samples
	n = len(data_control)
# sum squared difference between observations
	d1 = sum([(data_control[i]-data_treatment[i])**2 for i in range(n)])
# sum difference between observations
	d2 = sum([data_control[i]-data_treatment[i] for i in range(n)])
# standard deviation of the difference between means
	sd = sqrt((d1 - (d2**2 / n)) / (n - 1))
# standard error of the difference between the means
	sed = sd / sqrt(n)
# calculate the t statistic
	t_stat = (mean_control - mean_treatment) / sed
# degrees of freedom
	df = n - 1
# calculate the critical value
	cv = t.ppf(1.0 - alpha, df)
# calculate the p-value
	p = (1.0 - t.cdf(abs(t_stat), df)) * 2.0
# return everything
	return t_stat, df, cv, p

In [34]:
# seed the random number generator
seed(1)
# generate two independent samples (pretend they are dependent)
data_control = 5 * randn(100) + 50
data_treatment = 5 * randn(100) + 51

In [35]:
# calculate the t test
alpha = 0.05
t_stat, df, cv, p = dependent_ttest(data_control, data_treatment, alpha)
print('t=%.3f, df=%d, cv=%.3f, p=%.3f' % (t_stat, df, cv, p))

t=-2.372, df=99, cv=1.660, p=0.020



#### 4.4.2 Interpret via critical value

In [36]:
# interpret via critical value
if abs(t_stat) <= cv:

    print('interpret via critical value:Accept null hypothesis that the means are equal.')

else:

    print('interpret via critical value:Reject the null hypothesis that the means are equal.')

interpret via critical value:Reject the null hypothesis that the means are equal.



#### 4.4.3 Interpret via p-value

In [37]:
# interpret via p-value
if p > alpha:

    print('interpret via p-value:Accept null hypothesis that the means are equal.')
    
else:

    print('interpret via p-value:Reject the null hypothesis that the means are equal.')

interpret via p-value:Reject the null hypothesis that the means are equal.



### 5 Conclusion

1. interpret via critical value:
   The statistic value in a two-tailed test, meaning that if we reject the null hypothesis, it could be because the first mean    is smaller or greater than the second mean. 

2. interpret via p- value:
   The t-statistic using the cumulative distribution function (CDF) of the t-distribution in order to calculate a p-value. The    p-value can then be compared to a chosen significance level (alpha) such as 0.05 to determine if the null hypothesis can be    rejected or not
   
**Here using both critical and p value null hypothesis is rejected. Hence it is recommended to *Shades N Style* to review their new page and make it more customer friendly. Need to re-run the similar testing and then move ahead with page refreshment.  

## Some Reference Links:
- https://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)
- https://en.wikipedia.org/wiki/Normal_distribution

| Name | GitHub |
| --- | --- |
| Jigna | https://github.com/jmps967/Statistics |

## THANK YOU !