<a href="https://colab.research.google.com/github/hazrakeruboO/DS-Colabs/blob/main/SLCopy_of_3_Statistical_Experiments_and_Significance_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Statistics in Data Science Webinar Series

## Week 3: Statistical Experiments and Significance Testing

February 20, 2021
<br>

Angela Gaetano, M.S. Data Science, Manager of Data Science, Medtronic
<br>
Sneha Thanasekaran, M.S. Information Systems, Data Scientist, Medtronic
<br>
Sowmya Uk
<br>
Shaivya Kodan

<br>

Data files: https://drive.google.com/drive/u/1/folders/1kvVDL6ps5q8vQr58j9WjMCrYWostiZKy


### Topics
1. Introduction: Why Run Statistical Experiments?
2. Variability in Estimates
3. Hypothesis Testing
 1. Hypothesis Testing Framework: Step-by-Step Guide
 2. Choosing the Appropriate Statistical Test, Testing Assumptions,and Interpretting the Results (i.e., "Significance")
 3. Parametric tests including t-tests, ANOVA, chi-squared tests
 4. Non-Parametric tests including Mann Whitney, Kruskal-Wallis, Wilcoxon Signed Rank Tests 
4. Supporting your Statistical Experiments with additional evidence
 1. Resampling - Permutation Testing
 2. Power and Sample Size
5. A/B Testing: A Common Application of Statistical Experimentation

### Introduction: Why Run Statistical Experiments?

Statistics is an attainable and yet powerful tool for the Data Scientist. By leveraging principles of mathematical probability and inference, we can provide answers to meaningful questions that would otherwise be impossible to answer in our world of large numbers and possibilities. By finding ways to harness samples to represent broader populations and themes, we can advance the learnings of the scientific community. Some say, "Data is Power." I say, "Statistics with Power is Significant!"
<br>
<br>

Not only will this lesson enable you to run your own statistical experiments, but it will also empower you to look at the world with a scientific eye, being able to research and read clinical studies and understand what they truly mean. For example, with nothing more than a basic level of expertise in statistical experiments and hypothesis testing, anyone can learn more about the Covid-19 Vaccine Clinical Trials, and how to interpret the results. Understanding these principles of experimentation can be useful to anyone in our day to day lives!

#### Principles of Experiments

* **Controlling** – control for differences between groups other than what is being tested for
* **Randomization** – helps account for uncontrollable confounding variables
* **Replication** – be able to perform tests on multiple subjects, or even replicate an entire study
* **Blocking** – intentional grouping of individuals with common uncontrollable confounding variable, and then randomly assign equal numbered subsets into the treatment groups

### Variability in Estimates

* **Point Estimate**: the sample estimate measure of the population parameter (e.g., $\bar x$  is the point estimate for $\mu$   , $\hat p$ is the estimate for $p$)
* The Point Estimate will slightly vary each time you take a new sample of the population (e.g., sampling variation)
* If you take samples and measure the point estimate many times, you can build up the sampling distribution of the point estimate. “Understanding the sampling distribution is central to understanding statistical inference.”
* The standard error of the estimate is the standard deviation of the sampling distribution
* From a single sample, we can use the following equation to calculate standard error for the sample mean: $𝑆𝐸=  \frac{σ}{\sqrt[]{n}}$ (use $s$ when $σ$ is unknown)

![test pic](https://drive.google.com/uc?id=1_J9ef_vy-EI81yRYLZIh3kOWVh-qKsLV)

### Hypothesis Testing: Introduction

* Takes the position of the skeptic:
 + I will not be convinced that there is any difference unless strong evidence is presented in favor of that alternative
 + Innocent until proven guilty
* Formal testing using p-values
 + The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis is true
 + The lower the p-value, the stronger the data favor the alternative hypothesis
 + Why 0.05 significance level (α = 0.05)? It turns out to be the most common “natural threshold” for human skepticism. For more info, visit www.openintro.org/why05


![image](https://drive.google.com/uc?export=view&id=1XCgrFhPAinvsmYJC_pSomd5mWHwh3oCA)




### Hypothesis Testing: Step-by-Step

1. Define the null and alternative hypothesis
* Null hypothesis is always the one that contains equality
* Alternative implies inequality (or “difference”)

2. Set the significance Level, α
* 0.05 is most commonly used. Use 0.01 if you want to make the test more strict

3. Select an appropriate statistical test
* Consider the type of data in the study
 + How Many groups of data? Are the data _paired_? Is the output you are trying to measure Continuous, Binary, or Categorical?
* Understand whether you need to test for assumptions. Typical assumptions of parametric tests are
 + Samples are independent and unbiased
 + Data is approximately normal (within all groups). Can use Anderson-Darling test or Shapiro-Wilks test (two most commonly used). If sample size is large enough and appears approximately normal with no significant skewness, no need to test assumption further
 + If there is more than one group of data, need to know whether the variances are equal. Can use an F test or Levene's test

4. Decide one or two tailed
* A one-tailed test implies that the alternative hypothesis is a directional inequality (i.e., "greater than" or "less than"). Only use a one-tailed test if the intention of the experiment is based on a pre-stated suspicion of direction. Otherwise, always assume two-tailed (i.e., alternative hypothesis is "not equal"). Most statistical functions assume two-tailed default parameter

5. Run the Statistical Test
* Usually can be done with one line of code that takes the data as input as well as other specific testing parameters

6. Interpret the output: Decide whether to Reject or Fail to Reject the Null hypothesis
* Use the p-value to determine the outcome of the test. If the p-value < α, Reject the Null Hypothesis, proving the alternative. Otherwise, Fail to Reject the Null, conveying there was not enough evidence to reject the null


### Choosing the Appropriate Statistical Test, Testing Assumptions,and Interpretting the Results (i.e., "Significance")

To choose the appropriate statistical test, you must ask certain key questions:
1. What is the data type of the response? i.e., what data type is the value you are measuring? Continous, Binary, or Categorical?
2. What is the data type of the input? Binary input implies you are dealing with two groups, "A" and "B". Categorical input implies you have multiple groups. Continuous input implies you are interested in correlation
 * If the input in Binary, and your response is continuous, ask if the data are "_paired_"
 * If the input is Binary or Categorical, and your response is continuous, test whether the variances between the groups are equal
 * If the input is Continous, run a correlation test using either Pearson's, Spearman's, or Kendall's depending on results of (3) below
3. Do your data satisfy assumptions for a Parametric test, or do you need to choose a Non-Parametric test?
 * Recall assumptions for parametric tests from above


Below are some flowcharts created by one of my colleagues to assist you in your decision-making when your inputs are either Binary or Categorical (Credit: Alice Townsend: [LinkedIn](https://www.linkedin.com/in/alice-townsend-04474a77/)).

![](https://drive.google.com/uc?id=1grEMp76x_RNng2faf-wp-IVsbkvtVXjY)

![](https://drive.google.com/uc?id=1TMn7Qih_8zvvJ2DqUpJZvxSYZOD8dRVP)

### Let's work through some examples

In [None]:
import scipy.stats as stats
import numpy as np 
import pandas as pd

## Exercise 1: Are Mean Crime Rates different between the North and the South?


In [None]:
url = 'https://drive.google.com/file/d/1RxU0OyMgyH4O843AT7OrczjHilhYrnCG/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
crime = pd.read_csv(path)
crime.head()

Unnamed: 0,CrimeRate,Youth,Southern,Education,ExpenditureYear0,LabourForce,Males,MoreMales,StateSize,YouthUnemployment,MatureUnemployment,HighYouthUnemploy,Wage,BelowWage,CrimeRate10,Youth10,Education10,ExpenditureYear10,LabourForce10,Males10,MoreMales10,StateSize10,YouthUnemploy10,MatureUnemploy10,HighYouthUnemploy10,Wage10,BelowWage10
0,45.5,135,0,12.4,69,540,965,0,6,80,22,1,564,139,26.5,135,12.5,71,564,974,0,6,82,20,1,632,142
1,52.3,140,0,10.9,55,535,1045,1,6,135,40,1,453,200,35.9,135,10.9,54,540,1039,1,7,138,39,1,521,210
2,56.6,157,1,11.2,47,512,962,0,22,97,34,0,288,276,37.1,153,11.0,44,529,959,0,24,98,33,0,359,256
3,60.3,139,1,11.9,46,480,968,0,19,135,53,0,457,249,42.7,139,11.8,41,497,983,0,20,131,50,0,510,235
4,64.2,126,0,12.2,106,599,989,0,40,78,25,1,593,171,46.7,125,12.2,97,602,989,0,42,79,24,1,660,162


### Step 1: Define the null and alternative hypothesis

$H_{0}: \mu_{N} - \mu_{S} = 0$

$H_{A}: \mu_{N} - \mu_{S} \neq 0$


### Step 2: Set Significance Level

$\alpha = 0.05$

### Step 3: Select the appropriate statistical test

#### What type of variable is the "input"?

The input is a binary classification, "North" vs "South". So, the answer is **Binary**

#### What type of variable is the "response"?

The response is "Crime Rate," which is a **continuous** value

#### Is the response paired?

**No**, the responses are not paired by the groups; they are independent groups.

#### Does the response follow a normal distribution in both groups?

Need to either test for Normality with an Anderson-Darling hypothesis test, or assume normality based on other descriptive details of the data (sample size, skewness, etc.)




In [None]:
SouthCrime = crime.loc[(crime.Southern == 1)] #South
SouthCrime.describe()

Unnamed: 0,CrimeRate,Youth,Southern,Education,ExpenditureYear0,LabourForce,Males,MoreMales,StateSize,YouthUnemployment,MatureUnemployment,HighYouthUnemploy,Wage,BelowWage,CrimeRate10,Youth10,Education10,ExpenditureYear10,LabourForce10,Males10,MoreMales10,StateSize10,YouthUnemploy10,MatureUnemploy10,HighYouthUnemploy10,Wage10,BelowWage10
count,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0,16.0
mean,100.68125,148.6875,1.0,11.625,69.75,533.0625,970.25,0.0,34.0,91.1875,34.8125,0.0625,440.75,234.5,100.525,147.625,11.6125,65.75,540.75,973.3125,0.0625,34.75,92.875,34.1875,0.1875,515.125,233.0625
std,22.588233,13.174312,0.0,0.982174,22.734702,35.448496,12.829653,0.0,20.136203,17.026328,8.01015,0.25,91.394019,31.411251,31.243143,11.552056,1.022334,21.430508,34.032338,15.413063,0.25,20.673655,15.34438,7.203876,0.403113,92.404094,26.611949
min,56.6,127.0,1.0,10.0,45.0,480.0,950.0,0.0,4.0,72.0,24.0,0.0,288.0,165.0,37.1,125.0,10.1,41.0,497.0,945.0,0.0,4.0,76.0,24.0,0.0,359.0,170.0
25%,94.0,141.25,1.0,10.975,57.75,514.25,960.5,0.0,22.75,78.25,28.0,0.0,394.75,227.0,92.3,139.0,10.975,53.75,518.75,961.25,0.0,24.0,81.5,28.75,0.0,468.75,226.75
50%,101.45,148.0,1.0,11.4,62.0,526.0,970.5,0.0,32.0,88.5,34.0,0.0,424.0,243.0,102.45,151.0,11.25,59.0,537.0,977.0,0.0,32.0,90.5,33.5,0.0,505.0,243.0
75%,112.65,157.0,1.0,12.125,81.25,544.75,978.75,0.0,39.25,97.0,38.75,0.0,476.0,251.75,117.15,154.5,12.025,74.75,554.5,983.0,0.0,39.5,98.25,39.25,0.0,553.0,250.25
max,145.4,177.0,1.0,13.9,123.0,638.0,996.0,0.0,96.0,135.0,53.0,1.0,631.0,276.0,157.3,164.0,14.0,115.0,638.0,1001.0,1.0,99.0,131.0,50.0,1.0,703.0,257.0


In [None]:
NorthCrime = crime.loc[(crime.Southern == 0)] #North
NorthCrime.describe()

Unnamed: 0,CrimeRate,Youth,Southern,Education,ExpenditureYear0,LabourForce,Males,MoreMales,StateSize,YouthUnemployment,MatureUnemployment,HighYouthUnemploy,Wage,BelowWage,CrimeRate10,Youth10,Education10,ExpenditureYear10,LabourForce10,Males10,MoreMales10,StateSize10,YouthUnemploy10,MatureUnemploy10,HighYouthUnemploy10,Wage10,BelowWage10
count,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0,31.0
mean,103.906452,133.354839,0.0,12.787097,92.870968,575.709677,989.612903,0.290323,37.967742,97.677419,33.548387,0.451613,569.064516,173.096774,102.867742,136.870968,12.809677,87.709677,578.322581,993.870968,0.290323,39.225806,99.806452,32.935484,0.516129,635.677419,172.258065
std,31.957555,8.526619,0.0,0.983444,30.129434,35.156975,33.425217,0.461414,44.879456,18.401788,8.759143,0.505879,65.634308,24.919009,43.307947,8.674645,1.037739,28.276249,33.125405,33.141859,0.461414,46.588775,18.803226,9.058828,0.508001,63.991868,25.464966
min,45.5,119.0,0.0,10.4,51.0,521.0,934.0,0.0,3.0,70.0,20.0,0.0,425.0,126.0,26.5,120.0,10.2,47.0,524.0,935.0,0.0,3.0,71.0,15.0,0.0,499.0,126.0
25%,80.2,126.0,0.0,12.25,70.0,541.0,965.5,0.0,8.0,83.0,27.0,0.0,527.5,159.0,70.4,132.5,12.2,66.5,549.0,973.5,0.0,8.5,82.5,27.0,0.0,590.0,155.0
50%,104.3,132.0,0.0,12.9,90.0,577.0,985.0,0.0,18.0,99.0,35.0,0.0,572.0,170.0,104.5,134.0,12.9,83.0,580.0,989.0,0.0,19.0,97.0,34.0,1.0,639.0,169.0
75%,129.8,140.0,0.0,13.45,109.0,599.0,1012.0,1.0,45.0,106.0,38.5,1.0,619.5,192.0,136.05,141.0,13.5,99.5,601.0,1007.0,1.0,47.0,110.0,37.5,1.0,681.5,185.5
max,161.8,152.0,0.0,15.1,166.0,641.0,1071.0,1.0,168.0,142.0,58.0,1.0,689.0,227.0,178.2,153.0,15.2,157.0,641.0,1079.0,1.0,180.0,143.0,59.0,1.0,748.0,234.0


We can tell from the CrimeRate descriptions that the data are very centered (not skewed), and the sample size for the South is questionable at n = 16, and the sample size for the North is decent at n = 31. With all of this information, we can assume normality.

#### Are variances between groups equal?

Need to run an F-test to test for equal variances.

In [None]:
#perform F-test
stats.f_oneway(NorthCrime['CrimeRate'],SouthCrime['CrimeRate'])

F_onewayResult(statistic=0.12900371305145147, pvalue=0.7211471608862728)

p-value > 0.05, so we Fail To Reject the Null Hypothesis. The variances are equal.
<br>
So, the appropriate statistical test to test our original hypothesis is a **Student's t-test.** Yay for normal, easy data!


### Step 4: Decide One or Two Tailed

Two-tailed

### Step 5: Run the Statistical Test

In [None]:
stats.ttest_ind(NorthCrime['CrimeRate'],SouthCrime['CrimeRate']) ## from scipy.stats

Ttest_indResult(statistic=0.35917086887921945, pvalue=0.7211471608862718)

p-value > 0.05, so we Fail To Reject the Null Hypothesis. **There is no difference in crime rate between the Northern and Southern regions.**

## Exercise 2: Is there a difference in ticket fare between Titantic survivors and non-survivors?

In [None]:
url = 'https://drive.google.com/file/d/1Yyyd9rnsaIpbeAFcnrgtnCs1lm9fxKbP/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
titanic = pd.read_csv(path)
titanic

In [None]:
titanic.dtypes

pclass        int64
survived      int64
Residence     int64
name         object
age          object
sibsp         int64
parch         int64
ticket       object
fare         object
cabin        object
embarked     object
boat         object
body         object
home.dest    object
Gender        int64
dtype: object

#### Need to clean data to measure Fare - convert to numeric

In [None]:
titanic = titanic[titanic.fare != ' '] # found a blank fare value by sorting the dataframe by fare
titanic['fare'] = titanic['fare'].astype('float')
titanic.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


pclass         int64
survived       int64
Residence      int64
name          object
age           object
sibsp          int64
parch          int64
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body          object
home.dest     object
Gender         int64
dtype: object

### Step 1: Define the null and alternative hypothesis

$H_{0}: \mu_{S} - \mu_{N} = 0$

$H_{A}: \mu_{S} - \mu_{N} \neq 0$


### Step 2: Set Significance Level

$\alpha = 0.05$

### Step 3: Select the appropriate statistical test

#### What type of variable is the "input"?

The input is a binary classification, "Survivor" vs "Non-Survivor". So, the answer is **Binary**

#### What type of variable is the "response"?

The response is "Fare," which is a **continuous** value

#### Is the response paired?

**No**, the responses are not paired by the groups; they are independent groups.

#### Does the response follow a normal distribution in both groups?

Need to either test for Normality with a Shapiro-Wilk hypothesis test, or assume normality based on other descriptive details of the data (sample size, skewness, etc.)


In [None]:
SurvivorFares = titanic.loc[(titanic.survived == 1)] #Survivors
SurvivorFares.describe()

Unnamed: 0,pclass,survived,Residence,sibsp,parch,fare,Gender
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,1.962,1.0,1.228,0.462,0.476,49.361184,0.678
std,0.872972,0.0,0.870363,0.685197,0.776292,68.648795,0.467711
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,0.0,0.0,0.0,11.2146,0.0
50%,2.0,1.0,2.0,0.0,0.0,26.0,1.0
75%,3.0,1.0,2.0,1.0,1.0,57.75,1.0
max,3.0,1.0,2.0,4.0,5.0,512.3292,1.0


#### We can tell there is significant skewness in fare, so we need to run a test for normality

In [None]:
stats.shapiro(SurvivorFares['fare'])

(0.607525110244751, 4.0718098693662947e-32)

#### Since the p-value is small, we reject the null hypothesis that the fare data is normally distributed. Therefore, we need to use a non-parametric test. There is no need now to test the other data group's normality, since it already failed for the first group, but it is described below for visibility

In [None]:
NonSurvivorFares = titanic.loc[(titanic.survived == 0)] #Non-Survivors
NonSurvivorFares.describe()

Unnamed: 0,pclass,survived,Residence,sibsp,parch,fare,Gender
count,808.0,808.0,808.0,808.0,808.0,808.0,808.0
mean,2.5,0.0,1.466584,0.522277,0.329208,23.353831,0.157178
std,0.745079,0.0,0.72749,1.21106,0.912824,34.145096,0.364194
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,0.0,1.0,0.0,0.0,7.8542,0.0
50%,3.0,0.0,2.0,0.0,0.0,10.5,0.0
75%,3.0,0.0,2.0,1.0,0.0,26.0,0.0
max,3.0,0.0,2.0,8.0,9.0,263.0,1.0


#### Based on these steps, the appropriate statistical test is the Mann-Whitney rank test to compare medians

### Step 4: Decide One or Two Tailed

Two-tailed

### Step 5: Run the Statistical Test

In [None]:
stats.mannwhitneyu(SurvivorFares['fare'],NonSurvivorFares['fare']) ## from scipy.stats

MannwhitneyuResult(statistic=131452.5, pvalue=1.0885770643327487e-26)

p-value < 0.05, so we **Reject the Null Hypothesis. There is a difference in fares between the survivors and non survivors of the Titanic.**

## Exercise 3: Is there a difference in proportion of favorite ice cream flavors between females and males? Or said another way, is there a difference in proportion of gender between different favorite ice cream flavors? Are these two factors of gender and ice cream flavors dependent of each other?

In [None]:
import scipy.stats as stats
import numpy as np 
import pandas as pd

url = 'https://drive.google.com/file/d/1rbJNoqy9y27BKfIfNeCoxnYNska9DQBV/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
ice_cream = pd.read_csv(path)
ice_cream.head()

Unnamed: 0,id,female,ice_cream,video,puzzle
0,70,0,2,47,57
1,121,1,1,63,61
2,86,0,3,58,31
3,141,0,3,53,56
4,172,0,1,53,61


In [None]:
# H0 = There is no difference in proportin of favorite ice cream
# H1 = There is a difference 
α=0.05


In [None]:
from matplotlib import pyplot
import pandas as pd
import numpy as np
# Importing functions from numpy
from numpy.random import seed
from numpy.random import randn



In [None]:
stats.shapiro(ice_cream['ice_cream'])

(0.7500160932540894, 4.1669008290399505e-17)

In [None]:
the p-value is small, we reject the null hypothesis that the fare data is normally distributed. Therefore, we need to use a non-parametric test

In [None]:
# Two Tailed test 
female = 1

stats.mannwhitneyu(ice_cream['ice_cream'],ice_cream['ice_cream']) ## from scipy.stats

NameError: ignored

## Exercise 4: Is there a relationship between video game scores and puzzle scores?



## Exercise 5: Is there a difference in puzzle scores between groups of favorite ice cream flavors?

## Exercise 6: Was cholesterol reduced after introducing a particular brand of margarine (margarine A)?

In [None]:
url = 'https://drive.google.com/file/d/16PaXKkf2dd18l9ci-FP6M6XMmcppyM9F/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
cholesterol = pd.read_csv(path)
cholesterol.head()

### A/B Testing: A Common Application of Statistical Experimentation

A/B Testing is a common application of a statistical experiment, used often in the field of web development. You might have heard A/B Testing in the context of testing different UI Designs of a website, assigning a control group "A" to some website viewers, and a "treatment" group "B" to another set of website viewers. You can directly measure and compare metrics of interest between these groups, for example, how many converted to a "sale", or "conversion rate." 

A/B test includes the application of Statistical Hypothesis Testing 

![](https://drive.google.com/uc?id=1I5uoJm3UMJrH0iKxx1K3u9tzKjPX293l)

Let’s see an example...

Imagine that you are running a UI experiment where you want to understand the difference between the time spent on your initial layout vs a new layout. (let’s imagine you want to understand the impact of having a vertical layout Vs a Horizontal layout )

Imagine we are using the average session time (time spent on the page) as our metric to analyze the result of the A/B test. We aim to understand if the new design of the page gets more attention from the users and increase the time they spend on the page.


In [None]:
import pandas as pd
import statsmodels.api as sm
import numpy as np
from scipy import stats
import seaborn as sns
url = 'https://drive.google.com/file/d/1Z43HPoX916GEJ5sMZs-6fLnvsTW8P3BV/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
design = pd.read_csv(path, sep=';')
design.head()
#Time is in minutes

Unnamed: 0,Page,Time
0,Old design,0.35
1,New design,1.2
2,Old design,0.8
3,New design,1.2
4,Old design,1.5



### Step 1: Define the null the alternative hypothesis

$H_{0}: \mu_{OldDesign} = \mu_{NewDesign}$

$H_{A}: \mu_{OldDesign} \neq \mu_{NewDesign}$


In this experiment, the null hypothesis assumes the average session time spent on the pages are equal and if there is a difference this is only due to the chance factor. In contrast, the alternative hypothesis assumes there is a statistically significant difference between the average session time.
### Step 2: Set Significance Level

$\alpha = 0.05$

### Step 3: Select the appropriate statistical test

#### What type of variable is the "input"?

The input is a binary classification, "old design" vs "new design". So, the answer is **Binary**

#### What type of variable is the "response"?

The response is "average session time," which is a **continuous** value

#### Is the response paired?

**No**, the responses are not paired by the groups; they are independent groups.

#### Does the response follow a normal distribution in both groups?

Need to either test for Normality with a Shapiro-Wilk hypothesis test, or assume normality based on other descriptive details of the data (sample size, skewness, etc.)

In [None]:
old_design = design.loc[(design.Page == 'Old design')] #Survivors
old_design.describe()

Unnamed: 0,Time
count,21.0
mean,1.435714
std,0.589734
min,0.35
25%,0.9
50%,1.5
75%,1.9
max,2.5


In [None]:
stats.shapiro(old_design['Time'])

(0.9582112431526184, 0.48085227608680725)

#### Since the p-value is large, we do not reject the null hypothesis. The   **time** is normally distributed. Therefore, we need to use a Student's t-test. Let's do the same for the next set 

In [None]:
new_design = design.loc[(design.Page == 'New design')] #Survivors
new_design.describe()

Unnamed: 0,Time
count,28.0
mean,1.664286
std,0.645907
min,0.4
25%,1.275
50%,1.5
75%,2.05
max,2.9


In [None]:
stats.shapiro(new_design['Time'])

(0.9624826312065125, 0.398596853017807)

#### Are variances between groups equal?

Need to run an F-test to test for equal variances.

In [None]:
#perform F-test
stats.f_oneway(old_design['Time'],new_design['Time'])

F_onewayResult(statistic=1.6172405295817571, pvalue=0.20973393181735653)

p-value > 0.05, so we **Fail To Reject the Null Hypothesis**. The variances are equal.
So, the appropriate statistical test to test our original hypothesis is a **Student's two-sample t-test**!

### Step 4: Decide One or Two Tailed

Two-tailed

### Step 5: Run the Statistical Test



In [None]:
stats.ttest_ind(old_design['Time'],new_design['Time']) ## from scipy.stats

Ttest_indResult(statistic=-1.271707721759113, pvalue=0.20973393181735656)

p-value > 0.05, so we Fail To Reject the Null Hypothesis. **There is no difference in the session time of the new design and the old design**