# Hypothesis testing using Python

https://towardsdatascience.com/hypothesis-testing-in-machine-learning-using-python-a0dc89e169ce

“Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data. Hypothesis Testing is basically an assumption that we make about the population parameter.”

- video 1 is better than video 2 in terms of CTR
- page design 1 is better than page design 2 in terms of session time (engagement)


”a hypothesis test evaluates two mutrually exclusive statements about a population to determin which statement is best supported by the sample data"

The mornal distribution assumption:

- normaization
- standard normalization: mean $\mu$ is 0 and variance $\sigma ^2$ is 1?

- Null hypothesis: the null hypothesis is a general statement or default position that there is no relationship between two measured phenomena, or no association among groups
- The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary to the null hypothesis. It is usually taken to be that the observations are the result of a real effect (with some amount of chance variation superposed)


in terms of classification problem in machine learning, Null hypothesis (no difference)/accepting null hypothesis is negative, alternative hypothesis/rejecting null hypothesis is positive. 

- Type 1 $\alpha$: false positive (reject null but null is acutally true, e.g., claim there is some differnece when there are acutlly null)
- Type 2 $\beta$: false negative (accepting null but null is acutally false)


![null-alternative](https://user-images.githubusercontent.com/595772/144764014-183e8597-41a1-488d-b6a2-3a44e73d99db.png)

- One tailed test :- A test of a statistical hypothesis , where the region of rejection is on only one side of the sampling distribution , is called a one-tailed test.
Example :- a college has ≥ 4000 student or data science ≤ 80% org adopted.
- Two-tailed test :- A two-tailed test is a statistical test in which the critical area of a distribution is two-sided and tests whether a sample is greater than or less than a certain range of values. If the sample being tested falls into either of the critical areas, the alternative hypothesis is accepted instead of the null hypothesis.

P-value :- The P value, or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H 0) of a study question is true — the definition of ‘extreme’ depends on how the hypothesis is being tested.
If your P value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample gives reasonable evidence to support the alternative hypothesis. It does NOT imply a “meaningful” or “important” difference; that is for you to decide when considering the real-world relevance of your result.


In [114]:

import numpy as np
import pandas as pd

## One sample t-test
The One Sample t Test determines whether the sample mean is statistically different from a known or hypothesised population mean. The One Sample t Test is a parametric test.
Example :- you have 10 ages and you are checking whether avg age is 30 or not.

In [115]:
# one sample test
# null hypothesis: the mean of the population age is 30
# we random sampled 10 people
from scipy.stats import ttest_1samp
ages = [32, 34, 29, 22, 39, 38, 37, 38, 36, 30, 26, 22, 22]
ages_mean = np.mean(ages)
print(ages_mean)
tset, pval = ttest_1samp(ages, 30)
print("p-values", pval)

if pval < 0.05:    # alpha value is 0.05 or 5%
   print(" we are rejecting null hypothesis")
else:
   print("we are accepting null hypothesis")

31.153846153846153
p-values 0.5336578606482159
we are accepting null hypothesis


## The two-sample t-test
 (also known as the independent samples t-test) is a method used to test whether the unknown population means of two groups are equal or not.

https://www.jmp.com/en_us/statistics-knowledge-portal/t-test/two-sample-t-test.html


One way to measure a person’s fitness is to measure their body fat percentage. Average body fat percentages vary by age, but according to some guidelines, the normal range for men is 15-20% body fat, and the normal range for women is 20-25% body fat.

Our sample data is from a group of men and women who did workouts at a gym three times a week for a year. Then, their trainer measured the body fat. The table below shows the data.

Body Fat Percentages by Groups

Men [13.3, 6.0,	20.0, 8.0, 14.0, 19.0, 18.0, 25.0, 16.0, 24.0, 15.0, 1.0, 15.0]	
 
Women [22.0, 16.0, 21.7, 21.0, 30.0, 26.0, 12.0, 23.2, 28.0, 23.0]

A national fitness club chain wants to know whether the fat

In [116]:
# indepedent sample t test
from scipy.stats import ttest_ind
men = [13.3, 6.0,	20.0, 8.0, 14.0, 19.0, 18.0, 25.0, 16.0, 24.0, 15.0, 1.0, 15.0]	
women =  [22.0, 16.0, 21.7, 21.0, 30.0, 26.0, 12.0, 23.2, 28.0, 23.0]

men_mean = np.mean(men)
men_std = np.std(men)
women_mean = np.mean(women)
women_std = np.std(women)

print('men data:', men_mean, men_std)
print('women data:', women_mean, women_std)

ttest, pval = ttest_ind(men, women)

print("p-value", pval)
if pval <0.05:
  print("we reject null hypothesis")
else:
  print("we accept null hypothesis")

men data: 14.946153846153846 6.574146962460124
women data: 22.29 5.046672170846844
p-value 0.010730607904197957
we reject null hypothesis


## Paired sampled t-test
The paired sample t-test is also called dependent sample t-test. It’s an uni variate test that tests for a significant difference between 2 related variables. An example of this is if you where to collect the blood pressure for an individual before and after some treatment, condition, or time point.

- H0 :- means difference between two sample is 0
- H1:- mean difference between two sample is not 0

In [117]:
import pandas as pd
from scipy import stats
df = pd.read_csv("blood-pressure.csv")
df[['bp_before','bp_after']].describe()

Unnamed: 0,bp_before,bp_after
count,120.0,120.0
mean,156.45,151.358333
std,11.389845,14.177622
min,138.0,125.0
25%,147.0,140.75
50%,154.5,149.5
75%,164.0,161.0
max,185.0,185.0


In [118]:
ttest,pval = stats.ttest_rel(df['bp_before'], df['bp_after'])
print(pval)
if pval<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

0.0011297914644840823
reject null hypothesis


## Z Test.
Several different types of tests are used in statistics (i.e. f test, chi square test, t test). You would use a Z test if:
- Your sample size is greater than 30. Otherwise, use a t test.
- Data points should be independent from each other. In other words, one data point isn’t related or doesn’t affect another data point.
- Your data should be normally distributed. However, for large sample sizes (over 30) this doesn’t always matter.
- Your data should be randomly selected from a population, where each item has an equal chance of being selected.
- Sample sizes should be equal if at all possible.

Repeat the previous t test with z test

In [119]:
# one sample z test
# the null hypothsis: the mean blood presure before the treatment is 156
import pandas as pd
from scipy import stats
from statsmodels.stats import weightstats as stests
ztest ,pval = stests.ztest(df['bp_before'], x2=None, value=156)
print(float(pval))
if pval<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

0.6651614730255063
accept null hypothesis


In [120]:
# two sample z test
# null: no difference for group mean before and after
ztest, pval1 = stests.ztest(df['bp_before'], x2=df['bp_after'], value=0, alternative='two-sided')
print(float(pval1))
if pval1<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

0.002162306611369422
reject null hypothesis


## ANOVA (F-TEST)
The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.

The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time. For example, if we wanted to test whether voter age differs based on some categorical variable like race, we have to compare the means of each level or group the variable. We could carry out a separate t-test for each pair of groups, but when you conduct many tests you increase the chances of false positives. 

## One Way F-test (Anova)

It tell whether two or more groups are similar or not based on their mean similarity and f-score.
Example : there are 3 different category of plant and their weight and need to check whether all 3 group are similar or not.

The following dataset has three groups one control group and two treatment groups - we want to see whether they are similar or not.

one way means there is one variable and 2+ groups, weight variable with three groups

In [121]:
from scipy import stats
df_anova = pd.read_csv('plant-growth.csv')
df_anova.head()

Unnamed: 0.1,Unnamed: 0,weight,group
0,1,4.17,ctrl
1,2,5.58,ctrl
2,3,5.18,ctrl
3,4,6.11,ctrl
4,5,4.5,ctrl


In [122]:
df_anova.head()

Unnamed: 0.1,Unnamed: 0,weight,group
0,1,4.17,ctrl
1,2,5.58,ctrl
2,3,5.18,ctrl
3,4,6.11,ctrl
4,5,4.5,ctrl


In [123]:
df_anova['group'].values

array(['ctrl', 'ctrl', 'ctrl', 'ctrl', 'ctrl', 'ctrl', 'ctrl', 'ctrl',
       'ctrl', 'ctrl', 'trt1', 'trt1', 'trt1', 'trt1', 'trt1', 'trt1',
       'trt1', 'trt1', 'trt1', 'trt1', 'trt2', 'trt2', 'trt2', 'trt2',
       'trt2', 'trt2', 'trt2', 'trt2', 'trt2', 'trt2'], dtype=object)

In [124]:
# there are three groups
grps = pd.unique(df_anova['group'].values)
grps

array(['ctrl', 'trt1', 'trt2'], dtype=object)

In [125]:
ctrl = df_anova['weight'][df_anova['group'] == 'ctrl'] 
trt1 = df_anova['weight'][df_anova['group'] == 'trt1'] 
trt2 = df_anova['weight'][df_anova['group'] == 'trt2'] 

In [126]:
# easier way to get the three lists
d_data = [df_anova['weight'][df_anova.group == grp] for grp in grps]

In [127]:
F, p = stats.f_oneway(ctrl, trt1, trt2)
#F, p = stats.f_oneway(d_data['ctrl'], d_data['trt1'], d_data['trt2'])

print("p-value for significance is: ", p)
if p<0.05:
    print("reject null hypothesis - the groups are NOT similar")
else:
    print("accept null hypothesis - the groups are similar")

p-value for significance is:  0.0159099583256229
reject null hypothesis - the groups are NOT similar


## Two Way F-test

Two way F-test is extension of 1-way f-test, it is used when we have 2 independent variable and 2+ groups. 2-way F-test does not tell which variable is dominant. if we need to check individual significance then Post-hoc testing need to be performed.
Now let’s take a look at the Grand mean crop yield (the mean crop yield not by any sub-group), as well the mean crop yield by each factor, as well as by the factors grouped together

In [128]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
df_anova2 = pd.read_csv("crop-yield.csv")
df_anova2

Unnamed: 0,Fert,Water,Yield
0,A,High,27.4
1,A,High,33.6
2,A,High,29.8
3,A,High,35.2
4,A,High,33.0
5,B,High,34.8
6,B,High,27.0
7,B,High,30.2
8,B,High,30.8
9,B,High,26.4


In [129]:
# Yield = Fert x Water
mod = ols('Yield ~ C(Fert)*C(Water)', df_anova2)
res = mod.fit()

print(f"Overall model: F({res.df_model: .0f},{res.df_resid: .0f}) = {res.fvalue: .3f}, p = {res.f_pvalue: .4f}")

print(res.summary())


Overall model: F( 3, 16) =  4.112, p =  0.0243
                            OLS Regression Results                            
Dep. Variable:                  Yield   R-squared:                       0.435
Model:                            OLS   Adj. R-squared:                  0.330
Method:                 Least Squares   F-statistic:                     4.112
Date:                Tue, 07 Dec 2021   Prob (F-statistic):             0.0243
Time:                        15:54:08   Log-Likelihood:                -50.996
No. Observations:                  20   AIC:                             110.0
Df Residuals:                      16   BIC:                             114.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------

In [130]:
anova_table = sm.stats.anova_lm(res, typ=2)
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(Fert),69.192,1.0,5.766,0.028847
C(Water),63.368,1.0,5.280667,0.035386
C(Fert):C(Water),15.488,1.0,1.290667,0.272656
Residual,192.0,16.0,,


## Chi-Square Test

The key here is **categorical variables**

The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.
For example, in an election survey, voters might be classified by gender (male or female) and voting preference (Democrat, Republican, or Independent). We could use a chi-square test for independence to determine whether gender is related to voting preference
check example in python below

In [131]:
df_chi = pd.read_csv('chi-test.csv')
df_chi


Unnamed: 0,Gender,Shopping
0,Male,No
1,Female,Yes
2,Male,Yes
3,Female,Yes
4,Female,Yes
5,Male,Yes
6,Male,No
7,Female,No
8,Female,No


In [132]:
contingency_table = pd.crosstab(df_chi["Gender"], df_chi["Shopping"])
contingency_table 

Shopping,No,Yes
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,2,3
Male,2,2


In [133]:
#Observed Values
Observed_Values = contingency_table.values 
print("Observed Values :-\n",Observed_Values)

b=stats.chi2_contingency(contingency_table)
Expected_Values = b[3]
print("Expected Values :-\n",Expected_Values)

Observed Values :-
 [[2 3]
 [2 2]]
Expected Values :-
 [[2.22222222 2.77777778]
 [1.77777778 2.22222222]]


In [134]:
no_of_rows=len(contingency_table.iloc[0:2, 0])
no_of_columns=len(contingency_table.iloc[0, 0:2])
ddof=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-", ddof)


Degree of Freedom:- 1


In [135]:
alpha = 0.05
from scipy.stats import chi2
chi_square = sum([(o-e)**2./e for o,e  in zip(Observed_Values,Expected_Values)])
chi_square_statistic = chi_square[0] + chi_square[1]
print("chi-square statistic:-", chi_square_statistic)

critical_value = chi2.ppf(q=1-alpha, df=ddof)
print('critical_value:', critical_value)

#p-value
p_value = 1 - chi2.cdf(x=chi_square_statistic, df=ddof)
print('p-value:', p_value)
print('Significance level: ', alpha)
print('Degree of Freedom: ', ddof)
print('chi-square statistic:', chi_square_statistic)
print('critical_value:', critical_value)
print('p-value:', p_value)

print('*'*20)
print('use chi critical value')
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

print('*'*20)
print('use p value')
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

chi-square statistic:- 0.09000000000000008
critical_value: 3.841458820694124
p-value: 0.7641771556220945
Significance level:  0.05
Degree of Freedom:  1
chi-square statistic: 0.09000000000000008
critical_value: 3.841458820694124
p-value: 0.7641771556220945
********************
use chi critical value
Retain H0,There is no relationship between 2 categorical variables
********************
use p value
Retain H0,There is no relationship between 2 categorical variables
