# Hypothesis Testing

t-test in general assesses whether the means of two groups or conditions, are statistically different from one another

## One-sample T-test with Python
The test will tell us whether means of the sample and the population are different

In [None]:
# Null hypothesis is that there is no significant difference between means of population and means of sample
# Alternate hypothesis is that there significant difference between means of population and means of sample

In [28]:
ages=[10,20,35,50,28,40,55,18,16,55,30,25,43,18,30,28,14,24,16,17,32,35,26,27,65,18,43,23,21,20,19,70]

In [29]:
len(ages)

32

In [30]:
import numpy as np
ages_mean=np.mean(ages)
print(ages_mean)

30.34375


In [31]:
## Lets take sample

sample_size=10
age_sample=np.random.choice(ages,sample_size)

In [32]:
age_sample

array([23, 35, 17, 28, 16, 32, 14, 10, 16, 25])

In [33]:
ages_sample_mean=np.mean(age_sample)
print(ages_sample_mean)

21.6


In [34]:
from scipy.stats import ttest_1samp

In [35]:
ttest,p_value=ttest_1samp(age_sample,30) 

#### the second parameter 30 is the population mean, we can see that the population mean was around 30 so we have taken it as 30. So if the sample mean comes around 30 we can accept the null hypothesis and if not then reject the null hypothesis and accepth the alternate hypothesis

In [36]:
print(p_value)

0.010752674457011259


In [37]:
if p_value < 0.05:    # alpha value is 0.05 or 5%
    print(" we are rejecting null hypothesis")
else:
    print("we are accepting null hypothesis")

 we are rejecting null hypothesis


#### here we are rejecting the null hypothesis because the p-value is less than 0.05%, and and we can see that the sample mean(21.6) has significant difference from the population means of 30

In [38]:
## Lets take another sample

sample_size=10
age_sample=np.random.choice(ages,sample_size)

In [39]:
age_sample

array([18, 23, 32, 23, 21, 43, 16, 35, 25, 55])

In [40]:
ages_sample_mean=np.mean(age_sample)
print(ages_sample_mean)

29.1


In [41]:
ttest,p_value=ttest_1samp(age_sample,30)

In [42]:
print(p_value)

0.8220159739857206


In [43]:
if p_value < 0.05:    # alpha value is 0.05 or 5%
    print(" we are rejecting null hypothesis")
else:
    print("we are accepting null hypothesis")

we are accepting null hypothesis


#### here we are accepting the null hypothesis because the p-value is greater than 0.05%, and and we can see that the sample mean(29.1) has no significant difference from the population means of 30

In [44]:
## Lets take another sample

sample_size=10
age_sample=np.random.choice(ages,sample_size)

In [45]:
age_sample

array([30, 23, 19, 40, 43, 32, 55, 25, 19, 21])

In [46]:
ages_sample_mean=np.mean(age_sample)
print(ages_sample_mean)

30.7


In [47]:
ttest,p_value=ttest_1samp(age_sample,popmean=ages_sample_mean) 
# here instead of just giving a number we are giving as popmean=ages_sample_mean

In [48]:
print(p_value)

1.0


In [49]:
if p_value < 0.05:    # alpha value is 0.05 or 5%
    print(" we are rejecting null hypothesis")
else:
    print("we are accepting null hypothesis")

we are accepting null hypothesis


In [50]:
# Some more example

In [51]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import math
np.random.seed(6)
school_ages=stats.poisson.rvs(loc=18,mu=35,size=1500)
# here the ages will start from 18 and have mean of 35, and total of 1500 students
classA_ages=stats.poisson.rvs(loc=18,mu=30,size=60)
# here the ages will start from 18 and have mean of 30, and total of 60 students 

In [52]:
classA_ages.mean()

46.9

In [53]:
_,p_value=stats.ttest_1samp(a=classA_ages,popmean=school_ages.mean())

In [54]:
p_value

1.139027071016194e-13

In [55]:
school_ages.mean()

53.303333333333335

In [56]:
if p_value < 0.05:    # alpha value is 0.05 or 5%
    print(" we are rejecting null hypothesis")
else:
    print("we are accepting null hypothesis")

 we are rejecting null hypothesis


#### Note: Here we compared means of same group i.e means of population and means of sample, in next we will compare means of two independent groups

## Two-sample T-test With Python
The Independent Samples t Test or 2-sample t-test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test. This test is also known as: Independent t Test

In [174]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import math
np.random.seed(6)
schoolA_ages=stats.poisson.rvs(loc=18,mu=35,size=1500)
# here the ages will start from 18 and have mean of 35, and total of 1500 students
schoolB_ages=stats.poisson.rvs(loc=25,mu=30,size=1500)
# here the ages will start from 18 and have mean of 30, and total of 1500 students

In [175]:
schoolA_ages.mean()

53.303333333333335

In [176]:
schoolB_ages.mean()

55.24466666666667

In [177]:
_,p_value=stats.ttest_ind(a=schoolA_ages,b=schoolB_ages,equal_var=False)

In [178]:
p_value

5.31439935202909e-20

In [179]:
if p_value < 0.05:    # alpha value is 0.05 or 5%
    print(" we are rejecting null hypothesis")
else:
    print("we are accepting null hypothesis")

 we are rejecting null hypothesis


In [None]:
# Here we are comparing two independent groups SchoolA_ages and SchoolB_ages

#### here we are rejecting null hypothesis and accepting alternate hypothesis which says there is significant difference between the population means of schoolA_ages and schoolB_ages

## Paired Sample t-test
When you want to check how different samples from the same group are, you can go for a paired T-test

In [180]:
weight1=[25,30,28,35,28,34,26,29,30,26,28,32,31,30,45]
weight2=weight1+stats.norm.rvs(scale=5,loc=-1.25,size=len(weight1))

In [181]:
print(weight1)
print(weight2)

[25, 30, 28, 35, 28, 34, 26, 29, 30, 26, 28, 32, 31, 30, 45]
[28.1910884  33.77436861 24.67549793 33.52152749 23.63231126 36.83538707
 27.74117409 28.16788971 27.00633951 27.93887629 25.71758717 31.02440861
 22.28851128 31.00964264 42.78462996]


In [182]:
weight_df=pd.DataFrame({"Befor_weight":np.array(weight1),
                         "After_weight":np.array(weight2),
                       "Weight_change":np.array(weight2)-np.array(weight1)})

In [183]:
weight_df

Unnamed: 0,Befor_weight,After_weight,Weight_change
0,25,28.191088,3.191088
1,30,33.774369,3.774369
2,28,24.675498,-3.324502
3,35,33.521527,-1.478473
4,28,23.632311,-4.367689
5,34,36.835387,2.835387
6,26,27.741174,1.741174
7,29,28.16789,-0.83211
8,30,27.00634,-2.99366
9,26,27.938876,1.938876


In [184]:
_,p_value=stats.ttest_rel(a=weight1,b=weight2)

In [185]:
p_value

0.346614666075463

In [186]:
if p_value < 0.05:    # alpha value is 0.05 or 5%
    print(" we are rejecting null hypothesis")
else:
    print("we are accepting null hypothesis")

we are accepting null hypothesis


## Correlation

In [189]:
import seaborn as sns
df=sns.load_dataset('iris')

In [190]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [191]:
df.corr()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
sepal_length,1.0,-0.11757,0.871754,0.817941
sepal_width,-0.11757,1.0,-0.42844,-0.366126
petal_length,0.871754,-0.42844,1.0,0.962865
petal_width,0.817941,-0.366126,0.962865,1.0


## Analysis Of Variance (ANOVA) Test

### One Way ANOVA

One Way Anova compares Single Independent Factor (Categorical) with Single Dependent Variable (Continous)

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [6]:
df = pd.read_csv("Soils.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Group,Contour,Depth,Gp,Block,pH,N,Dens,P,Ca,Mg,K,Na,Conduc
0,1,1,Top,0-10,T0,1,5.4,0.188,0.92,215,16.35,7.65,0.72,1.14,1.09
1,2,1,Top,0-10,T0,2,5.65,0.165,1.04,208,12.25,5.15,0.71,0.94,1.35
2,3,1,Top,0-10,T0,3,5.14,0.26,0.95,300,13.02,5.68,0.68,0.6,1.41
3,4,1,Top,0-10,T0,4,5.14,0.169,1.1,248,11.92,7.88,1.09,1.01,1.64
4,5,2,Top,10-30,T1,1,5.14,0.164,1.12,174,14.17,8.12,0.7,2.17,1.85


In [10]:
df['Depth'].value_counts() # Categorical

30-60    12
60-90    12
0-10     12
10-30    12
Name: Depth, dtype: int64

### Here the null hypothesis H0 is that there is no significanct contibution of Depth in relation to pH value
### And the Alternate hypothesis H1 is that there is sigificant contibution of Depth in relation to pH value

### If p-value is more than 0.05 then we can accept the null hypothesis, and if the p-value is less than 0.05 then we will be rejecting the null hypothesis and accept the alternate hypothesis

In [11]:
mod = ols('pH~Depth', data=df).fit() #pH is continous variable and Depth is Categorical

In [12]:
aov = sm.stats.anova_lm(mod,type=2)
print(aov)

            df     sum_sq   mean_sq          F        PR(>F)
Depth      3.0  14.958973  4.986324  35.068327  9.810893e-12
Residual  44.0   6.256308  0.142189        NaN           NaN


#### Here we can see that the p-value is much less than 0.05 so here we can reject the null hypothesis and accept the alternate hypothesis, which says that there is a significant contibution of Depth in relation to pH value

### Two Way ANOVA

Two Way Anova compares Two Independent Factor (Categorical) with Single Dependent Variable (Continous)

 Example1:
 
 Here we are going to do Two way anova in relation to pH, Here we will be seeing what is the contribution of depth in relation to ph and also what is the contribution of Contour in relation to ph

In [13]:
mod1 = ols('pH~Depth+Contour', data=df).fit() #pH is continous variable and Depth and Contour is Categorical
aov1 = sm.stats.anova_lm(mod1,type=2)
print(aov1)

            df     sum_sq   mean_sq          F        PR(>F)
Depth      3.0  14.958973  4.986324  34.929618  1.736182e-11
Contour    2.0   0.260662  0.130331   0.912981  4.091423e-01
Residual  42.0   5.995646  0.142753        NaN           NaN


#### Here we can see that p-value of Contour is greater than 0.05 which means there is no significant contribution of Contour in relation to pH, hence we will accept the null hypothesis
#### On the other hand we can see that p-value of Depth is less than 0.05 which means there is significant contribution of Depth in relation to pH, hence we will be rejecting the null hypothesis and accept the alternate hypothesis 

Example2:

Here we are going to do Two way anova in relation to pH, Here we will be seeing what is the contribution of depth in relation to ph and also what is the contribution of Block in relation to ph

In [14]:
mod2 = ols('pH~Depth+Block', data=df).fit() #pH is continous variable and Depth and Block is Categorical
aov2 = sm.stats.anova_lm(mod2,type=2)
print(aov2)

            df     sum_sq   mean_sq          F        PR(>F)
Depth      3.0  14.958973  4.986324  39.274230  2.213912e-12
Block      1.0   0.796954  0.796954   6.277118  1.609733e-02
Residual  43.0   5.459355  0.126962        NaN           NaN


#### Here we can see that p-value of Block is less than 0.05 which means there is significant contribution of Contour in relation to pH, hence we will reject the null hypothesis and accepth the alternate hypothesis

#### Also we can see that p-value of Depth is less than 0.05 which means there is significant contribution of Depth in relation to pH, hence we will be rejecting the null hypothesis and accept the alternate hypothesis

### Two Way ANOVA(combined)

Example3:

Here we are going to do Two way anova in relation to pH, Here we will be seeing what is the contribution of depth in relation to ph, what is the contribution of Block in relation to ph, and what is the contribution of Depth and Block combined in relation to ph

In [16]:
mod3 = ols('pH~Depth+Block+Depth*Block', data=df).fit() #pH is continous variable and Depth and Block is Categorical
# Depth*Block means combined
aov3 = sm.stats.anova_lm(mod3,type=2)
print(aov3)

               df     sum_sq   mean_sq          F        PR(>F)
Depth         3.0  14.958973  4.986324  37.786849  9.459884e-12
Block         1.0   0.796954  0.796954   6.039393  1.842085e-02
Depth:Block   3.0   0.180985  0.060328   0.457173  7.137201e-01
Residual     40.0   5.278370  0.131959        NaN           NaN


#### Here we can see that p-value of Block is less than 0.05 which means there is significant contribution of Contour in relation to pH, hence we will reject the null hypothesis and accepth the alternate hypothesis
#### Also we can see that p-value of Depth is less than 0.05 which means there is significant contribution of Depth in relation to pH, hence we will be rejecting the null hypothesis and accept the alternate hypothesis
#### In addition to that we can see that p-value of Combined Block and Depth is more than 0.05  which means there is no significant contribution of Depth and Block combined in relation to pH

In [None]:
# Another Example

In [194]:
import seaborn as sns
df1=sns.load_dataset('iris')
df1

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [195]:
df1['species'].value_counts()

versicolor    50
setosa        50
virginica     50
Name: species, dtype: int64

In [200]:
mod4 = ols('petal_length~species', data=df1).fit() #pH is continous variable and Depth and Contour is Categorical
aov4 = sm.stats.anova_lm(mod4,type=2)
print(aov4)

             df    sum_sq     mean_sq            F        PR(>F)
species     2.0  437.1028  218.551400  1180.161182  2.856777e-91
Residual  147.0   27.2226    0.185188          NaN           NaN


#### here we can see that the p-value of species is less than 0.05 which means we can reject the null hypothesis and accept alternate hypothesis which means there is significant contribution of species in relation to petal_length 

## Chi-Square Test of independence
The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.

In [206]:
!pip install bioinfokit

Collecting bioinfokit
  Downloading bioinfokit-2.0.4.tar.gz (82 kB)
Collecting matplotlib-venn
  Downloading matplotlib-venn-0.11.6.tar.gz (29 kB)
Collecting tabulate
  Downloading tabulate-0.8.9-py3-none-any.whl (25 kB)
Collecting textwrap3
  Downloading textwrap3-0.9.2-py2.py3-none-any.whl (12 kB)
Collecting adjustText
  Downloading adjustText-0.7.3.tar.gz (7.5 kB)
Building wheels for collected packages: bioinfokit, matplotlib-venn, adjustText
  Building wheel for bioinfokit (setup.py): started
  Building wheel for bioinfokit (setup.py): finished with status 'done'
  Created wheel for bioinfokit: filename=bioinfokit-2.0.4-py3-none-any.whl size=56028 sha256=8c772ac40a72c21358d3f02bc098475e8495fd0afe1b4419d1797b1f92d4f657
  Stored in directory: c:\users\shashank\appdata\local\pip\cache\wheels\5d\f0\f9\3cfa74f39676f6745eeb99be64826173241a47808d2cb2f30f
  Building wheel for matplotlib-venn (setup.py): started
  Building wheel for matplotlib-venn (setup.py): finished with status 'done'
  

In [207]:
import scipy.stats as stats
import seaborn as sns
import pandas as pd
import numpy as np
from bioinfokit.analys import stat
df2=sns.load_dataset('tips')
df2.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [203]:
cross_tab = pd.crosstab(df2.sex,df2.smoker)
print(cross_tab) #observed values

smoker  Yes  No
sex            
Male     60  97
Female   33  54


In [208]:
res = stat()
res.chisq(df=cross_tab)
print(res.summary)


Chi-squared test for independence

Test              Df    Chi-square    P-value
--------------  ----  ------------  ---------
Pearson            1    0.00876329   0.925417
Log-likelihood     1    0.00875825   0.925438



#### Note: here p-value is greater than 0.05 so we have to accept the null hypothesis, which means there is no significant relationship/association between two categorical variable(sex,smoker) that means they are independent of each other, and that is what chi square method is for to check for independence

In [210]:
print(res.expected_df) #The observed and the expected values are almost same 


Expected frequency counts

        Yes       No
--  -------  -------
 0  59.8402  97.1598
 1  33.1598  53.8402

