## Hypothesis Testing Hands On

Author: Dhrub Satyam


### 1 Sample T Test

T-Test 1 Sample:
One sample t-test : The One Sample t Test determines whether the sample mean is statistically different from a known or hypothesised population mean. The One Sample t Test is a parametric test.
Example :- you have 10 ages and you are checking whether avg age is 30 or not.

In [1]:
from scipy.stats import ttest_1samp
import numpy as np
ages = [20,25,30,31,54,65,67,21,23,45,32,43,32,23,56,30,30,30,30,30,30,30,30,30,30,30,30,30]
print(ages)
ages_mean = np.mean(ages)
print(ages_mean)
t, pval = ttest_1samp(ages, 30)
print(pval)
print(t)
if pval < 0.05:    # alpha value is 0.05 or 5%, 10%
   print(" we are rejecting the null hypothesis")
else:
  print("we are failing to reject the null hypothesis")

[20, 25, 30, 31, 54, 65, 67, 21, 23, 45, 32, 43, 32, 23, 56, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30]
34.17857142857143
0.08268758534670989
1.802213147983849
we are failing to reject the null hypothesis


In [2]:
# 1. Compare P values : Pvalue(for my RR) = 0.05
# 2. Compare Statistics: Critical value and I will comare the T-Cal with that

### 2 Sample T-Test

Two sampled T-test :-The Independent Samples t Test or 2-sample t-test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test. This test is also known as: Independent t Test.
Example : is there any association between week1 and week2 ( code is given below in python)

In [3]:
from scipy.stats import ttest_ind
import numpy as np
week1 = np.random.uniform(low=10.0, high=25.3, size=(50,))
week2 = np.random.uniform(low=10.2, high=26.5, size=(50,))
print("\nweek1 data :-\n")
print(week1)
print("\nweek2 data :-\n")
print(week2)
week1_mean = np.mean(week1)
week2_mean = np.mean(week2)
print("\nweek1 mean value:",week1_mean)
print("week2 mean value:",week2_mean)
week1_std = np.std(week1)
week2_std = np.std(week2)
print("\nweek1 std value:",week1_std)
print("week2 std value:",week2_std)
ttest,pval = ttest_ind(week1,week2)
print("\np-value",pval)
if pval <0.05:
  print("We will reject the null hypothesis")
else:
  print("We have failed to reject the null hypothesis")


week1 data :-

[15.13752002 11.71068074 18.1433447  14.89889832 10.52701699 15.68345972
 25.12696171 17.82848277 17.41883341 11.78313534 12.82185191 19.07843171
 20.53249048 23.46787617 20.94107919 12.83071399 21.90640683 19.8425833
 14.35209358 22.29968045 17.00550679 20.55325992 17.83722245 24.71561971
 21.26731847 20.94412729 11.72131192 23.63780371 16.20697042 23.41005702
 24.83766365 12.84436677 17.35155348 14.10907176 20.20209476 21.20307286
 17.70266199 19.81383787 10.03496278 14.74837633 17.78264519 14.97178117
 14.75650288 23.88879449 10.67730214 15.20050552 20.11706761 22.08028515
 13.14894981 11.36261298]

week2 data :-

[18.87369496 10.28106891 13.24956206 24.70366841 13.55155085 17.31484644
 13.08965113 25.41880179 11.06031893 26.35008782 20.58055351 21.57636283
 11.28589643 24.56226545 21.19387463 22.53625677 18.68668893 23.40068919
 19.06584755 15.38197324 21.12740094 15.64102061 11.82712873 22.72750299
 16.54539252 18.33622972 14.41014423 19.80840103 21.57153729 26.080

### Paired Sample T-Test

Paired sampled t-test :- The paired sample t-test is also called dependent sample t-test. It’s an uni variate test that tests for a significant difference between 2 related variables. An example of this is if you where to collect the blood pressure for an individual before and after some treatment, condition, or time point.

#H0 :- means difference between before and after treatment is 0
#H1:- means difference between before and after treatment is not 0

In [4]:
import pandas as pd
from scipy import stats
df = pd.read_csv("blood_pressure.csv")
df.head()
# df[['bp_before','bp_after']].describe()


Unnamed: 0,patient,sex,agegrp,bp_before,bp_after
0,1,Male,30-45,143,153
1,2,Male,30-45,163,170
2,3,Male,30-45,153,168
3,4,Male,30-45,153,142
4,5,Male,30-45,146,141


In [5]:
ttest,pval = stats.ttest_rel(df['bp_before'], df['bp_after'])
print(pval)
if pval<0.05:
    print("reject null hypothesis")
else:
    print("failed to reject the null hypothesis")

0.0011297914644840823
reject null hypothesis


### One Sample Z Test

we are using z-test for blood pressure with some mean like 156 

In [6]:
import pandas as pd
from scipy import stats
from statsmodels.stats import weightstats as stests
ztest ,pval = stests.ztest(df['bp_before'], x2=None, value=159)  #SampleMean
print(float(pval))
if pval<0.05:
    print("reject null hypothesis")
else:
    print("failed to reject the null hypothesis")

0.014185854025524306
reject null hypothesis


### 2 Sample Z Test

Two-sample Z test- In two sample z-test , similar to t-test here we are checking two independent data groups and deciding whether sample mean of two group is equal or not.

H0 : mean of two group is 0

H1 : mean of two group is not 0
    
Example : we are checking in blood data after blood and before blood data.

In [37]:
ztest ,pval1 = stests.ztest(df['bp_before'], x2=df['bp_after'], value=0,alternative='two-sided')
print(pval1)
if pval<0.05:
    print("reject null hypothesis")
else:
    print("failed to reject the null hypothesis")

0.002162306611369422
reject null hypothesis


### One Way ANOVA

3 different category of plant and their weight and need to check whether all 3 group are similar or not 

In [39]:
df_anova = pd.read_csv('PlantGrowth.csv')
df_anova = df_anova[['weight','group']]
grps = pd.unique(df_anova.group.values)
df_anova

Unnamed: 0,weight,group
0,4.17,ctrl
1,5.58,ctrl
2,5.18,ctrl
3,6.11,ctrl
4,4.5,ctrl
5,4.61,ctrl
6,5.17,ctrl
7,4.53,ctrl
8,5.33,ctrl
9,5.14,ctrl


In [41]:
d_data = {grp:df_anova['weight'][df_anova.group == grp] for grp in grps}
 
Ftest, pval = stats.f_oneway(d_data['ctrl'], d_data['trt1'], d_data['trt2'])
print("p-value for significance is: ", pval)
if pval<0.05:
    print("reject null hypothesis")
else:
    print("failed to reject the null hypothesis")

p-value for significance is:  0.0159099583256229
reject null hypothesis


Here is an example on a shell measurement (the length of the anterior adductor muscle scar, standardized by dividing by length) in the mussel Mytilus trossulus from five locations: Tillamook, Oregon; Newport, Oregon; Petersburg, Alaska; Magadan, Russia; and Tvarminne, Finland, taken from a much larger data set used in McDonald et al. (1991).

In [43]:
from scipy.stats import f_oneway
tillamook = [0.0571, 0.0813, 0.0831, 0.0976, 0.0817, 0.0859, 0.0735,
             0.0659, 0.0923, 0.0836]
newport = [0.0873, 0.0662, 0.0672, 0.0819, 0.0749, 0.0649, 0.0835,
           0.0725]
petersburg = [0.0974, 0.1352, 0.0817, 0.1016, 0.0968, 0.1064, 0.105]
magadan = [0.1033, 0.0915, 0.0781, 0.0685, 0.0677, 0.0697, 0.0764,
           0.0689]
tvarminne = [0.0703, 0.1026, 0.0956, 0.0973, 0.1039, 0.1045]

F,p = f_oneway(tillamook, newport, petersburg, magadan, tvarminne)
print("P Value calculated is ", p)

if p<0.05:
    print("We will reject the null hypothesis")
else:
    print("failed to reject the null hypothesis")

P Value calculated is  0.0002812242314534544
We will reject the null hypothesis


### Two Way ANOVA

used when we have 2 independent variable and 2+ groups. 

In [44]:
import numpy as np
import pandas as pd

#create data
df = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15),
                   'sun': np.tile(np.repeat(['low', 'med', 'high'], 5), 2),
                   'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5,
                              6, 6, 7, 8, 7, 3, 4, 4, 4, 5,
                              4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})

In [45]:
df

Unnamed: 0,water,sun,height
0,daily,low,6
1,daily,low,6
2,daily,low,6
3,daily,low,5
4,daily,low,6
5,daily,med,5
6,daily,med,5
7,daily,med,6
8,daily,med,4
9,daily,med,5


In [46]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform two-way ANOVA
model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()
sm.stats.anova_lm(model, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(water),8.533333,1.0,16.0,0.000527
C(sun),24.866667,2.0,23.3125,2e-06
C(water):C(sun),2.466667,2.0,2.3125,0.120667
Residual,12.8,24.0,,


We can see the following p-values for each of the factors in the table:

water: p-value = .000527
sun: p-value = .0000002
water*sun: p-value = .120667
Since the p-values for water and sun are both less than .05, this means that both factors have a statistically significant effect on plant height.

And since the p-value for the interaction effect (.120667) is not less than .05, this tells us that there is no significant interaction effect between sunlight exposure and watering frequency.

In [47]:
#Write down the decision loop for each condition provided above for 2 way ANOVA.

Null Hypothesis: No Siginificant effect on plant height

Alt Hypothesis: Significant effect on plant height

    #### if pval< 0.05 ( Reject Null)

In [7]:
#To write the null and alternate hypothesis for all the examples