https://towardsdatascience.com/hypothesis-testing-in-machine-learning-using-python-a0dc89e169ce/

T- Test :- A t-test is a type of inferential statistic which is used to determine if there is a significant difference between the means of two groups which may be related in certain features. It is mostly used when the data sets, like the set of data recorded as outcome from flipping a coin a 100 times, would follow a normal distribution and may have unknown variances. T test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a population.

T-test has 2 types : 1. one sampled t-test 2. two-sampled t-test.

**One sample** t-test : The One Sample t Test determines whether the sample mean is statistically different from a known or hypothesised population mean. The One Sample t Test is a parametric test.

**Two sampled** T-test :-The Independent Samples t Test or 2-sample t-test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test. This test is also known as: Independent t Test.


In [9]:
import pandas as pd
import numpy as np
from scipy import stats

In [17]:
growth_pd=pd.read_csv('treegrowth.csv')
growth_pd.groupby('group').weight.mean()

group
ctrl    4.876894
trt1    5.478568
trt2    6.349131
Name: weight, dtype: float64

In [8]:
#let's test if the mean weight of control and experiment group is statistical significance 
t_test_2group=growth_pd[growth_pd['group'].isin(['ctrl','trt1'])]

In [12]:
t_test_2group[t_test_2group.group=='ctrl']

Unnamed: 0.1,Unnamed: 0,weight,group
0,0,5.838792,ctrl
1,1,4.004314,ctrl
2,2,4.906519,ctrl
3,3,5.852097,ctrl
4,4,4.314734,ctrl
5,5,4.899277,ctrl
6,6,5.177049,ctrl
7,7,4.612868,ctrl
8,8,5.054972,ctrl
9,9,4.108316,ctrl


In [15]:
ttest,pval = stats.ttest_ind(t_test_2group[t_test_2group.group=='ctrl'].weight, t_test_2group[t_test_2group.group=='trt1'].weight)
print(pval)

0.015926741965344212


In [16]:
if pval<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

reject null hypothesis


**Paired sampled** t-test :- The paired sample t-test is also called dependent sample t-test. It’s an uni variate test that tests for a significant difference between 2 related variables. An example of this is if you where to collect the blood pressure for an individual before and after some treatment, condition, or time point.

In [20]:
df = pd.read_csv("bloodpressure.csv")
df[['bp_before','bp_after']].describe()

Unnamed: 0,bp_before,bp_after
count,120.0,120.0
mean,160.833333,168.208333
std,10.549072,5.956277
min,140.0,160.0
25%,153.0,162.0
50%,160.0,167.5
75%,170.0,174.0
max,179.0,179.0


In [24]:
ttest,pval = stats.ttest_rel(df['bp_before'], df['bp_after'])
print(pval)
if pval<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

8.02479928440675e-10
reject null hypothesis


## When you can run a Z Test.

* Your sample size is greater than 30. Otherwise, use a t test.
* Your data should be normally distributed. However, for large sample sizes (over 30) this doesn’t always matter.

In [32]:
?stests.ztest

In [31]:
from statsmodels.stats import weightstats as stests
ztest ,pval1 = stests.ztest(df['bp_before'], x2=df['bp_after'], value=0,alternative='two-sided')
print(float(pval1))
if pval<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

2.57879808579463e-11
reject null hypothesis


**ANOVA** (F-TEST) :- The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time. For example, if we wanted to test whether voter age differs based on some categorical variable like race, we have to compare the means of each level or group the variable. We could carry out a separate t-test for each pair of groups, but when you conduct many tests you increase the chances of false positives. The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.

One Way F-test(Anova) :- It tell whether two or more groups are similar or not based on their mean similarity and f-score.

In [33]:
df_anova = growth_pd[['weight','group']]
grps = pd.unique(growth_pd.group.values)
d_data = {grp:df_anova['weight'][df_anova.group == grp] for grp in grps}
 
F, p = stats.f_oneway(d_data['ctrl'], d_data['trt1'], d_data['trt2'])
print("p-value for significance is: ", p)
if p<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

p-value for significance is:  4.287496423404694e-06
reject null hypothesis


Two Way F-test :- Two way F-test is extension of 1-way f-test, it is used when we have 2 independent variable and 2+ groups.

**Chi-Square** Test- The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.

In [39]:
df_chi = pd.read_csv('click_ad.csv')
contingency_table=pd.crosstab(df_chi["ad"],df_chi["click"])
print('contingency_table\n',contingency_table)

contingency_table
 click  no  yes
ad            
a       4    2
b       4    0


In [46]:
Expected_Values=stats.chi2_contingency(contingency_table)
print("Expected Values\n",Expected_Values[3])

Expected Values
 [[4.8 1.2]
 [3.2 0.8]]


In [53]:
#number of freedom
col=len(contingency_table.iloc[0,:])
row=len(contingency_table.iloc[:,0])
n_fd=(col-1)*(row-1)
print('number of freedom is',n_fd)

number of freedom is 1


In [62]:
from scipy.stats import chi2
chi_square=sum([(o-e)**2/e for o,e in zip(contingency_table.values,Expected_Values[3])])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)

chi-square statistic:- 1.6666666666666665


In [68]:
alpha=0.05
critical_value=chi2.ppf(q=1-alpha,df=n_fd)
print('critical_value:',critical_value)

critical_value: 3.841458820694124


In [70]:
#p-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=n_fd)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',n_fd)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)
print('p-value:',p_value)
if chi_square_statistic>=critical_value:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")
    
if p_value<=alpha:
    print("Reject H0,There is a relationship between 2 categorical variables")
else:
    print("Retain H0,There is no relationship between 2 categorical variables")

p-value: 0.19670560245894675
Significance level:  0.05
Degree of Freedom:  1
chi-square statistic: 1.6666666666666665
critical_value: 3.841458820694124
p-value: 0.19670560245894675
Retain H0,There is no relationship between 2 categorical variables
Retain H0,There is no relationship between 2 categorical variables
