# 1. Basic concepts
Hypothesis testing is a method of statistical inference that tests the validity of a claim about the population, using sample data. Hypothesis testing takes into account the following concepts.

#### Hypotheses
- The *null hypothesis* (denoted $H_0$): the common view whose validity needs to be tested.
- The *alternative hypothesis* (denoted $H_1$): what will be believed if $H_0$ is rejected.

#### Significance level
*Significance level* (denoted $\alpha$) is a pre-selected number ranges from $0$ to $1$, indicates the probability of rejecting the *null hypothesis*. Common values of $\alpha$ is $0.05$ and $0.01$. A related concept to *significance level* is *confidence level*, $\gamma$ ($\gamma=1-\alpha$).

#### Test statistic
Being one of the most important factor, *test statistic* is the transformed data that follows a theoretical distribution. Since probability distribution function is known, it allows calculating the probability value, telling which hypothesis is more likely to happen. Each *test statistic* is represented by a fraction where:
- The numerator is the power of signal (denoted $P_{signal}$).
- The denominator is the power of noise (denoted $P_{noise}$).

#### p-value
*p-value* is the probability of making type I error - rejecting $H_0$ when it's true. It represents the probability of $H_0$ being true, and when this probability is less than $\alpha$, $H_0$ should be rejected. The smaller the p-value is, the stronger evidence that $H_0$ should be rejected.
- A p-value less than $0.05$ indicates the difference is significant, meaning there is a probability of less than $5%$ that the null hypothesis is correct. Therefore, $H_0$ is rejected and $H_1$ is accepted.
- A p-value higher than $0.05$ indicates the difference is not significant. In this case, $H_1$ is rejected but $H_0$ is failed to be rejected.

#### Descriptive statistics
For populations:
- $N$: population size
- $\mu$: population mean
- $\sigma$: population standard deviation
- $\sigma^2$: population variance
- $p$: propotion of successes in population

For samples
- $n$: sample size
- $\hat\mu$ or $\bar x$: sample mean
- $\hat\sigma$ or $SD$: sample standard deviation
- $\hat\sigma^2$ or $s^2$: sample variance
- $\hat p$: propotion of successes in sample
- $SE_{\mu}$: standard error of mean
- $SE_p$: standard error of proportion

# Hypothesis testing summary
Type  |Usage| Test statistic             
:----------|:--------------|:--------------------------
Z-test|1. Comparing the means of 1 or 2 populations <br> 2. Comparing the proportions of 1 or 2 populations|$Z$|
F-test|Comparing the variances of 2 populations|$F$|
T-test|Comparing the means of 1 or 2 populations|$T$|
Chi-squared test|1. Comparing the propotions of 3 or more populations <br> 2. Testing of qualitative variables replationship|$\chi^2$|
ANOVA|Comparing the means of 3 population or more|$F$|
KS test|Testing of distribution|$D$|

# 1. Z-test
Usage:
- Comparing the means of one or two populations
- Comparing the propotions of one or two populations

Assumptions:
- Populations are normally distributed
- Samples are random and must have more than 30 observations
- Population variances are known (only in mean z-test)

In [1]:
import math
import numpy as np
import pandas as pd
from statsmodels.stats.weightstats import ztest
from scipy import stats

from collections import namedtuple

In [3]:
df = pd.read_excel('data/hypothesis.xlsx')
df.head()

Unnamed: 0,Code,area,gender,age,age_group,year_of_school,degree,job,know_english,know_france,...,flight_date,flight_status,professionally_staff,customer_service,diversity_product,good_price,easily_transaction,goodlooking_staff,diversity_flighttime,good_construction
0,1,central,female,69,middle,16,master,manager,1,1,...,01/05/2013,1,2,2,1,1,2,2,1,1
1,2,southern,female,50,middle,12,highshool,officer,0,0,...,01/05/2013,1,3,3,3,2,3,3,2,3
2,3,northern,male,73,elder,12,highshool,officer,1,0,...,01/05/2013,1,2,2,1,2,3,5,1,1
3,4,northern,female,73,elder,12,highshool,officer,0,0,...,01/05/2013,1,5,3,2,4,5,3,2,2
4,5,central,male,69,middle,16,master,officer,1,0,...,01/05/2013,0,3,3,3,3,3,3,3,3


In [60]:
def _ztest_(data1, 
            data2= None, 
            value=0, 
            var1=0, 
            var2=None, 
            alternative='2s'):
    ZtestResult = namedtuple('ZtestResult', ['zstat', 'pvalue'])
    x1 = np.array(data1)
    x1_mean = x1.mean()
    x1_len = len(data1)
    zstat = (x1_mean - value) / (np.sqrt(var1/x1_len))
    if data2 is not None and var2 is not None:
        x2 = np.array(data2)
        x2_mean = x2.mean()
        x2_len = len(data2)
        zstat = (x1_mean - x2_mean - value) / (np.sqrt(var1/x1_len+var2/x2_len))
    if alternative in ["two-sided", "2-sided", "2s"]:
        pvalue = stats.norm.sf(np.abs(zstat)) * 2
    elif alternative in ["larger", "l"]:
        pvalue = stats.norm.sf(zstat)
    elif alternative in ["smaller", "s"]:
        pvalue = stats.norm.cdf(zstat)
    else:
        raise ValueError("invalid alternative")
    return ZtestResult(zstat, pvalue)

## 1.1. One sample mean Z-test

*Problem:* Given a random sample sized $n=500$ of people's income from a population having the standard deviation $\sigma=5000$. With the significant level  $\alpha=0.05$, can we conclude that the mean of the population $\mu=A=14000$?

First, state the hypotheses from the information:
- $H_0: \mu = 14000$
- $H_1: \mu \neq 14000$

Since we are doing a two-tailed test, the critical value will be $z_{\alpha/2}=z_{0.025} = 1.96$. If $|Z|>1.96$, reject $H_0$ and accept $H_1$. However in this example, $|Z|=0.63$ and the corresponding p-value $=0.2643$, so $H_0$ cannot be rejected. The formula for test statistic is:

$$Z = \frac{\hat{\mu}-A}{\sigma/\sqrt{n}} = \frac{\hat{\mu}-A}{SE_{\mu}}$$

In [61]:
_ztest_(df.age, value = 59, var1 = np.var(df.age),alternative='two-sided')

ZtestResult(zstat=1.480134241749042, pvalue=0.13883742504819616)

## 1.2. Two sample mean Z-test
*Problem:* The average income of male is $5000$ higher than female, true or false? Given $\alpha = 0.05$, population standard deviations of income of male and female are $\sigma_1=7000$ and $\sigma_2=5000$, consecutively.

The hypotheses:
- $H_0: \mu_1 = \mu_2+5000$
- $H_1: \mu_1 > \mu_2+5000$

This is a right-tailed test, $z_{\alpha}=z_{0.05} = 1.64$ will be taken. If $Z>1.64$, reject $H_0$ and conclude that the average income of male is higher than female. In this example, $Z=2.57$ and the corresponding p-value is $0.0051$. The formula for test statistic is:

$$Z = \frac{\hat{\mu}_1-\hat{\mu}_2-A}{\sqrt{\sigma_1^2/n_1+\sigma_2^2/n_2}}
= \frac{\hat{\mu}_1-\hat{\mu}_2-A}{SE_{\mu}}$$

In [5]:
x1 = df[df['gender']=='male'].income
x2 = df[df['gender']=='female'].income

In [6]:
_ztest_(x1, x2, value=5000, var1=7000**2, var2=5000**2 ,alternative='two-sided')

(2.5734791058298225, 0.010068172587777743)

## 1.3. One sample proportion z-test
*Problem:* In a large consignment of food packets, a random sample of $n=100$ packets revealed that 5 packets were leaking. Can we conclude that the population contains at least $A=10\%$ of leaked packets at $\alpha=0.05$?

The hypotheses:
- $H_0: p\geq0.1$
- $H_1: p<0.1$

This is a left-tailed test, $H_0$ will be rejected if $Z<-z_{0.05}=-1.64$. For $Z=-2.294$, the corresponding p-value is $0.011$ ($<0.05$). The formula for test statistic is:

$$Z = \frac{\hat{p}-A}{\sqrt{\hat{p}(1-\hat{p})/n}} = \frac{\hat{p}-A}{SE_p}$$

In [5]:
def _ztestp_(p1, 
            p2= None, 
            n1=1,
            n2=None,
            value=0, 
            alternative='2s'):
    ZtestResult = namedtuple('ZtestResult', ['zstat', 'pvalue'])
    se = np.sqrt(p1*(1-p1)/n1)
    if se == 0:
        raise ValueError("divide by zero")
    else:
        zstat = (p1 - value) / se
        if p2 is not None and n2 is not None:
            se1 = p1*(1-p1)/n1
            se2 = p2*(1-p2)/n2
            zstat = (p1 - p2 - value) / np.sqrt(se1+se2)
        if alternative in ["two-sided", "2-sided", "2s"]:
            pvalue = stats.norm.sf(np.abs(zstat)) * 2
        elif alternative in ["larger", "l"]:
            pvalue = stats.norm.sf(zstat)
        elif alternative in ["smaller", "s"]:
            pvalue = stats.norm.cdf(zstat)
        else:
            raise ValueError("invalid alternative")
    return ZtestResult(zstat, pvalue)

In [6]:
_ztestp_(p1=5/100, n1=100, value=0.1)

ZtestResult(zstat=-2.294157338705618, pvalue=0.021781462791119477)

## 1.4. Two sample proportion z-test 
*Problem:* A machine turns out 16 imperfect articles in a sample of $n_1=500$. After maintaining, it turns 3 imperfect articles in a sample of $n_2=100$. Has the machine improved after maintaining at significance level $\alpha=0.05$?

The hypotheses:
- $H_0: p_1=p_2$
- $H_1: p_1>p_2$

If $Z>z_{0.05}=1.64$, reject $H_0$. The formula for test statistic is:

$$Z = \frac{\hat{p}_1-\hat{p}_2-A}{\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}}
= \frac{\hat{p}_1-\hat{p}_2-A}{SE_p}$$

In [7]:
_ztestp_(p1=16/500,p2=3/100,n1=500,n2=100,value=0)

ZtestResult(zstat=0.10645649716403335, pvalue=0.9152201694418036)

# 2. F-test
Usage:
- Comparing the variances of two populations
- Being used in one-way ANOVA to compare means between groups (section 2.4)
- Being used in multivariate linear regression to testing the significant of R-squared (section 3.2)

Assumption:
- Populations are normally distributed
- Two random independent samples

In [7]:
import math
import numpy as np
import pandas as pd
from scipy import stats
from collections import namedtuple

In [8]:
df = pd.read_excel('data/hypothesis.xlsx')
df.head()

Unnamed: 0,Code,area,gender,age,age_group,year_of_school,degree,job,know_english,know_france,...,flight_date,flight_status,professionally_staff,customer_service,diversity_product,good_price,easily_transaction,goodlooking_staff,diversity_flighttime,good_construction
0,1,central,female,69,middle,16,master,manager,1,1,...,01/05/2013,1,2,2,1,1,2,2,1,1
1,2,southern,female,50,middle,12,highshool,officer,0,0,...,01/05/2013,1,3,3,3,2,3,3,2,3
2,3,northern,male,73,elder,12,highshool,officer,1,0,...,01/05/2013,1,2,2,1,2,3,5,1,1
3,4,northern,female,73,elder,12,highshool,officer,0,0,...,01/05/2013,1,5,3,2,4,5,3,2,2
4,5,central,male,69,middle,16,master,officer,1,0,...,01/05/2013,0,3,3,3,3,3,3,3,3


In [64]:
def _ftest_(data1, data2, ratio, alternative):
    FtestResult = namedtuple('FtestResult', ['fstat', 'pvalue'])
    x1 = np.array(data1)
    x2 = np.array(data2)
    var1 = x1.var()
    var2 = x2.var()
    df1 = len(x1)-1
    df2 = len(x2)-1
    if ratio <= 0:
        raise ValueError('Invalid ratio')
    else:
        fstat = var1/(ratio*var2)
        if alternative in ["two-sided", "2-sided", "2s"]:
            pvalue = stats.f.sf(np.abs(fstat), df1, df2) * 2
        elif alternative in ["larger", "l"]:
            pvalue = stats.f.sf(fstat, df1, df2)
        elif alternative in ["smaller", "s"]:
            pvalue = stats.f.cdf(fstat, df1, df2)
        else:
            raise ValueError("invalid alternative")
    return FtestResult(fstat, pvalue)

*Problem:* With the significance level $\alpha=0.05$, compare the population variances of income of male and female.

The hypotheses:
- $H_0: \sigma^2_1 = 5\sigma^2_2$
- $H_1: \sigma^2_1 > 5\sigma^2_2$

If p-value $<0.05$: reject $H_0$. The formula for test statistic is:

$$F = \frac{1}{A}\frac{\hat{\sigma}_1^2}{\hat{\sigma}_2^2}$$

In [10]:
x1 = df[df['gender']=='male'].income
x2 = df[df['gender']=='female'].income

In [65]:
_ftest_(x1, x2, ratio=5, alternative='larger')

FtestResult(fstat=0.25990922844175496, pvalue=0.971297775456867)

# 3. T-test

In [12]:
import math
import numpy as np
import pandas as pd
from scipy import stats
import pingouin as pg
from collections import namedtuple

In [13]:
df = pd.read_excel('data/hypothesis.xlsx')
df.head()

Unnamed: 0,Code,area,gender,age,age_group,year_of_school,degree,job,know_english,know_france,...,flight_date,flight_status,professionally_staff,customer_service,diversity_product,good_price,easily_transaction,goodlooking_staff,diversity_flighttime,good_construction
0,1,central,female,69,middle,16,master,manager,1,1,...,01/05/2013,1,2,2,1,1,2,2,1,1
1,2,southern,female,50,middle,12,highshool,officer,0,0,...,01/05/2013,1,3,3,3,2,3,3,2,3
2,3,northern,male,73,elder,12,highshool,officer,1,0,...,01/05/2013,1,2,2,1,2,3,5,1,1
3,4,northern,female,73,elder,12,highshool,officer,0,0,...,01/05/2013,1,5,3,2,4,5,3,2,2
4,5,central,male,69,middle,16,master,officer,1,0,...,01/05/2013,0,3,3,3,3,3,3,3,3


## 3.1. One sample T-test
Usage: To compare the mean of a population when its variance is unknown.

Assumption:
- The population is normally distributed
- The sample is random

*Problem:* With the confidence level $\alpha=0.05$, the mean of income is $13000$ or not?

The hypotheses:
- $H_0: \mu=13000$
- $H_1: \mu\neq13000$

The formula for test statistic is:

$$T = \frac{\hat{\mu}-A}{\hat\sigma/\sqrt{n}}$$

In [14]:
# new in scipy version 1.6.0
stats.ttest_1samp(df.income, 13000, alternative='two-sided')

Ttest_1sampResult(statistic=2.6904932307376574, pvalue=0.007373867945841888)

In [15]:
pg.ttest(x = df.income, y = 13000.0, tail='two-sided')

Unnamed: 0,T,dof,tail,p-val,CI95%,cohen-d,BF10,power
T-test,2.690493,499,two-sided,0.007374,"[13231.88, 14487.36]",0.120323,1.79,0.765882


## 3.2. Two independent sample T-test
Usage: To compare the means of two population using their independent samples. A f-test should be used first to check the equality of the two population variances.

Assumptions:
- Two populations are normally distributed
- Two samples are independent and random
- Two variances are equal

*Problem:* With $\alpha=0.05$, the average income of male and female are equal, true or false?

The hypotheses:
- $H_0: \mu_1 = \mu_2$
- $H_1: \mu_1 \neq \mu_2$

If $\sigma_1^2 \neq \sigma_2^2$ (this example - already tested in section 2.2), the formula for test statistic is:

$$T = \frac{\hat{\mu}_1-\hat{\mu}_2-A}{\sqrt{\frac{\hat{\sigma}_1^2}{n_1}+\frac{\hat{\sigma}_2^2}{n_2}}}$$

If $\sigma_1^2 = \sigma_2^2$, the test statistic is:

$$T = \frac{\hat{\mu}_1-\hat{\mu}_2-A}{\hat\sigma_p \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}$$

where

$$\hat\sigma_p = \sqrt{\frac{(n_1-1)\hat{\sigma}_1^2 + (n_2-1)\hat{\sigma}_2^2}{n_1+n_2-2}}$$

is the pooled standard deviation of the two samples.

In [16]:
x1 = df[df['gender']=='male'].income
x2 = df[df['gender']=='female'].income

In [17]:
stats.ttest_ind(x1, x2, equal_var=False, alternative='two-sided')

Ttest_indResult(statistic=11.816737235568048, pvalue=1.6702963183174316e-27)

In [18]:
pg.ttest(x = x1, y = x2, correction = True, tail = 'two-sided')

Unnamed: 0,T,dof,tail,p-val,CI95%,cohen-d,BF10,power
T-test,11.816737,364.874101,two-sided,1.670296e-27,"[5323.12, 7448.52]",0.997539,1.904e+25,1.0


## 3.3. Pair sample T-test
Usage: Comparing two population means, given their dependent samples. A paired samples t-test calculates the diffrence between paired observation and then performs a one-sample t-test.

Assumptions:
- The two populations should be both normally distributed
- The two random samples come in pairs (before and after data for example)
- Same sample sizes

In [19]:
x1 = [72,77,84,79,74,67,74,77,79,89]
x2 = [65,68,77,73,66,61,66,71,71,78]

*Problem:* With $\alpha=0.05$, the average weight after is 8 kg less than before, true or false?

The hypotheses:
- $H_0: \mu_1-\mu_2\geq8$
- $H_1: \mu_1-\mu_2<8$

The test statistic is:

$$F = \frac{\hat\mu_1-\hat\mu_2-A}{\hat\sigma_d/\sqrt n} = \frac{\hat\mu_d-A}{\hat\sigma_d/\sqrt n}$$
where
- $\hat\mu_d$ is the sample mean of the differences
- $\hat\sigma_d$ is the sample standard deviation of the differences

In [20]:
# using scipy and pg can not change the mu value
stats.ttest_rel(x1, x2, alternative='two-sided')

Ttest_relResult(statistic=15.23389078900819, pvalue=9.861813084066749e-08)

In [21]:
pg.ttest(x=x1, y=x2, paired=True, tail='two-sided')

Unnamed: 0,T,dof,tail,p-val,CI95%,cohen-d,BF10,power
T-test,15.233891,9,two-sided,9.861813e-08,"[6.47, 8.73]",1.30767,117600.0,0.955888


In [22]:
def pair_ttest(data1, data2, mu=0, alternative='two-sided'):
    x1 = np.array(data1)
    x2 = np.array(data2)
    x1_mean = x1.mean()
    x2_mean = x2.mean()
    d = np.sum(x1 - x2)**2
    D = np.sum((x1 - x2)**2)
    n = len(x1)
    df = len(x1) - 1
    var_d = np.sqrt((n*D-d)/df)/ np.sqrt(n)
    tstat = (x1_mean - x2_mean - mu)/(var_d/np.sqrt(n))
    if alternative in ["two-sided", "2-sided", "2s"]:
        pvalue = stats.t.sf(np.abs(tstat), df) * 2
    elif alternative in ["larger", "l"]:
        pvalue = stats.t.sf(tstat, df)
    elif alternative in ["smaller", "s"]:
        pvalue = stats.t.cdf(tstat, df)
    else:
        raise ValueError("invalid alternative")
    return tstat, pvalue

In [23]:
pair_ttest(x1,x2, mu=8, alternative='smaller')

(-0.8017837257372562, 0.22166592508483474)

# 4. Chi-square
Usage:
- Comparing the propotions of two or more populations
- Independence testing between qualitative variables

Assumptions:
- Populations are normally distributed

In [24]:
import math
import numpy as np
import pandas as pd
from scipy import stats
import pingouin as pg

In [30]:
df = pd.read_excel('data/hypothesis.xlsx')
df.head()

Unnamed: 0,Code,area,gender,age,age_group,year_of_school,degree,job,know_english,know_france,...,flight_date,flight_status,professionally_staff,customer_service,diversity_product,good_price,easily_transaction,goodlooking_staff,diversity_flighttime,good_construction
0,1,central,female,69,middle,16,master,manager,1,1,...,01/05/2013,1,2,2,1,1,2,2,1,1
1,2,southern,female,50,middle,12,highshool,officer,0,0,...,01/05/2013,1,3,3,3,2,3,3,2,3
2,3,northern,male,73,elder,12,highshool,officer,1,0,...,01/05/2013,1,2,2,1,2,3,5,1,1
3,4,northern,female,73,elder,12,highshool,officer,0,0,...,01/05/2013,1,5,3,2,4,5,3,2,2
4,5,central,male,69,middle,16,master,officer,1,0,...,01/05/2013,0,3,3,3,3,3,3,3,3


## 4.1. Dependent chi-square 
*Problem*: Is there a relationship between `age_group` and `degree`?

The hypotheses:
- $H_0:$ The two variables are independent
- $H_1:$ The two variables are dependent

`age_group` and `degree` are said to be strongly related if p-value $<0.05$.

In [31]:
table = pd.crosstab(df.age_group,df.degree)

In [32]:
chi, pvalue, dof, _ = stats.chi2_contingency(table)
print("chi stats:", chi)
print('p-value:', pvalue)

chi stats: 84.7456748205155
p-value: 1.717625046407707e-17


In [33]:
expected, observed, stats = pg.chi2_independence(data=df, x='age_group', y='degree')

  terms = f_obs * ((f_obs / f_exp)**lambda_ - 1)
  terms = f_obs * ((f_obs / f_exp)**lambda_ - 1)
  cond2 = cond0 & (x <= _a)
  return (df > 0) & (nc >= 0)
  terms = 2.0 * special.xlogy(f_exp, f_exp / f_obs)


In [34]:
expected

degree,bachelor,highshool,master
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
elder,25.542,50.886,22.572
middle,91.074,181.442,80.484
youth,12.384,24.672,10.944


In [35]:
stats

Unnamed: 0,test,lambda,chi2,dof,pval,cramer,power
0,pearson,1.0,84.745675,4.0,1.7176250000000003e-17,0.291111,0.999915
1,cressie-read,0.666667,89.164397,4.0,1.981493e-18,0.298604,0.999956
2,log-likelihood,0.0,108.047053,4.0,1.89858e-22,0.328705,0.999998
3,freeman-tukey,-0.5,,4.0,,,
4,mod-log-likelihood,-1.0,inf,4.0,0.0,inf,1.0
5,neyman,-2.0,,4.0,,,


## 4.2. Proportion chi-square

In R, Yate's correction chi-square test is used for the `prop.test` function. A pearson's chi-square is upward bias for 2x2 contingency table - an upwards bias tends to make results larger than they should be so yate's correction is a regularization term in formula of chi-square statistic. However, Yate correction should'n be used because the correction is too strict for making the decision on data.

In that case, `stats.chisquare` function can be used for proportion chi-square

*Problem*: The number of officer is equal to the number of salesperson and 5 times greater than the number of managers, true or false?

- $H_0: p_1=1/11, p_2=p_3=5/11$
- $H_1$: There is at least one incorrect equation.

In [33]:
df_chi = df.groupby('job').count()[['Code']].reset_index()

df_chi['obs'] = df_chi.Code/len(df)
df_chi['exp'] = [1/11,5/11,5/11]

In [34]:
df_chi

Unnamed: 0,job,Code,obs,exp
0,manager,44,0.088,0.090909
1,officer,239,0.478,0.454545
2,sale,217,0.434,0.454545


In [35]:
stats.chisquare(df_chi.obs,df_chi.exp)

Power_divergenceResult(statistic=0.0022319999999999987, pvalue=0.9988846224964097)

# 5. ANOVA
ANOVA (Analysis of Variance) is a technique involving a collection of statistical tests analyzing the difference of the means of two or more groups. The means is calculated from a quantitative variable; the groups are determined using qualitative variables.

## One way ANOVA
Usage: Compare multiple population means when you have one categorical variable containing at least three categories.

Assumptions:
- Populations are normally distributed
- Samples are random
- Homogeneity of variances

The work flow:
1. Test the homogeneity of variances, using one of the following test:
    - Bartlett test (`stats.bartlett` function)
    - Levene test (`stats.levene` function)
    - Fligner-Killeen test (`stats.fligner` function)
2. Test the equality of population means:
    - If the variances are equal, use the `f_oneway` or `pg.anova` function
    - If the variances are not equal, use the `pg.welch_anova` function
3. Post-hoc test to compare pairwise population means:
    - If the variances are equal, use Tukey HSD test (`pg.pairwise_tukey`)
    - If the variances are not equal, use Games-Howell test (`pg.pairwise_gameshowell`)

In [36]:
import math
import numpy as np
import pandas as pd
from scipy import stats
import pingouin as pg

In [37]:
df = pd.read_excel('data/hypothesis.xlsx')
df.head()

Unnamed: 0,Code,area,gender,age,age_group,year_of_school,degree,job,know_english,know_france,...,flight_date,flight_status,professionally_staff,customer_service,diversity_product,good_price,easily_transaction,goodlooking_staff,diversity_flighttime,good_construction
0,1,central,female,69,middle,16,master,manager,1,1,...,01/05/2013,1,2,2,1,1,2,2,1,1
1,2,southern,female,50,middle,12,highshool,officer,0,0,...,01/05/2013,1,3,3,3,2,3,3,2,3
2,3,northern,male,73,elder,12,highshool,officer,1,0,...,01/05/2013,1,2,2,1,2,3,5,1,1
3,4,northern,female,73,elder,12,highshool,officer,0,0,...,01/05/2013,1,5,3,2,4,5,3,2,2
4,5,central,male,69,middle,16,master,officer,1,0,...,01/05/2013,0,3,3,3,3,3,3,3,3


*Problem:* Is there a difference in average income between 3 areas (Central, Southern and Northern)? If there is, which group differs from the others?

**Step 1:** Check the equality of population variances. If p-value $<0.05$, then reject $H_0$. The hypotheses:
- $H_0: \sigma_1^2 = \sigma_2^2 = \dots = \sigma_k^2$
- $H_1$: Exist at least one pair $\sigma_i^2 \neq \sigma_j^2 $ where $i \neq j$

In [38]:
central = df[df['area'] =='central']['income']
northern = df[df['area'] =='northern']['income']
southern = df[df['area'] =='southern']['income']

In [39]:
stats.bartlett(central, northern, southern)

BartlettResult(statistic=865.4773139975722, pvalue=1.1587484312813138e-188)

**Step 2:** Test wether the population means are equal or not. If p-value $<0.05$, then reject $H_0$. The hypothese:
- $H_0$: $\mu_1 = \mu_2 = \dots = \mu_k$
- $H_1$: There is at least one pair $\mu_i \neq \mu_j $ where $i \neq j$

In [40]:
stats.f_oneway(central, northern, southern)

F_onewayResult(statistic=222.22364903952098, pvalue=1.1382909390301103e-69)

In [41]:
pg.anova(data=df, dv='income', between='area')

Unnamed: 0,Source,ddof1,ddof2,F,p-unc,np2
0,area,2,497,222.223649,1.1382910000000001e-69,0.472089


In [42]:
pg.welch_anova(data=df, dv='income', between='area')

Unnamed: 0,Source,ddof1,ddof2,F,p-unc,np2
0,area,2,283.472665,441.890825,7.605951e-88,0.472089


**Step 3:** Post-hoc test to compare pairwise means. Any pair having p-value $<0.05$ can be considered significantly diffrerent in mean. 

In [43]:
pg.pairwise_tukey(data=df, dv='income', between='area')

Unnamed: 0,A,B,mean(A),mean(B),diff,se,T,p-tukey,hedges
0,central,northern,19586.096618,10750.421622,8835.674997,526.24002,16.7902,0.001,1.695475
1,central,southern,19586.096618,8209.814815,11376.281804,617.404943,18.425965,0.001,2.181955
2,northern,southern,10750.421622,8209.814815,2540.606807,629.865678,4.033569,0.001,0.487196


In [44]:
pg.pairwise_gameshowell(data=df, dv='income', between='area')

Unnamed: 0,A,B,mean(A),mean(B),diff,se,T,df,pval,hedges
0,central,northern,19586.096618,10750.421622,8835.674997,560.507528,15.763704,211.211292,0.001,1.591819
1,central,southern,19586.096618,8209.814815,11376.281804,563.372921,20.193164,215.382039,0.001,2.391222
2,northern,southern,10750.421622,8209.814815,2540.606807,105.242773,24.14044,218.527723,0.001,2.915812


# 6. Distribution test
The Kolmogorov-Smirnov test (KS test) is used to test whether a random variable follows a specific distribution or not. The test statistic is calculated as the difference between the empirical CDF of the observed variable and the CDF of the reference distribution. In scipy, the `stats.ks` function performs Kolmogorov-Smirnov test.

Here are the popular distributions that `stats.ks` supports:

Distribution|function|Parameters         |
:-----------|:---------|:------------------|
Binomial    |`binom`  |`size`, `prob`     |
Poisson     |`poisson`   |`lambda`           |
Unifrom     |`uniform`   |`min`, `max`       |
Normal      |`norm`   |`mean`, `sd`       |
Cauchy      |`cauchy` |`location`, `scale`|
T           |`t`      |`df`               |
F           |`f`      |`df1`, `df2`       |
Chi-squared |`chi`  |`df`               |
Beta        |`beta`   |`shape1`, `shape2` |
Gamma       |`gamma`  |`shape`, `scale`   |

In [45]:
import math
import numpy as np
import pandas as pd
from scipy import stats

In [46]:
df = pd.read_excel('data/hypothesis.xlsx')
df.head()

Unnamed: 0,Code,area,gender,age,age_group,year_of_school,degree,job,know_english,know_france,...,flight_date,flight_status,professionally_staff,customer_service,diversity_product,good_price,easily_transaction,goodlooking_staff,diversity_flighttime,good_construction
0,1,central,female,69,middle,16,master,manager,1,1,...,01/05/2013,1,2,2,1,1,2,2,1,1
1,2,southern,female,50,middle,12,highshool,officer,0,0,...,01/05/2013,1,3,3,3,2,3,3,2,3
2,3,northern,male,73,elder,12,highshool,officer,1,0,...,01/05/2013,1,2,2,1,2,3,5,1,1
3,4,northern,female,73,elder,12,highshool,officer,0,0,...,01/05/2013,1,5,3,2,4,5,3,2,2
4,5,central,male,69,middle,16,master,officer,1,0,...,01/05/2013,0,3,3,3,3,3,3,3,3


In [47]:
stats.kstest(df.age, cdf = 'norm', args=(24, 0.05))

KstestResult(statistic=1.0, pvalue=0.0)

---
*&#9829; By Quang Hung x Thuy Linh &#9829;*