# SciPy: Hypothesis Testing

## 1. Overview

### 1.1. Basic concepts
[Hypothesis testing](https://en.wikipedia.org/wiki/Statistical_hypothesis_testing) is a method of statistical inference that tests the validity of a claim about the population, using sample data. It makes use of the following concepts.

#### Hypotheses
- The [null hypothesis] (denoted $H_0$): a common view whose validity needs to be tested.
- The [alternative hypothesis] (denoted $H_1$): what will be believed if $H_0$ is rejected.

[null hypothesis]: https://en.wikipedia.org/wiki/Null_hypothesis
[alternative hypothesis]: https://en.wikipedia.org/wiki/Alternative_hypothesis

#### Significance level
[Significance level] (denoted $\alpha$) is a pre-selected number ranges from $0$ to $1$, indicates the probability of rejecting the null hypothesis. Common values of $\alpha$ is $0.05$ and $0.01$. A related concept to significance level is [confidence level] (denoted $\gamma=1-\alpha$). Each significance level corresponds to a critical value ($c$).

[Significance level]: https://en.wikipedia.org/wiki/Statistical_significance
[confidence level]: https://en.wikipedia.org/wiki/Confidence_interval

#### Test statistic
Being one of the most important factors, [test statistic](https://en.wikipedia.org/wiki/Test_statistic) (denoted $T$) is the transformed data that follows a theoretical distribution. Since the probability distribution function is known, it allows calculating the probability value, telling which hypothesis is more likely to happen. Each test statistic is represented by a fraction where:
- The numerator is the power of signal
- The denominator is the power of noise

[test statistic]: https://en.wikipedia.org/wiki/Test_statistic

#### p-value
[p-value] is the probability of making [type I error] - rejecting $H_0$ when it's true. It represents the probability of $H_0$ being true, and when this probability is less than $\alpha$, $H_0$ should be rejected. The smaller the p-value is, the stronger the evidence that $H_0$ should be rejected.
- A p-value less than $0.05$ indicates the difference is significant, meaning there is a probability of less than $5\%$ that the null hypothesis is correct. Therefore, $H_0$ is rejected and $H_1$ is accepted.
- A p-value higher than $0.05$ indicates the difference is not significant. In this case, $H_1$ is rejected but $H_0$ is failed to be rejected.

[p-value]: https://en.wikipedia.org/wiki/P-value
[type I error]: https://en.wikipedia.org/wiki/Type_I_and_type_II_errors

In [None]:
from dsutil import np, stats

In [36]:
mu, std = 0, 1
dist = stats.norm(mu, std)

In [28]:
# compute critical value (c) given a significance level (alpha)
# 2-tailed test
alpha = 0.05
crit = dist.isf(alpha/2)
crit

1.9599639845400545

In [35]:
# compute p-value (p) given a test statistic (t)
# 2-tailed test
test = 1.96
test = np.abs(test)
pval = dist.cdf(-test) + dist.sf(test)
pval

0.04999579029644087

#### Descriptive statistics
For populations:
- $N$: population size
- $\mu$: population mean
- $\sigma$: population standard deviation
- $\sigma^2$: population variance
- $p$: proportion of successes in population

For samples:
- $n$: sample size
- $\hat\mu$ or $\bar x$: sample mean
- $\hat\sigma$ or $\text{SD}$: sample standard deviation
- $\hat\sigma^2$ or $s^2$: sample variance
- $\hat p$: proportion of successes in sample
- $\text{SE}_{\mu}$: standard error of mean
- $\text{SE}_p$: standard error of proportion

### 1.2. Hypothesis testing summary
Type  |Usage
:----------|:--------------
Z-test|Comparing population means or proportions
F-test|Comparing the variances of 2 populations
t-test|Comparing the means of 1 or 2 populations
Chi-squared test|Testing of qualitative variables replationship
ANOVA|Comparing the means of 3 population or more
KS test|Testing of distribution

## 2. Z-test
The usage of [Z-test]:
- Comparing the mean of a population with a specific number or comparing the means of two populations
- Comparing the proportion of a population with a specific number or comparing the proportions of two populations

Assumptions:
- Populations are normally distributed
- Samples are random and must have more than 30 observations
- Population variances are already known (only in mean Z-test)

[Z-test]: https://en.wikipedia.org/wiki/Z-test

In [1]:
from dsutil import np, pd
from dsutil import ZTestMean, ZTestProportion

In [2]:
columns = ['age', 'gender', 'income', 'age_group', 'degree', 'area', 'id', 'job']
df = pd.read_csv('../data/hypothesis.csv', usecols=columns)
df.head()

Unnamed: 0,id,area,gender,age,age_group,degree,job,income
0,1,central,female,69,middle,master,manager,33250
1,2,southern,female,50,middle,highshool,officer,6960
2,3,northern,male,73,elder,highshool,officer,11100
3,4,northern,female,73,elder,highshool,officer,11100
4,5,central,male,69,middle,master,officer,16140


### 2.1. One-sample mean
&#9800;&nbsp;<b>Practice</b><br>
Given a random sample sized $N=500$ of people's income from a population having the standard deviation $\sigma=5000$. With the significant level $\alpha=0.05$, can we conclude that the mean of the population $\mu=A=14000$?

First, state the hypotheses from the information:
- $H_0: \mu = 14000$
- $H_1: \mu \neq 14000$

Since it is a two-tailed test, the critical value will be $z_{\alpha/2}=z_{0.025} = 1.96$. If $|T|>1.96$, reject $H_0$ and accept $H_1$. However, in this example, $|T|=0.63$ and the corresponding p-value is $0.2643$, so $H_0$ cannot be rejected. The formula for the test statistic is:

$$T = \frac{\hat{\mu}-A}{\text{SE}_{\mu}}\quad\text{for } \text{SE}_\mu = \sqrt{\frac{\sigma^2}{N}}$$

In [3]:
test = ZTestMean(df['age'], var1=140, const=59)
test.conduct()

H1: mean != 59
p-value = 0.1384 > 0.05
Conclusion: fail to reject H0


### 2.2. Two-sample mean
&#9800;&nbsp;<b>Practice</b><br>
The average income of male is $5000$ higher than female, true or false? Given $\alpha = 0.05$, population standard deviations of income of male and female are $\sigma_1=7000$ and $\sigma_2=5000$, consecutively.

The hypotheses:
- $H_0: \mu_1 = \mu_2+5000$
- $H_1: \mu_1 > \mu_2+5000$

This is a right-tailed test, $z_{\alpha}=z_{0.05} = 1.64$ will be taken. If $T>1.64$, reject $H_0$ and conclude that the average income of male is higher than female. In this example, $T=2.57$ and the corresponding p-value is $0.0051$. The formula for the test statistic is:

$$T=\frac{(\hat{\mu}_1-\hat{\mu}_2)-A}{\text{SE}_{\mu}}
\quad\text{for }\text{SE}_\mu=\sqrt{\frac{\sigma_1^2}{N_1}+\frac{\sigma_2^2}{N_2}}$$

In [4]:
x1 = df.query("gender=='male'")['income']
x2 = df.query("gender=='female'")['income']

test = ZTestMean(x1=x1, x2=x2, var1=7000**2, var2=5000**2, const=5000, alternative='2s')
test.conduct()

H1: mean1 - mean2 != 5000
p-value = 0.0101 < 0.05
Conclusion: reject H0


### 2.3. One-sample proportion
&#9800;&nbsp;<b>Practice</b><br>
In a large consignment of food packets, a random sample of $n=100$ packets revealed that 5 packets were leaking. Can we conclude that the population contains at least $A=10\%$ of leaked packets at $\alpha=0.05$?

The hypotheses:
- $H_0: p\geq0.1$
- $H_1: p<0.1$

This is a left-tailed test, $H_0$ will be rejected if $T<-z_{0.05}=-1.64$. For $T=-2.294$, the corresponding p-value is $0.011$ ($<0.05$). The formula for the test statistic is:

$$T = \frac{\hat{p}-A}{\text{SE}_p}
\quad\text{for }\text{SE}_p=\sqrt{\frac{\hat{p}(1-\hat{p})}{N}}$$

In [5]:
test = ZTestProportion(p1=5/100, n1=100, const=0.1)
test.conduct()

H1: p != 0.1
p-value = 0.0218 < 0.05
Conclusion: reject H0


### 2.4. Two-sample proportion
&#9800;&nbsp;<b>Practice</b><br>
A machine turns out 16 imperfect articles in a sample of $n_1=500$. After maintaining, it turns 3 imperfect articles in a sample of $n_2=100$. Has the machine improved after maintaining at the significance level of $\alpha=0.05$?

The hypotheses:
- $H_0: p_1=p_2$
- $H_1: p_1>p_2$

If $T>z_{0.05}=1.64$, reject $H_0$. The formula for the test statistic is:

$$T = \frac{(\hat{p}_1-\hat{p}_2)-A}{\text{SE}_p}
\quad\text{for }\text{SE}_p=\sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{N_1}+\frac{\hat{p}_2(1-\hat{p}_2)}{N_2}}$$

In [6]:
test = ZTestProportion(p1=16/500, n1=500, p2=3/100, n2=100, const=0)
test.conduct()

H1: p1 - p2 != 0
p-value = 0.9152 > 0.05
Conclusion: fail to reject H0


## 3. F-test
The usage of [F-test]:
- Comparing the variances of two populations
- Being used in one-way ANOVA to compare the means between groups (section 2.4)
- Being used in multivariate linear regression to testing the significant of R-squared (section 3.2)

Assumption:
- Populations are normally distributed
- The two random samples are independent

[F-test]: https://en.wikipedia.org/wiki/F-test

In [1]:
from dsutil import np, pd, stats
from dsutil import FTest

In [2]:
columns = ['age', 'gender', 'income', 'age_group', 'degree', 'area', 'id', 'job']
df = pd.read_csv('../data/hypothesis.csv', usecols=columns)
df.head()

Unnamed: 0,id,area,gender,age,age_group,degree,job,income
0,1,central,female,69,middle,master,manager,33250
1,2,southern,female,50,middle,highshool,officer,6960
2,3,northern,male,73,elder,highshool,officer,11100
3,4,northern,female,73,elder,highshool,officer,11100
4,5,central,male,69,middle,master,officer,16140


&#9800;&nbsp;<b>Practice</b><br>
With the significance level $\alpha=0.05$, compare the population variances of income of male and female.

The hypotheses:
- $H_0: \sigma^2_1 = 5\sigma^2_2$
- $H_1: \sigma^2_1 > 5\sigma^2_2$

If p-value $<0.05$: reject $H_0$. The formula for the test statistic is:

$$T = \frac{1}{A}\frac{\hat{\sigma}_1^2}{\hat{\sigma}_2^2}$$

In [7]:
x1 = df.query("gender == 'male'")['income']
x2 = df.query("gender == 'female'")['income']

FTest(x1, x2, const=5, alternative='larger').conduct()

H1: var1 / var2 > 5
p-value = 0.0299 < 0.05
Conclusion: reject H0


## 4. t-test
The [t-test] is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis.

[t-test]: https://en.wikipedia.org/wiki/Student%27s_t-test

In [1]:
from dsutil import np, pd, stats
from dsutil import TTestOneSample, TTestIndSample, TTestPairedSample

In [2]:
columns = ['age', 'gender', 'income', 'age_group', 'degree', 'area', 'id', 'job']
df = pd.read_csv('../data/hypothesis.csv', usecols=columns)
df.head()

Unnamed: 0,id,area,gender,age,age_group,degree,job,income
0,1,central,female,69,middle,master,manager,33250
1,2,southern,female,50,middle,highshool,officer,6960
2,3,northern,male,73,elder,highshool,officer,11100
3,4,northern,female,73,elder,highshool,officer,11100
4,5,central,male,69,middle,master,officer,16140


### 4.1. One-sample
The usage of [one-sample t-test]: To compare the mean of a population with a number, when the population variance is unknown.

Assumption:
- The population is normally distributed
- The sample is random

[one-sample t-test]: https://en.wikipedia.org/wiki/Student%27s_t-test#One-sample_t-test

&#9800;&nbsp;<b>Practice</b><br>
With the confidence level of $\alpha=0.05$, the mean of income is $13000$ or not?

The hypotheses:
- $H_0: \mu=13000$
- $H_1: \mu\neq13000$

The formula for the test statistic is:

$$T = \frac{\hat{\mu}-A}{\hat\sigma/\sqrt{n}}$$

In [3]:
test = TTestOneSample(df['income'], 13000)
test.conduct()

H1: mean != 13000
p-value = 0.0074 < 0.05
Conclusion: reject H0


### 4.2. Independent two-sample
The usage of [independent two_sample t-test]: to compare the means of two populations using their independent samples. A F-test should be used first to check the equality of the two population variances.

Assumptions:
- Two populations are normally distributed
- Two samples are independent and random
- Two variances are equal

[independent two_sample t-test]: https://en.wikipedia.org/wiki/Student%27s_t-test#Independent_two-sample_t-test

&#9800;&nbsp;<b>Practice</b><br>
With $\alpha=0.05$, the average income of male and female are equal, true or false?

The hypotheses:
- $H_0: \mu_1 = \mu_2$
- $H_1: \mu_1 \neq \mu_2$

If $\sigma_1^2 \neq \sigma_2^2$ (this example - already tested in section 2.2), the formula for the test statistic is:

$$T = \frac{\hat{\mu}_1-\hat{\mu}_2-A}{\sqrt{\dfrac{\hat{\sigma}_1^2}{n_1}+\dfrac{\hat{\sigma}_2^2}{n_2}}}$$

If $\sigma_1^2 = \sigma_2^2$, the test statistic is:

$$T = \frac{\hat{\mu}_1-\hat{\mu}_2-A}{\hat\sigma_p \sqrt{\dfrac{1}{n_1}+\dfrac{1}{n_2}}}$$

where

$$\hat\sigma_p = \sqrt{\frac{(n_1-1)\hat{\sigma}_1^2 + (n_2-1)\hat{\sigma}_2^2}{n_1+n_2-2}}$$

is the pooled standard deviation of the two samples.

In [3]:
x1 = df[df['gender']=='male'].income
x2 = df[df['gender']=='female'].income

In [4]:
test = TTestIndSample(x1, x2)
test.conduct()

H1: mean1 - mean2 != 0
p-value = 0.0000 < 0.05
Conclusion: reject H0


### 4.3. Dependent paired samples
The usage of [dependent paired samples t-test]: to compare two population means, given their dependent samples. A paired samples t-test calculates the diffrence between paired observation and then performs a one-sample t-test.

Assumptions:
- The two populations should be both normally distributed
- The two random samples come in pairs (before and after data for example)
- Same sample sizes

[dependent paired samples t-test]: https://en.wikipedia.org/wiki/Student%27s_t-test#Dependent_t-test_for_paired_samples

In [5]:
x1 = np.array([72,77,84,79,74,67,74,77,79,89])
x2 = np.array([65,68,77,73,66,61,66,71,71,78])

&#9800;&nbsp;<b>Practice</b><br>
With $\alpha=0.05$, the average weight after is 8 kg less than before, true or false?

The hypotheses:
- $H_0: \mu_1-\mu_2\geq8$
- $H_1: \mu_1-\mu_2<8$

The test statistic is:

$$T = \frac{\hat\mu_1-\hat\mu_2-A}{\hat\sigma_d/\sqrt n} = \frac{\hat\mu_d-A}{\hat\sigma_d/\sqrt n}$$
where
- $\hat\mu_d$ is the sample mean of the differences
- $\hat\sigma_d$ is the sample standard deviation of the differences

In [7]:
test = TTestPairedSample(x1, x2)
test.conduct()

H1: mean1 - mean2 != 0
p-value = 0.0000 < 0.05
Conclusion: reject H0


## 5. Chi-squared test
The usage of [chi-square test]:
- Comparing the propotions of two or more populations
- Independence testing between qualitative variables

Assumptions:
- Populations are normally distributed

[chi-square test]: https://en.wikipedia.org/wiki/Chi-squared_test

In [10]:
import numpy as np
import pandas as pd
from scipy import stats
import pingouin as pg
import warnings
warnings.filterwarnings("ignore")

In [4]:
columns = ['age', 'gender', 'income', 'age_group', 'degree', 'area', 'id', 'job']
df = pd.read_csv('../data/hypothesis.csv', usecols=columns)
df.head()

Unnamed: 0,id,area,gender,age,age_group,degree,job,income
0,1,central,female,69,middle,master,manager,33250
1,2,southern,female,50,middle,highshool,officer,6960
2,3,northern,male,73,elder,highshool,officer,11100
3,4,northern,female,73,elder,highshool,officer,11100
4,5,central,male,69,middle,master,officer,16140


### 5.1. Dependence test
&#9800;&nbsp;<b>Practice</b><br>
Is there a relationship between <code style="font-size:13px">age_group</code> and <code style="font-size:13px">degree</code>?

The hypotheses:
- $H_0:$ The two variables are independent
- $H_1:$ The two variables are dependent

<code style="font-size:13px">age_group</code> and <code style="font-size:13px">degree</code> are said to be strongly related if p-value $<0.05$.

In [6]:
table = pd.crosstab(df.age_group,df.degree)

In [7]:
chi, pvalue, dof, _ = stats.chi2_contingency(table)
print("chi stats:", chi)
print('p-value:', pvalue)

chi stats: 84.7456748205155
p-value: 1.717625046407707e-17


In [11]:
expected, observed, summary = pg.chi2_independence(data=df, x='age_group', y='degree')

In [9]:
expected

degree,bachelor,highshool,master
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
elder,25.542,50.886,22.572
middle,91.074,181.442,80.484
youth,12.384,24.672,10.944


In [33]:
summary

Unnamed: 0,test,lambda,chi2,dof,pval,cramer,power
0,pearson,1.0,84.745675,4.0,1.7176250000000003e-17,0.291111,0.999915
1,cressie-read,0.666667,89.164397,4.0,1.981493e-18,0.298604,0.999956
2,log-likelihood,0.0,108.047053,4.0,1.89858e-22,0.328705,0.999998
3,freeman-tukey,-0.5,,4.0,,,
4,mod-log-likelihood,-1.0,inf,4.0,0.0,inf,1.0
5,neyman,-2.0,,4.0,,,


### 5.2. Proportion test
In R, Yate's correction chi-squared test is used for the <code style="font-size:13px">prop.test()</code> function. A Pearson's chi-squared is upward bias for 2x2 contingency table - an upwards bias tends to make results larger than they should be so Yate's correction is a regularization term in the formula of chi-squared statistic. However, Yate correction shouldn't be used because the correction is too strict for making the decision on data. In that case, <code style="font-size:13px">stats.chisquare()</code> function can be used for proportion chi-squared.

&#9800;&nbsp;<b>Practice</b><br>
The number of officers is equal to the number of salespersons and is 5 times greater than the number of managers, true or false?

- $H_0: p_1=1/11, p_2=p_3=5/11$
- $H_1$: There is at least one incorrect equation.

In [34]:
df_chi = df.groupby('job').count()[['id']].reset_index()

df_chi['obs'] = df_chi.id/len(df)
df_chi['exp'] = [1/11,5/11,5/11]

In [35]:
df_chi

Unnamed: 0,job,id,obs,exp
0,manager,44,0.088,0.090909
1,officer,239,0.478,0.454545
2,sale,217,0.434,0.454545


In [36]:
stats.chisquare(df_chi.obs,df_chi.exp)

Power_divergenceResult(statistic=0.0022319999999999987, pvalue=0.9988846224964097)

## 6. Other tests

### 6.1. One-way ANOVA
[ANOVA] (Analysis of Variance) is a technique involving a collection of statistical tests analyzing the difference of the means of two or more groups. The means is calculated from a quantitative variable; the groups are determined using qualitative variables.

[ANOVA]: https://en.wikipedia.org/wiki/Analysis_of_variance

Usage: Compare multiple population means when there is a categorical variable containing at least three categories.

Assumptions:
- Populations are normally distributed
- Samples are random
- Homogeneity of variances

The work flow:
1. Test the homogeneity of variances, using one of the following tests:
    - [Bartlett's test], implemented via 
<code style="font-size:13px">[stats.bartlett()]</code>
    - [Levene's test], implemented via
<code style="font-size:13px">[stats.levene()]</code>
    - Fligner-Killeen test, implemented via
<code style="font-size:13px">[stats.fligner()]</code>
2. Test the equality of population means:
    - If the variances are equal, use
<code style="font-size:13px">[stats.f_oneway()]</code>
or
<code style="font-size:13px">[pg.anova()]</code>
    - If the variances are not equal, use
<code style="font-size:13px">[pg.welch_anova()]</code>
3. Post-hoc test to compare pairwise population means:
    - If the variances are equal, use [Tukey's HSD test], implemented via
<code style="font-size:13px">[pg.pairwise_tukey()]</code>
    - If the variances are not equal, use Games-Howell test, implemented via
<code style="font-size:13px">[pg.pairwise_gameshowell()]</code>
    
[Bartlett's test]: https://en.wikipedia.org/wiki/Bartlett%27s_test
[Levene's test]: https://en.wikipedia.org/wiki/Levene%27s_test
[Tukey's HSD test]: https://en.wikipedia.org/wiki/Tukey%27s_range_test

[stats.bartlett()]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bartlett.html
[stats.levene()]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.levene.html
[stats.fligner()]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fligner.html
[stats.f_oneway()]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html
[pg.anova()]: https://pingouin-stats.org/build/html/generated/pingouin.anova.html
[pg.welch_anova()]: https://pingouin-stats.org/build/html/generated/pingouin.welch_anova.html
[pg.pairwise_tukey()]: https://pingouin-stats.org/build/html/generated/pingouin.pairwise_tukey.html
[pg.pairwise_gameshowell()]: https://pingouin-stats.org/build/html/generated/pingouin.pairwise_gameshowell.html

In [37]:
import numpy as np
import pandas as pd
from scipy import stats
import pingouin as pg

In [4]:
columns = ['age', 'gender', 'income', 'age_group', 'degree', 'area', 'id', 'job']
df = pd.read_csv('../data/hypothesis.csv', usecols=columns)
df.head()

Unnamed: 0,id,area,gender,age,age_group,degree,job,income
0,1,central,female,69,middle,master,manager,33250
1,2,southern,female,50,middle,highshool,officer,6960
2,3,northern,male,73,elder,highshool,officer,11100
3,4,northern,female,73,elder,highshool,officer,11100
4,5,central,male,69,middle,master,officer,16140


&#9800;&nbsp;<b>Practice</b><br>
Is there a difference in average income between 3 areas (Central, Southern and Northern)? If there is, which group differs from the others?

*Step 1*: Check the equality of population variances. If p-value $<0.05$, then reject $H_0$. The hypotheses:
- $H_0: \sigma_1^2 = \sigma_2^2 = \dots = \sigma_k^2$
- $H_1$: Exist at least one pair $\sigma_i^2 \neq \sigma_j^2 $ where $i \neq j$

In [39]:
central = df[df['area'] =='central']['income']
northern = df[df['area'] =='northern']['income']
southern = df[df['area'] =='southern']['income']

In [40]:
stats.bartlett(central, northern, southern)

BartlettResult(statistic=865.4773139975722, pvalue=1.1587484312813138e-188)

*Step 2*: Test whether the population means are equal or not. If p-value $<0.05$, then reject $H_0$. The hypotheses:
- $H_0$: $\mu_1 = \mu_2 = \dots = \mu_k$
- $H_1$: There is at least one pair $\mu_i \neq \mu_j $ where $i \neq j$

In [41]:
stats.f_oneway(central, northern, southern)

F_onewayResult(statistic=222.22364903952098, pvalue=1.1382909390301103e-69)

In [42]:
pg.anova(data=df, dv='income', between='area')

Unnamed: 0,Source,ddof1,ddof2,F,p-unc,np2
0,area,2,497,222.223649,1.1382910000000001e-69,0.472089


In [43]:
pg.welch_anova(data=df, dv='income', between='area')

Unnamed: 0,Source,ddof1,ddof2,F,p-unc,np2
0,area,2,283.472665,441.890825,7.605951e-88,0.472089


*Step 3*: Post-hoc test to compare pairwise means. Any pair having p-value $<0.05$ can be considered significantly different in mean. 

In [44]:
pg.pairwise_tukey(data=df, dv='income', between='area')

Unnamed: 0,A,B,mean(A),mean(B),diff,se,T,p-tukey,hedges
0,central,northern,19586.096618,10750.421622,8835.674997,526.24002,16.7902,0.001,1.695475
1,central,southern,19586.096618,8209.814815,11376.281804,617.404943,18.425965,0.001,2.181955
2,northern,southern,10750.421622,8209.814815,2540.606807,629.865678,4.033569,0.001,0.487196


In [45]:
pg.pairwise_gameshowell(data=df, dv='income', between='area')

Unnamed: 0,A,B,mean(A),mean(B),diff,se,T,df,pval,hedges
0,central,northern,19586.096618,10750.421622,8835.674997,560.507528,15.763704,211.211292,0.001,1.591819
1,central,southern,19586.096618,8209.814815,11376.281804,563.372921,20.193164,215.382039,0.001,2.391222
2,northern,southern,10750.421622,8209.814815,2540.606807,105.242773,24.14044,218.527723,0.001,2.915812


### 6.2. Distribution test
The [Kolmogorov-Smirnov test] (KS test) is used to test whether a random variable follows a specific distribution or not. The test statistic is calculated as the difference between the empirical CDF of the observed variable and the CDF of the reference distribution. In SciPy, the
<code style="font-size:13px">[stats.ks()]</code>
function performs Kolmogorov-Smirnov test, which supports any [SciPy distribution].

[Kolmogorov-Smirnov test]: (https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test)
[SciPy distribution]: (https://docs.scipy.org/doc/scipy/reference/stats.html#probability-distributions)
[stats.ks()]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html

In [46]:
import numpy as np
import pandas as pd
from scipy import stats

In [4]:
columns = ['age', 'gender', 'income', 'age_group', 'degree', 'area', 'id', 'job']
df = pd.read_csv('../data/hypothesis.csv', usecols=columns)
df.head()

Unnamed: 0,id,area,gender,age,age_group,degree,job,income
0,1,central,female,69,middle,master,manager,33250
1,2,southern,female,50,middle,highshool,officer,6960
2,3,northern,male,73,elder,highshool,officer,11100
3,4,northern,female,73,elder,highshool,officer,11100
4,5,central,male,69,middle,master,officer,16140


In [48]:
stats.kstest(df.age, cdf = 'norm', args=(24, 0.05))

KstestResult(statistic=1.0, pvalue=0.0)

## Resources
- *ethanweed.github.io - [Learning Statistics with Python](https://ethanweed.github.io/pythonbook/landingpage.html)*

---
*&#9829; By Quang Hung x Thuy Linh &#9829;*