# Analysis and Visualization of Complex Agro-Environmental Data
---
## Hypothesis testing

Most hypothesis testing functions in Python are provided by the stats submodule of SciPy. Other modules such as statsmodels and scikit have some advantages on the provided outputs and have additional relevant functions for hypothesis testing.

##### Import modules:

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as sts
import statsmodels.stats as stm
import scikit_posthocs as sp
import seaborn as sns
import matplotlib.pyplot as plt


##### Simulate populations (N = 100000)

In [None]:
# seed the random number generator
np.random.seed(24)
# generate univariate observations
pop1 = np.random.normal(50,20,100000)
pop2 = np.random.normal(70,25,100000)
pop3 = np.random.exponential(50, 100000)
pop4 = np.random.exponential(100, 100000)

### Parametric one-sample tests
#### One-sample t-test

1. Define H0 : The population mean is 40

2. Take a sample from population (pop1)

In [None]:
# Take random samples from data (n=30)
import random
sample1 = random.sample(list(pop1), 30)
sns.histplot(sample1)
plt.show()

3. Compute the statistic and check *p-value*

In [None]:
# perform one sample t-test. 
# H0: The population mean is 40
stat, p = sts.ttest_1samp(a=sample1, popmean=40)
print('t-stat=%.3f, p-value=%.3f' % (stat, p))

### Parametric two-sample tests
#### Two-sample *t* test (two-tailed)

1. Define H0 : The samples are drawn from populations with equal means

2. Take sample from populations (pop1 and pop2)

In [None]:
# Take random samples from data (n=30)
import random
np.random.seed(123)
sample1 = random.sample(list(pop1), 30)
sample2 = random.sample(list(pop2), 30)

3. Check assumptions: outliers, overal normality, homogeneity of variances

In [None]:
# outliers
sns.stripplot(sample1, label="sample 1")
sns.stripplot(sample2, label="sample 2")
plt.legend()
plt.show()

In [None]:
# normality
sns.kdeplot(sample1, label="sample 1")
sns.kdeplot(sample2, label="sample 2")
plt.legend(frameon=False)
plt.show()

In [None]:
# Homogeneity of variances
# Leven's test - tests the null hypothesis that the population variances are equal
stat, p = sts.levene(sample1, sample2, center='median')
print('Statistics=%.3f, p=%.3f' % (stat, p)) # print outputs
alpha=0.05
if p > alpha:
 print('fail to reject H0. Rejecting H0 has an error probability >0.05')
else:
 print('reject H0 with an error probability <0.05)')

4. Compute the t-statistic and check *p-value*

In [None]:
# t-test - tests the null hypothesis that sample 1 and 2 are derived from populations with the same mean
stat, p = sts.ttest_ind(sample1, sample2, )
print('Statistics=%.3f, p=%.3f' % (stat, p)) # print outputs
alpha=0.05
if p > alpha:
 print('fail to reject H0. Rejecting H0 has an error probability >0.05')
else:
 print('reject H0 with an error probability <0.05)')

#### Two-sample *t* test (one-tailed)

H0 : Population 1 has a mean > or = to Population 2

In [None]:
stat, p = sts.ttest_ind(sample1, sample2, alternative='greater')
print('Statistics=%.3f, p=%.3f' % (stat, p)) # print outputs
alpha=0.05
if p > alpha:
 print('fail to reject H0. Rejecting H0 has an error probability >0.05')
else:
 print('reject H0 with an error probability <0.05)')

##### `Now try to run the last two tests analysis with the same code but now using  big data (the whole population or a big sample)`

#### Paired two-sample *t* test (two-tailed)

H0 : The samples are drawn from populations with equal means

In [None]:
stat, p = sts.ttest_rel(sample1, sample2)
print('t-stat=%.3f, p-value=%.3f' % (stat, p))
alpha=0.05
if p > alpha:
 print('fail to reject H0. Rejecting H0 has an error probability >0.05')
else:
 print('reject H0 with an error probability <0.05)')

### Parametric multiple sample tests
#### One-way ANOVA
1. Define H0 : The samples are drawn from populations with equal means
2. Take sample from populations pop1 - pop4

In [None]:
# Take random samples from data (n=20)
import random
sample1 = random.sample(list(pop1), 50)
sample2 = random.sample(list(pop2), 50)
sample3 = random.sample(list(pop3), 50)
sample4 = random.sample(list(pop4), 50)
sns.kdeplot(sample1, label='Pop1')
sns.kdeplot(sample2, label='Pop2')
sns.kdeplot(sample3, label='Pop3')
sns.kdeplot(sample4, label='Pop4')
plt.legend(frameon=False, loc='upper right')
plt.show()

3. Compute the statistic and check the *p-value*

In [None]:
stat, p = sts.f_oneway(sample1, sample2, sample3, sample4)
print('F-statistics=%.3f, p=%.6f' % (stat, p))
alpha=0.05
if p > alpha:
 print('fail to reject H0. Rejecting H0 has an error probability >0.05')
else:
 print('reject H0 with an error probability <0.05)')

The SciPy does not provide the usual ANOVA table. An alternative is to use the statsmodel api that includes a more complete output:

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# the statsmodels ANOVA needs to convert data into a DataFrame
list_sample = [sample1, sample2, sample2, sample4]
df = pd.DataFrame(list_sample)
df = df.T
df.columns = ["sample1", "sample2", "sample3", "sample4"]
df = df.stack()
df=df.reset_index()
df.rename(columns = {'level_1':'group', 0:'value'}, inplace = True)
df.drop('level_0', inplace=True, axis=1)
df

In [None]:
mod = ols('value ~ group',
                data=df).fit()
                
aov_table = sm.stats.anova_lm(mod, typ=2) # typ is the type of anova to perform ('I','II' or 'III' = 1,2,3) - type 2 does not consider interactions, which is the case.
print(aov_table) # provides the usual ANOVA table

#### Two-way ANOVA

Tests whether two factors affect the mean of three or more groups. It also tests whether there is an interaction between the two factors (if one factor changes the effect of the other factor).

Possible H0: 
1. There is no difference in the means of factor A.
2. There is no difference in means of factor B.
3. There is no interaction between factors A and B.

In [None]:
# create data (example taken from https://www.statology.org/two-way-anova-python/) - influence of plant growth by sunlight exposure and watering frequency
df2 = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15),
                   'sun': np.tile(np.repeat(['low', 'med', 'high'], 5), 2),
                   'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5,
                              6, 6, 7, 8, 7, 3, 4, 4, 4, 5,
                              4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})
print(df2)

In [None]:
#perform two-way ANOVA
model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df2).fit()
sm.stats.anova_lm(model, typ=2)

#### Repeated measures ANOVA

Used when the responses from the same subjects (experimental units) are measured repeatedly over a period of time or under different experimental conditions.

H0: Treatment or time groups means are equal

In [None]:
#create data - Ex: Measurements (time of response to stressor) taken over time for the same fish individuals
df3 = pd.DataFrame({'fish': np.repeat([1, 2, 3, 4, 5], 4),
                   'time': np.tile([1, 2, 3, 4], 5),
                   'time of response': [30, 28, 16, 34,
                                14, 18, 10, 22,
                                24, 20, 18, 30,
                                38, 34, 20, 44, 
                                26, 28, 14, 30]})
df3

In [None]:
rmanova = stm.anova.AnovaRM(data=df3, depvar='time of response', subject='fish', within=['time']).fit()
print(rmanova)

### Post-hoc or multiple comparison tests

#### Tukey's test

To check which pairs of groups differ in their mean values. Can be used also as a stand alone test.
Implemented in statsmodels.stats.multicomp

In [None]:
# perform Tukey's test using the df dataframe defined above)
tukey = stm.multicomp.pairwise_tukeyhsd(endog=df['value'],
                          groups=df['group'],
                          alpha=0.05)
#display results
print(tukey)