# Geospatial Data Analysis I 

## Statistical Tests - Solution

In this exercise we will test different statistical hypotheses related to groundwater parameters in the area of Karlsruhe. 

- Load the dataset "Data_GW_KA.csv" (or "Data_GW_KA.xslx") and create two DataFrames: one should contain the data from the forest area (Land use = 2), the other the data from the urban area (Land use = 1).

In [3]:
# [1]
import pandas as pd

data = pd.read_excel("Data_GW_KA.xlsx")

data_urban = data.loc[data.Land_use == 1,:]
data_forest = data.loc[data.Land_use == 2,:]

### Hypothesis 1: "The average groundwater temperature in the forest is 11°C." 

This null hypothesis is based on the observation that the air temperature in this area is aroudn 11°C. So, we could expect groundwater temperatures in the same range. The alternative hypothesis is accordingly: "The average temperature in the forest is not 11°C". Such hypotheses can be tested with a two-sided t-Test. 

The t-test is a parametric test, meaning that for a valid test results the samples need to be normally distributed. In `scipy.stats` there is a function `shapiro()`, which can be used to test for normal distribution. This function takes (at least) one input, i.e. the data set to be tested, and creates two outputs (the test statistic and the correspinding *p*-value). 

- Test the samples of groundwater temperature in the forest for normal distribution. Based on a significance level of $\alpha$=0.01, would you accept the hypothesis that the values are normally distributed? 



In [5]:
# [2]
from scipy.stats import shapiro
stat, p = shapiro(data_forest["GW_Temperature_°C"])
print(p)

0.7650041580200195


As the p-value is substantially higher than the significance level ($\alpha$=0.01), we can accept the null hypothesis that the values are normally distributed. So, we can continue with the t-test for our sample. In `scipy.stats` there is a function `ttest_1samp()`, which takes as input arguments the data to be tested and the hypothetical mean value (here 11°C).  

- Write a piece of code containing an `if else` condition, which outputs (in addition to the p-value) a suggestion "accept H0" for p-values > 0.01, and "reject H0" for p-values <= 0.01. 


In [7]:
# [4]
from scipy import stats
stat, p = stats.ttest_1samp(data_forest["GW_Temperature_°C"], 11)
print('stat=%.4f, p=%.4f' % (stat, p))
if p > 0.01:
	print('accept H0')
else:
	print('reject H0')


stat=-6.1775, p=0.0005
reject H0


How did you decide regarding the hypothesis above? Of course, it interesting to known (after testing!) what the actual mean value is.  

- Calculate the mean groundwater temperature in the forest area, and compare it to the hypothetical value. Also, determine the number of samples ('n'). How would you judge the result? 

In [8]:
# [5]
import statistics
mean_GWT = statistics.mean(data_forest["GW_Temperature_°C"])
n = len(data_forest)
print(mean_GWT, n)

10.69125 8


### Hypothesis 2: "Wells in the urban area have a lower content of dissolved oxygen than in the Hardtwald" 

This hypothesis can also be tested with a t-test. As we are comparing to data sets (forest vs. urban area) we a Python function for a two-sample t-test. Does this have to be one or two-sided? 

For a valid two-sample t-test both data sets have to be normally distributed, both need to have the same variance. 

- First, test whether the contents of dissolved oxygen in the urban area and the forest are normally distribiuted, using the same function `shapiro()` as above.  

In [9]:
# [6]
stat_f, p_f = shapiro(data_forest["Oxygen_mg/l"])
print(p_f)
stat_u, p_u = shapiro(data_urban["Oxygen_mg/l"])
print(p_u)

0.9679363965988159
0.1985752433538437


A common test for variances is the F-test. The according null hypothesis is that the variances are equal, the alternative hypothesis is that they are different. 

The equation for calculation the F-statistic is: 

<img src="https://latex.codecogs.com/gif.latex?\hat{F}&space;=&space;\frac{s_{a}^{2}}{s_{b}^{2}}" title="\hat{F} = \frac{s_{a}^{2}}{s_{b}^{2}}" />

where data *a* and *b* have to set so that s<sub>a</sub><sup>2</sup> > s<sub>b</sub><sup>2</sup>. 

- First, calculate the variances of both datasets, to decide on which data is *a* and which is *b*. 

- Then, calculate the F-statistic accrodingly using the equation above. 


In [10]:
# [7]
import statistics
var_urban = statistics.variance(data_urban["Oxygen_mg/l"])
var_forest = statistics.variance(data_forest["Oxygen_mg/l"])
if var_urban > var_forest:
    sa = var_urban
    sb = var_forest
else:
    sa = var_forest
    sb = var_urban
F = sa/sb
print (F)

1.7914795850111387


For testing the hypothesis we now compare the calculated F-statistic with the critical F-value of the Fisher distribution, and calculatet the corresponding p-value. 

- Use `1 - scipy.stats.f.cdf()` to calculate the p-value. This function needs the F-statistic as an input, as well as the degrees of freedom of both data sets *a* and *b*. As variances are calculated using the individual data points the degrees of freedom are here equal to `n-1`.

- Output the p-value and compare to the signficance level. Would you accept or reject the null hypothesis that both variances are equal? 

In [11]:
# [8]
import scipy
dfn = data_forest["Oxygen_mg/l"].size-1
dfd = data_urban["Oxygen_mg/l"].size-1
p = 1-scipy.stats.f.cdf(F, dfn, dfd)
print(p)

0.12584668944869004


As both conditions (normal distribution and equal variances) can be accepted, we can now proceed to applying the two-sided t-test. One option in Python to do so, is the function `scipy.stats.ttest_ind()`, which requires both data sets *a* and *b* as inputs, where mean(*a*) > mean(*b*). As output the function delivers the t-statistic and the p-value.

- Calculate the mean values of data sets *a* and *b*. 

- Then, use an if-else condition to calculate the t-statistic and the p-value according to the data sets *a* and *b*. 

- As the hypothesis requires a one-sided t-test, you have to divide the p-value by 2, before comparing it to the significance level. Would you accept or reject the null hypothesis? 

In [None]:
# [9]
mean_oxy_forest = data_forest["Oxygen_mg/l"].mean()
mean_oxy_urban = data_urban["Oxygen_mg/l"].mean()

print(mean_oxy_forest, mean_oxy_urban)

# am besten über if-else Bedingung anhand des größeren Mittelwert lösen
if mean_oxy_forest > mean_oxy_urban: 
    t_value, p = stats.ttest_ind(data_forest["Oxygen_mg/l"], data_urban["Oxygen_mg/l"])

else: 
    t_value, p = stats.ttest_ind(data_urban["Oxygen_mg/l"], data_forest["Oxygen_mg/l"])

p = p*0.5
print (t_value, p)

7.93875 4.468387096774194
3.4696694607406515 0.0006702352381517308


Similar as for the F-test, there is an option in scipy to calculate the critical t-value. The corresponding function is `scipy.stats.t.ppf()`, which requires as inputs the value of "1 - significance level/2" as well as the sum of the degrees of freedom. 

- Calculate the critical t-value and compare it to the t-statistic from above. If the t-statistic is larger than the critical t-value, the null hypothesis should be rejected. Would you accept or reject the null hypothesis? 

In [None]:
# [10]
df = dfn + dfd 
t_crit = scipy.stats.t.ppf(1-0.05/2, df)
print(t_crit)

2.0261924630291093


### What to do if the hypothesis for normal distribution is rejected? 

Commonly, data sets and parameters in geoscience are not normally distributed. The non-parametric equivalent to the t-test is the Mann-Whitney U-test. The null hypothesis for this test is that both data sets have the same distribution, the alternative hypothesis is that they are not equal. The corresponding function is `scipy.stats.mannwhitneyu()`, which takes as inputs the two data sets, and returns the U-statistic and the p-value. 

- Use a for-loop to iterate over every parameter in the data set, and test for each parameter whether the samples from the urban area and the forest are normally distributed. If the samples are not normally distributed, they can be further tested for having the same distribution using a Mann-Whitney U-test. 

- Add another if-else condition to your code, which prints information on those parameters that have the same distribution. Or rather the ones for which we would accept the hypothesis that the distributions are equal (p-value > significance level). 

- Tip: Build your code step-by-step (1. the for-loop, 2. first if-else, 3. second if-else), and test the functionality before adding further elements. 

In [13]:
# [11]

for item in range(1, data.shape[1]-1): 
    #print(data[0, item])
    stat1, p1 = shapiro(pd.Series(data_forest.iloc[:,item]))
    stat2, p2 = shapiro(pd.Series(data_urban.iloc[:,item]))

    if p1 and p2 < 0.01:
        U_value, p_value = scipy.stats.mannwhitneyu(pd.Series(data_forest.iloc[:,item]), pd.Series(data_urban.iloc[:,item]))

        if p_value > 0.01:
            print (data.columns[item], U_value, p_value)
        

Phosphate_mg/l 69.5 0.058506478816411724
Detritus 113.0 0.6966670993087477
Sediment 127.5 0.9142043078871558
Geology 85.0 0.14231142289670165
Abundancy_Species 145.5 0.4520197594195483
Abundancy_Individuals 160.5 0.2103966930599971
Perc_Crustaceen_% 162.0 0.18509761043690465




## END