# Geospatial Data Analysis I 

## Inferential and bivariate statistics - Exercise

###  Exercise 1: Inferential statistics 

In this exercise, we will take a closer look at the groundwater data from the last lesson, and infer information about the entire population in the area of the Hardtwald forest based on the measured samples. 

First, read in the dataset 'Data_GW_KA.csv' (or the corresponding excel file). 

In [None]:
# [1]



In the last exercise we  already characterised the parameter groundwater temperature, which roughly had the shape of a normal distribution. Now, let's assume you need 50 representative temperature values for some modelling exercise (but you only have 39 measurements). 

In order to generate these 50 values, you can fit a normal distribution to the measured values, and then use Python to create the desired number of random (but representative) values, which should have the same statistical characteristics as the measured data. To do so, you first need to find out which type of normal distribution (mean, std) fits best to the measured data. 

- Use the function `scipy.stats.norm.fit()` to define two outputs ($\mu$, $\sigma$) based on the groundwater tempertaure data as input. 

- Print the mean and standard deviation to inspect them. 

In [None]:
# [2]


Using the function `scipy.stats.norm.rvs()` you can now generate a specific number of normally distributed values. As inputs for the function you need to specific the mean, the standard deviation, and the number of values (e.g. 50) 

In [None]:
# [3]


Now, have a look at the generated values, especially at their mean value and standard deviation. Are they identical to the empirical values from the measured samples?

In [None]:
# [4]


Most likely the values for mean and standard deviation differ slightly to the empirical ones. This might be due to the small number (n=50) of generated values. 

- Generate another set of random values, now with n = 500,000 (German principle: "viel hilft viel!")

- Then compare the new mean and standard deviation again to the empirical measures. 

In [None]:
# [5]


Mean value and standard deviation should now be close to the empirically derived values. However, the large number of generated values leads to a different issue. 

- Calculate the minimum and maximum value of the generated 500,000 values. Are they physically reasonable, given the range of the measured data? 

In [None]:
# [6]


Based on the measured data these values seem too small and too high, respectively. The reason is that with this large number of randomly generated values,  some values at the extremely unlikely tails of the distributions get picked as well. 

You can avoid this by working with truncated distributions. The function `scipy.stats.truncnorm.rvs()` can generate such distributions using two scaling parameters (*a*, *b*) to cut off extreme values: 

<img src="https://latex.codecogs.com/gif.latex?a&space;=&space;(minimum&space;-&space;mean)/&space;std" title="a = (minimum - mean)/ std" />

und 

<img src="https://latex.codecogs.com/gif.latex?b&space;=&space;(maximum&space;-&space;mean)/&space;std" title="b = (maximum - mean)/ std" />

- Think of a resonable minimum and maximum value for your theoretical distribution of groundwater temperatures, and calculate *a* and *b* accordingly.

- Use them in `scipy.stats.truncnorm.rvs()`to generate 500,000 random values, and check the statistical characteristic of those values (e.g. using `scipy.stats.describe()`). 


In [None]:
# [7]


The new values are much more reasonable to represent groundwater temperatures than the un-truncated ones, and could now be used for further modelling / analyses. 

There are also functions to fit other theoretical distributions beside a normal distribution to exisiting data. You can find an overview on available distributions in `scipy` here: https://docs.scipy.org/doc/scipy/reference/stats.html

To visusalise the measured data and the fitted distribution, we can now plot them using `matplotlib`. 

- First, generate a sequence x-values to plot the probability density function on using `numpy.linspace()`

- Then, calculate the corresponding y-values using `scipy.stats.norm.pdf(x, mean, variance)` with the estimated mean and variance from above. 

- Finally, plot the probability density function as a line (`plt.plot(x, y)`), and the measured data as a histogram (`pyplot.hist(data, density =True)`) into the same plot. The argument `density = True` ensures that the measured values of your histogram are normalized to the probability (instead of the frequency).

In [None]:
# [8]


### Exercise 2: Bivariate statistics

After looking at one individual parameter above, we now analyse the relationships between two measured variables. 

First, read in the shortened data set on groundwater in Karlsruhe from Koch et al. (2020) ('Data_GW_KA_short.csv') as a dataframe using pandas. 


In [None]:
# [9]


#### Explorative Data Analysis: Histograms and scatterplots

If you are confronted with a new data set, the first step is to get an overview on the measured variables, their values and some basic characteristics. Also, it is recommended to always visualize the data, e.g. with histograms for each parameter, as this is the easiest way to identify outliers, patterns in the data, etc. If interested in the relationship between two variables, a scatterplot is good option. 

The Python package `seaborn` has a very useful function (`seaborn.jointplot()`) to plot the scatterplot of two variables, and the histograms of the marginal distribution in one go. `seaborn` is a Python package similar to `matplotlib`, which is also contained in Anaconda and offers speficic functionalities for visualising data. 

- Import `seaborn` with the abbreviation 'sns', and use `seaborn.jointplot(data=dataframe, x=variable1, y=variable2)` to visualise two measured variables (free choice) from the dataframe. 

In [None]:
# [10]


To get an quick overview on all bivariate relations in the entire data set, there is the function `seaborn.pairplot()` that combines histograms and scatterplots for all variables. 

- Apply the function `seaborn.pairplot()` to the groundwater data set. Depending on the computational capacity of your laptop this might take a few seconds.

In [None]:
# [11]


#### Correlation coefficients

Now we would like to quantify the relationship between individual variables. The basic measure for this is the covariance: 

<img src="https://latex.codecogs.com/gif.latex?cov_{xy}&space;=&space;\frac{1}{1-n}\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})" title="cov_{xy} = \frac{1}{1-n}\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})" />

To calculate the covariance for all variable pairs in a Pandas DataFrame, you can use the function `pandas.DataFrame.cov()`. 

- First, remove the columns with the well names from the data set to avoid any errors resulting from the string data type `DataFrame.drop(column = 'Name')`. 

- Then calculate the covariances using `DataFrame.cov()`, and print the resulting covariance matrix. 

- Interpret the shown values with respect to the strength of the bivariate relationship, and compare the covariances with the visual impressression form the figure above. 

In [None]:
# [12]


The strength of the individual relations is not easy to interpret because the values of the variables and thus their (co)variances vary quite a lot. Dividing the covariance by the product of the individual standard deviations solves this problem, and returns Pearson's correlation coefficient. 

- Use the Pandas function `DataFrame.corr()` to calculate the correlation coefficient matrix for your data set, and print it. 

- Compare the results to the covariance matrix and the pairplot above. Which is the variable pair with the strongest linear relationship?

In [None]:
# [13]


Generally, the correlation coefficients in this dataset are quite small. One reason might be that the relations are not strictly linear, but more complex (see pairplot above). 

- Use `scipy.stats.spearmanr(x, y)` to calculate Spearman's correlation coefficient for a chosen variable pair. This function outputs the correlation coefficient (statistic) and the p-value (measure for the statistical significance, more on that later...)

- Compare the Spearman coefficient to the one from Pearson above. Do the values differ, and what does this mean for the kind of bivariate relationship?  

In [None]:
# [14]



As mentioned on the slides, there is a third commonly-used correlation coefficient, which can also be used for discrete and ordinal data. 

- Pick a suitable variable pair from the dataset, and use `scipy.stats.kendalltau()` to calculate Kendall's correlation coefficient. 

- Check for differences between Pearson's, Spearman's and Kendall's correlation coefficient and interpret them. 

In [None]:
# [15]


A nice way to visualise matrix values (e.g. from a correlation matrix) is a heatmap plot. 

- Use `seaborn.heatmap(data)` to visualise the correlation matrix from Pandas above. By adding the argument `annot=True` you can plot the numbers of the coefficients as well. 

In [None]:
# [16]


If there is time left you can visit the seaborn webpage (https://seaborn.pydata.org/examples/index.html) to see more examples of data visualisation. 

### END

#### References: 

Koch et al. (2020), Groundwater fauna in an urban area: natural or affected? https://hess.copernicus.org/preprints/hess-2020-151/hess-2020-151.pdf
