# Geospatial Data Analysis I 

## Multivariate statistics - Solution

This exercise is about analysing and visualising multi-dimensional data. For this will again use the dataset on groundwater paremeters in Karlsruhe (the large one: "Data_GW_KA.csv"). 

- Import the csv or excel file as a Dataframe via Pandas.  

- Check the dataframe for any columns that contain strings or NaNs and drop them, as we will not be able to work with them here. 

In [None]:
# [1]


### Exercise 1: Principal component analysis in 5 steps

#### 1. Standardising the data

From last week we already know that the variances and covariances of the dataset vary over several orders of magnitude. Accordingly, we need to standardise (or transform) all parameters before further analysis, so that they follow a standard normal distribution *N* (0, 1). 

One option in Python to do this is the function `sklearn.preprocessing.StandardScaler().fit_transform()` from the package `sklearn`. 

- Import the function `sklearn.preprocessing.StandardScaler().fit_transform()`. 

- Create a variable for the standardised data using the method above. The first brackets takes no arguments (i.e. it remains empty), the second one takes the name of the dataset to be transformed. 

- Check the statistical characteristics of the created standardised dataset.  

In [None]:
# [2] 



#### 2. Calculating eigenvalues and eigenvectors

Now we can calculate the eigenvalues and eigenvectors of the covariance matrix, i.e. the axes in the parameter space that later represent the prinicpal components.

- First, calcualte the covariance matrix of your data using `numpy.cov(data.T)`. The `.T` is used to calculate the transposed data matrix (columns become rows), which simplifies matrix handling in the following steps. 

- Then, use `numpy.linalg.eig()` to calculate the eigenvalues and eigenvectors (i.e. two outputs for one function) with the covariance matrix as the input. 

In [None]:
# [3] 


#### 3. Determining principal components

The aim of principal component analysis is to reduce the dimensions of the dataset, e.g. for visualising the entire information of the data in 2D, while preserving as much information, i.e. variance, as possible. The direction (orientation of the axes in the parameter space) of the principal components is equal to the eigenvectors (with a unit length of 1). In order to preserve as much information as possible, we are interested in the eigenvectors with the highest eigenvalues. 

- First, define a new variable as an empty list (`name = [[]]*n`), with `n` being the number of eigenvalues. 

- Now, we want to fill this list with the eigenvalues (one value) and the corresponding eigenvectors (*n* values) per row: row_list = [(eigenvalue, eigenvector)]. You can e.g. use a for-loop to do so. Because we are interested in overall information content, we need to use the absolute values `numpy.abs(array)` for eigenvalues and eigenvectors. 

Tipp: in the for-loop, the indexing for eigenvalues is per row (with a simple index `[i]`), the eigenvectors need to be indexed column-wise (`[:,i]`), as the eigenvectors are a 15x15 matrix. 

In [None]:
# [4]


Now we need to identify the rows with the highest eigenvalues. 

- Sort the created list using `list.sort()` and `list.reverse()`, in a way that the first entry has the highest eigenvalue. 

In [None]:
# [5] 



The first two rows of this list now contain the two main principal components with the largest variances. 

- Save these first two rows as two separate variables of the type `list`.  

In [None]:
# [6]


#### 4. Constructing the matrix for visualisation 

Using the principal components as rotational axes, we now want to transform (or project) the orginal data points onto the two identified axes in the parameter space (i.e. the principal components). In order to reduce 15 parameters to two dimensions we need a 15x2 matrix, with the two eigenvectors (those with the highest eigenvalue) as columns. You have already saved these two eigenvectors and their eigenvalues as seperate lists in the last step. 

- Use `numpy.stack((vector1, vector2))` to create the required transformation matrix. With the additional argument `axis=-1` you can make sure that the two lists (or vectors) are aligned horizontally. Also, make sure the correc index, so that only the eigenvector is contained in the matrix (we do not need the eigenvalues anymore). 

- Check the size of the created matrix to make sure it has 15x2 dimensions. 

In [None]:
# [7]


#### 5. Projection onto the new axes and visualisation

Now, you can perform the tranforamtion of your standardised matrix from above using the equation `Y = data x W `. 

- Use the syntax for matrix multplication `matrix1.dot(matrix2)` your transform your data. The output matrix should have the dimension 39 x 2 (datapoints x PCs).  

In [None]:
# [8]


To make further analysis of the PCA results easier, we can convert the matrix into a `pandas` DataFrame. Also, we can then add additional information to this DataFrame for enhanced visualisation. 

- Convert the matrix (or ndarray) into a DataFrame (x = pd.DataFrame(data = data, columns = ['column', 'more columns']))

- Add a column with the values for the land use type (the original ones, not the standardised ones) to this DataFrame. 

- Then, visualise the PCA results as a Score Plot, i.e. a scatter plot with the first PC on the x-axis, and the second PC on the y-axis. Use the land use type as color for the scatter points. What do the results indicate for the land use?  

In [None]:
# [9]




### Principal Component Analysis with sklearn

The Python Packge `sklearn` contains many useful functions and methods for statistical analysis and machine learning. One of them is `sklearn.decomposition.PCA()` for principal component analysis. 

- Define a variable using `sklearn.decomposition.PCA()`, with the desired number of components as an input argument (`n_components=15`). Check the data type of the created variable. 

- You can then calculate the transformed dataset by applying the function `.fit_transform()` as an attribute to the just created variable. The function requires the original (standardised) data as an input argument. 

In [None]:
# [10]


For evaluating the meaningfulness of the PCA results it is important to know, how much of the original variance (and thus information) is contained in the new transformed dataset. 

- Calculate the explained variance by applying the attribute `.explained_variance_ratio_` to the created PCA object from above. 

- Also, calculate the sum of all variances. Check (e.g. print) the values of both the individually explained variances for each PC and the sum of all variances. How would you evaluate the results?

In [None]:
# [11] 


- Now you can plot the PCA results from `sklearn` analogue to the Score Plot above, and compare both visualisation. 

In [None]:
# [12] 



If you have done everything correctly, both plots should look identically (or maybe with mirrored results, if you have switched the orientation of one of the axes). 

Other ways to inspect and visualise PCA results include the Loading Plot, i.e. a bar plot showing the contribution of each parameter to the principal components. 

- Create a Loading plot, for the first two PCs (e.g. using `subplots` in `matplotlib`) and the bars according to the parameters in your dataset. 

- Looking at this plot, you can now assess the importance of each parameter in the different PCs. How would you rate the importance of the land use? 

In [None]:
# [13] Loading Plot



## END

### References: 

Koch et al. (2020), Groundwater fauna in an urban area: natural or affected? https://hess.copernicus.org/preprints/hess-2020-151/hess-2020-151.pdf

Lever et al. (2017) Principal component analysis, Nature Methods 14(7), 641-642, https://doi.org/10.1038/nmeth.4346 

https://towardsdatascience.com/a-complete-guide-to-principal-component-analysis-pca-in-machine-learning-664f34fc3e5a