# A - PCA in Forest Science

This series of notebooks address the topic of principal component analysis, short PCA, from the point of view of forest science. They were originally developed for a short seminar at the Forest Information Technology program of the Hochschule for Nachhaltige Entwicklung Eberswalde (HNEE - Eberswalde University for Sustainable Development).

## Contents and general organization

The 4 notebooks are designed to show the implementation of PCA and start from the theory and math behind the method, building up to an application using metereological data from the forest botanical garden of the University.

Notebooks:
* <font color=darkgreen>**A - PCA in Forest Science.**</font> This notebook, giving an overview of the seminar and its motivations as well as the requirements to run the code examples.
* <font color=darkgreen>**B - PCA Math, Change of Coordinates.**</font> A simple example featuring a 3D data table, to show the math behind the PCA method (linear algebra, covariance matrix) and the concept of projection on another coordinate system (3D→2D).
* <font color=darkgreen>**C - *Forstbotanischer Garten* 3D.**</font> An example similar to that in notebook B, with real meteorological data from the forest botanical garden in Eberswalde. Key points are the correlations between variables and the importance of scaling them prior to PCA.
* <font color=darkgreen>**D - *Forstbotanischer Garten* Stations.**</font> A more complex example including 6 meteorological stations located in the forest botanical garden in Eberswalde, each giving measurements of 7 variables. It shows the use of **scree plots** and **biplots** to show feature importance and clustering.

## Motivation, short story

Principal Component Analysis, PCA, is used to transform a set of data, i.e. a table, into another one. The information in the quantities is mixed together and given in another form. The number variables is the same, meaning that the new data table has the same dimensions than the first one. The good point is that the new table has the data ordered according to their importance in explaining (the variance of) the data in the original table.

In other words: We transform a table into another of the same size, and the we can strip a part of the second one, keeping a reduced version of the original dataset. Even better: the transformation is reversible.

---

Forest science, and in general all disciplines relying on complex processes and using big amounts of data, can profit from PCA because it can help to organize big data tables and reduce their dimensions.

Dimensionality reduction and PCA in particular can be used:

* To filter noisy signals (also images!)
* To compress tables (and images)
* To look for similarities and differences (clustering)
* To build simpler models (i.e. regression models) using less inputs

The underlying idea is that we have **correlated variables**, and that this correlation can be interpreted as **redundancy**. This is common in biology, ecology, agronomy and forest science. Many things change together, at least partially, for example:

* Biomass production and solar radiation during the year
* Population of a plague and air temperature
* Fungal infections and humidity
* Another plague and rainfall
* Presence of two species
* Prevalence of a disease and a tree species
* More ideas?

---

A super simple example:

We measure the temperature in 2 points close to each other in the same room, 5 times. We take the average temperature and thus reduce the dimensions of the measurements matrix from $[2 \times 5]$ to $[1 \times 5]$. 

They are correlated after all!

## What you bring to profit the most of this course

#### Yourself

* Understanding of linear algebra: matrix product, transpose of a matrix, eigenvalues, eigenvectors.
* Statistics: Covariance, correlation
* Graph literacy,  coordinate systems
* Meteorology: Microclimate
    
* Programming in general, *python* in particular

#### Your computer

* python 3.x, numpy, pandas, matplotlib, scikit-learn

## Course materials

These notebooks, as well as the dataset used in the examples, are uploaded to a public repository and can be downloaded any time.

The repository is located at:
    
https://github.com/mirandal-gh/pca

## Some more resources

#### Websites, blogs

https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

(Particularly the first answer gives a very nice explanation of the PCA method)

---
   
https://towardsdatascience.com/principal-components-of-pca-bea010cc1d33

(Neat explanation of the math; projections 3D→2D)

---

https://medium.com/apprentice-journal/pca-application-in-machine-learning-4827c07a61db

(Application on input reduction for model building)

#### Documents, books

http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

(A tutorial on Principal Component Analysis by Lindsay I Smith)

https://www.elsevier.com/books/statistical-methods-in-the-atmospheric-sciences/wilks/978-0-12-385022-5

(Statistical Methods in the Atmospheric Sciences, )

## Contact

Luis Miranda

Doctor in Agricultural Sciences by the Humboldt-Universität zu Berlin

Researcher in Data science, Artificial Intelligence and Biosystems Engineering for Agriculture and Horticulture

luis.miranda@posteo.net