# Repository workflow + THEORY

Before start running all the notebooks, it might be interesting to understand the whole repository as a sinle piece. In the README file, accesible [here](https://javitausia.github.io/geocean-nz-ss/README.html), the project organigram can be seen, but here we will focus in how all the notebooks / python scripts / datasets ... are used.

## Repository GLOBAL aspect

The way this repository is thought is based on running notebooks, and this means that all the results obtained in the project are extracted using the notebooks, which internally import the corresponfing python scripts. For this purpose, we created a modular code with different classes and functions, that correctly behaves offering a statistical / numerical tool to be applied over atmospheric and oceanographic data. Below a summary of the global picture:

```{figure} ../media/images/repo-sketch.png
---
width: 600px
name: repo-sketch
---
Global picture of the geocean-nz-ss repository. As it can be seen, everything is hosted in GitHub, and all the code might be run correctly using the anaconda/miniconda interface
```

In the **data_visualization** notebook we already used the **Loader** class, which internally uses more basic python libraries, and this is the basic idea in this project, we build python classes and functions that already use all the possible pre-built python packages, so the effort is optimized.

```{note}
The jupyter notebook choice was prefered based on the level of difficulty these notebooksoffer, which is very basic, and also because of the popularity of them. Have a look at [this article](https://www.nature.com/articles/d41586-018-07196-1#:~:text=Jupyter%20is%20a%20free%2C%20open,resources%20in%20a%20single%20document.) and [this video](https://www.youtube.com/watch?v=9Q6sLbz37gk&t=1956s) for more important reasons of why notebooks are helpful!!
```

## Main Python packages

Regarding the pre-built python libraries that are mainly used in the project, we will briefly explain some of them:

* [NumPy](https://numpy.org/): NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

* [xarray](http://xarray.pydata.org/en/stable/#): xarray (formerly xray) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun! Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures. Xarray is inspired by and borrows heavily from [pandas](https://pandas.pydata.org/), the popular data analysis package focused on labelled tabular data. It is particularly tailored to working with [netCDF](https://unidata.github.io/netcdf4-python/) files, which were the source of xarray’s data model, and integrates tightly with dask for parallel computing.

* [matplotlib](https://matplotlib.org/stable/index.html#): Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. The majority of the plots are obtained using this plotting library.

* [cartopy](https://scitools.org.uk/cartopy/docs/latest/): Cartopy is a Python package designed for geospatial data processing in order to produce maps and other geospatial data analyses. Cartopy makes use of the powerful PROJ.4, NumPy and Shapely libraries and includes a programmatic interface built on top of Matplotlib for the creation of publication quality maps.

* [scikit-learn](https://scikit-learn.org/stable/index.html#): scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the About us page for a list of core contributors. 

and less common libraries such as:

* [pyextremes](https://georgebv.github.io/pyextremes/): pyextremes is a Python library aimed at performing univariate Extreme Value Analysis (EVA).

* [kats](https://facebookresearch.github.io/Kats/): Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis. Time series analysis is an essential component of Data Science and Engineering work at industry, from understanding the key statistics and characteristics, detecting regressions and anomalies, to forecasting future trends. Kats aims to provide the one-stop shop for time series analysis, including detection, forecasting, feature extraction/embedding, multivariate analysis, etc. Kats is released by Facebook's Infrastructure Data Science team.


## Model metrics for Regression

Regarding one of the machine learning packages (scikit-learn), here is where all the interesting things happen, as all the algorithms and mathematics are hosted here. Different techniques are used, and they are briefly described in each notebook, but in the paragraph below, we will introduce the evaluation metrics that will be used in all the regression / interpolation models that will be used. This metrics can be found [here](https://scikit-learn.org/stable/modules/model_evaluation.html).

When measuring the goodness of a fit, we can calculate very different statistical metrics, such as the bias, the rmse or the pearson and spearman correlation coefficients. In this part, we will briefly describe how all of them work, and how they must be interpreted:

* [Bias](https://en.wikipedia.org/wiki/Bias_(statistics)): This is the difference in the means of the predicted and real observations, and should be interpreted as how much the majority of the poinis differ from the real points, in terms of the mean.

$$
\boxed{
\textrm{BIAS:  } = \frac{\sum_{i=1}^Ny_{i \textrm{ (real) }}}{N} - \frac{\sum_{i=1}^Ny_{i \textrm{ (pred) }}}{N}
}
$$

* [RMSE](https://en.wikipedia.org/wiki/Root-mean-square_deviation): The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSE represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences. These deviations are called residuals when the calculations are performed over the data sample that was used for estimation and are called errors (or prediction errors) when computed out-of-sample. The RMSD serves to aggregate the magnitudes of the errors in predictions for various data points into a single measure of predictive power. RMSD is a measure of accuracy, to compare forecasting errors of different models for a particular dataset and not between datasets, as it is scale-dependent.

$$
\boxed{
\textrm{RMSE:  } = \sqrt{\frac{\sum_{i=1}^N \left(y_{i \textrm{ (real) }}-y_{i \textrm{ (pred) }}\right)^2}{N}}
}
$$

* [SI](https://www.sciencedirect.com/science/article/pii/S1463500313001418): When using the BIAS and the RMSE, we are not paying attention to some important features in our fit, and they are both dependent on the magnitude of the measures. Here, we introduce the scatter index (SI), which tries to measure the dispersion of the points from the bisection (0,0) to (x,x).

$$
\boxed{
\textrm{SI:  } = \sqrt{
\frac{\sum_{i=1}^N \left[
\left(P_i-\bar{P}\right) - \left(O_i-\bar{O}\right)\right]^2}{\sum_{i=1}^N O_i^2}
}}
$$

where $\bar{P}$ and $\bar{O}$ are the predicted and observed means!!

* [Pearson](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) and [Spearman](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) correlation coefficients: In statistics, the Pearson correlation coefficient is a measure of linear correlation between two sets of data. It is the covariance of two variables, divided by the product of their standard deviations; thus it is essentially a normalised measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationship or correlation. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1 (as 1 would represent an unrealistically perfect correlation).The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson's correlation assesses linear relationships, Spearman's correlation assesses [monotonic](https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/basics/linear-nonlinear-and-monotonic-relationships/) relationships (whether linear or not). If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other. Equations below:

$$
\boxed{ \textrm{Pearson's  } \rightarrow
\rho _{X,Y} = \frac {\operatorname {cov} (X,Y)}{\sigma _{X}\sigma _{Y}}} \simeq
\boxed{ r_{xy}={\frac {\sum _{i=1}^{N}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{{\sqrt {\sum _{i=1}^{N}(x_{i}-{\bar {x}})^{2}}}{\sqrt {\sum _{i=1}^{N}(y_{i}-{\bar {y}})^{2}}}}}}
\\\\
\boxed{ \textrm{Spearman's  } \rightarrow
\rho _{rg_X,rg_Y} = \frac {\operatorname {cov} (rg_X,rg_Y)}{\sigma _{rg_X}\sigma _{rg_Y}}}
$$

as the Spearman correlation refers here to the [rank variables](https://en.wikipedia.org/wiki/Ranking).

* [R score](https://en.wikipedia.org/wiki/Coefficient_of_determination): The coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. It is internaly used in the main pachake we are constantly using, so please have a look at the python documentation for further information [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)!!

$$
\boxed{R^2=1-\frac{SS_{res}}{SS_{tot}}}
$$

where $SS_{tot}$ and $SS_{res}$ refer to the total sum of squares (proportional to the variance of the data):
$SS_{tot}=\sum _{i}(y_{i}-{\bar {y}})^{2}$ and the sum of squares of residuals, also called the residual sum of squares: $SS_{res}=\sum _{i}(y_{i}-f_{i})^{2}=\sum _{i}e_{i}^2$.

* [Explained variance](https://en.wikipedia.org/wiki/Explained_variation): The explained_variance_score computes the explained variance regression score. If $\hat{y}$ is the estimated target output,$y$ the corresponding (correct) target output, and $Var$ is Variance, the square of the standard deviation, then the explained variance is estimated as follow:

$$
explained\_{}variance(y, \hat{y}) = 1 - \frac{Var\{ y - \hat{y}\}}{Var\{y\}}
$$

The best possible score is 1.0, lower values are worse.


and much more...


```{note}
Although all these metrics are defined in the sscode/validation.py module, they can be also directly imported from scikit-learn using the **sklearn.metrics** module
```