# Useful theory

Below some useful theory to help project details...

## Model metrics for Regression

Regarding one of the machine learning packages (scikit-learn), here is where all the interesting things happen, as all the algorithms and mathematics are hosted here. Different techniques are used, and they are briefly described in each notebook, but in the paragraph below, we will introduce the evaluation metrics that will be used in all the regression / interpolation models that will be used. This metrics can be found [here](https://scikit-learn.org/stable/modules/model_evaluation.html).

When measuring the goodness of a fit, we can calculate very different statistical metrics, such as the bias, the rmse or the pearson and spearman correlation coefficients. In this part, we will briefly describe how all of them work, and how they must be interpreted:

* [Bias](https://en.wikipedia.org/wiki/Bias_(statistics)): This is the difference in the means of the predicted and real observations, and should be interpreted as how much the majority of the poinis differ from the real points, in terms of the mean.

$$
\boxed{
\textrm{BIAS:  } = \frac{\sum_{i=1}^Ny_{i \textrm{ (real) }}}{N} - \frac{\sum_{i=1}^Ny_{i \textrm{ (pred) }}}{N}
}
$$

* [RMSE](https://en.wikipedia.org/wiki/Root-mean-square_deviation): The root-mean-square deviation (RMSD) or root-mean-square error (RMSE) is a frequently used measure of the differences between values (sample or population values) predicted by a model or an estimator and the values observed. The RMSE represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences. These deviations are called residuals when the calculations are performed over the data sample that was used for estimation and are called errors (or prediction errors) when computed out-of-sample. The RMSD serves to aggregate the magnitudes of the errors in predictions for various data points into a single measure of predictive power. RMSD is a measure of accuracy, to compare forecasting errors of different models for a particular dataset and not between datasets, as it is scale-dependent.

$$
\boxed{
\textrm{RMSE:  } = \sqrt{\frac{\sum_{i=1}^N \left(y_{i \textrm{ (real) }}-y_{i \textrm{ (pred) }}\right)^2}{N}}
}
$$

* [SI](https://www.sciencedirect.com/science/article/pii/S1463500313001418): When using the BIAS and the RMSE, we are not paying attention to some important features in our fit, and they are both dependent on the magnitude of the measures. Here, we introduce the scatter index (SI), which tries to measure the dispersion of the points from the bisection (0,0) to (x,x).

$$
\boxed{
\textrm{SI:  } = \sqrt{
\frac{\sum_{i=1}^N \left[
\left(P_i-\bar{P}\right) - \left(O_i-\bar{O}\right)\right]^2}{\sum_{i=1}^N O_i^2}
}}
$$

where $\bar{P}$ and $\bar{O}$ are the predicted and observed means!!

* [Pearson](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) and [Spearman](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) correlation coefficients: In statistics, the Pearson correlation coefficient is a measure of linear correlation between two sets of data. It is the covariance of two variables, divided by the product of their standard deviations; thus it is essentially a normalised measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationship or correlation. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1 (as 1 would represent an unrealistically perfect correlation).The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson's correlation assesses linear relationships, Spearman's correlation assesses [monotonic](https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/supporting-topics/basics/linear-nonlinear-and-monotonic-relationships/) relationships (whether linear or not). If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other. Equations below:

$$
\boxed{ \textrm{Pearson's  } \rightarrow
\rho _{X,Y} = \frac {\operatorname {cov} (X,Y)}{\sigma _{X}\sigma _{Y}}} \simeq
\boxed{ r_{xy}={\frac {\sum _{i=1}^{N}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{{\sqrt {\sum _{i=1}^{N}(x_{i}-{\bar {x}})^{2}}}{\sqrt {\sum _{i=1}^{N}(y_{i}-{\bar {y}})^{2}}}}}}
\\\\
\boxed{ \textrm{Spearman's  } \rightarrow
\rho _{rg_X,rg_Y} = \frac {\operatorname {cov} (rg_X,rg_Y)}{\sigma _{rg_X}\sigma _{rg_Y}}}
$$

as the Spearman correlation refers here to the [rank variables](https://en.wikipedia.org/wiki/Ranking).

* [R score](https://en.wikipedia.org/wiki/Coefficient_of_determination): The coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. It is internaly used in the main pachake we are constantly using, so please have a look at the python documentation for further information [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html)!!

$$
\boxed{R^2=1-\frac{SS_{res}}{SS_{tot}}}
$$

where $SS_{tot}$ and $SS_{res}$ refer to the total sum of squares (proportional to the variance of the data):
$SS_{tot}=\sum _{i}(y_{i}-{\bar {y}})^{2}$ and the sum of squares of residuals, also called the residual sum of squares: $SS_{res}=\sum _{i}(y_{i}-f_{i})^{2}=\sum _{i}e_{i}^2$.

* [Explained variance](https://en.wikipedia.org/wiki/Explained_variation): The explained_variance_score computes the explained variance regression score. If $\hat{y}$ is the estimated target output,$y$ the corresponding (correct) target output, and $Var$ is Variance, the square of the standard deviation, then the explained variance is estimated as follow:

$$
\boxed{explained\_{}variance(y, \hat{y}) = 1 - \frac{Var\{ y - \hat{y}\}}{Var\{y\}}}
$$

The best possible score is 1.0, lower values are worse.

* [POCID](https://reader.elsevier.com/reader/sd/pii/S0020025519300945?token=0C57BE865A188E39245C968D0BB4F249D09A7F3E490FF047DA7BD3CEDE05B1389253D080AD7B2087D791E41117E16AA8&originRegion=eu-west-1&originCreation=20210908111213): Another performance index considered was the POCID, which is formalized by the equation below. In this equation, the term $D_t$ stores the value **1** if $(\hat{y}_t − \hat{y}_{t-1})(y_t − y_{t−1})$ > 0, and **0** otherwise. The idea of this index is to estimate the accuracy of direction’s changes of the projected data, i.e., if the future value will increase or decrease when compared to current value. We should use POCID in a complementary way to the analysis of the prediction errors. It is not advisable to make a decision based solely on POCID values.

$$
\boxed{\textrm{POCID} = \sum_{i=1}^{N}\frac{D_t}{N}*100}
$$

* [TU (test)](https://reader.elsevier.com/reader/sd/pii/S0020025519300945?token=0C57BE865A188E39245C968D0BB4F249D09A7F3E490FF047DA7BD3CEDE05B1389253D080AD7B2087D791E41117E16AA8&originRegion=eu-west-1&originCreation=20210908111213): The TU coefficient, expressed below, is based on the MSE of the predictor, normalized by the prediction error of a trivial (or naı̈ve) model. The naı̈ve model assumes that the best value for time t + 1 is the value obtained at time t. The values obtained can be interpreted in the following way: if TU > 1, the algorithm’s performance is lower than the naı̈ve model; if TU = 1, the algorithm’s performance is the same as th naı̈ve model; if TU < 1, the algorithm’s performance is higher than the naı̈ve model; and if TU ≤ [0.55](https://reader.elsevier.com/reader/sd/pii/S0020025519300945?token=0C57BE865A188E39245C968D0BB4F249D09A7F3E490FF047DA7BD3CEDE05B1389253D080AD7B2087D791E41117E16AA8&originRegion=eu-west-1&originCreation=20210908111213), the algorithm of interest is trusted to carry out future predictions.

$$
\boxed{\textrm{TU} = \sum_{i=1}^{N}\frac{(y_t-\hat{y}_t)^2}{(y_t-y_{t-1})^2}}
$$

and much more...


```{note}
Although all these metrics are defined in the sscode/validation.py module, they can be also directly imported from scikit-learn using the **sklearn.metrics** module
```

## GEV analysis (monthly)

As we are still trying to understand the data we will be using, let's plot some more data. In this case, let's plot the results of the Generalized-Extreme-Value analysis performed over all the nodes of the Moana hindcast, for 2 different months in the year. 

To refresh our memories, this probability density function is usually used to statistically represent the distribution of extreme events (in this case, monthly maxima), where the equation can take the following aspects:

$$
\boxed{
f(x)=\left\{\begin{matrix}
\exp\left [ -\left ( 1+\xi \frac{x^{(r)}-\mu}{\psi}{} \right )^{-\frac{1}{\xi}}_+ \right ] \cdot \prod_{k=1}^{r}\psi ^{-1}\exp \left ( -\frac{x^{(k)}-\mu}{\psi} \right )^{-1-\frac{1}{\xi}}_+ & \textrm{for } \xi\neq 0 \\ 
\exp\left [- \exp \left ( -\xi \frac{x^{(r)}-\mu}{\psi}{} \right ) \right ] \cdot \prod_{k=1}^{r}\psi ^{-1}\exp \left ( -\frac{x^{(k)}-\mu}{\psi} \right )
 & \textrm{for } \xi =  0
\end{matrix}\right.
}
$$

where $(a)_+ = max(0,a)$ and $\mu$ , $\psi$ y $\xi$ are the parameters of the GEV distribution. This parameters are estimated using the maximum likelihood estimation, which consists in finding the optimal parameters that maximizes the logarithm of the likelihood for the data:

$$
\boxed{
\theta^{\textrm{ opt}}=\underset{\theta}{\textrm{arg max }}L(\theta)
}
$$

more information can be found [here](https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution)!!

```python
    # perform gev analysis over stormsurge-monthly data
    from sscode.statistical import gev_matrix
    gev_data_list = []
    for month in [2,10]: # select the months!!
        gev_data_list.append(gev_matrix(
            load_moana_hindcast_ss(daily=True).interp(
                lon=np.arange(160,190,2.5),
                lat=np.arange(-52,-30,2.5)
            ).isel(
                time=np.where(
                    pd.to_datetime(
                        load_moana_hindcast_ss(daily=True).time.values
                    ).month==month
                )[0]
            ).resample(time='1M').max(),'lon','lat',plot=True,
            gev_title='GEV parameters in month = {}'.format(month)
        )[['mu','phi','xi']].expand_dims(
            {'month':[month]}
        ))
    save = False # save results
    if save:
        xr.concat(gev_data_list,dim='month')\
            .to_netcdf('../data/statistics/stats_ss_gev_moana_monthly.nc')
```