In [1]:
from myst_nb import glue

from src.commons import NUM_EXPERIMENTS

glue('num_experiments', NUM_EXPERIMENTS, display=False)

# Statistical verification of results

In order to assess significance and performance of obtained results two statistical approaches were used. The majority of metrics were compared using the Bayesian estimation (BEST) method {cite}`kruschke2013bayesian`, while the other straightforward metrics were just averaged.

## Bayesian analysis
The Bayesian approach towards comparing data from multiple groups was used instead of traditional methods of _null hypothesis significance testing_ (NHST). They are more intuitive than the calculation and interpretation of _p-value_ scores, provides complete information about credible parameter values and allow more coherent inferences from data {cite}`dienes2011bayesian`.

The perils of frequentist NHST approach when comparing machine learning classifiers were depicted by Benavoli in {cite}`benavoli2017time`, which is particularly suited for this work. He points the following reasons against using the NHST methods:
- it does not estimate probability of hypotheses,
- point-wise null hypothesis are practically always false,
- the p-value does not separate between the effect size and the sample size,
- it ignores magnitude and uncertainty,
- it yields no information about the null hypothesis,
- there is no principled way to decide the $\alpha$ level

Additionally, in 2016 the _[American Statistical Association](https://www.amstat.org/)_ made a statement against p-values {cite}`wasserstein2016asa` which might be a motivation for other disciplines to pursue the Bayesian approach.

````{margin}
```{admonition} T-distribution
The T distribution, like the normal distribution, is bell-shaped and symmetric, but it has heavier tails, which means it tends to produce values that fall far from its mean.

Tail heaviness is determined by a parameter called _degrees of freedom_ $\nu$ with smaller values giving heavier tails, and with higher values making the T distribution resemble a standard normal distribution with a mean of 0, and a standard deviation of 1.
```
````

In this work we focus on establishing a descriptive mathematical model of the data $D$ using the Equation {eq}`bayesian_approach`.

```{math}
:label: bayesian_approach
\underbrace{p(\mu, \sigma, \nu|D)}_{\text{posterior}} = \underbrace{p(D|\mu, \sigma, \nu)}_{\text{likelihood}} \times \underbrace{p(\mu, \sigma, \nu)}_{\text{prior}} \big/ \underbrace{p(D)}_{\text{evidence}}
```

Each experiment is performed {glue:}`num_experiments` times, generating independent samples, which according to the Central Limit Theorem, should be enough to consider it approximating the normal distribution {cite}`islam2018sample`. To further provide a robust solution towards dealing with potential outliers, the _Student t-distribution_ is chosen. The prior distribution, described with three parameters - $\mu$ (expected mean value), $\sigma$ (standard deviation) and $\nu$ (degrees of freedom) is presented using the Equation {eq}`bayesian_prior`.

```{math}
:label: bayesian_prior
\begin{align}
    \mu &\sim \mathrm{N}(\mu_D, \sigma^2_D) \\
    \sigma &\sim \mathrm{U}(\frac{1}{100}, 1000) \\
    \nu &\sim \mathrm{Exp}(\frac{1}{29})
\end{align}
```
The posterior distribution is approximated to arbitrarily high accuracy by generating a large representative sample from it using _Markov chain Monte Carlo_ (MCMC) methods. It's sample provides many thousands of combinations of parameter values $<\mu, \sigma, \nu>$. Each such combination of values is representative of credible parameter values that simultaneously accommodate the observed data and the prior distribution. From the MCMC sample, one can infer credible parameter values like the mean or standard deviation.

To perform Bayesian estimation this work uses the publicly accesible [PyMC3](https://docs.pymc.io/en/v3/) framework for probabilistic programming in Python language.
