# Notes on model evaluation so far

These notes are generated when I was trying to convert the `HDDM` object to a `arviz` `InferenceData`.

This process is very painful to me, because basically I relied on the internet for searching different sources. I've posted some questions on some forums/issues, but none of them were really replied. Fortunately, I am an old internet bag and can put information from different sources together and find hint from here and there. 

## Quantitative indices

### $R^2$ and Explained variance
$R^2$ is a goodness-of-fit measure that tells you how well the data fits the model that we created. Put it more simply,  it explains the proportion of variance in the outcomes that the independenct variables explain.

### How to calculate $R^2$:
If we observed data given by $y_i$, such that the fitted predicts $f_i$ for each point i, we can write the mean of all the observed data, given by $y_{mean}$ as:
$$y_{mean} = \frac{1}{n}\sum_{i}{y_i}$$

* Total sum of sqaures, which is proportional to the variance of the data is
$$ SS_{total} = \sum_{i} (y_i - y_{mean})^2$$

* The residual sum of squares (also called the error) is defined as:
$$ SS_{res} = \sum_{i} (y_i - f_{i})^2$$

* Note the $R^2$ is defined as:
$$ R^2 = 1 - \frac {SS_{res}}{SS_{total}}$$

### How to intepret $R^2$:
* $R^2$ is 1  for a model that perfectly fits with the observed data
* If the model predicts $y_{mean}$ always then $SS_{res} = SS_{total}$ and $R^2 = 0$, this indicates a baseline model to which all other models can be compared.
* Any model that performs worse than the baseline model whill have a negative $R^2$ score.

### Explained variance
The term $\frac {SS_{res}}{SS_{total}}$ is also called Unexplained Variance.

### PPC

#### MSE
used to measure the error in our model with regards to the data that the model is trying to fit.
Mean sqaured error (MSE) is the most intuitive index:

$$MSE=\frac {\sum_{1}^{n}(y_{true} - y_{predicted})^2} {n}$$


## Cross validation
In k-fold cross-validation, we divide the data into k-folds or subsets, and perform training of the model on k minus 1 folds or the model performance is assessed on the one fold that's left. 

If the number of folds is equal to the number of data points, we have leave-one-out cross validation. 

The ideal way to test is using another independent sample, 'out-of-sample' prediction.

### loo-PSIS


## Information criteria
1. Log-likelihood (Log predictive density) and deviance
2. Akaike information criterion (AIC)
3. Widely applicable information criterion (WAIC)
4. Deviance information criterion (DIC)
5. Bayesian Information criterion (BIC)

From (2) to (5), 
* They take the form of the equation with two terms given by
$$metric = model fit + penalization$$
* The model fit is measured using log likelihood of the data given model parameters (could be a pointwise estimate, or could use the full posterior distribution)
* Lower values imply a better fit

AIC, BIC, and DIC use the joint probability of the data, where waic computes the pointwise probability of the data.

**Note**: AIC and BIC are regarded as not Bayesian criteria and was excluded in statistical rethinking (2ed).

### Log-likelihood and Deviance
These terms, like MSE, are used to measure the error in our model with regards to the data that the model is trying to fit.

MSE is acceptable especially when the likelihood is a normal distribution. 

#### Log-likelihood

---

over-thinking: likelihood

Likelihood is the probability that a population with a specified set of parameters was examined, given a sample of observations.

$$L(a population | x_1, x_2, x_3, ..., x_n)$$

$$L(a population | x) = P(x | a population)$$

--- 

A more theoretically justified way to measure the performance of a model is using the log-likelihood function.

$$Log likelihood = \sum_{1}^{n} logp(y_i | \theta)$$

If the likelihood function is a Normal, the log-likelihood is proportional to the MSE

#### Deviance

$$ Deviance = -2 \sum_{1}^{n} (logp(y_i | \theta) - logp(y_i | \theta_s)$$

where $logp(y_i | \theta_s)$ is the likelihood of saturated model, and $\theta_s$ is the parameters of a saturated model, which is a model that overfitted to the point that it fits the observed data perfectly, range form 0 to $\infty$

$p(y_i | \theta)$ range form 0 to 1, 0 means not fit at all, and 1 means perfect fit. $\sum_{1}^{n} logp(y_i | \theta)$ take the log of likelihood, thus takes value from $-\infty$ to $0$. Mutiplying the log-likelihood function by $-2$ results in a number that is interpreatable similar to the MSE:
* Poorly fit models have large positive values
* A perfectly fit model has a value of 0

***over-thinking: MLE and deviance:***
Maximum likelihood Estimation (MLE): estimating the parameters $\theta$ that maximizing the probability \sum_1^n p(y_i|\theta). MLE is the most efficient estimator for the distribution parameter $\theta$. A disadvantage of the MLE when you have non-regular distributions, i.e., distribution whose parameters are constrained by the observed values. For such distributions, a maximum likelihood may not exist.

**Question**: How the saturated model was obtained?

### Posterior Predictive Distribution

$$ accuracy = p(y_{new} | y) =  \int p(y_{new} | y) p(\theta|y) \, d\theta $$
where $p(\theta|y)$ is the posterior distribution for $\theta$ and we integrate over the entire distribution of $\theta$. it is the average of all the probabilities of seeing $y_{new}$ calculated over all possible values of $\theta$.

$$ accuracy = E[p(y_{new} | \theta)]$$

This has the following steps:
1. Draw a $\theta_{i}$ from the posterior distribution for $\theta$
2. Given the value of $\theta_{i}$, how likely are you to see $y_{new}$ or compute $p(y_{new} | \theta)$.
3. Repeat the above two steps several times to compute the expectation of $p(y_{new} | \theta)$.

This index is also computed using log as:
$$ accuracy = log(E[p(y_{new} | \theta)])$$


### AIC (Akaike Information Criterion)

$$ AIC = -2 \sum_{i=1}^{n} logp(y_i | \theta_{mle}) + 2n_{parameter}$$, where $n_{parameter}$ refers to the number of parameters in the model and $\theta_{mle}$ is the MLE estimate of $\theta$. 

AIC does not use the the posterior distribution, so it does not take into account any information regarding the uncertainty of the parameters. 

### BIC (Bayesian Information Criterion)

$$ BIC = -2 \sum_{i=1}^{n} logp(y_i | \theta_{mle}) + n_{parameter} log(n_{sample})$$, 


### DIC (Deviance Information Criterion)

$$ DIC = -2 \sum_{i=1}^{n} logp(y_i | \theta_{Bayes}) + 2 var_{posterior} logp(y_i | \theta)$$, 

**what does the $\theta_{Bayes}$ and the $\theta$ mean here??**


### WAIC (Widely applicable information criterion)
The deviation for the log pointwise predictive density is log likelihood.

#### Log pointwise predictive density (lppd)
$$p_{post}(y_{new}) = \int p(y_{new} | \theta) p_{post}(\theta) d\theta$$

if we take the log of both sides:

$$log(p_{post}(y_{new})) = log(\int p(y_{new} | \theta) p_{post}(\theta) d\theta)$$, where $p_{post}\theta$ is the posterior distribution of $\theta$ obstained by training set. This is predictive fit of the new point. if we have a number of new data point $i = 1, 2, .., n$, we can write the following for the log pointwise predicitve density for a model using the new data:

$$lppd = log \prod_{i} p_{post}(y_{new_i}) = \sum_{i} \int logp(y_{new} | \theta) p_{post}(\theta) d\theta $$

* in practice, the inner integral over $\theta$ is computed using an average over possible values of $\theta$ (sampled) denoted as $\theta_{S}$.

$$\sum_{i} \int logp(y_{new} | \theta) p_{post}(\theta) d\theta = \sum_{i} log \frac {1} {S} \sum_{S}p(y_{new_{i}} | \theta_{S})$$

* Now suppose we donot have new dataset $y_{new}$ and we compute the $lppd$ over our training set, that is not a good measure for future performance of the model. So WAIC adds a term to correct for this overestimated performance.

$$2*\sum_{i} Var_{s}(logp(y_{new} | \theta_{S}))$$

* so WAIC is defined as the sum of the two items above:

$$WAIC = -2 \sum_{i} log \frac {1} {S} \sum_{S}p(y_{new_{i}} | \theta_{S}) + 2*\sum_{i} Var_{s}(logp(y_{new} | \theta_{S}))$$

## Hierarchical models

In `PyMC3`, for hierarchical model, log likelihood has only **four** dimensions: chain, draw, obs_dim_0, obs_dim_1.

As Oriol Abril mentioned in [here](https://discourse.pymc.io/t/modeling-human-response-time-data-with-an-exponential-model-model-comparison-issues/3666/9) and [here](https://discourse.pymc.io/t/calculating-waic-for-models-with-multiple-likelihood-functions/4834/5), hierarchical models has different ways for comparing `loo` and `waic`.

Some useful resources for understanding this: https://avehtari.github.io/modelselection/rats_kcv.html

### Understand the log-likelihood of HDDM

#### First, figure out how the `DIC` is calculated

##### `DIC`'s formula as describe in Wieckie et al (2013) ?

Actually, there is no formula for DIC in that paper, only a reference to Speigelhalter et a., 2019 was give.

##### `DIC`'s common formula (Student's Guide)

$$ DIC = 2 * \widehat {elpd_{DIC}} $$
$$\widehat {elpd_{DIC}} = log[(y | \hat{\theta}_{Bayes})] - k_{DIC}$$
$$ k_{DIC} = 2var_{posterior}(log[p(y | \theta)]$$


##### How `DIC` is calculated in the HDDM script:

In `HDDM` the `DIC` information is calcuated by referring the `dic_info()` function, which is define @L665 in the kabuki/hierarchical.py. In this piece of code, it used the `DIC` function from `MCMC` object of `pymc2`. 

In the `pymc2`'s script (pymc/MCMC.py), we can find `_calc_dic` @L419, where the `DIC` is calculated from two variables: `mean_deviance` and `self.deviance`.

`self.deviance` is a property of `MCMC` model object, which was defined as `-2 * sum[v.get_logp() for v in self.observed_stochastics])` [@L219 of the pymc/Model.py](https://github.com/pymc-devs/pymc/blob/6b1b51ddea1a74c50d9a027741252b30810b29e0/pymc/Model.py#L219).

`mean_deviance = np.mean(self.db.trace('deviance')(). axis=0)`

`dic` is `2* mean_deviance - self.deviance`


**Here is the unsolved issues:**

* `mean_deviance` is the mean from `tmp.mc.db.trace('deviance')()`, if the `deviance` in the trace is defined as in @L842 in `pymc2/Model.py`, then, it uses `_sum_deviance()`, which used the `-2*sum([v.get_logp() for v in self.observed_stochastic])`. However, `self.observed_stochastic` has 42 elements, each is a `wtfp` distribution object, `tmp.mc.db.trace('deviance')` has 1500 elements, which is the same as MCMC sample. So, it seems to me that `tmp.mc.db.trace('deviance')` and the loglikelihood calculated from `self.observed_stochastic` is differnt.

* `logp` is determined by `wftp_like`?


arviz/stats/stats_utils.py#L257
the loglikelihoohd was stored in sample_stats groups,but has been deprecated.

#### Second, compare `DIC` and `WAIC`

##### `WAIC` in Student's guide:

We consider each of the $n$ data points separately. Consider a single data point $y_i$, we can take the log of the average value of the likelihood across the posterior distribution:

$\widehat{lpd} = log[E_{posterior} (p(y_i | \theta))]$

Where $E_{posterior}$ is the expectation with respect to the posterior distribution. If we sum corresponding terms for each of the $n$ points, and include a bias correction term, we obtain an estimate of the expected log pointwise predicitive density (*elppd*):

$\widehat{elppd} = \sum_{i=1}^{n} log[E_{postereior}(p(y_i | \theta))] - k_{WAIC}$

where:

$k_{WAIC} = \sum_{i=1}^{n} var_{posterior}(log p(y_i | \theta))$.

$WAIC = -2 * \widehat{elppd}$


`log_likelihood` in `ArviZ` is the pointwise loglikelihood, where the `Samples` should match with `posterior` ones (1500 in this case) and its variable should match `observed_data` variable (42 = 14 * 3) in this case. which means, for each sample (totally 1500), we can calculate the likelihood for each data point (14 * 3 * 1500 arrays, the length of each array depends on the number of data points there).

Differences between `logp` in `DIC` and `point-wise log-likelihood` (which uses `pdf` function) in `WAIC` or `loo`:

In the student's guide to Bayesian statistics, point-wise likelihood is the log-likelihood for each data point across all posterior samples (page 393).

```
for (i in 1:N){
    loglikelihood[i] = normal_lpdf(X[i] | mu, sigma);
    }
```


## WAIC and LOO for Hierachical Models

https://discourse.pymc.io/t/calculating-waic-for-models-with-multiple-likelihood-functions/4834/19
https://discourse.pymc.io/t/modeling-human-response-time-data-with-an-exponential-model-model-comparison-issues/3666/9

https://avehtari.github.io/modelselection/rats_kcv.html

https://statmodeling.stat.columbia.edu/2018/08/03/loo-cross-validation-approaches-valid/