## Deriving Laplace approximation, Fisher matrix

We assume that the considered learning process produces a model $p(y | x, \theta)$ that maps the example's feature vector $x$ to an output variable $y$, and the mapping comes from a family of models, parametrised by $\theta$, a $d$-dimensional vector, $\theta \in \mathcal{R}^d$.  The parameter vector might have a prior distribution, $P(\theta)$, which acts as a regulariser. The training procedure is performed on a dataset of $n$ examples, $\mathbb{D} = \{x, y\}_{i = 1}^n$ and finds the optimal parameter vector $\hat \theta$ as an optimum of the log-likelihood of the posterior, $L(\theta)$ (MAP estimate):
$$
\hat \theta = argmax_{\theta} ~ L(\theta) = argmax_{\theta} ~ log \left[ P(\mathbb{D} | \theta) \cdot P(\theta) \right] = argmax_{\theta} \left[ \sum_{i = 1}^n log ~ p(y_i|x_i, \theta) + log P (\theta) \right]
$$

After finding the MAP estimate $\hat \theta$, we make a Taylor expansion of the posterior likelihood in the neighbourhood of the local maximum $\hat \theta$ up to the second term:
$$
L(\theta) \approx L(\hat \theta) + \left( \nabla_{\hat \theta} L \right) (\theta - \hat \theta) + \frac{1}{2}(\theta - \hat \theta)^T \mathbb{H} (\theta - \hat \theta)
$$
, where $\mathbb{H}$ is the Hessian of $L$ calculated at the point $\theta = \hat \theta$.

However, since $\hat \theta$ is the local optimum, the first order term $\left( \nabla_{\hat \theta} L\right) (\theta - \hat \theta)$ equates to the zero vector. Hence,  we can re-write the previous equation as:
$$
L(\theta) = L(\hat \theta) + \frac{1}{2}(\theta - \hat \theta)^T \mathbb{H} (\theta - \hat \theta)
$$

We can treat the posterior $p(\theta| \mathbb{D})$ as a Normal distribution around the mode $\hat \theta$. Indeed, the log of the density function of the Normal distribution is is a quadratic form, just as the one we obtained for the posterior $log ~ p(\theta| \mathbb{D})$. Noting that $-\mathbb{H}$ is a positive-defined matrix (as  $\hat \theta$ is a local maximum), we obtain:
$$
p(\theta | \mathbb{D}) \propto exp \left[-\frac{1}{2}(\theta - \hat \theta)^T (-\mathbb{H}) (\theta - \hat \theta) \right] \Leftrightarrow \theta \sim \mathbb{N}(\hat \theta; \left( - \mathbb{H}\right)^{-1}) 
$$

Operating with a full Hessian $\mathbb{H}$ is impractical in high-dimensional spaces, as the number of entries in it grows quadratically with the number of dimensions $d$. An accepted approach is to approximate it with its diagonal.\footnote{If $-\mathbb{H}$ is positive-defined matrix, then the matrix $-diag(\mathbb{H})$ is also positive-defined.} Denoting $\mathbb{A} = -diag(\mathbb{H})$, we obtain the Laplace approximation of the posterior:
$$
\theta \sim \mathbb{N}(\hat \theta; \mathbb{A}^{-1}) 
$$

## Estimating Fisher Matrix
...