# Chapter.03 Regression
---

### 3.3. Bayesian Regression
3.3.1. Overview<br>
Statistical inference is the process of extracting information about unknown variables or unknown models from available data. For example, a set of observations or input-output data can be chosen. However, bayesian inference is a method in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available.<br><br>

Let $\theta$ is unknown parameters. Following is a table of difference between bayesian statistics and classical statistics.


|                        | Bayesian statistics                                       |  Classical statistics                        |
|------------------------|-----------------------------------------------------------|----------------------------------------------|
| Property of variables  | $\theta$ : unknown, random(with known prior distribution) | $\theta$ : unknown, deterministic            |
| Goal                   | Finding the posterior distribution of $\theta$ | Finding an estimate of $\theta$ based on the likelihood |
<br>

Suppose the prior distribution $p(\theta)$ of unknown parameters $\theta$ and the model $p(x | \theta)$ of observation $X = (X_1, \cdots, X_n)$ are given. After observing the value $x$ of $X$, we calculate the posterior distribution $p(\theta | x)$ of $\theta$ using the bayes rule. 

> (Bayesian inference) = (Inferring with the posterior distribution)





3.3.2. Maximum A Posteriori(MAP) estimation<br>
MAP estimation of parameter is 
$$
\begin{align*}
\mathbf{w}_{\text{MAP}} &= \arg\max_\mathbf{w} p(\mathbf{w} | y, \mathbf{x}) \qquad \text{(Posterior)} \\
                        &= \arg\max_\mathbf{w} p(y| \mathbf{w}, \mathbf{x}) p(\mathbf{w}) \qquad \text{(Likelihood, prior)} \,\ (\because \,\ \text{Bayes rule})
\end{align*}
$$

- Adding confidence based on the prior
- The parameters $\mathbf{w}$ are estimated probabilistically

Equivalent form is 
$$ \mathbf{w}_{\text{MAP}} = \arg\max_\mathbf{w} \log p(\mathbf{w} | y, \mathbf{x}) $$

- The MAP estimation generally leads to a nonlinear estimator.
- It requires the knowledge of both the prior and the likelihood.

3.3.3. Maximum Likelihood(ML) estimation<br>
ML estimation of parameter is
$$ \mathbf{w}_\text{ML} = \arg \max_\mathbf{w} p(y | \mathbf{w}, \mathbf{x}) \qquad \text{(Likelihood, observation density)} $$

- The training set we observed must be one with the highest probability of occurrence.
- ML can be considered as a special case of MAP with an equally likely prior.

Equivalent form is 
$$ \mathbf{w}_\text{ML} = \arg \max_\mathbf{w} \log p(y | \mathbf{w}, \mathbf{x}) \quad (\because \,\ \text{logarithm function is monotonically increasing.})$$

- The ML estimation generally leads to a nonlinear estimator.
- It requires the knowledge of the likelihood function.

3.3.4. Bayesian linear regression with ML estimation<br>
Consider a set of training samples $\{\mathbf{x}_i, y_i\}_{i = 1}^{N}$ of a linearly regressive model with i.i.d. gaussian errors:
$$ y_i = \mathbf{w}^T \mathbf{x}_i + \epsilon_i, \quad i = 1, 2, \cdots, N $$
where $\epsilon_i \sim N(0, \sigma^2)$ is i.i.d. error with $p(\epsilon_i) = \frac{1}{\sqrt{2 \pi \sigma^2 }} \exp\{ - \frac{\epsilon_i^2}{2 \sigma^2 } \}$

- $ \mathbf{y} = [y_1, \cdots, y_N]_{N \times 1}^T $
- $ X = [\mathbf{x_1}, \cdots, \mathbf{x}_N]_{N \times m}^T $
- $m$ : order of model
- $N$ : number of training data

ML estimation of the regressive model is 
$$ \mathbf{w}_\text{ML} = \arg \max_\mathbf{w} p(\mathbf{y} | X, \mathbf{w}) $$

- Likelihood function of $y_i$ given $\mathbf{w}$ and $\mathbf{x}_i$ :

$$ p(y_i | \mathbf{w}, \mathbf{x}_i) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp( - \frac{(y_i - \mathbf{w}^T \mathbf{x}_i )^2}{2 \sigma^2} ), \quad i = 1, 2, \cdots, N $$

- Likelihood function of $\mathbf{y}$ given $\mathbf{w}$ and $X$ : 

$$ 
\begin{align*}
p(\mathbf{y} | \mathbf{w}, X) &= \prod_{i = 1}^{N} p(y_i | \mathbf{w}, \mathbf{x}_i) \qquad (i.i.d.) \\
                              &= \frac{1}{(\sqrt{2 \pi \sigma^2})^N} \prod_{i = 1}^{N} \exp(- \frac{(y_i - \mathbf{w}^T \mathbf{x}_i)^2}{2 \sigma^2}) \\
                              &= \frac{1}{(\sqrt{2 \pi \sigma^2})^N} \exp(- \frac{1}{2 \sigma^2} \sum_{i = 1}^{N}(y_i - \mathbf{w}^T \mathbf{x}_i)^2 )
\end{align*}
$$

In above situations, 
$$
\begin{align*}
\mathbf{w}_\text{ML} &= \arg\max_\mathbf{w} p(\mathbf{y} | X, \mathbf{w}) \\
                     &= \arg\max_\mathbf{w} \log p(y|X, \mathbf{w}) \quad (\because \,\ \text{Log is monotonically increasing}) \\
                     &= \arg\max_\mathbf{w} \frac{1}{2 \sigma^2} \sum_{i = 1}^{N} (y_i - \mathbf{w}^T \mathbf{x}_i)^2 \quad (\because \,\ \text{Exp is monotonically decreasing}) \\
                     &= \arg\max_\mathbf{w} || \mathbf{y} - X \mathbf{w} ||^2 \quad (\because \,\ \text{Suppose that} \,\ N \ge m) \\
                     &= (X^T X)^{-1} X^T \mathbf{y} = \mathbf{w}_\text{LS} \\
\end{align*}
$$

In above interesting result, we found that ML estimate of $\mathbf{w}$ is the same as that of the least squares(CH03.02).
<br><br>

We can also maximize with respect to the error variance $\sigma^2$
$$ p(\mathbf{y} | \mathbf{w}, X) = (\frac{\beta}{\sqrt{2 \pi}})^\frac{N}{2} \exp(- \frac{\beta}{2} J(\mathbf{w}))  \qquad \text{where} \,\ \beta \triangleq \frac{1}{\sigma^2}, \,\ J(\mathbf{w}) \triangleq \frac{1}{2} \sum_{i = 1}^{N} (y_i - \mathbf{w}^T \mathbf{x}_i)^2 $$
$$ \log p(\mathbf{y} | \mathbf{w}, X) = \frac{N}{2} \log \beta - \frac{N}{2} \log 2 \pi - \beta J(\mathbf{w}) $$
$$
\begin{align*}
 \frac{\partial \log p}{\partial \beta} = \frac{N}{2 \beta} - J(\mathbf{w}) = 0 \quad \Rightarrow \quad \sigma_{\text{ML}}^2 
 &= \arg\max_{\sigma^2} \log p(\mathbf{y} | X, \mathbf{w}) \\
 &= \frac{1}{\beta} = \frac{2 J(\mathbf{w})}{N} \\
\end{align*}
$$
- ML estimate of $\sigma^2$ equals to the MSE.
- The standard deviation $\sigma$ is equivalent to the root mean square (RMS) error.
$$ \sigma_{\text{ML}} = \sqrt{\frac{1}{\beta}} = \sqrt{\frac{2 J(\mathbf{w})}{N}} = \epsilon_{\text{RMS}} $$

3.3.5. Bayesian linear regression with MAP estimation<br>
Consider a set of training samples $\{\mathbf{x}_i, y_i\}_{i = 1}^{N}$ of a linearly regressive model with i.i.d. gaussian errors:
$$ y_i = \mathbf{w}^T \mathbf{x}_i + \epsilon_i, \quad i = 1, 2, \cdots, N $$
where $\epsilon_i \sim N(0, \sigma^2)$ is i.i.d. error with $p(\epsilon_i) = \frac{1}{\sqrt{2 \pi \sigma^2 }} \exp\{ - \frac{\epsilon_i^2}{2 \sigma^2 } \}$

- $ \mathbf{y} = [y_1, \cdots, y_N]_{N \times 1}^T $
- $ X = [\mathbf{x_1}, \cdots, \mathbf{x}_N]_{N \times m}^T $
- $m$ : order of model
- $N$ : number of training data

Let a priori distribution of parameters :
$$ \mathbf{w} \sim N(\boldsymbol{\mu}, \Sigma) \,\ \text{with} \,\ p(\mathbf{w}) = \frac{1}{\sqrt{|2 \pi \Sigma|}} \exp\{- \frac{1}{2} (\mathbf{w} - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{w} - \boldsymbol{\mu}) \}$$

A posteriori probability density :
$$ 
\begin{align*}
p(\mathbf{w} | X, \mathbf{y}) &\propto p(\mathbf{y} | X, \mathbf{w}) p(\mathbf{w}) \\
                              &= K \cdot \exp\{ -\frac{1}{2 \sigma^2} || \mathbf{y} - X \mathbf{w} ||^2 - \frac{1}{2} (\mathbf{w} - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{w} - \boldsymbol{\mu}) \} \\
\end{align*}
$$

In above situation, 

$$
\begin{align*}
\mathbf{w}_{\text{MAP}} &= \arg\max_{\mathbf{w}} p(\mathbf{w} | X, \mathbf{y}) \\
                        &= \arg\min_{\mathbf{w}} \frac{1}{2}[ - \frac{1}{\sigma^2} || \mathbf{y} - X\mathbf{w} ||^2 - (\mathbf{w} - \boldsymbol{\mu})^T \Sigma^{-1}(\mathbf{w} - \boldsymbol{\mu}) ] \\
\end{align*}
$$

$$ \frac{\partial}{\partial \mathbf{w}} [\frac{1}{\sigma^2} || \mathbf{y} - X\mathbf{w} ||^2 + (\mathbf{w} - \boldsymbol{\mu})^T \Sigma^{-1}(\mathbf{w} - \boldsymbol{\mu}) ] = \Sigma^{-1} (\mathbf{w} - \boldsymbol{\mu}) - \frac{1}{\sigma^2} X^T (\mathbf{y}- X \mathbf{w}) = \mathbf{0} $$

$$ \therefore \quad \mathbf{w}_{\text{MAP}} = (\Sigma^{-1} + \frac{1}{\sigma^2} X^T X)^{-1}(\Sigma^{-1} \boldsymbol{\mu} + \frac{1}{\sigma^2} X^T \mathbf{y}) $$

If the components of the weight vector $\mathbf{w}$ is i.i.d. with zero mean and equal variance, we have
$$
\begin{align*}
p(\mathbf{w}) &= \prod_{k = 1}^{M} p(w_k) = \prod_{k = 1}^{M} \frac{\alpha}{\sqrt{2 \pi}} \exp(- \frac{\alpha w_k^2}{2}) \qquad \text{where} \,\ \alpha \,\ \text{is inverse of variance of} \,\ w_k \\
              &= (\frac{\alpha}{\sqrt{2 \pi}})^M \exp(- \frac{\alpha}{2} \sum_{k = 1}^{M} w_k^2) = (\frac{\alpha}{\sqrt{2 \pi}})^M \exp(- \frac{\alpha}{2} || \mathbf{w} ||^2) \\
\end{align*}
$$
We can get 

$$
\begin{align*}
\mathbf{w}_{\text{MAP}} &= \arg\max_\mathbf{w} p(\mathbf{w} | X, \mathbf{y}) \\
                        &= \arg\min_\mathbf{w} (\beta || \mathbf{y} - X \mathbf{w} ||^2 + \alpha || \mathbf{w} ||^2) \quad \text{where} \,\ \beta = \sigma^2 \\
                        &= (X^T X + \frac{\alpha}{\beta} I )^{-1} X^T \mathbf{y} 
\end{align*}
$$

We can found that above result is regularized least squares(CH03.02) with $\lambda = \frac{\alpha}{\beta}$
<br><br>

Let the prior distribution of $\mathbf{w}$ be
$$ p(\mathbf{w}) = N(\boldsymbol{\mu}_0, \Sigma_0) $$
Posterior pdf of $\mathbf{w}$ \[for $\mathbf{y} = X\mathbf{w} + \boldsymbol{\epsilon}$ where $p(\epsilon) = N(\mathbf{0}, \beta I)$\]

## -Block-
[PDF File](./res/ch03/note_map.pdf)

3.3.6. Bayesian linear regression with MAP estimation<br>
