In [1]:
import sys
sys.path.insert(0, '../zdrojaky')

import numpy as np
import matplotlib.pylab as plt
import nig
np.set_printoptions(precision=2)

# Lecture 2: Sequential estimation of linear models, prediction


Let us assume a random process $\{Y_t|X_t; t=1,2,\ldots\}$ with **independent identically distributed (iid) realizations** $y_1, y_2,\ldots$ determined by a **known observable variable** $X_t$, e.g., the **regressor**, if present. Our goal is to model this process using a suitable probabilistic **model** $f(y_t|x_t, \theta)$, where $\theta$ is an **unknown parameter**. Its reliable estimation is crucial. Furthermore, let our task be an online, i.e., **sequential**, estimation.

There are multiple ways towards online modelling, for instance:
- periodic estimation on **data window**, aggregating all data from the beginning time $t=1$ up to the present time. As the amount of data increases, there are high demands on memory and computational performance.
- periodic estimation on **floating data window**. This approach is one of the most popular, as it quite reasonably exploits only recent data, e.g., the last 100 measurements.
- **fully sequential estimation**, where the previously available information is only updated by the most recent data. We saw a demonstration of this in the previous lecture.

## Motivation example

Say, that we measure the altitude $y_t$ of a climbing object by an *imprecise* radar instrument. We denote the speed at the moment of the first detection as $v_0$ and assume a constant acceleration $a$. The altitude of the first detection is set $y_0 = 0$. 

**The goals are:**
- online sequential **prediction of the altitude** at the next time instant, $y_{t+1}$,
- online **estimation of $a$ and $v_0$**,
- online **estimation of the noise variance**. Note that this noise is due to the terrain, weather conditions, intrinsic noise in the instrument etc.

From physics we know that the height is given by $v_0 t + \frac{1}{2}a t^2$. We add a noise variable $\varepsilon$, and the resulting model reads

$$
    y_t = v_0 t + \frac{1}{2} a t^2 + \varepsilon_t.
$$

Let us look at the evolution of the first measurements and the online predictions. The blue line depicts true measurements, crosses stand for prediction (red are current, green past).
![Výška a predikce](img/l2-regrese-anim.gif)

And for the whole data set after 3, 10, 20, 30, 50 and 80 measurements:
![Výška a predikce](img/l2-predikce.jpg)


Let us now shortly immerse into the theory and then return back to this example...

---

## Bayesian estimation

Denote by $y_t$ the random variable observed at discrete time instants $t=0, 1,2,\ldots$, and by $y_{0:t-1} = [y_0, y_1, \ldots, y_{t-1}]$. Assume that the $y_t$ are iid. Furthermore, let $y_t$ be determined by a known observed variable $x_t$ (e.g., the regressor), and by a constant parameter $\theta$. We denote $x_{0:t-1} = [x_0, x_1, \ldots, x_{t-1}]$. 

Now assume that there is a prior distribution (density) $\pi(\theta|x_0, y_0)$ expressing our prior knowledge. $x_0$ and $y_0$ can be seen as pseudo-data.

> **Bayes' theorem**
>
> Let $f(y_t|x_t, \theta)$ be a pdf of $y_t|x_t,\theta$. Let $\pi(\theta|y_{0:t-1}, x_{0:t-1})$ be the prior density for $\theta$. The one-step Bayesian update yields a posterior density of the form
>
$$
\begin{aligned}
\pi(\theta|y_{0:t}, x_{0,t}) 
&= 
\frac
{f(y_t|x_t, \theta) \pi(\theta|x_{0:t-1}, y_{0:t-1})}
{\int f(y_t|x_t, \theta) \pi(\theta|x_{0:t-1}, y_{0:t-1})d\theta} \\
&=
\frac
{f(y_t|x_t, \theta) \pi(\theta|x_{0:t-1}, y_{0:t-1})}
{f(y_t|x_t)} \\
&\propto
f(y_t|x_t, \theta) \pi(\theta|x_{0:t-1}, y_{0:t-1}).
\end{aligned}
$$

Recall that the last row expresses the Bayesian update in terms of proportionality, i.e., without the normalizing density in the denominator. (How to calculate it?)


Now we are interested in the sequential Bayesian update. It is as follows:

$$
\begin{aligned}
\pi(\theta|y_{0:1}, x_{0:1})
&\propto
f(y_1|x_1, \theta) \pi(\theta|x_{0}, y_{0}) \\
\pi(\theta|y_{0:2}, x_{0:2})
&\propto
f(y_2|x_2, \theta) \pi(\theta|x_{0:1}, y_{0:1}) \\
&\propto \pi(\theta|x_{0}, y_{0}) f(y_1|x_1, \theta) f(y_2|x_2, \theta) \\
&\vdots \\
\pi(\theta|y_{0:t}, x_{0:t})
&\propto
\pi(\theta|y_{0}, x_{0})
\prod_{\tau = 1}^{t}
f(y_{\tau}|x_{\tau}, \theta) \\
&\propto
\pi(\theta|y_{0:\tau-1}, x_{0:\tau-1})
\prod_{\tau = \widetilde{\tau}}^{t}
f(y_{\widetilde\tau}|x_{\widetilde\tau}, \theta),
\end{aligned}
$$

i.e., we either may multiply the models and subsequently update the initial prior, or equivalently update at each time step by the most recent data. **Conclusion: the sequential one-by-one update is equivalent to the update by a batch of data.**


** Think abouts... **
- How to calculate the normalization term in the Bayes' theorem?
- Think about the posterior distribution - which properties does it have? For instance, if we take any model and any prior, can we make direct conclusions about the mean value?

## Sequential estimation

We saw that the sequential Bayesian update is - theoretically - a simple task. We use the prior distribution, update it by new data, obtain the posterior distribution, which we reuse as the prior for the next time step:

$$
\pi(\theta|x_{0}, y_{0}) \xrightarrow[\text{Bayes}]{x_1, y_1}
\pi(\theta|x_{0:1}, y_{0:1}) \xrightarrow[\text{Bayes}]{x_2, y_2}
\pi(\theta|x_{0:2}, y_{0:2}) \rightarrow
\cdots \xrightarrow[\text{Bayes}]{x_t, y_t}
\pi(\theta|x_{0:t}, y_{0:t}) \rightarrow
\cdots
$$

Furthermore recall that the point estimate of $\theta$ may serve
- the expected value $\mathbb{E}[\theta]$, more precisely $\mathbb{E}[\theta|x_{0:t}, y_{0:t}]$, or 
- the mode - maximum aposteriori (MAP) estimate, or 
- the median.

The uncertainty in the Bayesian estimate is represented mostly by the variance of the prior/posterior distribution, $\text{var}\theta$.

**
Although the general equations look quite simple, the most fundamental problem of Bayesian modelling is the derivation of the posterior distribution and its properties. This is particularly true for sequential modelling. If the posterior distribution after the first update is not any "standard" distribution, the subsequent update makes it yet more complicated.
**

**
Fortunately, there are cases where the Bayes' theorem yields analytically tractable (and "standard") distribution. For this purpose, let us introduce the exponential family of distributions and conjugate prior distributions.
**

> **Definition (exponential family of distributions, EF)**
>
> Assume a random variable $y$ conditioned by $x$ and a parameter $\theta$. The EF is a **class of distributions** with pdfs of the form
>
> $$
f(y|x, \theta) = h(y, x) g(\theta) \exp \left[ \eta^{\intercal} T(y,x) \right],
$$
>
> where $\eta \equiv \eta(\theta)$ is the **natural parameter**, $T(y,x)$ is the **sufficient statistic**, $h(y,x)$ is a **known function**, and $g(\theta)$ is a **normalization function**. If $\eta(\theta)=\theta$ the class is canonical.

> **Definition (conjugate prior distribution)**
>
> Let $y|x, \theta$ have a distribution from the EF. We say that the **prior distribution** for $\theta$ with **hyperparameters** $\xi$ and $\nu$ is **conjugate to the model**, if its pdf has the form
>
>$$
        \pi(\theta) = q(\xi, \nu) g(\theta)^{\nu} \exp \left[ \eta^{\intercal} \xi \right],
$$
>
>where $\xi$ has the same size as $T(y,x)$, $\nu\in\mathbb{R}^{+}$, and $q(\xi,\nu)$ is a known function. The function $g(\theta)$ is the same as in the definition of EF for the model of $y|x, \theta$.

Recall, that hyperparameters are parameters of the prior - to avoid with the model parameter $\theta$. If the prior has its own prior, its hyperparameters are sometimes called as hyper-hyperparameters, but it's a bit...strange :)

### Examples of conjugate priors

Although we did not rewrite the binomial distribution into the EF form and the beta distribution to the compatible conjugate form, we saw, that the conjugacy was fulfilled. Of course, under conjugacy, it is possible to evaluate posteriors without any rewriting, but...we will see shortly why Kamil likes it :)

| Model | Use | Conjugate prior |
|:---|:---:|:---|
|Normal with known variance | Tracking, physics... :-) | Normal |
|Normal with unknown variance | Everywhere :-) | Normal inverse-gamma |
|Bernoulli | Success-Failure (coin, reliability) | Beta |
|Binomial |  Success-Failure (coin, reliability) | Beta |
|Poisson | Traffic, particle physics | Gamma |
|Multinomial | Classification | Dirichlet |

See [wikipedia](https://en.wikipedia.org/wiki/Conjugate_prior).

## Bayesian estimation with conjugate prior
If we assume conjugate prior with hyperparameters $\xi_{t-1}$ a $\nu_{t-1}$, the Bayes update

$$
\pi(\theta|y_{0:t}, x_{0,t}) 
\propto
f(y_t|x_t, \theta) \pi(\theta|x_{0:t-1}, y_{0:t-1})
$$

is a trivial summation

$$
\begin{aligned}
    \xi_{t} &= \xi_{t-1} + T(y_{t},x_{t}), \\
    \nu_{t} &= \nu_{t-1} + 1.
\end{aligned}
$$

---

With multiple data,

$$
\pi(\theta|y_{0:1}, x_{0:1})
\propto
\pi(\theta|y_{0:\tau-1}, x_{0:\tau-1})
\prod_{\tau = \widetilde{\tau}}^{t}
f(y_{\widetilde\tau}|x_{\widetilde\tau}, \theta),
$$

it is again a summation!

$$
\begin{aligned}
    \xi_{t} &= \xi_{\tau-1} + \sum_{\widetilde{\tau}=\tau}^{t} T(y_{\widetilde{\tau}},x_{\widetilde{\tau}}),\\
    \nu_{t} &= \nu_{\tau-1} + t - \tau+1.
\end{aligned}
$$

**Conclusion: Under conjugacy, the Bayes' theorem is simply a sum of the sufficient statistics and the hyperparameter $\xi_{t-1}$ and an incrementation of $\nu_{t-1}$.**

### Example: Ping
The availability of an internet server is tested by a *ping* message (ECHO REQUEST). Say that we send it each 500ms and expect the reply (ECHO REPLY) to arrive within 50ms. We describe the server availability as the probability $p \in [0, 1]$ of receiving a reply. Denote this by $X=1$ ("success").

#### Model
Since the modelled variable $X\in\{0, 1\}$ is binary and has a probability $p$, a suitable distribution is the Bernoulli distribution with a pmf
$$
\begin{aligned}
f(x_t|p) &= p^x (1-p)^{1-x} \\
&= \exp\{ \ln [p^x_t \cdot (1-p)^{1-x_t}] \} \\
&= \exp\{x_t \ln p + (1-x_t) \ln(1-p)\} \\
&= \exp 
\left\{
\begin{bmatrix}
\ln p \\
\ln (1-p)
\end{bmatrix}^\intercal
\begin{bmatrix}
x_t \\
1-x_t
\end{bmatrix}
\right\}
\end{aligned}
$$

hence $h(x) = 1$, $g(\theta)=g(\pi) = 1$. Note that the EF form is neither unique nor nice here, but it is practical. We will see why.

#### Prior for $p$
We also know that the probability $p$ can be modelled by the beta distribution with hyperparameters $a_{t-1}, b_{t-1}>0$. Its pdf is

$$
\begin{aligned}
\pi(p|a_{t-1}, b_{t-1})
&= \frac{1}{B(a_{t-1}, b_{t-1})} p^{1-a_{t-1}} (1-p)^{1-b_{t-1}} \\
&= \frac{1}{B(a_{t-1}, b_{t-1})} 
\exp 
\left\{
\begin{bmatrix}
\ln p \\
\ln (1-p)
\end{bmatrix}^\intercal
\begin{bmatrix}
a_{t-1} - 1 \\
b_{t-1} - 1
\end{bmatrix}
\right\}
\end{aligned}.
$$

Thus we see that the prior is conjugate with $q(\cdot) = B(\cdot)$, hyperparameter $\nu_{t-1}$ as an exponent of $g(\theta) = 1$ (thus we may ignore it), the vector $\xi_{t-1}$ is the second vector in the exponent.

#### Sequential Bayesian update
$$
\xi_t = \xi_{t-1} + T(x_t) \qquad \Rightarrow \qquad a_t = a_{t-1} + x_t, \qquad b_t = b_{t-1} + (1-x_t).
$$

#### Posterior estimates
$$
\hat{p} = \frac{a_t}{a_t + b_t}, \qquad var(\hat{p}) = \frac{a_t b_t}{(a_t + b_t)^2 (a_t + b_t + 1)}.
$$

## Linear regression model

We should already know the linear regression model

$$
y_t = \beta^\intercal x_t + \varepsilon_t, \qquad t=1,2,\ldots,
$$

where $y_t$ is an observed scalar variable, $x_t$ is a known regression vector, and $\beta$ is a vector of regression coefficients of the same size. $\varepsilon_t \sim \mathcal{N}(0, \sigma^2)$ is and iid noise. 
According to the form of the model, there are several model types, e.g., a straight line, quadratic function, higher-order polynomial etc.

From statistics we should know that the maximum likelihood estimate $\beta = (X^\intercal X)^{-1} X^\intercal y$, where $X$ a $y$ are matrices whose rows are regressors $x_t$, and $y$ is a column vector of $y_t$s, respectively. 

![Regrese](img/l2-linmodely.jpg) 

## Bayesian linear regression

### Model
Assume an observed $y_t$ determined by the regressor $x_t\in\mathbb{R}^p$ and a vector of regression coefficients $\beta\in\mathbb{R}^p$,

$$
\begin{aligned}
y_t &= \beta^\intercal x_t + \varepsilon_t, \\
\varepsilon_t &\sim \mathcal{N}(0, \sigma^2) \qquad \text{iid}.
\end{aligned}
$$

---

> Recall our altitude tracking example:
> ![Regrese](img/l2-sigmapas.jpg)
> We already know that the corresponding model is
$$
y_t = v_0 t + \frac{1}{2} a t^2 + \varepsilon_t =
\underbrace{
\begin{bmatrix}
v_0 \\
a
\end{bmatrix}^\intercal
}_{\beta^\intercal}
\underbrace{
\begin{bmatrix}
t \\
\frac{1}{2}t^2
\end{bmatrix}
}_{x_t}
+ \varepsilon_t
$$

> **Think abouts...**
- Modify the model for nonzero initial altitude.
- How to predict the altitude for a preset time (say $t=100$) if we know $a$ and $v_0$?
- That is: what do we need to know in order to make predictions?

---

**Task: estimation of unknown constant $\beta$ and $\sigma^2$. And naturally prediction for given $x'$.**

Since the measurement noise is normal, the model is normal too, $y_t\sim\mathcal{N}(\beta^\intercal x_t, \sigma^2)$. Its pdf is

$$
\begin{aligned}
    f(y_{t}|x_{t}, \beta, \sigma^{2}) 
    &= \frac{(\sigma^{2})^{-\frac{1}{2}}}{\sqrt{2\pi}}
       \exp
       \left\{ 
           -\frac{1}{2\sigma^{2}} (y_{t} - \beta^{\intercal}x_{t})^{2} 
       \right\} \notag \\
    &= \frac{(\sigma^{2})^{-\frac{1}{2}}}{\sqrt{2\pi}}
       \exp
       \Bigg\{ 
           \text{Tr}
           \bigg( 
               \underbrace{
                   -\frac{1}{2\sigma^{2}}
                   \begin{bmatrix}
                       1 \\ -\beta
                   \end{bmatrix}
                   \begin{bmatrix}
                       1 \\ -\beta
                   \end{bmatrix}^{\intercal}
               }_{\eta}
               \underbrace{
                   \begin{bmatrix}
                       y_{t} \\ x_{t}
                   \end{bmatrix}
                   \begin{bmatrix}
                       y_{t} \\ x_{t}
                   \end{bmatrix}^{\intercal}
               }_{T(y_{t}, x_{t})}    
           \bigg)
       \Bigg\}.
\end{aligned}
$$

Now we need a convenient prior, preferably conjugate.

> *Remark.: The only "trick" above is a matrix trace Tr. It is defined as the sum of diagonal elements, e.g., the 3x3 identity matrix has Tr(I) = 1 + 1 + 1 = 3. The trace has very appealing properties. E.g., for three compatible matrices A, B, and C it holds $Tr(ABC) = Tr(CAB) = Tr(BCA)$. From this follows this:*
$$
\begin{aligned}
x &= [a, b], \\
y &= [c, d]^{\intercal}, \\
x\cdot y &= ac + bd = Tr(x\cdot y) \qquad\text{(trace of a scalar is the same scalar)} \\
Tr(y\cdot x) &= Tr([c, d]^{\intercal} [a, b]) \\ 
&= Tr
\begin{bmatrix}
ac & ad \\
bc & bd
\end{bmatrix}
= ac + bd. \qquad\text{(trace preserves the result under rotation of arguments)}
\end{aligned}
$$

### Prior distribution
Since we do not know neigther the regression coefficients $\beta$ nor the noise variance $\sigma^2$, we aim to estimate both. It is known that a convenient prior distribution $\pi(\beta, \sigma^2)$ is the **normal inverse-gamma** distribution with a pdf

$$
\beta, \sigma^{2} 
\sim \mathcal{N}i\mathcal{G}(m_{t-1}, V_{t-1}, a_{t-1}, b_{t-1})
= \underbrace{\mathcal{N}(m_{t-1}, \sigma^{2} V_{t-1})}_{\pi(\beta|\sigma^2)} 
\times 
\underbrace{i\mathcal{G}(a_{t-1}, b_{t-1})}_{\pi(\sigma^2)},
$$

with real hyperparameters $a_{t-1}>0$ a $b_{t-1}>0$, vector of mean values $m_{t-1}\in\mathbb{R}^{p}$ and a scale matrix $V_{t-1}^{-1}$ of a corresponding size. The figure depicts examples of the marginal normal and inverse-gamma distributions. Naturally, the NiG distribution is more complicated.
![N x iG](img/l2-apriorno-nig.jpg)

The curious students may be want to see the pdf. It is as follows:

$$
\pi(\beta, \sigma^{2}|\cdot)
    = \frac{b^{a_{t-1}} (\sigma^{2})^{-(a_{t-1}+1+\frac{p}{2})}}{\sqrt{2\pi}|V_{t-1}|^{\frac{1}{2}}\Gamma(a_{t-1})}
       \exp
       \Bigg\{ 
           -\frac{1}{2\sigma^{2}}
           \bigg[ 
           b_{t-1} + 
               \text{Tr}
               \bigg( 
                       \begin{bmatrix}
                           1 \\ -\beta
                       \end{bmatrix}^{\intercal}
                       \begin{bmatrix}
                           1 \\ -\beta
                       \end{bmatrix}
                       \begin{bmatrix}
                           m_{t-1}^{\intercal} \\ I 
                       \end{bmatrix}
                       V_{t-1}^{-1}
                       \begin{bmatrix}
                           m_{t-1}^{\intercal} \\ I 
                       \end{bmatrix}^{\intercal}
                \bigg)
            \bigg]
       \Bigg\}.
$$

If we look at the model, we surely want to derive $\xi_{t-1}$ a $\nu_{t-1}$ that will replace (or more precisely *represent*) our $a_{t-1}, b_{t-1}, m_{t-1}$, and $V_{t-1}$. These are:

$$
\begin{aligned}
    \xi_{t-1} 
    &=
    \begin{bmatrix}
        m_{t-1}^{\intercal} V_{t-1}^{-1} m_{t-1} + 2b_{t-1} & m_{t-1}^{\intercal} V_{t-1}^{-1} \\
        V_{t-1}^{-1}m_{t-1} & V_{t-1}^{-1}
    \end{bmatrix} \\
    &=
    \begin{bmatrix}
        \xi_{t-1}^{[11]} & \xi_{t-1}^{[12]} \\
        \xi_{t-1}^{[21]} & \xi_{t-1}^{[22]}
    \end{bmatrix}, \\
\nu_{t-1} &= 2a_{t-1}.
\end{aligned}
$$

Now, the Bayes' theorem can be used in its full simplicity due to the conjugate representation.

### Bayesian update
Recall that the update is a sum of $\xi_{t-1}$ and $T(y_t, x_t)$. Simple algebra shows that

$$
\begin{aligned}
    V_{t} &= \left( V_{t-1}^{-1} + x_{t}x_{t}^{\intercal} \right)^{-1}
           = V_{t-1} - \frac{V_{t-1} x_{t}x_{t}^{\intercal} V_{t-1}}{1+x_{t}^{\intercal} V_{t-1} x_{t}}= \left(\xi_{t}^{[22]}\right)^{-1}, \\ 
    m_{t} &= V_{t}(V_{t-1}^{-1}m_{t-1} + y_{t}x_{t}) = \left(\xi_{t}^{[22]}\right)^{-1} \xi_{t}^{[21]}, \\
    a_{t} &= a_{t-1} + \frac{1}{2} = \frac{1}{2}(\nu_{t-1} + 1) = \frac{1}{2}\nu_{t}, \label{eq:nig-update} \\
    b_{t} &= b_{t-1} + \frac{1}{2} \left(-m_{t}^{\intercal}V_{t}^{-1}m_{t} + m_{t-1}^{\intercal} V_{t-1}^{-1}m_{t-1} + y_{t}^{2} \right) \\
    &= \frac{1}{2}\left[\xi_{t}^{[11]} - \xi_{t}^{[12]}\left( \xi_{t}^{[22]} \right)^{-1} \left( \xi_{t}^{[12]} \right)^{\intercal}\right], \notag
\end{aligned}
$$

An interesting point is that this approach to the derivation of the posterior hyperparameters is easy. However, the standard literature mostly uses the traditional forms of pdfs and a tedious algebra with certain tricks ;-)

Recall that

$$
\beta, \sigma^{2} = 
\underbrace{\mathcal{N}(m_{t-1}, \sigma^{2} V_{t-1})}_{\pi(\beta|\sigma^2)} 
\times
\underbrace{i\mathcal{G}(a_{t-1}, b_{t-1})}_{\pi(\sigma^2)}.
$$

**The estimates follow from the marginal distributions:**
- $\hat{\sigma}^2 = \frac{b_{t}}{a_{t}-1}$. This follows from the marginal [inverse-gamma distribution](https://en.wikipedia.org/wiki/Inverse-gamma_distribution). The variance is $\operatorname{var}(\sigma^{2}|\cdot) = \frac{b_{t}^{2}}{(a_{t}-1)^{2}(a_{t}-2)}$. **The uncertainty - measured by variance - tends to zero with  $t\to\infty$.**
- $\hat{\beta} = m_t$ follows from the marginal distribution $\int \pi(\beta|\sigma^2) \pi(\sigma^2)d\sigma^2$ which is the [Student t distribuion](https://en.wikipedia.org/wiki/Student%27s_t-distribution#Non-standardized_Student%27s_t-distribution) with $2a_t$ degrees of freedom, centered in $m_t$ and with a scale matrix $\frac{b_t}{a_t}V_t$. The variance $var \beta = \frac{b_t}{a_t-1}V_t$. **The uncertainty follows from the finite number of measurements and from the noise $\varepsilon_t$ with the variance $\sigma^2$.**

> The regression with the altitude-model:
> - estimate of $\beta = [\beta_1, \beta_2] \equiv [v_0, a]$ including the $\pm$3 standard deviations band
![Ebeta](img/l2-regrese-Ebeta.jpg)
> - detail
![Ebeta detail](img/l2-regrese-Ebeta-detail.jpg)
> - estimate of $\sigma^2$ including $\pm$3 standard deviations band,
![Esigma2](img/l2-regrese-Esigma2.jpg)

> **Naturally it depends on the prior distribution. If it were too narrow and with a wrong location, the convergence would be slower. On the other hand, a flat prior converges faster, but the uncertainty (variance) will be larger in the beginning.**

### Prediction
Assume that we want to know (predict) the value of $y'$ for given $x'$, e.g., the future value. The Bayesian approach performs predictions via the *predictive distribution*, which reads:

$$
f(y'|y_{0:t},x_{0:t},x') = \iint f(y'|x', \beta, \sigma^{2}) \pi(\beta, \sigma|y_{0:t}, x_{0:t}) \mathrm{d}\beta \mathrm{d}\sigma^{2}.
$$

This is again the [Student t distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution#Non-standardized_Student%27s_t-distribution)
$$
y'|y_{0:t}, x_{0:t}, x' \sim t_{2a_{t}}\left(m_{t}^{\intercal}x', \frac{b_{t}}{a_{t}} \left(1 + (x')^{\intercal}V_{t}x'\right) \right).
$$

Note that it is centered at $m_t^{\intercal} x' = \hat{\beta}^\intercal x'$, exactly as axpected. Moreover, we have a measure of uncertainty - $var(y'|\cdot) = \frac{b_t}{a_t-1} \left(1 + (x')^{\intercal}V_{t}x'\right)$. **This combines the uncertainty both in $\beta$ and $\sigma^2$.**