#### Setup

A `changepoint` is an underlying shift in the parameters that generate a data sequence (e.g. the mean of a Gaussian suddenly jumps). Here, we focus on the online or causal task

To do this, we introduce a latent variable named `run length` $r_t$, indicating the number of steps since the most recent changepoint

Based on observation $x_t$, if a changepoint occurs at $t$, then $r_t=0$; otherwise it increments by 1

We want to maintains full run length `posterior` $p(r_t|x_{1:t})$ and update this `recursively`

In addition, we want to update the sequence `predictive distribution` for $x_{t+1}$ using only $x_{1:t}$


#### Predictive distribution

Using marginalization, we can write the `predictive` as

$$
\begin{align*}
p(x_{t+1}|x_{1:t})&=\sum_{r_t=0}^tp(x_{t+1}|r_t, x_{t-r_t:t})p(r_t|x_{1:t}) \\
&=\sum_{r_t}p(x_{t+1}|r_t, x_t^{(r)})p(r_t|x_{1:t})
\end{align*}
$$

where $x_t^{(r)}$ denotes the most recent portion of $x_{1:t}$ that belongs to the `current` run

This indicates sequence predictive is determined by $p(x_{t+1}|r_t, x_t^{(r)})$ and run length posterior $p(r_t|x_{1:t})$

$p(x_{t+1}|r_t, x_t^{(r)})$ is often refered to as underlying probabilistic model `(UPM) predictive` to distinguish it from sequence predictive $p(x_{t+1}|x_{1:t})$

#### Run length posterior

To compute run length posterior, we use the expression for conditional probability and joint

$$\begin{align*}
p(r_t|x_{1:t})&=\frac{p(r_t, x_{1:t})}{\sum_{r_{t'}}p(r_{t'},x_{1:t})}
\end{align*}$$

We express the joint in a `recursive` manner

$$\begin{align*}
p(r_t, x_{1:t})&=\sum_{r_{t-1}}p(r_t, r_{t-1}, x_t, x_{1:t-1}) \\
&=\sum_{r_{t-1}}p(r_t, x_t | r_{t-1},  x_{1:t-1})p(r_{t-1}, x_{1:t-1})\\
&=\sum_{r_{t-1}}p(x_t|r_t, r_{t-1}, x_{1:t-1})p(r_t|r_{t-1}, x_{1:t-1})p(r_{t-1}, x_{1:t-1}) \\
& \text{assumption: } x_t \text{ conditionally independent of } r_t \\
& \text{assumption: } r_t \text{ conditionally independent of } x_{1:t-1} \\
&=\sum_{r_{t-1}}p(x_t|r_{t-1}, x_{1:t-1})p(r_t|r_{t-1})p(r_{t-1}, x_{1:t-1})\\
&=\sum_{r_{t-1}}p(x_t|r_{t-1}, x_{t-1}^{(r)})p(r_t|r_{t-1})p(r_{t-1}, x_{1:t-1})\\
\end{align*}$$

Therefore, the joint at step $t$, $p(r_t, x_{1:t})$, depends on UPM predictive $p(x_t|r_{t-1}, x_{t-1}^{(r)})$ and joint $p(r_{t-1}, x_{1:t-1})$ at step $t-1$ plus a changepoint prior $p(r_t|r_{t-1})$

It is noted from derivation above that once we set the initial joint $p(r_0)$, what remains to be calculated is the UPM predictive and the changepoint prior

#### Conjugate models

The computation of UPM predictive leverages `conjugate model`

Assume we have observations $D$, model parameters $\theta$ and hyperparameters $\alpha$. Then the prior predictive distribution marginalized over parameters can be written as

$$p(x|\alpha)=\int p(x|\theta)p(\theta |\alpha)$$

where $p(x|\theta)$ is `predictive model` given parameters and $(\theta |\alpha)$ is the prior of `parameters`

This is called `prior predictive distribution` because this is the prediction `before` we observe any data (that is, $D$ is not taken into account)

Similarly, we can write `posterior predictive distribution` as

$$
\begin{align*}
p(x|D, \alpha)&=\int p(x|\theta)p(\theta |D, \alpha) \\
\end{align*}
$$

A wonderful property of conjugate model is that the prior distribution and posterior distribution are of the same form. Therefore, if parameter prior is conjugate, then we have

$$p(\theta|D, \alpha) = p(\theta | \alpha')$$

Therefore

$$
\begin{align*}
p(x|D, \alpha)&=\int p(x|\theta)p(\theta |D, \alpha) \\
&=\int p(x|\theta)p(\theta |\alpha')\\
&=p(x|\alpha')
\end{align*}
$$

That is, the posterior predictive distribution is the same as the prior predictive with only changed hyperparameters $\alpha'$

This allows us to bypass the whole integration thing is we can compute $\alpha'$

One family of conjugate model that is particularly attractive to efficiently compute $\alpha'$ is the exponential family