#### Setup

A `changepoint` is an underlying shift in the parameters that generate a data sequence (e.g. the mean of a Gaussian suddenly jumps). Here, we focus on the online or causal task

To do this, we introduce a latent variable at time step $t$ named `run length` $r_t$, indicating the number of steps since the most recent changepoint

Based on observation $x_t$, if a changepoint occurs at $t$, then $r_t=0$; otherwise it increments by 1 from $r_{t-1}$

We want to maintains full run length `posterior` $p(r_t|x_{1:t})$ and update this `recursively`

In addition, we want to update the sequence `predictive distribution` for $x_{t+1}$ using only $x_{1:t}$


#### Predictive distribution

Using marginalization, we can write the `predictive` as

$$
\begin{align*}
p(x_{t+1}|x_{1:t})&=\sum_{r_t=0}^tp(x_{t+1}|r_t, x_{t-r_t:t})p(r_t|x_{1:t}) \\
&=\sum_{r_t=0}^tp(x_{t+1}|r_t, x_t^{(r)})p(r_t|x_{1:t})
\end{align*}
$$

where $x_t^{(r)}$ denotes the portion of $x_{1:t}$ that belongs to the `current` run

This shows that sequence predictive is determined by $p(x_{t+1}|r_t, x_t^{(r)})$ and run length posterior $p(r_t|x_{1:t})$

$p(x_{t+1}|r_t, x_t^{(r)})$ is often refered to as underlying probabilistic model `(UPM) predictive` to distinguish it from sequence predictive $p(x_{t+1}|x_{1:t})$

Next, we look at run length posterior and UPM predictive

#### Run length posterior

To compute run length posterior, we use the expression of conditional probability

$$\begin{align*}
p(r_t|x_{1:t})&=\frac{p(r_t, x_{1:t})}{\sum_{r_{t'}}p(r_{t'},x_{1:t})}
\end{align*}$$

We express the joint in a `recursive` manner

$$\begin{align*}
p(r_t, x_{1:t})&=\sum_{r_{t-1}}p(r_t, r_{t-1}, x_t, x_{1:t-1}) \\
&=\sum_{r_{t-1}}p(r_t, x_t | r_{t-1},  x_{1:t-1})p(r_{t-1}, x_{1:t-1})\\
&=\sum_{r_{t-1}}p(x_t|r_t, r_{t-1}, x_{1:t-1})p(r_t|r_{t-1}, x_{1:t-1})p(r_{t-1}, x_{1:t-1}) \\
& \text{assumption: } x_t \text{ conditionally independent of } r_t \\
& \text{assumption: } r_t \text{ conditionally independent of } x_{1:t-1} \\
&=\sum_{r_{t-1}}p(x_t|r_{t-1}, x_{1:t-1})p(r_t|r_{t-1})p(r_{t-1}, x_{1:t-1})\\
&=\sum_{r_{t-1}}p(x_t|r_{t-1}, x_{t-1}^{(r)})p(r_t|r_{t-1})p(r_{t-1}, x_{1:t-1})\\
\end{align*}$$

Therefore, the joint at step $t$, $p(r_t, x_{1:t})$, depends on UPM predictive $p(x_t|r_{t-1}, x_{t-1}^{(r)})$ and joint $p(r_{t-1}, x_{1:t-1})$ at step $t-1$ plus a changepoint prior $p(r_t|r_{t-1})$

It is noted from derivation above that once we set the initial joint $p(r_0)$, what remains to do is to efficiently update UPM predictive and compute changepoint prior

#### UPM predictive

The computation of UPM predictive leverages `conjugate model`

##### Conjugate models

Assume we have observations $D$, model parameters $\theta$ and hyperparameters $\alpha$. Then the prior predictive distribution marginalized over parameters can be written as

$$p(x|\alpha)=\int p(x|\theta)p(\theta |\alpha)$$

where $p(x|\theta)$ is `predictive model` given parameters and $(\theta |\alpha)$ is the prior of `parameters`

This is called `prior predictive distribution` because this is the prediction `before` we observe any data (that is, $D$ is not taken into account)

Similarly, we can write `posterior predictive distribution` as

$$
\begin{align*}
p(x|D, \alpha)&=\int p(x|\theta)p(\theta |D, \alpha) \\
\end{align*}
$$

A wonderful property of conjugate model is that the prior distribution and posterior distribution are of the same form. Therefore, if parameter prior is conjugate, then we have

$$p(\theta|D, \alpha) = p(\theta | \alpha')$$

Therefore

$$
\begin{align*}
p(x|D, \alpha)&=\int p(x|\theta)p(\theta |D, \alpha) \\
&=\int p(x|\theta)p(\theta |\alpha')\\
&=p(x|\alpha')
\end{align*}
$$

That is, the posterior predictive distribution is in the `same form` as the prior predictive with only changed hyperparameters $\alpha'$

This allows us to bypass the whole integration thing provided that we can compute $\alpha'$

One family of conjugate model that is particularly attractive to efficiently compute $\alpha'$ is the exponential family

##### Exponential family

Standard distributions with conjugate priors can be cast into a canonical form (we can think of it as `likelihood`)

$$p(x|\eta)=h(x)\exp \left[\eta^T u(x)-A(\eta)\right]$$

where $h(x)$ is base measure carrying every factor that does not involve $\eta$, $u(x)$ is sufficient-statistics vector, $\eta$ is `natural-parameter` vector, and $A(\eta)$ is a normalizing function

The `conjugate prior` for $\eta$ is expressed as

$$p(\eta | \chi, \nu)=\exp\left[\eta^T\chi-\nu A(\eta)-\tilde{A}(\chi, \nu)\right]$$

where $\chi$ and $\nu$ are `hyperparameters`

Now we can compute its posterior given data $x_{1:n}$

Let

$$\bar{u}=\sum_{i=1}^n u(x_i)$$

We then multiply likelihood with the prior

$$
\begin{align*}
p(\eta|x_{1:n}, \chi, \nu) &\propto \left[\prod_{i=1}^np(x_i|\eta)\right]p(\eta|\chi, \nu) \\
& \text{drop terms that are constant w.r.t. } \eta \\
&\propto \exp \left[\eta^T\ (\chi + \bar{u})-(\nu+n)A(\eta)\right]
\end{align*}
$$

We can see that hyperparameter update is simply

$$\begin{align*}
\nu'&=\nu+n \\
\chi' &= \chi + \sum_{i=1}^nu(x_i)
\end{align*}$$

##### Example of Gaussian

As an example, consider a Gaussian distribution with parameter $\mu$ and `fixed` variance $\sigma^2$, where $\mu$ is determined by two hyperparameters $\mu_0$ and $\sigma_0^2$, or with `prior` $\mu \sim N(\mu_0, \sigma_0^2)$

If we have $n$ data points $x_{1:n}$, we can write out the `likelihood`

$$
\begin{align*}L(x_{1:n}|\mu, \sigma^2)&=\prod_{i=1}^n p(x_i|\mu) \\
&=\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \exp\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right]
\end{align*}$$

We can now write the `joint`

$$
\begin{align*}
p(x_{1:n}, \mu) &= \frac{1}{\sqrt{2\pi \sigma_0^2}}\exp\left[\frac{1}{2\sigma_0^2}(\mu-\mu_0)^2\right]\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \exp\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right]\\
&\propto \exp\left[-\frac{1}{2\sigma_0^2}\left(\mu^2-2\mu \mu_0+\mu_0^2\right)-\frac{1}{2\sigma^2}\sum_{i=1}^n\left(x_i^2-2x_i\mu+\mu^2\right)\right]\\
&=\exp \left[-\frac{\mu^2}{2\sigma_0^2}+\frac{\mu\mu_0}{\sigma_0^2}-\frac{\mu_0^2}{2\sigma_0^2}-\frac{1}{2\sigma^2}\sum_{i=1}^n x_i^2 +\frac{\mu}{\sigma^2}\sum_{i=1}^nx_i -\frac{\mu^2}{2\sigma^2}n\right] \\
&\propto \exp\left[-\frac{\mu^2\sigma^2+\mu^2\sigma_0^2n}{2\sigma_0^2\sigma^2}+\frac{\mu\mu_0\sigma^2+\mu\sigma_0^2\sum_{i=1}^nx_i}{\sigma_0^2\sigma^2}\right] \\
&=\exp\left[-\frac{1}{2\sigma_0^2\sigma^2}\left((\sigma^2+\sigma_0^2n)\mu^2-2(\mu_0\sigma^2+\sigma_0^2\sum_{i=1}^nx_i)\mu\right)\right]\\
&=\exp\left[-\frac{1}{2\sigma_0^2\sigma^2}\left(a\mu^2+b\mu+c\right)\right] \\
&=\exp\left[-\frac{1}{2\sigma_0^2\sigma^2}\left(a(\mu-d)^2+e\right)\right]\\
a&=\sigma^2+\sigma_0^2n \\
d&=-\frac{b}{2a}=\frac{\sigma^2\mu_0+\sigma_0^2\sum_{i=1}^nx_i}{\sigma^2+\sigma_0^2n}\\
e&=c-\frac{b^2}{4a}=-\frac{(\sigma^2\mu_0+\sigma_0^2\sum_{i=1}^nx_i)^2}{\sigma^2+\sigma_0^2n} \\
&=\exp\left[-\frac{1}{2\sigma_0^2\sigma^2}\left((\sigma^2+\sigma_0^2n)\left(\mu-\frac{\sigma^2\mu_0+\sigma_0^2\sum_{i=1}^nx_i}{\sigma^2+\sigma_0^2n}\right)^2-\frac{(\sigma^2\mu_0+\sigma_0^2\sum_{i=1}^nx_i)^2}{\sigma^2+\sigma_0^2n}\right)\right] \\
&\propto \exp\left[-\frac{\sigma^2+\sigma_0^2n}{2\sigma_0^2\sigma^2}\left(\mu-\frac{\sigma^2\mu_0+\sigma_0^2\sum_{i=1}^nx_i}{\sigma^2+\sigma_0^2n}\right)^2\right]
\end{align*}
$$

Using Bayes' rule, we know that `posterior` is proportional to this joint and therefore

$$p(\mu|x_{1:n})\propto \exp\left[-\frac{\sigma^2+\sigma_0^2n}{2\sigma_0^2\sigma^2}\left(\mu-\frac{\sigma^2\mu_0+\sigma_0^2\sum_{i=1}^nx_i}{\sigma^2+\sigma_0^2n}\right)^2\right]$$

Therefore, the posterior `mean` is

$$
\begin{align*}
\mu' &= \frac{\sigma^2\mu_0+\sigma_0^2\sum_{i=1}^nx_i}{\sigma^2+\sigma_0^2n} \\
&=\frac{\sigma^2}{\sigma^2+\sigma_0^2n}\mu_0+\frac{\sigma_0^2}{\sigma^2+\sigma_0^2n}\sum_{i=1}^nx_i
\end{align*}$$

Recall that $\frac{1}{N}\sum_{i=1}^nx_i$ is the `maximum likelihood` estimate of $\mu$

So the posterior mean is a weighted sum of prior mean $\mu_0$ and the ML estimate $\mu_{ML}$

The posterior `variance` is

$$\sigma'^2 = \left[\frac{\sigma_0^2\sigma^2}{\sigma^2+\sigma_0^2}\right]^2$$

or

$$
\begin{align*}
\frac{1}{\sigma'} &=\frac{\sigma^2+\sigma_0^2n}{\sigma_0^2\sigma^2} \\
&=\frac{\sigma^2}{\sigma_0^2\sigma^2}+\frac{\sigma_0^2}{\sigma^2\sigma_0^2}n \\
&=\frac{1}{\sigma_0^2}+\frac{1}{\sigma^2}n
\end{align*}
$$