Open in [nbviewer](http://nbviewer.jupyter.org/github/luiarthur/stochastic_AMS263/blob/master/notes/notes2.ipynb)
$
% Latex definitions
% note: Ctrl-shfit-p for shortcuts menu
\newcommand{\iid}{\overset{iid}{\sim}}
\newcommand{\ind}{\overset{ind}{\sim}}
\newcommand{\p}[1]{\left(#1\right)}
\newcommand{\bk}[1]{\left[#1\right]}
\newcommand{\bc}[1]{ \left\{#1\right\} }
\newcommand{\abs}[1]{ \left|#1\right| }
\newcommand{\ceil}[1]{ \lceil#1\rceil }
\newcommand{\norm}[1]{ \left|\left|#1\right|\right| }
\newcommand{\E}{ \text{E} }
\newcommand{\N}{ \mathcal N }
\newcommand{\ds}{ \displaystyle }
\newcommand{\R}{ \mathbb{R} }
\newcommand{\suml}{ \sum_{i=1}^n }
\newcommand{\prodl}{ \prod_{i=1}^n }
\newcommand{\overunderset}[3]{\overset{#1}{\underset{#2}{#3}}}
\newcommand{\asym}{\overset{\cdot}{\sim}}
\newcommand{\given}{\bigg |}
\newcommand{\M}{\mathcal{M}}
\newcommand{\Mult}{\text{Mult}}
\newcommand{\F}{\mathcal{F}}
\newcommand{\P}{\mathcal{P}}
$

# Hidden Markov Models
Observation $Y_n$ for $n=0,1,...$ are generated from a conditional distribution $f(y_n|x_n)$ with parameters depending on an unnobserved or hidden state, $x_n \in \bc{1,2,...,K}$. Hidden states follow a transition matrix $P$.

## Partially Observed Data: Inference Example
Let the data be observed at time $t_1, t_6, t_9,t_{20},t_{35}$. Let the transition matrix be fiven by $P = \p{p_{ij}}_{i,j=1}^m$.

If the all observations were present, $\prod_{i,j}^m p_{ij}^{n_{ij}}$. $n_{ij}$ is the number of transitions from i to j.

Let $x_0$ be the known initial state and we observe $X_0=(x_{n_1},...,x_{n_m})$ where $n_1 < ... < n_m \in \mathbb{N}$.

$$
L(p|x_0) = \prod_{i=1}^m p_{n_{i-1},n_i}^{t_i-t_{i-1}}
$$

where $p_{ij}^{(t)}$ is the (i,j)-th entry of $t$ step transition matrix. i.e. $P^t$.

## Hidden Markov Model (HMM)
An HMM is based upon unobserved finite state RVs $S_t \in \bc{1,...,m}$ which evolve according to a markov chain. i.e. 

$$ P(S_t=j \mid S_{t-1}=i) = p_{ij} $$

where $\p{p_{ij}}_{i,j=1}^m$ is a transition matrix . 

Let $\pi_1$ be the probability distribution of $S_1$.

Assume that the chain is irreducible, aperiodic and time homogeneous. These are required for identifiability.

At each observation point $t$, a realization of the state occurs. Given $S_t=k$, $y_t$ is drawn as follows:

$$ y_t \mid y_{t-1},\theta_k \sim f(y_t\mid y_{t-1},\theta_k) $$

where $y_{t-1} = (y_1,...,y{t-1})$ and $k=1,...,m$.

This implies that

$$
f(y_t|y_{t-1},s_{t-1},\theta) = 
\begin{cases}
\sum_{k=1}^m f(y_t \mid y_{t-1},\theta_k) \pi_1(s_t=k), & t=1 \\
\sum_{k=1}^m f(y_t \mid y_{t-1},\theta_k) P(s_t=k|s_{t-1}), & t\ge 2 \\
\end{cases}
$$

where $\mathbf{\theta} = (\theta_1,...,\theta_m, p_{ij}, i,j,=1,...,m)$.

This is very different from the mixture model. In a mixture model, component specific latent variables are generally independent. In HMM, there is a serial correlation b/w them.

This representation is computationally cumbersome. So, we use $s_1,...,s_n$ as latent parameters and sample them alongside.

Good paper to read: **[Chib (1996) HMM](../resources/chib1996.pdf)**.

Define: $S_t = (s_1,...,s_t)$,  $S^{t+1} = (s_{t+1},...s_n)$. Similarly, $Y_t=(y_1,...,y_t)$ and $Y^{t+1}=(y_{t+1},...,y_n)$.

$$P(S_n \mid Y_n, \theta) = p(s_n\mid Y_n,\theta) \times ... \times p(s_t\mid Y_n,S^{t+1},\theta) \times p(s_1\mid Y_n,S^2,\theta)$$

$p(s_t \mid Y_n, S^{t+1},\theta)$ is a typical term in this product.

$$
\begin{split}
p(s_t \mid Y_n, S^{t+1},\theta) &\propto p(s_t \mid Y_t, \theta) ~g(Y^{t+1},S^{t+1}\mid Y_t,s_t,\theta) \\
&\propto p(s_t \mid Y_t, \theta)~ p(s_{t+1}\mid s_t,\theta) ~g(Y^{t+1},S^{t+2}\mid Y_t,s_t,s_{t+1},\theta) \\
\\
\Rightarrow p(s_t \mid Y_n, S^{t+1},\theta) &\propto p(s_t \mid Y_t, \theta)~ p(s_{t+1}\mid s_t,\theta)
\end{split}
$$

The last step follows because $Y^{t+1},S^{t+1}|s_{t+1}$ is independent of $s_t$ by the Markov property.

Thus the mass function of $s_t$ is proportional to the product of two terms, one of which is the mass function of $s_t$ given $(Y_t,\theta)$ and the other is the transition prob given $\theta$.

Assume $p(s_{t-1}\mid Y_{t-1},\theta)$ is available, then repeat the following steps:

### Prediction step
$$ p(s_t|Y_{t-1},\theta) = \sum_{k=1}^m p(s_t|s_{t-1}=k,\theta) p(s_{t-1}=k|Y_{t-1},\theta)$$

### UPdate step
$$ p(s_t|Y_{t},\theta) \propto p(s_t|Y_{t-1}=k,\theta) f(y_{t}|Y_{t-1},\theta)$$


### Algorithm
Initialize at $t=1$ by setting $p(s_1|y_0,\theta)$ to be the stationary distribution of the chain.

Run the prediction and update steps recursively ro comp[ute the mass fn of $p(s_t|Y_t,\theta)$. 

$S_n$ is the first updated Then the remaining steps are simulated from equation (1) above...

We know how to draw samples from $s_1,...,s_n$. 

### p-update
WE use $p_i=(p_{i1},...,p_{im})\sim Dir(\alpha_{i1},...,alpha_{im})$ then $p_i mid s_n \sim Dir(\alpha_{i1}+n_{i1},...,\alpha_{im}+n_{im})$, where $n_{ik}$ = the total number of transitions $i$ to $k$.



### See Example in Chib 1996 Section 4.1

Infact, just refer to the paper for this lecture on HMM.

# TO DO

- choose project by 3 March.
    - 20 minutes
- make-up class tomorrow 1-2 pm. 246 Porter.


# Point Processes

Point Processes are SP for events that occur separated in time or space.

If points are independently distributed, we would expect that the location of each point is independent of the location of other points. But there can be certain pattern of points. We would like to model that.

Poisson process plays an important  role in the study of point processes.

# Non-homogeneous Poisson Process (NHPP)

NHPP are defined on the observation window $R$ with intensity $lambda(x), x \in R$ which is a non-negative and lcally integrable function for all bounded $B \subset R$, the following holds:

1. for any $B$, the number of points in $B$, $N(B) \sim Pois(\Lambda(B))$, where $\Lambda(B) = \int_B\lambda(x) dx$
2. Given $N(B)$, the point locations within $B$ are iid with density $\frac{\lambda(x)}{\int_B\lambda(x) dx}$

Consider first NHPP in 1-dim. Spatial NHPP (in 2-dim) will follow later.

Let us study NHPP in the interval $R = (0,1)$ with events occuring at points $0 < t_1 < t_2<...<t_N<1$.

$P(N \text{ events occur in} (0,1)) = \frac{e^{-\Lambda(B)}\Lambda(B)^N}{N!}$, where $\Lambda(B) = \int_0^1 \lambda(x) dx$

by (2), $P(\text{events happened at} t_1<t_2<...<t_N | N \text{events}) = \prod_{i=1}^N \frac{\lambda(t_i)}{\int_B\lambda(x) dx}$

$P(N \text{events happened at points} t_1<...<t_N) = e^{-\Lambda}\frac{\prod_{i=1}^N\lambda(t_i)}{N!}$

### Prior on $\lambda(t)$

1. assume some parametric form of $\lambda(t)$ and put priors on parameters
2. assume fully non-parametric prior on $\lambda(t)$



### Parametric method:

Assume $\lambda(t)=\alpha t^{-\beta}$, for $\alpha>0, \beta\in \mathbb{R}$

### Nonparametric Prior

Define $f(t)=\frac{\lambda(t)}{\nu}, \nu = \int_0^1\lambda(u)du$

$f(t)$ is a density function on (0,1). $(f,\nu)$ provides an equivalent representation of $\lambda$. So a nonparametric prior for $f$ with a parametric prior on $\nu$ will induce a semi-parametric prior on $\lambda$.

$\nu$ determines the scale and $f$ will determines the shape of $\lambda$.

There are two different non-parametric priors that one can think of in estimating $f$.

1. DP mixture prior
    - $f(t) = \int \text{Beta}(t; \mu,\tau) dG(\mu,\tau)$, where $\mu \in (0,1)$ and scale parameter $\tau>0$
    - $G \sim DP(\alpha, G_0)$
    - DP Books References:
        - Dey, Muller, Sinha (1998)
        - Gosh & Rammoonorti (2003)
        - hjort, Holmes, Muller, Walker (2010)
        - Muller & Rodriguex (2013)
2. Logistic GP prior

# Model intensity function of $\lambda(t)$ in a non-homogeneous Poisson Process

$f(t) = \frac{\lambda(t)}{v}, v = \int_0^1\lambda(t)dt$

We want to put prior on $\lambda(t)$ but we instead (equivalently) put prior on $v$ and $f(t)$.

$f(t) = \int \text{ Beta}(t;\mu,\tau) dG(\mu,\tau)$, where $\mu\in(0,1)$ and $\tau>0$.

$G\sim DP(\alpha,G_0)$ $G_0(\mu,\tau) = G_{01}(\mu)G_{02}(\tau)$ we take the base distribution for $\mu$ to be uniform (0,1) and take the base distribution for $\tau$ to be gamma(a,b).

prior on $v$: $p(v)=1/v$.

** Likelihood:**  $e^{-\Lambda}\frac{\prod_{i=1}^N\lambda(t_i)}{N!}$
which is proportional to $e^{-v} \prodl\bc{f(t_i)v}$



# Logistic Gaussian Process Prior is used To Estimate $f(t)$

Tokdai et at (2007)

Take $I=[0,1]$. We are interested in estimating a density that is defined over $[0,1]$. Let $\sigma_0(.,.)$
be a fixed positive definite function on $\mathbb{R}\times\mathbb{R}$.
If you take $t_1,...,t_m$ for any $m$, 

$\Sigma = (((\sigma_0(t_i,t_j))))_{i,j=1}^m$ is a positive definite matrix.

Define a real valued process $f_N$ on $I$ as follows, 

$f_N(t) = \ds\frac{e^{W(t)}}{\int_I e^{W(t)}ds}$, $t\in I$, where given $\gamma=(\tau,\beta) \in \mathbb{R}^+\times \mathbb{R}^+$.

$W\sim GP(0,\sigma_\gamma(s,t))$, $\sigma(s,t)=\tau^2\sigma_0(\beta s, \beta t)$. The prior on $f$ is going to generate realizations of $f$ of the form $f_W$. And of course, $\int f_W(t) dt = 1$. This prior is called the logistic Gaussian process prior.

Small values of $\beta$ results in smooth sample paths, while large $\beta$ produces oscillating sample paths.
$\tau$ controls variability of $f_W$ from its prior.

The posterior distribution given observations at points $t_1,...,t_n$ is given by 
$\ds e^{-\nu}\bc{\prodl \nu f_w(t_i)} \times \pi(\beta) \pi(\tau^2) \times \N\p{(W(t_1),...,W(t_n))'\mid 0,\Sigma}$

When the number of points $n$ is large, the MCMC becomes prohibitive as it requires inverting the matrix $\Sigma$ in every iteration.

Computational issues can be solved by imputations. $T=\bc{x_1,...,x_m} \subset S$. We approximate $W$ by a new process $z(t)=E[W(t)|W_m,\gamma], t\in I$ , where $W_m=(W(x_1),...,W(x_m))$. 
This gives us 

$z(t) = W_m'\Sigma_\gamma^{-1}\sigma_\gamma(t), \Sigma_\gamma=((\sigma_\gamma(x_i,x_j)))_{i,j}^m$.

$\sigma_\gamma(t) = (\sigma_\gamma(x_1,t),...,\sigma_\gamma(x_m,t))$.

$\ds f_W(t) = f_{X^TA_\gamma}(t) = \frac{\exp\p{X^T A\gamma(t)}}{\int_0^1 \exp\p{X^T A\gamma(s)} ds}$,

where $X\sim \N_m(0,I)$ and $A_\gamma(t) = \Sigma_\gamma^{-1/2} \sigma_\gamma(t)$.

put prior on $\gamma=(\beta,\tau^2)$. Call the prior on $\gamma$ as $H(\gamma)$.

Given $\gamma$, we have the distributioon of $W_m$ is normal. And hence $z(t)$ has the same dist as $X^TA_\gamma(t)$. This approx of $W(t)$ by $z(t)$ is useful if posterior dist of $f_w|y$ is well approxmiated by $f_z|y$.

Let $\widehat{f_W} = \E\bk{f_W|y}$, and $\widehat{f_Z} = \E\bk{f_Z|y}$.
Then assume that there exists positive $c,q$ such that for all $s,t\in \mathbb{R}$, and $\gamma\in\sup(H)$

$ \sqrt{Var(W(s)-W(t))} \le c\norm{s-t}^q $, let $\delta(T) = \sup \min\norm{t_i-t_j}$. $t$ in unity

$\delta(T)$ is called the fitness o nodes. then $KL(\hat{f_W},\hat{f_Z})\rightarrow 0$ as $\delta(T)\rightarrow 0$.

### Computation
Given data $(y_1,...,y_n)$, the posterior density of $(X,\gamma)$ can be written as 

$$ p(X,\gamma | y) \propto \bc{\prodl f_X^TA_\gamma(y_j)} \N(X|0,I_m) \times H(\gamma) $$

### Algorithm

1. Initialize
2. propose to move from $X$ to $X'=(x_1',x_2,...,x_m)$.  $x_1'$ is generated from $\N(x_1,\sigma_{x_1}^2)$. Use MH.
    - Accept the new move with prob $\alpha = \ds\min\bc{1,\frac{\phi_1(x_1')}{\phi_1(x_1)}\frac{\prodl f(x')^TA_\gamma(y_j)}{\prodl f(x)^TA_\gamma(y_j)}}$.
    - (do this for each $x_j$)
3. update $\gamma$ using Metropolis
    - recall $f_{X^t A_\gamma(t)} = ...$ (see above) which requires the computation of an integral.
    - one can evaluate the process on a very fine grid. Let $G\subset [0,1]$ be the grid. The integral can just be numerically approximated (with the area). Do this at every MCMC iteration.

# Log Gaussian Cox Process

Doubly Stochastic Poisson Process.

In Poisson Process, the intensity fn is an unknown but fixed function $\lambda(t)$. In log Gaussian Cox process, 
$\lambda(t)$ is assumed to be a random function.

Both the processes $Y(t)$ and the intensity fn $\Lambda=\bc{\lambda(t):t\in\mathbb{R}}$  are SPs.

We assume $y|\Lambda \sim $ Poisson process with intensity $\lambda$.

In general, one restricts attention to cases where $\Lambda$ and hence $y$ is starionary, and sometimes, also
isotropic. One models $\lambda(s)$ using a log GP.

$\lambda(s) = \exp(Z(s))$ where $Z=\bc{z(s): s\in \mathbb{R}}$ is a real valued GP. $log(\lambda(s)) = z(s)$.
If there are a number of predictors, they can be incorporated in the above equation using 
$\log\lambda(s) = z(s) + x(s)'\beta$, where $x(s)$ is the predictor.

Also, before people have worked on replacing the GP $Z(s)$ by a predictive process or kernel convolution for computational tractability.

### Likelihood of log-Gaussian Cox process
Let the obs be found at locations $t_1,...,t_n$. The likelihood is given by 
$\E_\lambda\bk{\exp\bc{-\int_0^1 \lambda(s) ds} \prodl \lambda(t_i)}$

We know $\log \lambda(t) = z(t)$ which follows a GP. We have to calculate joint dist of $(\lambda(t_1),...,\lambda(t_n))$, which is known as the joint dist from a log Gaussian Cox proc.

One can also simplify this a bit by assuming the sparse GP on $Z$ so that the expectation is always over 
$(z(x_1),...,z(x_m))$ where $\bc{x_1,...,x_m}$ are knot points.