Skip to content

Latest commit

 

History

History
476 lines (341 loc) · 22.7 KB

_2022-11-28-IS-and-SMC.md

File metadata and controls

476 lines (341 loc) · 22.7 KB
layout title date comments author math summary
post
State-Space Models and Particle Filtering
2022-11-28 2:22
true
Jonathan Ramkissoon
true
Building up to particle filtering for the Bayesian filtering problem

In this post I introduce state-space (latent variable) models and explain the filtering problem. We'll then see how importance sampling and particle filtering are used to do inference on this class of models.

State Space Models

State-space models are a general class of latent variable models widely used in engineering, economics and finance to model the evolution of dynamic systems. They are characterized by an unobserved discrete-time Markov process, $X_t, t \ge 0$ and observations, $Y_t, t \ge 1$. In these models, we assume that $Y_t$ is conditionally independent given $X_t$, and both the transition density for the Markov process, $f(X_t \mid X_{t-1})$ and the observation model, $g(Y_t \mid X_t)$ are given.

These densities are parameterzied by $\Theta$, and parameter inference on these models is concerned with the estimation of $\Theta$.

Visual representation of a generic state-space model. $X_t$ is the state of the system at time $t$, which we don't have "access" to, and $Y_t$ is the observation of the system.

A very simple SSM is a random-walk + noise:

$$ \begin{align*} x_t &= x_{t-1} + \epsilon_{t-1}, \qquad \epsilon_t \sim N(0, \sigma^2_{\epsilon}) \\ y_t &= x_t + r_t, \qquad r_t \sim N(0, \sigma^2_{r}) \\ \end{align*} $$

N = 250
sigma2_eps = np.sqrt(0.5)
sigma2_r = np.sqrt(1.2)
eps = np.random.normal(scale=sigma2_eps, size=N)
r = np.random.normal(scale=sigma2_r, size=N)

x = np.cumsum(eps)
y = x + r

fig, ax = plt.subplots(figsize=(12, 5))
plt.plot(x, label="Latent states")
plt.scatter(np.arange(N), y, s=3, color="firebrick", label="Observations")
plt.legend();

Practitioners are typically interested in either parameter estimation or estimation of the latent states, $X_t$. For example, a common usecase of these models in engineering is for object tracking, where we only have access to a noisy measurement of the object's position at a point in time. In this case, the latent state is the exact position of the object, which we are more interested in estimating compared to the model parameters, $\Theta$. This problem of estimating the latent states is called the Bayesian filtering problem.

At this point you may wonder why I have made the distinction between estimation of parameters and latent states, since we can't really have one without the other. Actually, we can. There are some methods for parameter inference that lend well to cases where we don't care about the latent states, and therefore these are integrated out. To see this, consider the likelihood function for the state space model above:

$$ \begin{align*} \mathcal{L}(\Theta \mid Y_{1:T}) &= \int \left[ \pi(X_0) \prod_{t=0}^T g(Y_t \mid X_t, \Theta) \prod_{n=1}^T f(X_n \mid X_{n=1}, \Theta) \right] dX_{0:T} \end{align*} $$

If this integral could be approximated somehow, we can bypass the filtering problem and directly maximize the likelihood function to find the MLE. Anyway, that's enough of a tangent.

Bayesian Filtering

Bayesian filtering is the problem of estimating the marginal posterior of the state $X_t$ given all previous observations, $Y_{1:t}$, $p(X_t \mid Y_{1:t})$. Note that this distribution is in some sense (not really) the reverse of the observation model, but does not follow any Markov or conditional independence assumptions. However, the conditional independence and Markov assumptions can be used to construct recursive definitions of the quantities we are interested in. Bayesian Filtering and Smoothing does a good job explaining these recursive equations for computing the predictive distribution, $p(X_t \mid Y_{1:t-1})$ and filtering distribution, $p(X_t \mid Y_{1:t})$:

$$ \begin{align*} p(X_t \mid Y_{1:t-1}) &= \int f(x_t \mid x_{t-1}) g(x_{t-1} \mid y_{1:t-1}) dx_{t-1} \\ p(X_t \mid Y_{1:t}) &= \frac{1}{Z_t} g(y_t \mid x_t) p(x_t \mid y_{1:t-1}) \\ \end{align*} $$

Where $Z_t$ is a normalizing constant. The question now is how do we actually "find" these distributions in practice? I say "find" here because we would really be interested in a closed form solution, but if that isn't available we'll take the next best thing. It turns out that when the underlying dynamics are linear and Gaussian, as in our random walk + noise example, the Kalman filter provides a closed form for the filtering distribution. However, if the underlying dynamics are non-linear but still Gaussian, we no longer have a closed form solution but can approximate the filtering distribution with an extended Kalman filter. Finally, if the underlying dynamics are non-linear and the filtering distribution is non-Gaussian, which can happen for multi-modal or states with discrete components, we use particle filters.

Particle Filtering

Particle filters are a class of sequential Monte Carlo methods that approximate the filtering distribution and the log-likelihood of state-space models. As mentioned above, since we don't have closed form solutions to these distributions, they are approximated in the typical Monte Carlo way - using a set of weighted particles (samples). Broadly speaking, it works by proposing initial particles and stepping these particles through the SSM and re-weighting them at each timestep.

This section will build up the intuition behind particle filtering step by step, so if you don't care about that, skip to the Bootstrap Particle Filter section. I will use the same toy example used in A Tutorial on Particle Filtering and Smoothing: Fifteen years later for illustration, then a more real stochastic volatility example after.

$$ \begin{align*} y_t &= \frac{x_t^2}{20} + w_t, \qquad &w_t \sim N(0, 1) \\ x_t &= \frac{1}{2}x_{t-1} + \frac{25x_{t-1}}{1 + x_{t-1}^2} + 8cos(1.2t) + v_t, \qquad &v_k \sim N(0, \sigma_k^2) \\ x_0 &\sim N(0, \sigma_1^2) \end{align*} $$

def transition (x, t):
    """ transition funciton from x_t to x_t+1 """
    if type(x) == np.float64 or type(x) == int:
        n_v = 1
    else : 
        n_v = len(x)
    v = norm(loc = 0, scale = np.sqrt(10)).rvs(size=n_v)
    return 0.5*x + ((25*x)/(1 + x**2)) + 8*np.cos(1.2*t) + v

def marginal_distribution (x):
    """ forward samping marginal for y_t | x_t """
    w = norm.rvs(size=1)
    return ((x**2)/ 20) + w

def conditional_marginal (y, x):
    """ conditional marginal y_t | x_t """
    y_x = norm(loc = ((x**2)/20), scale = 1)
    return y_x.pdf(y)
    
def step (T, x0, transition, marginal):
    """ step through T times """
    x = np.zeros(T)
    x[0] = x0
    y = np.zeros(T)
    y[0] = marginal(x[0])
    for t in range(1, T):
        x[t] = transition(x[t-1], t) 
        y[t] = marginal(x[t])
    return x, y

T = 250
x, y = step(T, x_0, marginal = marginal_distribution, transition = transition)

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(12, 5), sharey = True)
sns.lineplot(x = np.arange(T), y = x, ax = ax, label = "Latent States")
sns.lineplot(x = np.arange(T), y = y, ax = ax, label = "Observed Data")
plt.legend();

Particle Filter Intuition

The goal of the particle filter is to approximate a distribution, which we can do with importance sampling (IS). IS is more commonly known as a way to approximate integrals, however we can use the fact that any distribution, $\pi(x)$, can be expressed as a sum of Dirac delta functions to get an approximation. Suppose ${X^{(i)}}_{i=1}^N$ are samples from $\pi(x)$, then the empirical representation of $\pi(x)$ is:

$$ \hat{\pi}^N(dx) = \frac{1}{N} \sum_{i=1}^N \delta_{X^{(i)}}(dx) $$

Where $\delta_{X}(A)$ is the Dirac measure, defined as:

$$ \delta_{X}(A) = \begin{cases} 1 \qquad X \in A,\\ 0 \qquad X \notin A \end{cases} $$

Suppose we want to approximate a distribution that is known up to a normalizing constant. In the filtering problem, this distribution is $p(X_t \mid Y_{1:t}) = \frac{1}{Z_t} g(y_t \mid x_t) p(x_t \mid y_{1:t-1})$, but we will generalize this to be:

$$ \pi(x) = \frac{\gamma(x)}{\int \gamma(x) dx} $$

Using the importance sampling approach with proposal density, $q(x)$:

$$ \begin{aligned} \pi(x) &= \frac{\frac{\gamma(x)}{q(x)}q(x)}{\int \frac{\gamma(x)}{q(x)}q(x) dx} \\ &= \frac{w(x)q(x)}{\int w(x)q(x) dx} \end{aligned} $$

Let ${X^{(i)}}_{i=1}^N$ be $N$ samples from the proposal density. The empirical representation of $q(x)$ is then:

$$ \hat{q}(dx) = \frac{1}{N} \sum_{i=1}^N \delta_{X^{(i)}}(dx) $$

Then the approximation of $\pi(x)$ becomes:

$$ \begin{aligned} \hat{\pi}(dx) &= \frac{w(x)\hat{q}(x)}{\int w(x)\hat{q}(x) dx} \\ &= \frac{w(x)\frac{1}{N} \sum_{i=1}^N \delta_{X^{(i)}}(dx)}{\int w(x) \frac{1}{N} \sum_{i=1}^N \delta_{X^{(i)}}(dx) dx} \\ &= \frac{\sum_{i=1}^N w(X^{(i)})\delta_{X^{(i)}}(dx)}{\frac{1}{N} \sum_{i=1}^N w(X^{(i)})}\\ \end{aligned} $$

We can use this approach at every timestep to approximate the filtering distribution. This is called sequential importance sampling and works as follows:

  • Draw $N$ samples, $x_0^{(i)}$ from the prior on the initial state, $x_0 \sim p(x_0)$ and set the weights, $w_0^{(i)} = \frac{1}{N}$
  • For each $k=1, \dots, T$:
    • Draw proposal states from the importance distribution: $$ x_k^{(i)} \sim \pi (x_k \mid x_{k-1}) $$
    • Calculate new weights according to:

$$ w_k^{(i)} \propto w_{k-1}^{(i)} \frac{g(y_k \mid x_k^{(i)}) f(x_k^{(i)} \mid x_{k-1}^{(i)})}{\pi (x_k \mid x_{k-1})} $$

The code below runs this algorithm on our example SSM and uses the transition density as the importance sampling proposal. The plot shows the estimated latent state (posterior mean of the filtering distribution) and the filtering distribution as each timestemp, where the point size is proportional to the particle weight at that time.

fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(14, 8), sharey = True)

N = 100 # num particles
particles = np.zeros((N, T)) # memory for N particles at each timestep
weights = np.zeros((N, T))   # weight for each particle
weights[:, 0] = np.ones(N) / N

# initialization: 
particles[:, 0] = norm(0, np.sqrt(10)).rvs(N)

ind = np.arange(N) # indices of rows of particle matrix
for t in range(1, T):
    # 1) use the transition density as the importance sampling proposal density (laziness)
    particles[:, t] = transition(particles[:, t-1], t)
    
    # 2) Evaluate importance weights and normalize: 
    weights[:, t] = conditional_marginal(y = y[t], x = particles[:, t])
    weights[:, t] = np.nan_to_num(weights[:, t] / np.sum(weights[:, t]), 0)
    
    p1 = sns.scatterplot(x = t, y = particles[:, t], ax = ax, 
                s = weights[:, t]*N, color = "green")
    
sns.lineplot(x = np.arange(T), y = x, ax = ax, linewidth=1,
             label = "Latent States", color = "firebrick")
sns.scatterplot(x = t, y = particles[:, t], ax = ax, 
                s = weights[:, t]*N, color = "green",
                label="Particles")
sns.lineplot(x = np.arange(T), label = "Estimated path", linewidth=1,
             y = np.apply_along_axis(np.mean, 0, particles),)\
    .set_title(f"Sequential Importance Sampling - {N} Particles")
plt.legend(loc="upper left"); 
plt.show();

TODO: Plot histograms of particle weights at different timesteps.

Notice that the posterior mean is not a great estimate of the true latent state. This is due to the degeneracy problem, where the weights of many particles are zero or very close to zero. When this happens, these particles die and never contribute to the posterior estimate. An intuitive solution to this problem is to upsample important particles and downsample unimportant particles. We can think of important and unimportant particles as particles that are close or far away from where the models think the true latent state is. When the weight of a particle is low, it is because this particle is far away from the true latent state. In practice, the solution to this problem is to introduce a resampling step, where the weights parameterize a multinomial distribution which is resampled at certain points during inference. This is called sequential importance resampling or the particle filter.

Bootstrap Particle Filters

In the bootstrap particle filter, we use the transition density as the proposal density in the importance samling step (this was already done in our example). Resampling is performed only when the effective sample size (ESS) is too small. ESS is a measure of how badly we are suffering from the degeneracy problem and is calculated as:

$$ n_{\text{eff}} \approx \frac{1}{\sum_{i=1}^N (w_t^{(i)})^2} $$

Outline of bootstrap particle filter:

  • Initialization:

    • For $i = 1, ..., N$, sample $x_0^{(i)} \sim p(x_0)$ and set $t = 1$.
  • Importance sampling step:

    • For $i = 1, ..., N$, sample $x_t \sim p(x_t \mid x_{t-1}^{(i)})$ and inlcude in $x_{0:t}$
    • For $i = 1, ..., N$, evaluate the importance weights: $$ w_t^{(i)} = p(y_t \mid x_t^{(i)}) $$
    • Normalize importance weights
  • Selection step:

    • Resample with replacement $N$ particles $(x_{0:t}^{(i)}, i = 1, ..., N)$ according to $w_t^{(i)}$
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(14, 8), sharey = True)

N = 100 # num particles
particles = np.zeros((N, T)) # memory for N particles at each timestep
weights = np.zeros((N, T))   # weight for each particle
weights[:, 0] = np.ones(N) / N

# initialization: 
particles[:, 0] = norm(0, np.sqrt(10)).rvs(N)

ind = np.arange(N) # indices of rows of particle matrix
for t in range(1, T):
    # 1) Step forward every x_t (sampling step):
    particles[:, t] = transition(particles[:, t-1], t)
    
    # 2) Evaluate importance weights and normalize: 
    weights[:, t] = conditional_marginal(y = y[t], x = particles[:, t])
    weights[:, t] = np.nan_to_num(weights[:, t] / np.sum(weights[:, t]), 0)
    p1 = sns.scatterplot(x = t, y = particles[:, t], ax = ax, 
                s = weights[:, t]*N, color = "green")
    # 3) Resample with replacement using weights
    # calculate ESS: 
    ess = 1 / sum(weights[:, t]**2)
    if ess < (N / 2):
        # multinomial resampling:
        resample_ind = np.random.choice(ind, size = N, p = weights[:, t])
        particles[:, t] = particles[resample_ind, t]
    p2 = sns.scatterplot(x = t, y = particles[:, t], ax = ax, 
                    s = weights[resample_ind, t]*N, color = "black")

print("RMSE: ", rmse(np.apply_along_axis(np.mean, 0, particles), x))
    
sns.lineplot(x = np.arange(T), y = x, ax = ax, label = "Latent States", linewidth=1,
             color = "firebrick")
sns.scatterplot(x = t, y = particles[:, t], ax = ax, 
                    s = weights[resample_ind, t]*N, color = "black", label="Resampled particles")
sns.lineplot(x = np.arange(T), label = "Estimated path", linewidth=1,
             y = np.apply_along_axis(np.mean, 0, particles),)\
    .set_title("Bootstrap Particle Filter w/ Multinomial Resampling")
plt.legend(loc="upper left"); 
plt.show();

With the added resampling step, the posterior mean of the filtering distribution matches the true latent state much closer.

Stochastic Volatility Model Example

I will also use a toy discrete-time stochastic volatility model, which is non-linear and non-Gaussian for illustrations along the way.

$$ \begin{align*} \text{Observation model:} \quad y_t &= \exp\left(\frac{x_t}{2}\right) \cdot e_t, \quad \text{where} \quad e_t \sim \mathcal{N}(0,1) \\ \text{State model:} \quad x_t &= x_{t-1} + w_t, \quad \text{where} \quad w_t \sim \mathcal{N}(0, \sigma) \end{align*} $$

$$ \begin{align*} dS_t &= \mu S_t dt + \sqrt{v_t} S_t dW_t^1 \\ dv_t &= \kappa (\theta - v_t) dt + \sigma \sqrt{v_t} dW_t^2 \end{align*} $$

Discrete-time approximation:

$$ \begin{align*} \Delta S_t &= \mu S_t \Delta t + \sqrt{v_t} S_t \Delta W_t^1 \\ v_{t+1} &= v_t + \kappa (\theta - v_t) \Delta t + \sigma \sqrt{v_t \Delta t} (W_{t+1}^2 - W_t^2) \end{align*} $$