# Experiments on backward variational ICA 

***Uncomment and run the following cell if you're using Collab***

In [None]:
# !rm -rf *
# !git clone https://github.com/mchagneux/backward_ica.git
# !mv backward_ica/* ./
# !rm -rf backward_ica

### Imports

In [1]:
import torch
from tqdm import tqdm
torch.set_default_dtype(torch.float64) 
torch.set_default_tensor_type(torch.DoubleTensor)
from functools import partial

from src.eval import mse_expectation_against_true_states
from src.kalman import Kalman, NumpyKalman
from src.hmm import AdditiveGaussianHMM, LinearGaussianHMM
from src.elbo import get_appropriate_elbo
# torch.set_printoptions(precision=10)

## sanity checks
hmm = LinearGaussianHMM(state_dim=2, obs_dim=2)
states, observations = hmm.sample_joint_sequence(10)

for param in hmm.model.parameters():param.requires_grad = False
likelihood_torch = Kalman(hmm.model).filter(observations)[4] #kalman with torch operators 
likelihood_numpy = NumpyKalman(hmm.model).filter(observations.numpy())[2] #kalman with numpy operators 
fully_linear_gaussian_elbo = get_appropriate_elbo('linear_gaussian','linear_emission')
likelihood_via_elbo = fully_linear_gaussian_elbo(hmm.model, hmm.model)(observations) #elbo

# both should be close to 0
print(likelihood_numpy - likelihood_torch)
print(likelihood_numpy - likelihood_via_elbo)

tensor(7.1054e-15)
tensor(-2.8422e-14)




## Introduction

This notebook is comprised of a series of experiments that attempt to recover expectations $\mathbb{E}[h(z_{1:t})|x_{1:t}]$ via variational approximations, when the process $(z_t, x_t)_{t \ge 1}$ is an HMM. The main metric $\ell$ all along is the MSE against the true states when $h$ is a plain sum, ie

$$\ell = \left(\sum_{t=1}^T z_t^* - \sum_{t=1}^T \mathbb{E}_{q_T(z_t)}[z_t] \right)^2$$

where $q_T(z_t) = q(z_t|x_{1:T})$ is the marginal smoothing distribution at $t$.

In all the following, we assume that the variational smoothing distribution factorizes as $q_\phi(z_{1:t}|x_{1:t}) = q_\phi(z_t|x_{1:t}) \prod_{s=1}^{t-1} q_\phi(z_s|z_{s+1},x_{1:s})$. We always assume that $$q_\phi(z_t|x_{1:t}) \sim \mathcal{N}(\mu_{1:t}, \Sigma_{1:t})$$ and 

$$q_\phi(z_s|z_{s+1},x_{1:s}) \sim \mathcal{N}(\overleftarrow{\mu}_{1:t}(z_{s+1}), \overleftarrow{\Sigma}_{1:t})$$

In the following, we make several assumptions on both $p_\theta$ and $q_\phi$.


In this case, not only should the expectations be correctly recovered, but parameters in $\phi$ and $\theta$ should be identifiable. We also know that in this case the best estimate of $z_{1:t}^*$ for any sequence is obtained via the Kalman smoothing recursions applied with parameters $\theta$ on the observations $x_{1:t}$. 



## 1. Linear Gaussian HMM 

First we assume that observation sequences $x_{1:T}$ arise from $p_\theta(z_{1:t},x_{1:t})$ defined as
$$z_t = A_\theta z_{t-1} + a_\theta + \eta_\theta$$ 
$$x_t = B_\theta z_t + b_\theta + \epsilon_\theta$$

where $\eta_\theta \sim \mathcal{N}(0,Q_\theta)$ and $\epsilon_\theta \sim \mathcal{N}(0,R_\theta)$

### 1. a. Approximated by a linear Gaussian HMM

We start by recovering $p_\theta$ when $q_\phi$ is in the family of the true p. We do this by prescribing the p for $q_\phi$ in forward time with a similar HMM structure as $p_\theta$ (but random initial parameters), and in this case the parameters of the filtering backward distributions exist via Kalman recursions and closed-form definitions.

In [7]:
hmm = LinearGaussianHMM(state_dim=2, obs_dim=2) # pick some true p p 
for param in hmm.model.parameters(): param.requires_grad = False # not learning the parameters of the true p for now 



# sampling 10 sequences from the hmm 
samples = [hmm.sample_joint_sequence(8) for _ in range(10)] 
state_sequences = [sample[0] for sample in samples]
observation_sequences = [sample[1] for sample in samples] 


# the variational p is a random LGMM with same dimensions, and we will not learn the covariances for now
q = LinearGaussianHMM.get_random_model(2,2)
q.prior.parametrizations.cov.original.requires_grad = False
q.transition.parametrizations.cov.original.requires_grad = False 
q.emission.parametrizations.cov.original.requires_grad = False 

# the elbo object with p and q as arguments
elbo = fully_linear_gaussian_elbo(hmm.model, q)

# optimize the parameters of the ELBO (but theta deactivated above)
optimizer = torch.optim.Adam(params=elbo.parameters(), lr=1e-2)
true_evidence_all_sequences = sum(Kalman(hmm.model).filter(observations)[-1] for observations in observation_sequences)

print('True evidence accross all sequences:', true_evidence_all_sequences)

# optimizing p 
distance_to_true_objective = torch.abs(true_evidence_all_sequences - torch.inf)
eps = distance_to_true_objective

while eps > 1e-3:
    epoch_loss = 0.0
    for observations in observation_sequences: 
        optimizer.zero_grad()
        loss = -elbo(observations)
        loss.backward()
        optimizer.step()
        epoch_loss += -loss
    with torch.no_grad():
        new_distance_true_objective = torch.abs(true_evidence_all_sequences - epoch_loss)
        eps = torch.abs(new_distance_true_objective - distance_to_true_objective)
        distance_to_true_objective = new_distance_true_objective
        print('Epoch gain w.r.t to objective:', eps)



True evidence accross all sequences: tensor(298.9847)
Epoch gain w.r.t to objective: tensor(inf)
Epoch gain w.r.t to objective: tensor(52984.8457)
Epoch gain w.r.t to objective: tensor(11180.8882)
Epoch gain w.r.t to objective: tensor(2283.8153)
Epoch gain w.r.t to objective: tensor(2449.6759)
Epoch gain w.r.t to objective: tensor(1178.3995)
Epoch gain w.r.t to objective: tensor(902.9804)
Epoch gain w.r.t to objective: tensor(666.8344)
Epoch gain w.r.t to objective: tensor(513.4298)
Epoch gain w.r.t to objective: tensor(404.8531)
Epoch gain w.r.t to objective: tensor(335.3184)
Epoch gain w.r.t to objective: tensor(287.6167)
Epoch gain w.r.t to objective: tensor(255.8584)
Epoch gain w.r.t to objective: tensor(234.5712)
Epoch gain w.r.t to objective: tensor(216.0975)
Epoch gain w.r.t to objective: tensor(200.6539)
Epoch gain w.r.t to objective: tensor(185.4191)
Epoch gain w.r.t to objective: tensor(170.7478)
Epoch gain w.r.t to objective: tensor(156.2305)
Epoch gain w.r.t to objective: t

In [8]:
# checking expectations under approximate p when the additive functional is just the sum 
with torch.no_grad():
    additive_functional = partial(torch.sum, dim=0)
    smoothed_with_true_model = mse_expectation_against_true_states(state_sequences, observation_sequences, hmm.model, additive_functional)
    smoothed_with_approximate_model = mse_expectation_against_true_states(state_sequences, observation_sequences, q, additive_functional)

    print('MSE when smoothed with p:',smoothed_with_true_model)
    print('MSE when smoothed with q:',smoothed_with_approximate_model)

MSE when smoothed with p: tensor(0.0079)
MSE when smoothed with q: tensor(0.0112)


### 1. b. Using a neural network to compute the backward parameters instead of Kalman recursions
We make the same assumptions on $p_\theta$ but now we attempt to recover the backward parameters via neural network.

## 2. A nonlinear emission p

We now assume that $p_\theta$ has a nonlinear emission distribution, ie. $x_t  = f_\theta(z_t) + \epsilon$.

### 2. a. Approximated by a linear Gaussian p.
We keep a linear gaussian distribution for $q_\phi$, but we add a mapping to compute the expectation of the emission term from $p_\theta$. We need to approximate the following quantity:

$$\mathbb{E}_{q(z_t|z_{t+1}, x_{1:t})}\left[(x_t - f_\theta(z_t))^T R^{{\theta}^{-1}}(x_t - f_\theta(z_t))\right]$$

And similarly for the last expectation under the filtering distribution: 

$$\mathbb{E}_{q(z_T|x_{1:T})}\left[(x_T - f_\theta(z_T))^T R^{{\theta}^{-1}}(x_T - f_\theta(z_T))\right]$$

#### 2. a. i. A sampling-free approach. 


If we know the expectation $\mu$ and variance $\Sigma$ of a random variable $v$ (which need not be Gaussian):

$$\mathbb{E}_{v}\left[(x - v)^T \Omega (x - v)\right] = tr(\Sigma \Omega) + (\mu - x)^T \Omega (\mu - x)$$

Suppose we a have neural network which approximates the mean and variance of $v \sim f_\theta(z)$ when $z \sim p_z$, given parameters of $p_z$. Denote $\tilde{\mu}$ and $\tilde{\Sigma}$ these means and variances estimated by this network. For the filtering case, we feed the network with filtering mean and covariance at $T$ to obtain an estimate of $\tilde{\mu}$ and $\tilde{\Sigma}$, then:

$$\mathbb{E}_{q(z_T|x_{1:T})}\left[(x_T - f_\theta(z_T))^T R^{{\theta}^{-1}}(x_T - f_\theta(z_T))\right] = tr(\tilde{\Sigma} \Omega) + (\tilde{\mu} - x)^T R^{{\theta}^{-1}} (\tilde{\mu} - x)$$

For the backwards case this is not as simple, because: $\overleftarrow{\mu}_{1:t}$ is a function of $z_{t+1}$, therefore $\mathbb{E}_{q(z_t|z_{t+1}, x_{1:t})}[f_\theta(z_t)]$ and $\mathbb{V}_{q(z_t|z_{t+1}, x_{1:t})}[f_\theta(z_t)]$ are also functions of $z_{t+1}$. 

We still attempt to use one network for both the fitlering and the backwards via the following scheme: 

- Build a neural network $g_\alpha(A, a, \Sigma)$ which outputs $\tilde{A}, \tilde{a}$ and $\tilde{\Sigma}$
- For the backwards case, use $A = \overleftarrow{A}_{1:t}, a = \overleftarrow{a}_{1:t}$ and $\Sigma = \overleftarrow{\Sigma}_{1:t}$, and consider that $\tilde{\mu} = \tilde{A}z_{t+1} + \tilde{a}$, while $\tilde{\Sigma}$ does not depend on $z_{t+1}$ (which is knowingly false). In this case, the quadratic form build for $\tilde{A}$ and $\tilde{a}$ is a quadratic form in $z_{t+1}$ as wanted.
- For the backwards case, use $A = 0, a = a_{1:t}$ and $\Sigma = \Sigma_{1:t}$, and consider that $\tilde{\mu} = \tilde{a}$ (without using the output $\tilde{A}$).

*This method, which is tried below: fails to learn anything as of now.*

In [4]:
hmm = AdditiveGaussianHMM(state_dim=2, obs_dim=2) # we now take an hmm wih 

# sampling 10 sequences from the hmm 
samples = [hmm.sample_joint_sequence(8) for _ in range(10)] 
state_sequences = [sample[0] for sample in samples]
observation_sequences = [sample[1] for sample in samples] 


# the variational p is a random LGMM with same dimensions, and we will not learn the covariances for now
q = LinearGaussianHMM.get_random_model(2,2)
q.prior.parametrizations.cov.original.requires_grad = False
q.transition.parametrizations.cov.original.requires_grad = False 
q.emission.parametrizations.cov.original.requires_grad = False 


elbo_nonlinear_emission = get_appropriate_elbo(q_description='linear_gaussian', 
                                            p_description='nonlinear_emission')

elbo = elbo_nonlinear_emission(hmm.model, q)

# print(elbo_nonlinear_emission(observation_sequences[0]))



# optimize the parameters of the ELBO (but theta deactivated above)
optimizer = torch.optim.Adam(params=elbo.parameters(), lr=1e-2)


eps = torch.inf
# optimizing p 
while True:
    epoch_loss = 0.0
    for observations in observation_sequences: 
        optimizer.zero_grad()
        loss = -elbo(observations)
        loss.backward()
        optimizer.step()
        epoch_loss += -loss
    with torch.no_grad():
        print("Loss:", epoch_loss)

TypeError: _quad_form_in_emission_term() missing 1 required positional argument: 'z'

#### 2. a. i. The Johnson trick
