# Experiments on backward variational ICA 

***Uncomment and run the following cell if you're using Collab***

In [None]:
# !rm -rf *
# !git clone https://github.com/mchagneux/backward_ica.git
# !mv backward_ica/* ./
# !rm -rf backward_ica

### Imports

In [None]:
from functools import partial
from src.eval import mse_expectation_against_true_states
from src.kalman import Kalman, NumpyKalman
from src.hmm import AdditiveGaussianHMM, LinearGaussianHMM
from src.elbo import LinearGaussianELBO
import torch
from tqdm import tqdm
torch.set_default_dtype(torch.float64) 
torch.set_default_tensor_type(torch.DoubleTensor)
# torch.set_printoptions(precision=10)

## sanity checks
hmm = LinearGaussianHMM(state_dim=2, obs_dim=2)
states, observations = hmm.sample_joint_sequence(10)

for param in hmm.model.parameters():param.requires_grad = False
likelihood_torch = Kalman(hmm.model).filter(observations)[4] #kalman with torch operators 
likelihood_numpy = NumpyKalman(hmm.model).filter(observations.numpy())[2] #kalman with numpy operators 
likelihood_via_elbo = LinearGaussianELBO(hmm.model, hmm.model)(observations) #elbo

# both should be close to 0
print(likelihood_numpy - likelihood_torch)
print(likelihood_numpy - likelihood_via_elbo)

## Introduction

This notebook is comprised of a series of experiments that attempt to recover expectations $\mathbb{E}[h(z_{1:t})|x_{1:t}]$ via variational approximations, when the process $(z_t, x_t)_{t \ge 1}$ is an HMM. The main metric $\ell$ all along is the MSE against the true states when $h$ is a plain sum, ie

$$\ell = \left(\sum_{t=1}^T z_t^* - \sum_{t=1}^T \mathbb{E}_{q_T(z_t)}[z_t] \right)^2$$

where $q_T(z_t) = q(z_t|x_{1:T})$ is the marginal smoothing distribution at $t$.

In all the following, we assume that the variational smoothing distribution factorizes as $q_\phi(z_{1:t}|x_{1:t}) = q_\phi(z_t|x_{1:t}) \prod_{s=1}^{t-1} q_\phi(z_s|z_{s+1},x_{1:s})$. We always assume that $$q_\phi(z_t|x_{1:t}) \sim \mathcal{N}(\mu_{1:t}, \Sigma_{1:t})$$ and 

$$q_\phi(z_s|z_{s+1},x_{1:s}) \sim \mathcal{N}(\overleftarrow{\mu}_{1:t}(z_{s+1}), \overleftarrow{\Sigma}_{1:t})$$

In the following, we make several assumptions on both $p_\theta$ and $q_\phi$.


In this case, not only should the expectations be correctly recovered, but parameters in $\phi$ and $\theta$ should be identifiable. We also know that in this case the best estimate of $z_{1:t}^*$ for any sequence is obtained via the Kalman smoothing recursions applied with parameters $\theta$ on the observations $x_{1:t}$. 



## 1. Linear Gaussian HMM 

First we assume that observation sequences $x_{1:T}$ arise from $p_\theta(z_{1:t},x_{1:t})$ defined as
$$z_t = A_\theta z_{t-1} + a_\theta + \eta_\theta$$ 
$$x_t = B_\theta z_t + b_\theta + \epsilon_\theta$$

where $\eta_\theta \sim \mathcal{N}(0,Q_\theta)$ and $\epsilon_\theta \sim \mathcal{N}(0,R_\theta)$

### 1. a. Approximated by a linear Gaussian HMM

We start by recovering $p_\theta$ when $q_\phi$ is in the family of the true model. We do this by prescribing the model for $q_\phi$ in forward time with a similar HMM structure as $p_\theta$ (but random initial parameters), and in this case the parameters of the filtering backward distributions exist via Kalman recursions and closed-form definitions.

In [None]:
hmm = LinearGaussianHMM(state_dim=2, obs_dim=2) # pick some true model p 
for param in hmm.model.parameters(): param.requires_grad = False # not learning the parameters of the true model for now 



# sampling 10 sequences from the hmm 
samples = [hmm.sample_joint_sequence(8) for _ in range(10)] 
state_sequences = [sample[0] for sample in samples]
observation_sequences = [sample[1] for sample in samples] 


# the variational model is a random LGMM with same dimensions, and we will not learn the covariances for now 
v_model = LinearGaussianHMM.get_random_model(2,2)
v_model.prior.parametrizations.cov.original.requires_grad = False
v_model.transition.parametrizations.cov.original.requires_grad = False 
v_model.emission.parametrizations.cov.original.requires_grad = False 

# the elbo object with p and q as arguments
elbo = LinearGaussianELBO(hmm.model, v_model)

# optimize the parameters of the ELBO (but theta deactivated above)
optimizer = torch.optim.Adam(params=elbo.parameters(), lr=1e-2)
true_evidence_all_sequences = sum(Kalman(hmm.model).filter(observations)[-1] for observations in observation_sequences)

print('True evidence accross all sequences:', true_evidence_all_sequences)

eps = torch.inf
# optimizing model 
while eps > 0.1:
    epoch_loss = 0.0
    for observations in observation_sequences: 
        optimizer.zero_grad()
        loss = -elbo(observations)
        loss.backward()
        optimizer.step()
        epoch_loss += -loss
    with torch.no_grad():
        eps = torch.abs(true_evidence_all_sequences - epoch_loss)
        print('Average of "L(theta, phi) - log(p_theta(x))":', eps)

In [None]:
# checking expectations under approximate model 
with torch.no_grad():
    additive_functional = partial(torch.sum, dim=0)
    smoothed_with_true_model = mse_expectation_against_true_states(state_sequences, observation_sequences, hmm.model, additive_functional)
    smoothed_with_approximate_model = mse_expectation_against_true_states(state_sequences, observation_sequences, v_model, additive_functional)

    print('MSE when smoothed with true model:',smoothed_with_true_model)
    print('MSE when smoothed with variational model:',smoothed_with_approximate_model)

### 1. b. Using a neural network to compute the backward parameters instead of Kalman recursions
We make the same assumptions on $p_\theta$ but now we attempt to recover the backward parameters via neural network.

## 2. A nonlinear emission model

We now assume that $p_\theta$ has a nonlinear emission distribution, ie. $x_t  = f_\theta(z_t) + \epsilon$.

### 2. a. Approximated by a linear Gaussian model.
We keep a linear gaussian distribution for $q_\phi$, but we add a mapping to compute the expectation of the emission term from $p_\theta$.