# analytical approach

- analytically tractable problem: Gaussian prior, Gaussian proposal, _linear-Gaussian likelihood_
- analytically tractable MDN: linear-affine network
- analytically tractable gradients and closed-form solution for MDN parameters for given dataset


TO DO:
- check gradients again (and again and again...)
- analytical division for actual CDELDI posterior estimates (vs. 'proposal-posterior' estimates atm)


prior: 

$p(\theta) = \mathcal{N}(\theta \ | \ 0, \eta^2)$

proposal prior: 

$\tilde{p}(\theta) = \mathcal{N}(\theta \ | \ \nu, \xi^2)$

simulator: 

$p(x \ | \ \theta) =  \mathcal{N}(x \ | \ \theta, \sigma^2)$

analytic posteriors: 

$p(\theta \ | \ x) = \mathcal{N}(\theta \ | \frac{\eta^2}{\eta^2 + \sigma^2} x, \eta^2 - \frac{\eta^4}{\eta^2 + \sigma^2})$ 

$\tilde{p}(\theta \ | \ x) = \mathcal{N}(\theta \ | \frac{\xi^2}{\xi^2 + \sigma^2} x + \frac{\sigma^2}{\xi^2 + \sigma^2} \nu, \xi^2 - \frac{\xi^4}{\xi^2 + \sigma^2})$

Data:

$(x_n, \theta_n) \sim p(\theta) p(x \ | \ \theta) = \mathcal{N}( (x_n, \theta_n) \ | \ (0, \nu), 
\begin{pmatrix}
\xi^{2} + \sigma^{2} &  \xi^{2}  \\
\xi^{2} & \xi^{2}  \\
\end{pmatrix})$

Loss: 

$ \mathcal{L}(\phi) = \sum_n \frac{{p}(\theta_n)}{\tilde{p}(\theta_n)} K_\epsilon(x_n | x_0) \ \log q_\phi(\theta_n | x_n)$

Model: 

$ q_\phi(\theta_n | x_n) = \mathcal{N}(\theta_n \ | \ \mu_\phi(x_n), \sigma^2_\phi(x_n))$

$ (\mu_\phi(x), \sigma^2_\phi(x)) = MDN_\phi(x) = \begin{pmatrix} \beta \\ 0 \end{pmatrix} x + \begin{pmatrix} \alpha \\ \gamma^2 \end{pmatrix}$

Gradients: 

$\mathcal{N}_n := \mathcal{N}(x_n, \theta_n \ | \ \mu_y, \Sigma_y)$

$\Sigma_y = 
\begin{pmatrix}
\epsilon^2  &  0  \\
0 & \left( \eta^{-2} - \xi^{-2} \right)^{-1}  \\
\end{pmatrix}$


$\mu_y = \begin{pmatrix} x_0  \\ \frac{\eta^2}{\eta^2 + \xi^2}\nu \end{pmatrix}$

$\frac{\partial{}\mathcal{L}}{\partial{}\alpha} = -2 \sum_n \mathcal{N}_n \frac{\theta_n - \mu_\phi(x_n)}{\sigma^2_\phi(x_n)}$

$\frac{\partial{}\mathcal{L}}{\partial{}\beta} = -2 \sum_n \mathcal{N}_n \frac{\theta_n - \mu_\phi(x_n)}{\sigma^2_\phi(x_n)} x_n$

$\frac{\partial{}\mathcal{L}}{\partial{}\gamma^2} 
= \sum_n \mathcal{N}_n \left( \frac{1}{\sigma^2_\phi(x_n)} 
- \frac{\left(\theta_n - \mu_\phi(x_n) \right)^2}{\sigma^4_\phi(x_n)} \right) 
= \frac{1}{\gamma^2} \sum_n \mathcal{N}_n \left( 1 
- \frac{\left(\theta_n - \mu_\phi(x_n) \right)^2}{\gamma^2} \right) $

Optima: 

$\hat{\alpha} = 
\frac{\sum_n \mathcal{N}_n \left(\theta_n - \frac{\sum_m \mathcal{N}_m \theta_m x_m}{\sum_m \mathcal{N}_m x_m^2} x_n \right)}{\sum_n \mathcal{N}_n - \frac{\left( \sum_n \mathcal{N}_n x_n \right)^2}{\sum_n \mathcal{N}_n x_n^2}}$

$\hat{\beta} = 
\frac{\sum_n \mathcal{N}_n \theta_n x_n - \hat{\alpha} \sum_n \mathcal{N}_n x_n}{\sum_n \mathcal{N}_n x_n^2}$

$\hat{\gamma}^2 = 
\frac{\sum_n \mathcal{N}_n \left( \theta_n - \hat{\alpha} - \hat{\beta} x_n \right)^2}{\sum_n \mathcal{N}_n}$

# define problem setup

- proposals narrower than the priors seem to induce some form of bias on MLE estimates (non-SVI fits), especially for posterior variances
- up to a certain degree, this can be negated by switching to students-t proposals (with few degrees of freedom)
- apart from bias, even switching to students-t distributions results in increased variance of the estimates (probably some ESS thing?)


- generally speaking, the real challenge for SNPE arises when $\sigma^2 << \eta^2$, i.e. when the likelihood (and hence posterior) are much narrower than the prior
- in a multi-round SNPE fit, the proposal will tend to be much narrower than the prior after several rounds
- the 'relevant' (relative) proposal width thus is the one that corresponds to the posterior variance


- for a simple setup that can be largely 'fixed' with students-t proposals choose $\sigma^2 = \eta^2$.

- intermediate setup:  $\sigma^2 = 9 \cdot \eta^2$ (i.e. posterior variance is $10\%$ of prior variance)

- to watch Rome burn, try  $\sigma^2 = 99 \cdot \eta^2$

In [None]:
%%capture 
import util
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

import delfi.distribution as dd
import delfi.generator as dg
import delfi.inference as infer
import delfi.summarystats as ds
from delfi.simulator.Gauss import Gauss


## problem setup ##

n_params = 1

assert n_params == 1 # cannot be overstressed: everything in this notebook goes downhill sharply otherwise

sig2 = 1.0/9. # likelihood variance
eta2 = 1.0     # prior variance
eps2 = 1e20    # calibration kernel width (everything above a certain threshold will be treated as 'uniform')

# pick observed summary statistics
x0 = 0.8 * np.ones(1) #_,obs = g.gen(1) 

## simulation setup ##

n_fits = 200  # number of MLE fits (i.e. dataset draws), each single-round fits with pre-specified proposal!
N      = 300  # number of simulations per dataset

# set proposal priors (one per experiment)
ksi2s = np.array([0.01, 0.1, 0.999]) * eta2  # proposal variance
nus = eta2/(eta2+sig2)*x0[0]* np.ones(len(ksi2s))              # proposal mean

n_bins = 50 # number of bins for plotting


res = {'normal'  : np.zeros((len(ksi2s), n_fits,5)),
       't_df10'  : np.zeros((len(ksi2s), n_fits,5)),
       't_df3'   : np.zeros((len(ksi2s), n_fits,5)),
       'cdelfi'  : np.zeros((len(ksi2s), n_fits,5)),
       'unif_w1' : np.zeros((len(ksi2s), n_fits,5)),
       'unif_w6' : np.zeros((len(ksi2s), n_fits,5)),
       'sig2' : sig2,
       'eta2' : eta2,
       'eps2' : eps2,
       'ksi2s' : ksi2s,
       'nus' : nus,
       'x0' : x0,
      }

# SNPE (Gaussian proposals) 
## multiple fits, with different proposal widths
- n_fits many runs, every run with different seed, i.e. different data set $\{(x_n, \theta_n)\}_{n=1}^N$
- collect statistics on mean and std. of estimated posterior mean and variance (bias?)
- outer loop over different proposal distribution variances (can in principle also move means)
  
Keep in mind that for large enough $N$, in theory SNPE should return ground-truth posteriors irrespective of the proposal prior (cf. Kaan's proof in JakobsNotes.pdf)

In [None]:
proposal_form = 'normal'
df = None

out_snpe  = res[proposal_form]

out_snpe = util.test_setting(out_snpe, n_params, N, sig2, eta2, eps2, x0, ksi2s, nus, proposal_form, track_rp=True, if_plot=False);


# SNPE (students T proposals, df = 10) 
## multiple fits, with different proposal widths
- same as before, but with students T proposals instead of Gaussian proposals (same mean and variance)
- df $=10$ in 1D : roughly Gaussian, but with slightly stronger tails


In [None]:
proposal_form = 'studentT'
df = 10

out_snpe  = res['t_df' + str(df)]

util.test_setting(out_snpe, n_params, N, sig2, eta2, eps2, x0, ksi2s, nus, proposal_form, track_rp=True, df=df);

# SNPE (students T proposals, df = 3) 
## multiple fits, with different proposal widths
- same as before, but with students T proposals instead of Gaussian proposals (same mean and variance)
- df $=3$ in 1D : much stronger tails than Gaussian (need df>2 for well-defined student-T variance)


In [None]:
proposal_form = 'studentT'
df = 3

out_snpe  = res['t_df' + str(df)]

util.test_setting(out_snpe, n_params, N, sig2, eta2, eps2, x0, ksi2s, nus, proposal_form, track_rp=True, df=df);

# uniform proposals
- uniform proposals 'cut' the prior tails, directly influencing the target posterior
- the importance sampling on this narrower prior may however be easier to achieve
- can we balance damage to the posterior against better importance sampling and gain something overall ?

In [None]:
proposal_form = 'unif'
sds = 1
marg = np.sqrt(sds) * np.sqrt(12)
out_snpe  = res['unif_w'+str(sds)]

util.test_setting(out_snpe, n_params, N, sig2, eta2, eps2, x0, ksi2s, nus, proposal_form, track_rp=True, marg=marg);

# compare with CDELFI
- simply set track_rp = False to compare MLE solution with mean and variance of proposal-posterior
- however requires Gaussian proposals to directly compare with proposal-posterior
- if the proposal-posterior is good, so should be the prior-corrected result of the analytical division


In [None]:
proposal_form = 'normal'

util.test_setting(out_snpe, n_params, N, sig2, eta2, eps2, x0, ksi2s, nus, proposal_form, track_rp=False);

In [None]:

np.save('res_analytic_n_fits' + str(n_fits) + '_N' + str(N), res)


# numerical checks for gradients


In [None]:
N = 3
track_rp = True
proposal_form = 'normal'
df = None

nu = 0.
ksi2 = 0.5 * eta2

ppr = dd.Gaussian(m=nu * np.ones(n_params), 
                S=ksi2 * np.eye(n_params))
s = ds.Identity()
m = Gauss(noise_cov=sig2)
g = dg.Default(model=m, prior=ppr, summary=s)

post   = dd.Gaussian(m = np.ones(n_params) * eta2/(eta2+sig2)*x0[0], 
                     S=eta2 - eta2**2 / (eta2 + sig2) * np.eye(n_params)) 
    
seed = 42
g.model.seed = seed
g.prior.seed = seed
g.seed = seed

data = g.gen(N, verbose=False)
params, stats = data[0].reshape(-1), data[1].reshape(-1)

normals = util.get_weights(proposal_form, eta2, ksi2, eps2, x0, nu, stats, params, df=df) if track_rp else np.ones(N)/N


# numerically check $\frac{\partial}{\partial{}\alpha}$

- $\frac{\partial}{\partial\alpha}$ is being difficult here. Analytic solution $\hat{\alpha}$ still fails to numberically set the stated partial derivative $\frac{\partial\mathcal{L}}{\partial{}\alpha}(\hat{\alpha})$ to zero ...
- Obtained $\hat{\alpha}$ however are pretty much sensible though (correct 'ballpark')

In [None]:
alpha_hat = np.array(util.alpha(params, stats, normals))

gamma2_ = post.std**2
alphas = np.linspace(-0.09, -0.02, 100000)

beta_ = util.beta(params, stats, normals, alpha_hat)
out_hat = -2*(normals.reshape(-1,1) * (params.reshape(-1,1) - beta_ * stats.reshape(-1,1) - alpha_hat)/gamma2_).sum(axis=0)
out = -2*(normals.reshape(-1,1) * (params.reshape(-1,1) - beta_ * stats.reshape(-1,1) - alphas.reshape(1,-1))/gamma2_).sum(axis=0)
plt.plot(alphas, out)
plt.show()


# (numerical solution, analytical solution, derivate evaluated at analytical solution (should be zero-ish) ) = 
alphas[np.argmin(np.abs(out))], alpha_hat, out_hat

# numerically check $\frac{\partial}{\partial{}\beta}$

In [None]:
alpha_ = alpha_hat
gamma2_ = post.std**2
betas = np.linspace(0., 1., 1000)
out = -2*(normals.reshape(-1,1) * (params.reshape(-1,1) - betas.reshape(1,-1) * stats.reshape(-1,1) - alpha_)/gamma2_ * stats.reshape(-1,1)).sum(axis=0)
plt.plot(betas, out)
plt.show()

beta_hat = util.beta(params, stats, normals, ahat=alpha_)
out_hat = -2*(normals.reshape(-1,1) * (params.reshape(-1,1) - beta_hat * stats.reshape(-1,1) - alpha_)/gamma2_ * stats.reshape(-1,1)).sum(axis=0)

# (numerical solution, analytical solution, derivate evaluated at analytical solution (should be zero-ish) ) = 
betas[np.argmin(np.abs(out))], beta_hat, out_hat

# numerically check $\frac{\partial}{\partial{}\gamma^2}$

In [None]:
alpha_ = alpha_hat
beta_ = beta_hat
gamma2s = np.linspace(0.0009, 0.001, 1000)

# something off with below (hard-coded...) gradients now. Outcommenting numerical solution for now!

tmp = (params.reshape(-1,1) - beta_*stats.reshape(-1,1) - alpha_)**2 / gamma2s.reshape(1,-1)
out = 1/gamma2s.reshape(-1,) * (normals.reshape(-1,1) * (1 - tmp)).sum(axis=0)

plt.plot(gamma2s, out)
plt.show()

gamma2_hat = util.gamma2(params, stats, normals, ahat=alpha_, bhat=beta_)
tmp_ = (params.reshape(-1,1) - beta_*stats.reshape(-1,1) - alpha_)**2 / gamma2_hat
out_hat = 1/gamma2_hat * (normals.reshape(-1,1) * (1 - tmp_)).sum(axis=0)

#gamma2s *= np.nan 
#out = np.zeros_like(gamma2s)

# (numerical solution, analytical solution, derivate evaluated at analytical solution (should be zero-ish) ) = 
gamma2s[np.argmin(np.abs(out))], gamma2_hat, out_hat