In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt;
import sys
import pyemu
assert "dependencies" in pyemu.__file__
sys.path.insert(0,"..")

# Introduction
“DSI” stands for “data space inversion”. Data space inversion (DSI) enables the exploration of a model prediction's posterior distribution without requiring the exploration of the posterior distribution of model parameters. This is achieved by constructing a surrogate model using principal component analysis (PCA) of the covariance matrix of model outputs (i.e., observations). This matrix links model outputs corresponding to field measurements with predictions of interest. The resulting predictions are then conditioned on real-world measurements of system behavior.

The general idea is to:
1. Generate an ensemble of model outputs simulated using with the prior. 
2. Covariance is empirically obtained from the ensemble of model outputs.
3. The surrogate model is constructed and conditioned against measured data.
4. A sample of the prediction posterior distribution is obtained.

The following notebook goes through the the method described by [Sun and Durlofsky (2017)](https://doi.org/10.1007/s11004-016-9672-8) and [Lima et al (2020)](https://doi.org/10.1007/s10596-020-09933-w).


# Generate the prior observation ensemble

First we need some "training data". Let us start by cooking up some fake "model outputs". Say we have a "model" that outputs three "measured" observations and a "prediction". Let us say we have run our model with a 1000 samples of the prior. 

In [None]:
# Mean values for each variable
mean = [0, 1, 2, 3]

# Covariance matrix: answer at the back of the book...
true_cov = [
    [1, 0.8, 0.5, 0.5],  
    [0.8, 1, 0.3, 0.3],  
    [0.5, 0.3, 1, .2],   
    [0.5, 0.3,.2,1]
]

# Number of samples to generate a.k.a. ensemble size
nreal = 1000

# Generate the fake prior observation ensemble
fake_sim_ensemble = pd.DataFrame(np.random.multivariate_normal(mean, true_cov, nreal),
                                 columns=["prediction","obs1","obs2","obs3"])
fake_sim_ensemble.head()

# Data-space inversion

Following the notation in [Lima et al (2020)](https://doi.org/10.1007/s10596-020-09933-w), $\mathbf{d}$ is the vector of model simulated outputs that contains both predictions and measurements. As mentioned above, the main idea behind the method is to use PCA to write the vector of predicted data ($\mathbf{d}_{\text{PCA}}$) as:

$$
\mathbf{d}_{\text{PCA}} = \bar{\mathbf{d}} + \mathbf{C}_d^{1/2} \mathbf{x}
$$

in which $\bar{\mathbf{d}}$ and $\mathbf{C}_d$ are the mean and the covariance matrix of  $\mathbf{d}$, and $\mathbf{x}$ is a vector of random numbers. Both of which are obtained from the ensemble of model outputs.

## Calculate the mean-vector

In [None]:
# the mean
d_bar = fake_sim_ensemble.mean()
d_bar

In [None]:
# note that this is an approximation of the `true_cov` matrix
Cd = fake_sim_ensemble.cov()
Cd

## Calculate $\mathbf{C}_d^{1/2}$

$\mathbf{C}_d$ is calcualted as:

$$
\mathbf{C}_d = \frac{1}{N-1} \sum_{i=1}^{N} (\mathbf{d}_i - \bar{\mathbf{d}}) (\mathbf{d}_i - \bar{\mathbf{d}})^T
$$

where $N$ is the number of samples in the ensemble, $\mathbf{d}_i$ is the $i$-th sample of the ensemble, and $\bar{\mathbf{d}}$ is the mean of the ensemble.
Which is equivalent to:

$$
\mathbf{C}_d = \Delta\mathbf{D} \Delta\mathbf{D}^T
$$

where $\Delta\mathbf{D}$ is the matrix of the ensemble of model outputs with the mean subtracted from each row, and is calculated as:

$$
\Delta \mathbf{D} = \frac{1}{\sqrt{N_e - 1}} \left[ \mathbf{d}_1 - \bar{\mathbf{d}}, \ldots, \mathbf{d}_{N_e} - \bar{\mathbf{d}} \right].
$$

where $N_e$ is the number of samples in the ensemble.

Let's calculate $\Delta \mathbf{D}$.

In [None]:
# NOTE: to maintain consistency with notation used in the papers, 
# here we need to transpose our ensemble to be of shape (nobs,nreal)
deltaD = fake_sim_ensemble.T.apply(lambda x: (x - x.mean()) / np.sqrt(fake_sim_ensemble.shape[0]-1),axis=1)
deltaD.head()

Why are we talking about $\Delta\mathbf{D}$? Because $\mathbf{C}_d^{1/2}$, used in the first equation we presented, is calculated using the singular value decomposition (SVD) of $\Delta\mathbf{D}$:

$$
\Delta\mathbf{D} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T
$$

where $\mathbf{U}$ and $\mathbf{V}$ are orthogonal matrices and $\mathbf{\Sigma}$ is a diagonal matrix with the singular values of $\Delta\mathbf{D}$.


In [None]:
# SVD
U, Sigma, Vt = np.linalg.svd(deltaD, full_matrices=False)
U.shape,Sigma.shape,Vt.shape



From these, we can now calculate the square root of $\mathbf{C}_d$ as:

$$
\mathbf{C}_d^{1/2} = \mathbf{U} \mathbf{\Sigma}

$$

where $\mathbf{\Sigma}^{1/2}$ is a diagonal matrix with the square root of the singular values of $\Delta\mathbf{D}$.

In [None]:
Cd_sqrt = np.dot(U,np.diag(Sigma)) #eq 14 in Lima 2020

## The model emulator

The emulator is nothing more than a linear transformation of the model outputs. The emulator is constructed by projecting the model outputs onto the principal components of the covariance matrix of the model outputs. 

The model emulator is "run" by calculating $\bar{\mathbf{d}} + \mathbf{C}_d^{1/2} \mathbf{x}$  The values of $\mathbf{x}$ will be "PEST adjustable parameters" that are sampled from a normal distribution with mean of zero.

In [None]:
# the "prior" mean of emulator "parameters"
x = np.zeros_like(Sigma)
x

In [None]:
# a model-emulator "forward run"
d_bar.values + np.dot(Cd_sqrt,x)

### Dummy calibration

In practice, how do we handle this with PEST? The $\bar{\mathbf{d}}$ and $\mathbf{C}_d^{1/2}$ matrices are constructed and recorded in the PEST mocel directory. Then, a forward run script is prepared which reads these matrices, as well as the PEST-adjusted values of the vector $\mathbf{x}$, and calculates the model emulator outputs. 

Let's demonstrate this with a simple example. Here is what a forward run might look like:

In [None]:
def forward_run(x):
    #pretend to read d_bar
    #pretend to read Cd_sqrt
    #pretend to read x
    return d_bar.values + np.dot(Cd_sqrt, x)

x = np.zeros_like(Sigma)
obs = forward_run(x)
obs

And now lets choose a "truth":

In [None]:
# choose a realisationas the truth
truth = fake_sim_ensemble.loc[0]
truth

...and now calibrate the emulator to the truth observations (don't do this at home folks...this only works well because it is a super simple example):

In [None]:
# find the pvals that minimize the difference between the truth and the forward run
from scipy.optimize import minimize

def objective(x):
    # objective does not include the prediction column
    return np.sum((forward_run(x)[1:] - truth[1:])**2)

# intial parameters
pvals_guess = np.zeros_like(Sigma)

# optimize
res = minimize(objective, pvals_guess,tol=1e-5)

In [None]:
fig,ax=plt.subplots(1,1,figsize=(4,4))
ax.set_aspect('equal')
ax.scatter(truth,forward_run(x), label='initial parameters')
ax.scatter(truth,forward_run(res.x), label='calibrated parameters')
ax.set_ylabel('simulated')
ax.set_xlabel('measured')
ax.legend()
ax.grid(alpha=0.3)
#add one to one line
ax.plot([truth.min(),truth.max()],[truth.min(),truth.max()],'k--',alpha=0.3);