In [None]:
import os
import shutil
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning) 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt;
import psutil

import sys
import pyemu
import flopy
assert "dependencies" in flopy.__file__
assert "dependencies" in pyemu.__file__
sys.path.insert(0,"..")
import herebedragons as hbd

# Data-space Inversion

Data space inversion (DSI) enables the exploration of a model prediction's posterior distribution without requiring the exploration of the posterior distribution of model parameters. This is achieved by constructing a surrogate model using principal component analysis (PCA) of a covariance matrix of model outputs. This matrix links model outputs corresponding to field measurements with predictions of interest. The resulting predictions are then conditioned on real-world measurements of system behavior.

DSI is a powerful tool for predictive uncertainty quantification, as it allows for the exploration of the uncertainty in model predictions without the need to explore the uncertainty in model parameters. This is particularly useful when the model is complex and the parameter space is high-dimensional. DSI can also be used to explore the sensitivity of model predictions to different types of data, and to identify the most informative data types for reducing predictive uncertainty.



See the "intro to DSI" notebook for a more detailed explanation of the method, as well as the original papers [Sun and Durlofsky (2017)](https://doi.org/10.1007/s11004-016-9672-8), and subsequent variations (e.g., [Lima et al 2020](https://doi.org/10.1007/s10596-020-09933-w)). There are also [GMDSI webinars and lectures available on YouTube](https://www.youtube.com/watch?v=s2g3HaJa1Wk&t=1564s).

In these notebooks we will focus on how to implement DSI for predictive uncertainty quantification, using pyEMU and PEST++. We will not discuss the maths behind the method in detail, but rather focus on how to use the tools to implement the method. For a review of the maths see the ["intro to dsi"](../part0_intro_to_dsi/intro_to_dsi.ipynb) notebook.

## Getting ready

Undertaking DSI relies on the existence of an ensemble of model-generated outputs (i.e., observations). These are generated by running the model with a range of parameter values, which are sampled from a prior distribution. Note that this prior does not need to be Gaussian, and each model "parameterization" can be as complex as the user desires. Generating the prior is the only time that the numerical model needs to be run. Ideally, the ensemble size should be as large as you can afford. However, once generated, the DSI model runs very quickly.

Let us start by generating the ensemble of model outputs. We will make use of the prior ensemble generated in a previous tutorial. The next couple of cells load necessary dependencies and call a convenience function to prepare the PEST dataset folder for you. This is the same dataset that was constructed during the ["freyberg ies"](../part2_06_ies/freyberg_ies_1_basics.ipynb) tutorial. Simply press `shift+enter` to run the cells.

Note: in our tutorial case, there is not actually much computational advantage in using the emulator (i.e., DSI) versus using the numerical model (i.e. Freyberg model). This is because the Freyberg model is very fast already. However, the DSI method is very useful for more computationally expensive models.

..anyway...lez'go!

Prepare the template directory:

In [None]:
# specify the temporary working folder
t_d = os.path.join('master_ies_1a')
if os.path.exists(t_d):
    shutil.rmtree(t_d)

org_t_d = os.path.join("..","part2_06_ies","master_ies_1a")
if not os.path.exists(org_t_d):
    raise Exception(f"you need to run the {org_t_d} notebook")
shutil.copytree(org_t_d,t_d)

Let us quickly load the Pst control file and remind ourselves of observations and predictions:

In [None]:
pst_name = os.path.join(t_d, "freyberg_mf6.pst")
pst = pyemu.Pst(pst_name)
pst.pestpp_options

In [None]:
predictions = pst.pestpp_options['forecasts'].split(',')
predictions

Start by loading the prior observation ensemble. Check how many realisations it has. These are our "training data". Let's use all of them. (if you like, you can experiment with what happens by using less realizations in the training dataset...)

In [None]:
oe_name = pst_name.replace(".pst", ".0.obs.csv")
oe_pr = pyemu.ObservationEnsemble.from_csv(pst=pst, filename=pst_name.replace(".pst", ".0.obs.csv"))
oe_pt = pyemu.ObservationEnsemble.from_csv(pst=pst, filename=pst_name.replace(".pst", ".3.obs.csv"))

oe_pr.shape

In [None]:
num_reals = oe_pr.shape[0]
num_reals

In a previous notebooks, we set the observation weights to be "balanced for visibility". Thus they do not reflect a measure of measurement noise. However, we included the `standard_deviation` column in the PEST observation data section. Ensembles of observation noise are generated using the latter.

Experience suggests that, when conditioning a Gaussian process model, there is not much advantage in weighting for visibility. Achieving model to measurement fits commensurate with noise is rarely an issue with these methods. In practice, it may be more convenient to assign weights explicitly as the inverse of the standard deviation of noise. This provides an easy way to verify if over-fitting is occurring (i.e Phi should be >= than the number of non-zero obs).

In [None]:
obs = pst.observation_data
obs.loc[obs.weight>0, ['obsval','weight','standard_deviation']].tail()

Let's change our observation weights to be the inverse of the standard deviation of noise.

In [None]:
obs.loc[obs.weight>0,"weight"] = 1.0 / obs.loc[obs.weight>0,"standard_deviation"]
assert obs.weight.sum()>0, "no non-zero obs weights found"

In pyEMU, the `EnDS` class is the entry point for all things DSI. It is initialized with the PEST control file and the ensemble of model outputs. When initialized it prepares in memory the various components required for DSI and associated analyses. 

In [None]:
ends = pyemu.EnDS(pst=pst,
                  sim_ensemble=oe_pr,
                  predictions=predictions,
                  verbose=True)

Now is where things get fancy. We need to construct a surrogate model. We also need to construct a new PEST setup to condition the surrogate. Luckily, pyEMU has a function that does all the hard work for us!

When calling `.prep_for_dsi()` on the `EnDS` object, pyemu will prepare a Pst object and folder with all the files required to run and condition the surrogate model using pestpp-ies. The new PEST control file is named `dsi.pst`. It contains all the observations that were included in the Pst object passed to `EnDS`. Parameters in the new control file are the vector of $\mathbf{x}$, i.e. the surrogate model parameters.

The `.prep_for_dsi()` method provides some optional arguments that enable the user to specify that observations should be subject to normal-score transformation, and to specify the use of truncated-SVD. The latter can be useful to reduce the dimensionality of the problem and for numerical stability. The former is often useful to improve the Gaussianity of the data - a condition on which these method relies. 

For now let us just use the defaults (i.e. no normal-score transformation and no truncated-SVD).

In [None]:
t_d = "dsi_template"
pst_dsi = ends.prep_for_dsi(t_d=t_d)

One extra step for our case, because `.prep_for_dsi` does not copy over executables.

In [None]:
# from pst_template copy exe files
for f in os.listdir(os.path.join("pst_template")):
    if f.startswith("pestpp-ies."):
        shutil.copy2(os.path.join("pst_template",f),os.path.join(t_d,f))

Let's take a quick look at whats in this new folder:

In [None]:
os.listdir(t_d)

You can see the various components of the `dsi.pst` version 2 control file. Files that start with `dsi_` are the model emulator input and output files. 

The `dsi_pr_mean.csv` and `dsi_proj_mat.jcb` are inputs required by the emulator. They are the prior mean observation values $\bar{\mathbf{d}}$ and the matrix of the square root of observation covariance matrix $\mathbf{C}_d^{1/2}$. (see the "intro to dsi" notebook for more details)
The `dsi_sim_vals.csv` file contains the emulator generated observation values using the input parameters, $\mathbf{x}$, contained in `dsi_pars.csv`. 
The `forward_run.py` script contains code to read input files, run the emulator and record outputs.

Let's take a quick look at how npar has changed. The number of parameters in the original Pst object is the number of parameters in the model. The number of parameters in the new Pst object is the number of principal components used in the emulator.

In [None]:
pst_dsi.npar_adj, pst.npar_adj

Verify we have the same number of observations. 

In [None]:
pst_dsi.nobs, pst.nobs

In [None]:
pst_dsi.nnz_obs, pst.nnz_obs

Right on! We are ready to get cracking. Let's run pestpp-ies and see what we get.

#ATTENTION!

As always, set the number of workers according to your resources.

In [None]:
num_workers=10

Depending on the number of realizations and how many workers you have at your disposal, the next cell may take a while to run. Although the emulator model is super fast, we are running it many times.

In [None]:
# load the control file
pst = pyemu.Pst(os.path.join(t_d,"dsi.pst"))

# lets specify the number of realizations to use; as usual this should be as many as you can afford
pst.pestpp_options["ies_num_reals"] = num_reals

# and a options to reduce the number of lost reals
pst.pestpp_options["overdue_giveup_fac"] = 1e30
pst.pestpp_options["overdue_giveup_minutes"] = 1e30

# set noptmax 
pst.control_data.noptmax = 5

# and re-write
pst.write(os.path.join(t_d,"dsi.pst"),version=2)

# the master dir
m_d = "master_dsi"
pyemu.os_utils.start_workers(t_d,"pestpp-ies","dsi.pst",num_workers=num_workers,worker_root='.',
                                master_dir=m_d)

Right on. Let's check the evolution of the objective function. We should see it decreasing as the emulator is conditioned on the observations, but we don't want it to become less than nnzobs.

In [None]:
m_d = os.path.join(".", "master_dsi")
pst = pyemu.Pst(os.path.join(m_d,"dsi.pst"))
phidf = pd.read_csv(os.path.join(m_d,"dsi.phi.actual.csv"),index_col=0)

fig,ax=plt.subplots(1,1,figsize=(4,4))
ax.plot(phidf.index,phidf['mean'],"bo-", label='dsi')

ax.set_yscale('log')
ax.set_ylabel('phi')
ax.set_xlabel('iteration')

ax.text(0.7,0.9,f"nnz_obs: {pst.nnz_obs}\nphi_dsi: {phidf['mean'].iloc[-1]:.2f}",
        transform=ax.transAxes,ha="right",va="top")


ax.legend()

And as usual, we bring this back to the predictions. How has DSI performed? While we are at it, let's compare to the numerical model derived predictions we obtained in the "freyberg ies" notebooks.

In [None]:
def plot_dsi_hist(m_d,pst,iteration=3):
    pr_oe_dsi = pd.read_csv(os.path.join(m_d,"dsi.0.obs.csv"),index_col=0)
    pt_oe_dsi = pd.read_csv(os.path.join(m_d, f"dsi.{iteration}.obs.csv"), index_col=0)

    #pv = pyemu.ObservationEnsemble(pst=pst,df=oe_pt).phi_vector
    #pv_dsi = pyemu.ObservationEnsemble(pst=pst, df=pt_oe_dsi).phi_vector

    fig,axes = plt.subplots(len(predictions),1,figsize=(7,7))
    for p,ax in zip(predictions,axes):
            #calculate consistent bin edges
            bins = np.linspace(
                    min(oe_pr.loc[:,p].values.min(),
                        oe_pt.loc[:,p].values.min(),
                        pr_oe_dsi.loc[:,p].values.min(),
                        pt_oe_dsi.loc[:,p].values.min()),
                    max(oe_pr.loc[:,p].values.max(),
                        oe_pt.loc[:,p].values.max(),
                        pr_oe_dsi.loc[:,p].values.max(),
                        pt_oe_dsi.loc[:,p].values.max()),
                    50)

            ax.hist(oe_pr.loc[:,p].values,alpha=0.5,facecolor="0.5",density=True,label="prior",bins=bins)
            ax.hist(oe_pt.loc[:, p].values,  alpha=0.5, facecolor="b",density=True,label="posterior",bins=bins)
            ax.hist(pr_oe_dsi.loc[:, p].values,  facecolor="none",hatch="/",edgecolor="0.5",lw=0.5,density=True,label="dsi prior",bins=bins)
            ax.hist(pt_oe_dsi.loc[:, p].values,  facecolor="none",density=True,hatch="/",edgecolor="b",lw=.5,label="dsi posterior",bins=bins)
            
            fval = pst.observation_data.loc[p,"obsval"]
            ax.plot([fval,fval],ax.get_ylim(),"r-",label='truth')
            
            ax.set_title(p,loc="left")
            ax.legend(loc="upper right")
            ax.set_yticks([])
    fig.tight_layout()


In [None]:
m_d = os.path.join(".", "master_dsi")
pst = pyemu.Pst(os.path.join(m_d,"dsi.pst"))

iteration=phidf.index.values[-1]
plot_dsi_hist(m_d,pst,iteration=iteration)

Not too shabby. Prediction variance has decreased for most forecasts. And the posterior distributions all capture the truth. 

There is a noticeable issue for the particle travel time prediction: the model emulator is predicting physically impossible values (i.e. negative times). On top of that, the physics-based model prior and posterior show a distinctly skewed distribution, whilst the emulator derived output are symmetric. This is due to the Gaussian assumption of the emulator. To avoid it, we can (and should!) apply transformation of the data. We shall do that in a minute. 

First, lets look at another common issue: saw-tooth patterns in time-series outputs. Make a quick plot of the future time series outputs:


In [None]:
def plot_tseries_ensembles(pt_oe, onames=["hds","sfr"]):
    pst.try_parse_name_metadata()
    # get the observation data from the control file and select 
    obs = pst.observation_data.copy()
    # onames provided in oname argument
    obs = obs.loc[obs.oname.apply(lambda x: x in onames)]
    # only non-zero observations
    obs = obs.loc[obs.obgnme.apply(lambda x: x in pst.nnz_obs_groups),:]
    # make a plot
    ogs = obs.obgnme.unique()
    fig,axes = plt.subplots(len(ogs),1,figsize=(7,2*len(ogs)))
    ogs.sort()
    # for each observation group (i.e. timeseries)
    for ax,og in zip(axes,ogs):
        # get values for x axis
        oobs = obs.loc[obs.obgnme==og,:].copy()
        oobs.loc[:,"time"] = oobs.time.astype(float)
        oobs.sort_values(by="time",inplace=True)
        tvals = oobs.time.values
        onames = oobs.obsnme.values
        ax.plot(oobs.time,oobs.obsval,color="fuchsia",lw=2,zorder=100)
        # plot prior
        #[ax.plot(tvals,pr_oe.loc[i,onames].values,"0.5",lw=0.5,alpha=0.5) for i in pr_oe.index]
        # plot posterior
        [ax.plot(tvals,pt_oe.loc[i,onames].values,"b",lw=0.5,alpha=0.5) for i in pt_oe.index]
        # plot measured+noise 
        oobs = oobs.loc[oobs.weight>0,:]
        tvals = oobs.time.values
        onames = oobs.obsnme.values
        #[ax.plot(tvals,noise.loc[i,onames].values,"r",lw=0.5,alpha=0.5,zorder=0) for i in noise.index]
        ax.plot(oobs.time,oobs.obsval,"r-o",lw=2,zorder=100)
        ax.set_title(og,loc="left")
    fig.tight_layout()
    return fig

The following cell plots the posterior of three forecast time-series. As you can see, DSI generated outputs are nice and smooth in the historical period. However, the same cannot be said for the forecast period, where a "saw-toothed" behaviour can be seen, with the time-series jumping up and down. It is specially evident in the early-future for site `trgw-0-3-8`. This too will be mitigated by using normal-score transformation of model derived outputs prior to undertaking DSI.

In [None]:
iteration=phidf.index.values[-1]
pt_oe_dsi = pd.read_csv(os.path.join(m_d, f"dsi.{iteration}.obs.csv"), index_col=0)
fig = plot_tseries_ensembles(pt_oe_dsi, onames=["hds","sfr"])

### DSI with normal-score transformation

We will now repeat all of the above, but using the normal-score transformation (NST) option when calling `.prep_for_dsi()`. This option will first apply a normal-score transform to all model derived outputs in the training dataset. Transformed observation values will have a distribution which is more similar to Gaussian, thus better respecting the assumptions on which DSI relies. Then, during the DSI forward run, emulator-derived outputs are back-transformed into the original data space, so as to allow for direct comparison to measured values.

In the `pyemu` python implementation this comes at a slight cost of increased run time, both in setting up the emulator, as well as in the forward run. The latter may cost 10 seconds instead of 5. Sorry about that. 

Here we go:

In [None]:
t_d = "dsi_template"
pst_dsi = ends.prep_for_dsi(t_d=t_d,apply_normal_score_transform=True)

# from pst_template copy exe files
for f in os.listdir(os.path.join("pst_template")):
    if f.startswith("pestpp-ies."):
        shutil.copy2(os.path.join("pst_template",f),os.path.join(t_d,f))

# load the control file
pst = pyemu.Pst(os.path.join(t_d,"dsi.pst"))

# lets specify the number of realizations to use; as usual this should be as many as you can afford
pst.pestpp_options["ies_num_reals"] = num_reals

# and a options to reduce the number of lost reals
pst.pestpp_options["overdue_giveup_fac"] = 1e30
pst.pestpp_options["overdue_giveup_minutes"] = 1e30

# set noptmax 
pst.control_data.noptmax = 5

# and re-write
pst.write(os.path.join(t_d,"dsi.pst"),version=2)

# the master dir
m_d = "master_dsi_nst"
pyemu.os_utils.start_workers(t_d,"pestpp-ies","dsi.pst",num_workers=num_workers,worker_root='.',
                                master_dir=m_d)

Let's check how the Phi progressed during that:

In [None]:
m_d = os.path.join(".", "master_dsi_nst")
pst = pyemu.Pst(os.path.join(m_d,"dsi.pst"))
phidf = pd.read_csv(os.path.join(m_d,"dsi.phi.actual.csv"),index_col=0)
fig,ax=plt.subplots(1,1,figsize=(4,4))
ax.plot(phidf.index,phidf['mean'],"bo-", label='dsi')

ax.set_yscale('log')
ax.set_ylabel('phi')
ax.set_xlabel('iteration')

ax.text(0.7,0.9,f"nnz_obs: {pst.nnz_obs}\nphi_dsi: {phidf['mean'].iloc[-1]:.2f}",
        transform=ax.transAxes,ha="right",va="top")
ax.legend();

Again, we achieved near noise levels of fit without too much effort. We would not want to fit any further...and we might even wish to not fit this well...But lets take the results of the last iteration and see how our predictions did:

In [None]:
iteration=phidf.index.values[-1]
pst = pyemu.Pst(os.path.join(m_d,"dsi.pst"))
plot_dsi_hist(m_d,pst,iteration=iteration)

Sweet! #winning. This time around, both the DSI prior and posterior look a lot more similar to those we obtained with the Freyberg model. On top of that, we now see the particle time distribution no longer has physically infeasible values, and it follows the same log distribution with a long tail.

Finally, lets take a look at the time series:

In [None]:
pt_oe_dsi = pd.read_csv(os.path.join(m_d, f"dsi.{iteration}.obs.csv"), index_col=0)
fig = plot_tseries_ensembles(pt_oe_dsi, onames=["hds","sfr"])

There we go. A lot less "saw toothing", with somewhat smoother time-series in the forecast period.

# Final remarks

 - use as many reals as possible when generating training data. The more the better.
 - the same applies for DSI conditioning with IES. The more reals the better.
  - make sure the DSI generated prior reflects the physical model based prior. This may require a number of DSI realizations equal to or greater than the training data. Specially if using normal-score transformation.
 - if forecasts are leaning towards the ends (or beyond!) of the prior distribution, this is a red flag. It is probably worth revising the physics-based model prior parameter distributions and re-training the emulator.
 - don't overfit...#duh

 ### Benefits:
 - extreme numerical efficiency. We only need a few hundred runs of the physics-based model as we are not using it for parameter adjustment. It is only used to construct/train the statistical model.
 - can use DSI as a verification of IES forecasts: are these too wide and/or not wide enough?
 - allows for parameter distributions of arbitrary complexity. 
 - can be used for analyzing the worth of existing and as-of-yet uncollected data to reduce predictive uncertainty. With no assumption of linearity! (see next notebook)

### Drawbacks:
 - if the relationship between the past and future are highly non-linear, it will not work. However, this will be an issue if using IES with a physics-based model anyway...
 - it is not possible to query the parameters that result in extreme forecasts
 - it is important that the physics-based model be complex enough to ensure that the connection between the past and the forecast have integrity.