In [None]:
%load_ext autoreload
%autoreload 2

import os
import shutil
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning) 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt;
import psutil

import sys
import pyemu
import flopy
assert "dependencies" in flopy.__file__
assert "dependencies" in pyemu.__file__
sys.path.insert(0,"..")
import herebedragons as hbd

# Data-space Inversion

Data space inversion (DSI) enables the exploration of a model prediction's posterior distribution without requiring the exploration of the posterior distribution of model parameters. This is achieved by constructing a surrogate model using principal component analysis (PCA) of a covariance matrix of model outputs. This matrix links model outputs corresponding to field measurements with predictions of interest. The resulting predictions are then conditioned on real-world measurements of system behavior in the latent PCA subspace - whew!

DSI can be an efficient tool for predictive uncertainty quantification, as it allows for the exploration of the uncertainty in model predictions without the need to explore the uncertainty in model parameters. This is particularly useful when the model is complex and the parameter space is high-dimensional. DSI can also be used to explore the sensitivity of model predictions to different types of data, and to identify the most informative data types for reducing predictive uncertainty. 

See the "intro to EVA and DSI" notebook for a more detailed explanation of the method, as well as the original papers [Sun and Durlofsky (2017)](https://doi.org/10.1007/s11004-016-9672-8), and subsequent variations (e.g., [Lima et al 2020](https://doi.org/10.1007/s10596-020-09933-w)). There are also [GMDSI webinars and lectures available on YouTube](https://www.youtube.com/watch?v=s2g3HaJa1Wk&t=1564s).

In these notebooks we will focus on how to implement DSI for predictive uncertainty quantification, using pyEMU and PEST++. We will not discuss the maths behind the method in detail, but rather focus on how to use the tools to implement it. For a review of the maths see the ["intro to eva and dsi"](../part0_intro_to_dsi/intro_to_dsi.ipynb) notebook.

## Getting ready

Undertaking DSI relies on the existence of an ensemble of model-generated outputs (i.e., observations in the pest control file) for both historical observation quantities (eg heads, flows, concentrations, etc) AND forecast quantities of interest - this is important so we will say it again: DSI requires the results of a Monte Carlo set of runs for both historic and future/scenario (prediction) conditions. These results are generated by running the model with a range of parameter values, which are usually sampled from the prior parameter distribution. Note that this distribution does not need to be Gaussian, and each model "parameterization" can be as complex as the user desires. Generating the the combined historic-future/scenario output ensemble is the only time that the numerical model needs to be run. Ideally, the ensemble size should be as large as you can afford. However, once generated, the DSI data-driven/emulator "model" runs very quickly.

Let us start by generating the ensemble of model outputs. We will make use of the prior ensemble generated in a previous tutorial. The next couple of cells load necessary dependencies and call a convenience function to prepare the PEST dataset folder for you. This is the same dataset that was constructed during the ["freyberg ies"](../part2_06_ies/freyberg_ies_1_basics.ipynb) tutorial. Simply press `shift+enter` to run the cells.

Note: in our tutorial case, there is not actually much computational advantage in using the emulator (i.e., DSI) versus using the numerical model (i.e. Freyberg model). This is because the Freyberg model is very fast already. However, the DSI method is very useful for more computationally expensive models.

..anyway...lez'go!

Prepare the template directory:

In [None]:
# specify the temporary working folder
ies_d = os.path.join('master_ies_1a')
if os.path.exists(ies_d):
    shutil.rmtree(ies_d)

org_t_d = os.path.join("..","part2_06_ies","master_ies_1a")
if not os.path.exists(org_t_d):
    raise Exception(f"you need to run the {org_t_d} notebook")
shutil.copytree(org_t_d,ies_d)

Let us quickly load the Pst control file and remind ourselves of observations and predictions:

In [None]:
pst_name = os.path.join(ies_d, "freyberg_mf6.pst")
pst_freyberg = pyemu.Pst(pst_name)
pst_freyberg.pestpp_options

In [None]:
predictions = pst_freyberg.pestpp_options['forecasts'].split(',')
predictions

Start by loading the prior observation ensemble. Check how many realisations it has. These are our "training data". Let's use all of them. (if you like, you can experiment with what happens by using less realizations in the training dataset...)

In [None]:
oe_name = pst_name.replace(".pst", ".0.obs.csv")
oe_pr = pyemu.ObservationEnsemble.from_csv(pst=pst_freyberg, filename=pst_name.replace(".pst", ".0.obs.csv"))
oe_pt = pyemu.ObservationEnsemble.from_csv(pst=pst_freyberg, filename=pst_name.replace(".pst", ".3.obs.csv"))

oe_pr.shape

In a previous notebooks, we set the observation weights to be "balanced for visibility". Thus they do not reflect a measure of measurement noise. However, we included the `standard_deviation` column in the PEST observation data section. Ensembles of observation noise are generated using the latter.

Experience suggests that, when conditioning a Gaussian process model, there is not much advantage in weighting for visibility. Achieving model to measurement fits commensurate with noise is rarely an issue with these methods. In practice, it may be more convenient to assign weights explicitly as the inverse of the standard deviation of noise. This provides an easy way to verify if over-fitting is occurring (i.e Phi should be >= than the number of non-zero obs).

In [None]:
obs = pst_freyberg.observation_data
obs.loc[obs.weight>0, ['obsval','weight','standard_deviation']].tail()

Let's change our observation weights to be the inverse of the standard deviation of noise. Alternatively we could explicitly pass in an ensemble of observation noise.

In [None]:
obs.loc[obs.weight>0,"weight"] = 1.0 / obs.loc[obs.weight>0,"standard_deviation"]
assert obs.weight.sum()>0, "no non-zero obs weights found"

Recall from previous notebooks that we are carrying along a whole bunch of zero-weighted observations. Just to speed things up (in particular when we implement normal-score transformation later on), let's drop these observations from the PEST control file. We will only keep the observation groups that have non-zero weights and the `predictions`.  This is probably best practice for real-world DSI usecases...

Note that, we don't have to do this...it just speeds things up a bit during the set up 

In [None]:
[f for f in pst_freyberg.instruction_files]

In [None]:
# drop obs to reduce matrix size
drop_list = [f for f in pst_freyberg.instruction_files if f.startswith("hdslay")]
drop_list.extend([f for f in pst_freyberg.instruction_files if ".npf_k_layer1." in f])
drop_list.append("inc.csv.ins")
drop_list.append("cum.csv.ins")
for o in drop_list:
    pst_freyberg.drop_observations(os.path.join(ies_d,o),pst_path='.')

Do the same for the observation ensembles:

In [None]:
obsnmes = pst_freyberg.observation_data.obsnme.values

oe_pr = oe_pr.loc[:, obsnmes]
oe_pt = oe_pt.loc[:, obsnmes]

In [None]:
oe_pr.shape[0], "realizations used for DSI training"

## Preparing the DSI model and PEST setup

In pyEMU, the "ensemble data space" `EnDS` class is the entry point for all things DSI. It is initialized with the PEST control file and the ensemble of model outputs. When initialized it prepares in memory the various components required for DSI and associated analyses. 

In [None]:
ends = pyemu.EnDS(pst=pst_freyberg,
                  sim_ensemble=oe_pr,
                  predictions=predictions,
                  verbose=False)

Now is where things get fancy. We need to construct a surrogate model. We also need to construct a new PEST setup to condition the surrogate. Luckily, pyEMU has a function that does all the hard work for us!

When calling `.prep_for_dsi()` on the `EnDS` object, pyemu will prepare a Pst object and folder with all the files required to run and condition the surrogate model using pestpp-ies. The new PEST control file is named `dsi.pst`. It contains all the observations that were included in the Pst object passed to `EnDS`. Parameters in the new control file are the vector of $\mathbf{x}$, i.e. the surrogate model parameters.

The `.prep_for_dsi()` method provides some optional arguments that enable the user to specify whether observations should be subject to normal-score transformation (useful to improve the Gaussianity of the data - a condition on which this method relies) and which method to use for calculating the square root of the covariance matrix. If the `use_ztz` argument is set to `True`, DSI will be set up using the same approach employed by `DSI2` in the PEST utilities. The optional argument `energy` can be used to specify the level of singular value truncation. 

Let us start by using no normal-score transformation and the default approach.

In [None]:
t_d = "dsi_template"
pst_dsi = ends.prep_for_dsi(t_d=t_d)

One extra step for our case, because `.prep_for_dsi` does not copy over executables (note we dont need MODFLOW for DSI!).

In [None]:
# from pst_template copy exe files
found = False
for f in os.listdir(os.path.join(ies_d)):
    if f.startswith("pestpp-ies"):
        shutil.copy2(os.path.join(ies_d,f),os.path.join(t_d,f))
        found = True
if not found:
    raise Exception("couldn't find pestpp-ies binary in {0}".format(ies_d))

Let's take a quick look at whats in this new folder:

In [None]:
os.listdir(t_d)

You can see the various components of the `dsi.pst` version 2 control file. Files that start with `dsi_` are the model emulator input and output files. 

The `dsi_pr_mean.csv` and `dsi_proj_mat.npy` are inputs required by the emulator. They are the prior mean observation values $\bar{\mathbf{d}}$ and the matrix of the square root of observation covariance matrix $\mathbf{C}_d^{1/2}$. (see the "intro to dsi" notebook for more details)
The `dsi_sim_vals.csv` file contains the emulator generated observation values using the input parameters, $\mathbf{x}$, contained in `dsi_pars.csv`. 
The `forward_run.py` script contains code to read input files, run the emulator and record outputs.

Let's take a quick look at how npar has changed. The number of parameters in the original Pst object is the number of parameters in the model. The number of parameters in the new Pst object is the number of principal components used in the emulator.

In [None]:
pst_dsi.npar_adj, pst_freyberg.npar_adj

Verify we have the same number of observations. 

In [None]:
pst_dsi.nobs, pst_freyberg.nobs

In [None]:
pst_dsi.nnz_obs, pst_freyberg.nnz_obs

Right on! We are ready to get cracking. Let's run pestpp-ies and see what we get.

#ATTENTION!

As always, set the number of workers according to your resources.

In [None]:
num_workers=10

Depending on the number of realizations and how many workers you have at your disposal, the next cell may take a while to run. Although the emulator model is super fast, we are running it many times.

As always, ideally you would want as many realizations as you can afford. For the sake of the tutorial let's stick to 200. (The next cell might take 10-20min to run, depending on the number of workers.)

In [None]:
# load the control file
pst = pyemu.Pst(os.path.join(t_d,"dsi.pst"))

# lets specify the number of realizations to use; usually this should be as many as you can afford
num_reals = 200 #oe_pr.shape[0] #using few 'cause its a tutorial and dont want to take too long...

pst.pestpp_options["ies_num_reals"] = num_reals

# and a options to reduce the number of lost reals and improve conditioning
pst.pestpp_options["overdue_giveup_fac"] = 1e30
pst.pestpp_options["overdue_giveup_minutes"] = 1e30
pst.pestpp_options["ies_no_noise"] = False 
pst.pestpp_options["ies_subset_size"] = -10 # the more the merrier
pst.pestpp_options["ies_bad_phi_sigma"] = 2.0

# set noptmax 
pst.control_data.noptmax = 3

# and re-write
pst.write(os.path.join(t_d,"dsi.pst"),version=2)


In [None]:
# the master dir
m_d = "master_dsi"
pyemu.os_utils.start_workers(t_d,"pestpp-ies","dsi.pst",num_workers=num_workers,worker_root='.',
                                master_dir=m_d)

Right on. Let's check the evolution of the objective function (remember - this is the same objective function - observations and weights- that we have been using in the other notebooks on data assimilation/history matching!). We should see it decreasing as the emulator is conditioned/trained on the observations, but we don't want it to become less than the number of nonzero-weighted observations (`nnzobs`) #overfitting.

In [None]:
m_d = "master_dsi"
phidf = pd.read_csv(os.path.join(m_d,"dsi.phi.actual.csv"),index_col=0)

fig,ax=plt.subplots(1,1,figsize=(4,4))
ax.plot(phidf.index,phidf['mean'],"bo-", label='dsi')

ax.set_yscale('log')
ax.set_ylabel('phi')
ax.set_xlabel('iteration')

ax.text(0.7,0.9,f"nnz_obs: {pst.nnz_obs}\nphi_dsi: {phidf['mean'].iloc[-1]:.2f}",
        transform=ax.transAxes,ha="right",va="top")

ax.legend();

And as usual, we bring this back to the predictions. How has DSI performed? While we are at it, let's compare to the numerical model derived predictions we obtained in the "freyberg ies" notebooks.

In [None]:
def plot_dsi_hist(m_d,pst,iteration=3):
    pr_oe_dsi = pd.read_csv(os.path.join(m_d,"dsi.0.obs.csv"),index_col=0)
    pt_oe_dsi = pd.read_csv(os.path.join(m_d, f"dsi.{iteration}.obs.csv"), index_col=0)

    #pv = pyemu.ObservationEnsemble(pst=pst,df=oe_pt).phi_vector
    #pv_dsi = pyemu.ObservationEnsemble(pst=pst, df=pt_oe_dsi).phi_vector

    fig,axes = plt.subplots(len(predictions),1,figsize=(7,7))
    for p,ax in zip(predictions,axes):
            #calculate consistent bin edges
            bins = np.linspace(
                    min(oe_pr.loc[:,p].values.min(),
                        oe_pt.loc[:,p].values.min(),
                        pr_oe_dsi.loc[:,p].values.min(),
                        pt_oe_dsi.loc[:,p].values.min()),
                    max(oe_pr.loc[:,p].values.max(),
                        oe_pt.loc[:,p].values.max(),
                        pr_oe_dsi.loc[:,p].values.max(),
                        pt_oe_dsi.loc[:,p].values.max()),
                    50)

            ax.hist(oe_pr.loc[:,p].values,alpha=0.5,facecolor="0.5",density=True,label="prior",bins=bins)
            ax.hist(oe_pt.loc[:, p].values,  alpha=0.5, facecolor="b",density=True,label="posterior",bins=bins)
            ax.hist(pr_oe_dsi.loc[:, p].values,  facecolor="none",hatch="/",edgecolor="0.5",lw=0.5,density=True,label="dsi prior",bins=bins)
            ax.hist(pt_oe_dsi.loc[:, p].values,  facecolor="none",density=True,hatch="/",edgecolor="b",lw=.5,label="dsi posterior",bins=bins)
            
            fval = pst.observation_data.loc[p,"obsval"]
            ax.plot([fval,fval],ax.get_ylim(),"r-",label='truth')
            
            ax.set_title(p,loc="left")
            ax.legend(loc="upper right")
            ax.set_yticks([])
    fig.tight_layout()


In [None]:
iteration=phidf.loc[phidf["mean"]>= pst.nnz_obs].index.values[-1]
plot_dsi_hist(m_d,pst,iteration=iteration)

Not too shabby. Prediction variance has decreased for most forecasts (blue histograms have less spread than grey ones). And the posterior distributions all capture the truth. 

There is a noticeable issue for the particle travel time prediction: the model emulator is predicting physically impossible values (i.e. negative times). On top of that, the physics-based model prior and posterior show a distinctly skewed distribution, whilst the emulator derived output are symmetric. This is due to the Gaussian assumption of the emulator. To avoid it, we can (and should!) apply a normal-score transformation of the data. stay tuned! (Alternatively, we could also/or apply log-transformation to the `part_time` values.)

First, lets look at another common issue: saw-tooth patterns in time-series outputs. Make a quick plot of the future time series outputs:


In [None]:
def plot_tseries_ensembles(pt_oe, onames=["hds","sfr"]):
    pst.try_parse_name_metadata()
    # get the observation data from the control file and select 
    obs = pst.observation_data.copy()
    # onames provided in oname argument
    obs = obs.loc[obs.oname.apply(lambda x: x in onames)]
    # only non-zero observations
    obs = obs.loc[obs.obgnme.apply(lambda x: x in pst.nnz_obs_groups),:]
    # make a plot
    ogs = obs.obgnme.unique()
    fig,axes = plt.subplots(len(ogs),1,figsize=(7,2*len(ogs)))
    ogs.sort()
    # for each observation group (i.e. timeseries)
    for ax,og in zip(axes,ogs):
        # get values for x axis
        oobs = obs.loc[obs.obgnme==og,:].copy()
        oobs.loc[:,"time"] = oobs.time.astype(float)
        oobs.sort_values(by="time",inplace=True)
        tvals = oobs.time.values
        onames = oobs.obsnme.values
        ax.plot(oobs.time,oobs.obsval,color="fuchsia",lw=2,zorder=100)
        # plot prior
        #[ax.plot(tvals,pr_oe.loc[i,onames].values,"0.5",lw=0.5,alpha=0.5) for i in pr_oe.index]
        # plot posterior
        [ax.plot(tvals,pt_oe.loc[i,onames].values,"b",lw=0.5,alpha=0.5) for i in pt_oe.index]
        # plot measured+noise 
        oobs = oobs.loc[oobs.weight>0,:]
        tvals = oobs.time.values
        onames = oobs.obsnme.values
        #[ax.plot(tvals,noise.loc[i,onames].values,"r",lw=0.5,alpha=0.5,zorder=0) for i in noise.index]
        ax.plot(oobs.time,oobs.obsval,"r-o",lw=2,zorder=100)
        ax.set_title(og,loc="left")
    fig.tight_layout()
    return fig

The following cell plots the posterior of three forecast time-series. As you can see, DSI generated outputs are nice and smooth in the historical period. However, the same cannot be said for the forecast period, where a "saw-toothed" behaviour can be seen, with the time-series jumping up and down. It is specially evident in the early-future for site `trgw-0-3-8`. This may also be somewhat mitigated by using normal-score transformation...

In [None]:
pt_oe_dsi = pd.read_csv(os.path.join(m_d, f"dsi.{iteration}.obs.csv"), index_col=0)
fig = plot_tseries_ensembles(pt_oe_dsi, onames=["hds","sfr"])

## DSI with normal-score transformation

We will now repeat all of the above, but using the normal-score transformation (NST) option when calling `.prep_for_dsi()`. This option will first apply a normal-score transform to all model derived outputs in the training dataset. Transformed observation values will have a distribution which is more similar to Gaussian, thus better respecting the assumptions on which DSI relies. Then, during the DSI forward run, emulator-derived outputs are back-transformed into the original data space, so as to allow for direct comparison to measured values.

In the `pyemu` python implementation this comes at a slight cost of increased run time, both in setting up the emulator, as well as in the forward run. The latter may cost a couple of seconds more. Sorry about that. 

(The next cell might take 10-20min to run, depending on the number of workers.)

Here we go:

In [None]:
t_d = "dsi_template"

ends = pyemu.EnDS(pst=pst_freyberg,
                  sim_ensemble=oe_pr,
                  predictions=predictions,
                  verbose=False)

pst_dsi = ends.prep_for_dsi(t_d=t_d,
                            apply_normal_score_transform=True,
                            nst_extrap="quadratic",
                            )

# from pst_template copy exe files
found = False
for f in os.listdir(os.path.join(ies_d)):
    if f.startswith("pestpp-ies"):
        shutil.copy2(os.path.join(ies_d,f),os.path.join(t_d,f))
        found = True
if not found:
    raise Exception("couldnt find pestpp-ies binary in {0}".format(ies_d))

# load the control file
pst = pyemu.Pst(os.path.join(t_d,"dsi.pst"))

# lets specify the number of realizations to use; as usual this should be as many as you can afford
pst.pestpp_options["ies_num_reals"] = num_reals

# and a options to reduce the number of lost reals and improve conditioning
pst.pestpp_options["overdue_giveup_fac"] = 1e30
pst.pestpp_options["overdue_giveup_minutes"] = 1e30
pst.pestpp_options["ies_no_noise"] = False 
pst.pestpp_options["ies_subset_size"] = -10
pst.pestpp_options["ies_bad_phi_sigma"] = 2.0


# set noptmax 
pst.control_data.noptmax = 3

# and re-write
pst.write(os.path.join(t_d,"dsi.pst"),version=2)

# the master dir
m_d = "master_dsi_nst"
pyemu.os_utils.start_workers(t_d,"pestpp-ies","dsi.pst",num_workers=num_workers,worker_root='.',
                                master_dir=m_d)

In [None]:
phidf = pd.read_csv(os.path.join(m_d,"dsi.phi.actual.csv"),index_col=0)
fig,ax=plt.subplots(1,1,figsize=(4,4))
ax.plot(phidf.index,phidf['mean'],"bo-", label='dsi')

ax.set_yscale('log')
ax.set_ylabel('phi')
ax.set_xlabel('iteration')

ax.text(0.7,0.9,f"nnz_obs: {pst.nnz_obs}\nphi_dsi: {phidf['mean'].iloc[-1]:.2f}",
        transform=ax.transAxes,ha="right",va="top")
ax.legend();

Again, we achieved near noise levels of fit without too much effort. We would not want to fit any further...and we might even wish to not fit this well...But lets take the results of the last iteration and see how our predictions did:

In [None]:
iteration=phidf.loc[phidf["mean"]>= pst.nnz_obs].index.values[-1]
plot_dsi_hist(m_d,pst,iteration=iteration)

Sweet! #winning. This time around, the DSI prior looks a lot more similar to those we obtained with the Freyberg model. On top of that, we now see the distribution of predicted `particle time` no longer has physically infeasible values, and it follows the same log distribution with a long tail.

# Final remarks

Data space inversion provides a powerful tool for data assimilation and uncertainty quantification for cases in which model complexity and/or computational cost preclude the use of traditional methods. 

 - use as many reals as possible when generating training data. The more the better.
 - the same applies for DSI conditioning with IES. The more reals the better.
  - make sure the DSI generated prior reflects the physical model based prior. This may require a number of DSI realizations equal to or greater than the training data. Specially if using normal-score transformation.
 - if forecasts are leaning towards the ends (or beyond!) of the prior distribution, this is a red flag. It is probably worth revising the physics-based model prior parameter distributions and re-training the emulator.
 - don't overfit...#duh

 ### Benefits
 - extreme numerical efficiency. We only need a few hundred runs of the physics-based model as we are not using it for parameter adjustment. It is only used to construct/train the statistical model.
 - can use DSI as a verification of IES forecasts: are these too wide and/or not wide enough?
 - allows for parameter distributions of arbitrary complexity in the prior.
 - can be used for analyzing the worth of existing and as-of-yet uncollected data to reduce predictive uncertainty. With no assumption of linearity! (see next notebook)

### Drawbacks
 - if the relationship between the past and future are highly non-linear, it will not work - DSI is predicated on the assumed linearity between model outputs from the past and model outputs from the future
 - it is not possible to view the process-based model (eg MODFLOW) inputs that produce the DSI forecast posterior distribution, so can't "see" what model inputs might be causing extreme results.
 - when simulating future scenarios, you have to rerun the training ensemble through the process-based model and also rerun the DSI training process for each scenario. But you would also have to run the ensemble for each scenario with the physics-based model...