In [None]:
%load_ext autoreload
%autoreload 2

import os
import shutil
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning) 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt;
import psutil

import sys
import pyemu
import flopy
assert "dependencies" in flopy.__file__
assert "dependencies" in pyemu.__file__
sys.path.insert(0,"..")
import herebedragons as hbd

# Data-space Inversion

Data space inversion (DSI) enables the exploration of a model prediction's posterior distribution without requiring the exploration of the posterior distribution of model parameters. This is achieved by constructing a surrogate model using principal component analysis (PCA) of a covariance matrix of model outputs. This matrix links model outputs corresponding to field measurements with predictions of interest. The resulting predictions are then conditioned on real-world measurements of system behavior in the latent PCA subspace - whew!

tl;dr: DSI is a surrogate modelling approach that works by mapping statistical relationships between observations. Its super fast and relatively robust. It is good for uncertainty quantification, data assimilation and even optimization.

DSI can be an efficient tool for predictive uncertainty quantification, as it allows for the exploration of the uncertainty in model predictions without the need to explore the uncertainty in model parameters. This is particularly useful when the model is complex and the parameter space is high-dimensional. DSI can also be used to explore the sensitivity of model predictions to different types of data, and to identify the most informative data types for reducing predictive uncertainty. 

See the "intro to EVA and DSI" notebook for a more detailed explanation of the method, as well as the original papers [Sun and Durlofsky (2017)](https://doi.org/10.1007/s11004-016-9672-8), and subsequent variations (e.g., [Lima et al 2020](https://doi.org/10.1007/s10596-020-09933-w)). There are also [GMDSI webinars and lectures available on YouTube](https://www.youtube.com/watch?v=s2g3HaJa1Wk&t=1564s).

In these notebooks we will focus on how to implement DSI for predictive uncertainty quantification, using pyEMU and PEST++. We will not discuss the maths behind the method in detail, but rather focus on how to use the tools to implement it. For a review of the maths see the ["intro to eva and dsi"](../part0_intro_to_dsi/intro_to_dsi.ipynb) notebook.

## Getting ready

Undertaking DSI relies on the existence of an ensemble of model-generated outputs (i.e., observations in the pest control file) for both historical observation quantities (eg heads, flows, concentrations, etc) AND forecast quantities of interest - this is important so we will say it again: DSI requires the results of a Monte Carlo set of runs for both historic and future/scenario (prediction) conditions. These results are generated by running the model with a range of parameter values, which are usually sampled from the prior parameter distribution. Note that this distribution does not need to be Gaussian, and each model "parameterization" can be as complex as the user desires. Generating the the combined historic-future/scenario output ensemble is the only time that the numerical model needs to be run. Ideally, the ensemble size should be as large as you can afford. However, once generated, the DSI data-driven/emulator "model" runs very quickly.

Let us start by generating the ensemble of model outputs. We will make use of the prior ensemble generated in a previous tutorial. The next couple of cells load necessary dependencies and call a convenience function to prepare the PEST dataset folder for you. This is the same dataset that was constructed during the ["freyberg ies"](../part2_06_ies/freyberg_ies_1_basics.ipynb) tutorial. Simply press `shift+enter` to run the cells.

Note: The DSI method is very useful for more computationally expensive models. In our tutorial case, there is not actually much computational advantage in using the emulator (i.e., DSI) versus using the numerical model (i.e. Freyberg model). This is because the Freyberg model is very fast already. Even so, DSI is faster...specially when using pyEMU, as we can run all of the workers in memory, without needing to read/write to disk! We will demonstrate this with the new PyWorker functionality available in pyEMU.

..anyway...lez'go!

Prepare the template directory:

In [None]:
# specify the temporary working folder
ies_d = os.path.join('master_ies_1a')
if os.path.exists(ies_d):
    shutil.rmtree(ies_d)

org_t_d = os.path.join("..","part2_06_ies","master_ies_1a")
if not os.path.exists(org_t_d):
    raise Exception(f"you need to run the {org_t_d} notebook")
shutil.copytree(org_t_d,ies_d)

Let us quickly load the Pst control file and remind ourselves of observations and predictions:

In [None]:
pst_freyberg = pyemu.Pst(os.path.join(ies_d, "pest.pst"))
pst_freyberg

In [None]:
# load the observation ensembles for use in plotting later
oe_pr = pst_freyberg.ies.obsen0.copy()
oe_pt = pst_freyberg.ies.get("obsen",pst_freyberg.control_data.noptmax).copy()
oe_pr.shape

Take a quick look at the pestpp options:

In [None]:
pst_freyberg.pestpp_options

In [None]:
predictions = pst_freyberg.pestpp_options['forecasts'].split(',')
predictions

Start by loading the prior observation ensemble. Check how many realisations it has. These are our "training data". Let's use all of them. (if you like, you can experiment with what happens by using less realizations in the training dataset...)

In [None]:
data = pst_freyberg.ies.obsen0.copy()
data.shape

In a previous notebooks, we set the observation weights to be "balanced for visibility". Thus they do not reflect a measure of measurement noise. However, we included the `standard_deviation` column in the PEST observation data section. Ensembles of observation noise are generated using the latter.

Experience suggests that, when conditioning DSI there is not much advantage in weighting for visibility. Achieving model to measurement fits commensurate with noise is rarely an issue with these methods. In practice, it may be more convenient to assign weights explicitly as the inverse of the standard deviation of noise. This provides an easy way to verify if over-fitting is occurring (i.e Phi should be >= than the number of non-zero obs).

In [None]:
obs = pst_freyberg.observation_data
obs.loc[obs.weight>0, ['obsval','weight','standard_deviation']].tail()

Let's change our observation weights to be the inverse of the standard deviation of noise. Alternatively we could explicitly pass in an ensemble of observation noise.

In [None]:
obs.loc[obs.weight>0,"weight"] = 1.0 / obs.loc[obs.weight>0,"standard_deviation"]
assert obs.weight.sum()>0, "no non-zero obs weights found"

In [None]:
obs.oname.unique()

Recall from previous notebooks that we are carrying along a whole bunch of zero-weighted observations. In principle, it is feasible to carry over all of these to DSI. In practice, you probably want to only carry over as much as you really need. This will keep DSi forward run as fast as possible and provide the PCA more useful degrees of freedom. Lets drop everything except nzobs and predictions to keep this super-light:

In [None]:
[f for f in pst_freyberg.instruction_files]

In [None]:
# drop obs to ncols
drop_list = [f for f in pst_freyberg.instruction_files if f.startswith("hdslay")]
drop_list.extend([f for f in pst_freyberg.instruction_files if ".npf_k_layer1." in f])
drop_list.append("inc.csv.ins")
drop_list.append("cum.csv.ins")
drop_list.extend([f for f in pst_freyberg.instruction_files if ".tdiff." in f])

for o in drop_list:
    pst_freyberg.drop_observations(os.path.join(ies_d,o),pst_path='.')

Do the same for the observation ensembles:

In [None]:
obsnmes = pst_freyberg.observation_data.obsnme.values

data = data.loc[:, obsnmes]

In [None]:
data.shape

## Preparing the DSI model and PEST setup

In pyEMU, `pyemu.emulators` is the entry point for all things emulation. The minimum requirement to instantiate a `DSI` object is the training data set. When initialized it prepares in memory the various components required for DSI and associated analyses. 

Optionaly you can specify the energy level truncation for SVD, data transfromations (we will ge tto this later) and an existing `Pst` object from the full-order model. `DSI` will use oinfromation in the `Pst` to help construct a dsi pestpp template directory later. 

In [None]:
from pyemu.emulators import DSI

dsi = DSI(pst=pst_freyberg,#optional
          data=data,
          transforms=None, #optional
          energy_threshold=1., #optional
          verbose=True)
dsi.fit();

You can check on the various attributes like so:

In [None]:
dsi.__dict__.keys()

And access them like so:

In [None]:
dsi.data.head()

To run the dsi model directly, you can call `dsi.predict()` and passing in a `pvals` array (i.e., a vector of random normal values with shape equal to `dsi.s`). This is effectively the dsi "forward run". When we setup pestpp, we are parameterizing the `pvals` vector and allowing pestpp to adjust the values to improive the fit with observations.

In [None]:
# the singular values
dsi.s[:5]

In [None]:
pvals = np.random.normal(0,1,dsi.s.shape)
dsi.predict(pvals)

Now is where things get fancy. We also need to construct a new PEST setup to condition the surrogate. Luckily, pyEMU has a function that does all the hard work for us!

When calling `.prepare_pestpp()` on the `DSI` object, pyemu will prepare a Pst object and folder with all the files required to run and condition the surrogate model using pestpp-ies. The new PEST control file is named `dsi.pst`. It contains all the observations that were included in the Pst object passed to `DSI`. Parameters in the new control file are the vector of $\mathbf{x}$, i.e. the surrogate model parameters.

In [None]:
t_d = "dsi_template"
pst = dsi.prepare_pestpp(t_d=t_d)

One extra step for our case, because `.prep_for_dsi` does not copy over executables (note we dont need MODFLOW for DSI!).

In [None]:
# from pst_template copy exe files
found = False
for f in os.listdir(os.path.join(ies_d)):
    if f.startswith("pestpp-ies"):
        shutil.copy2(os.path.join(ies_d,f),os.path.join(t_d,f))
        found = True
if not found:
    raise Exception("couldn't find pestpp-ies binary in {0}".format(ies_d))

Let's take a quick look at what's in this new folder:

In [None]:
os.listdir(t_d)

You can see the various components of the `dsi.pst` version 2 control file. Files that start with `dsi.` are the model emulator input and output files. 

The `DSI` class was pickled and stored for later use. 
The `dsi_sim_vals.csv` file contains the emulator generated observation values using the input parameters, $\mathbf{x}$, contained in `dsi_pars.csv`. 
The `forward_run.py` script contains code to read input files, run the emulator and record outputs.

Let's take a quick look at how npar has changed. The number of parameters in the original Pst object is the number of parameters in the model. The number of parameters in the new Pst object is the number of principal components used in the emulator.

In [None]:
pst.npar_adj, pst_freyberg.npar_adj

Verify we have the same number of observations. 

In [None]:
pst.nobs, pst_freyberg.nobs

In [None]:
pst.nnz_obs, pst_freyberg.nnz_obs

In [None]:
pst.pestpp_options

An important aspect in any emulation workflow is the idea of "extrapolation". It is usualy a bad idea to use an emulator to extrapolate beyond the range of training data. Note that, we do provide an option to do so. However, it is usualy not recommended. If extrapolation is necessary, it is probably a better idea to extend the trining data set. To protect against #badtimes we recommend always using prior data conflict resolution when conditioning a DSI model.

In [None]:
pst.pestpp_options["ies_drop_conflicts"] = True

Right on! We are ready to get cracking. Let's run pestpp-ies and see what we get.

#ATTENTION!

As always, set the number of workers according to your resources.

Depending on the number of realizations and how many workers you have at your disposal, the next cell may take a while to run. Although the emulator model is super fast, we are running it many times.

As always, ideally you would want as many realizations as you can afford. For the sake of the tutorial let's stick to 200. (The next cell might take 10-20min to run, depending on the number of workers.)

In [None]:
# lets specify the number of realizations to use; usually this should be as many as you can afford
num_reals = data.shape[0] #using few 'cause its a tutorial and dont want to take too long...
pst.pestpp_options["ies_num_reals"] = num_reals

# set noptmax
pst.control_data.noptmax = 3

# and re-write
pst.write(os.path.join(t_d,"dsi.pst"),version=2)


Now we are going to introduce a new way to deploy pestpp: the `PyWorker`. This option is only available when working with python/pyemu. Because DSI is very fast, reading and writting files to disk becomes the bottleneck to running it many times. And because we are impactient, we want to make it as fast as possible. Using `PyWorkers` allows us to run all the workers directly in memmory, avoiding disk i/o. As our dsi implementation does not actualy need to read/write files to disk.

To do so, we use the same `pyemu.os_utils.start_workers` function as we always do, but with 2 new arguments: `ppw_function` and `ppw_kwargs`. These efectilvely tell the worker what function it needs to run (and the arguments that that function takes). pyemu has inbuilt pyworker functions for our implmentation of dsi. So all you need to worry about is:

In [None]:
# the master dir
m_d = "master_dsi"

num_workers = 15

pvals = pd.read_csv(os.path.join(t_d, "dsi_pars.csv"), index_col=0)


pyemu.os_utils.start_workers(
    t_d,"pestpp-ies","dsi.pst", num_workers=num_workers,
    worker_root=".", master_dir=m_d, 
    ppw_function=pyemu.helpers.dsi_pyworker,
    ppw_kwargs={
        "dsi": dsi, "pvals": pvals,
    }
)

Right on. That was fast! (should be <1min)

Let's check the evolution of the objective function (remember - this is the same objective function - observations and weights - that we have been using in the other notebooks on data assimilation/history matching!). We should see it decreasing as the emulator is conditioned/trained on the observations, but we don't want it to become less than the number of nonzero-weighted observations (`nnzobs`) #overfitting.

In [None]:
pst = pyemu.Pst(os.path.join(m_d,"dsi.pst"))

phidf = pst.ies.phiactual

fig,ax=plt.subplots(1,1,figsize=(4,4))
ax.plot(phidf.index,phidf['mean'],"bo-", label='dsi')

ax.set_yscale('log')
ax.set_ylabel('phi')
ax.set_xlabel('iteration')

ax.text(0.7,0.9,f"nnz_obs: {pst.nnz_obs}\nphi_dsi: {phidf['mean'].iloc[-1]:.2f}",
        transform=ax.transAxes,ha="right",va="top")

ax.legend();

And as usual, we bring this back to the predictions. How has DSI performed? While we are at it, let's compare to the numerical model derived predictions we obtained in the "freyberg ies" notebooks.

In [None]:
def plot_dsi_hist(m_d,pst,iteration=3):
    pr_oe_dsi = pst.ies.obsen0.copy()
    pt_oe_dsi = pst.ies.get("obsen",iteration).copy()


    fig,axes = plt.subplots(len(predictions)//2,2,figsize=(7,7))
    for p,ax in zip(predictions,axes.flatten()):
            #calculate consistent bin edges
            bins = np.linspace(
                    min(oe_pr.loc[:,p].values.min(),
                        oe_pt.loc[:,p].values.min(),
                        pr_oe_dsi.loc[:,p].values.min(),
                        pt_oe_dsi.loc[:,p].values.min()),
                    max(oe_pr.loc[:,p].values.max(),
                        oe_pt.loc[:,p].values.max(),
                        pr_oe_dsi.loc[:,p].values.max(),
                        pt_oe_dsi.loc[:,p].values.max()),
                    30)

            ax.hist(oe_pr.loc[:,p].values,alpha=0.5,facecolor="0.5",density=True,label="prior",bins=bins)
            ax.hist(oe_pt.loc[:, p].values,  alpha=0.5, facecolor="b",density=True,label="posterior",bins=bins)
            ax.hist(pr_oe_dsi.loc[:, p].values,  facecolor="none",hatch="/",edgecolor="0.5",lw=0.5,density=True,label="dsi prior",bins=bins)
            ax.hist(pt_oe_dsi.loc[:, p].values,  facecolor="none",density=True,hatch="/",edgecolor="b",lw=.5,label="dsi posterior",bins=bins)
            
            fval = pst.observation_data.loc[p,"obsval"]
            ax.plot([fval,fval],ax.get_ylim(),"r-",label='truth')
            
            ax.set_title(p,loc="left")
            ax.legend(loc="best")
            ax.set_yticks([])
    fig.tight_layout()


In [None]:
iteration=1
plot_dsi_hist(m_d,pst,iteration=iteration)

Not too shabby. Prediction variance has decreased for most forecasts (blue histograms have less spread than grey ones). And the posterior distributions all capture the truth. 

There is a noticeable issue for the particle travel time prediction: the model emulator is predicting physically impossible values (i.e. negative times). On top of that, the physics-based model prior and posterior show a distinctly skewed distribution, whilst the emulator derived output are symmetric. This is due to the Gaussian assumption of the emulator. To avoid it, we can  apply transformation of the observation.

First, lets look at another common issue: saw-tooth patterns in time-series outputs. Make a quick plot of the future time series outputs:

In [None]:
def plot_tseries_ensembles(pt_oe, onames=["hds","sfr"],noise_oe=None):
    pst.try_parse_name_metadata()
    # get the observation data from the control file and select 
    obs = pst.observation_data.copy()
    # onames provided in oname argument
    obs = obs.loc[obs.oname.apply(lambda x: x in onames)]
    # only non-zero observations
    obs = obs.loc[obs.obgnme.apply(lambda x: x in pst.nnz_obs_groups),:]
    # make a plot
    ogs = obs.obgnme.unique()
    fig,axes = plt.subplots(len(ogs),1,figsize=(7,2*len(ogs)))
    ogs.sort()
    # for each observation group (i.e. timeseries)
    for ax,og in zip(axes,ogs):
        # get values for x axis
        oobs = obs.loc[obs.obgnme==og,:].copy()
        oobs.loc[:,"time"] = oobs.time.astype(float)
        oobs.sort_values(by="time",inplace=True)
        tvals = oobs.time.values
        onames = oobs.obsnme.values
        ax.plot(oobs.time,oobs.obsval,color="fuchsia",lw=2,zorder=99)
        # plot prior
        #[ax.plot(tvals,pr_oe.loc[i,onames].values,"0.5",lw=0.5,alpha=0.5) for i in pr_oe.index]
        # plot posterior
        [ax.plot(tvals,pt_oe.loc[i,onames].values,"b",lw=0.5,alpha=0.5) for i in pt_oe.index]
        # plot measured+noise 
        oobs = oobs.loc[oobs.weight>0,:]
        tvals = oobs.time.values
        onames = oobs.obsnme.values
        if noise_oe is not None:
            [ax.plot(tvals,noise_oe.loc[i,onames].values,"r",lw=0.5,alpha=0.5,zorder=0) for i in noise_oe.index]
        #[ax.plot(tvals,noise.loc[i,onames].values,"r",lw=0.5,alpha=0.5,zorder=0) for i in noise.index]
        ax.plot(oobs.time,oobs.obsval,"r-o",lw=2,zorder=100)
        ax.set_title(og,loc="left")
    fig.tight_layout()
    return fig

The following cell plots the posterior of three forecast time-series. As you can see, DSI generated outputs are nice and smooth in the historical period. However, the same cannot be said for the forecast period, where a "saw-toothed" behaviour can be seen, with the time-series jumping up and down. It is specially evident in the early-future for site `trgw-0-3-8`. This may also be somewhat mitigated by using normal-score transformation...

In [None]:
pt_oe_dsi = pst.ies.get("obsen",iteration)
fig = plot_tseries_ensembles(pt_oe_dsi, onames=["hds","sfr"])

## Structured noise

As you saw, overfitting is too easy. As we discussed in the `ies_4_noise` notebook, adding structure to the noise ensemble can help be helpfull. Lets do that.

In [None]:
#generate standard normal deviates
num_reals = pst.ies.paren0.shape[0]
np.random.seed(pyemu.en.SEED)
draws = np.random.normal(0,1,num_reals)
#_ = plt.hist(draws)

obs = pst.observation_data
onames = obs.loc[(obs.weight>0) & (obs.oname.apply(lambda x: x in ["hds","sfr"])),"obsnme"]
# first generate a standard noise ensemble 
newnoise = pyemu.ObservationEnsemble.from_gaussian_draw(pst=pst,cov=pyemu.Cov.from_observation_data(pst),num_reals=num_reals)
newnoise.index = pst.ies.paren0.index
ovals =  obs.loc[onames,"obsval"].values
stdevs = obs.loc[onames,"standard_deviation"].values
for i,draw in enumerate(draws):
    newnoise.loc[i,onames] = ovals + (draw * stdevs)
if "base" in newnoise.index:
    newnoise.loc["base",:] = obs.loc[newnoise.columns,"obsval"]
newnoise.head()

In [None]:
newnoise.to_csv(os.path.join(t_d,"extreme_noise.csv"))
pst.pestpp_options["ies_observation_ensemble"] = "extreme_noise.csv"
pst.write(os.path.join(t_d,"dsi.pst"),version=2)


# the master dir
m_d = "master_dsi_strctnoise"
pvals = pd.read_csv(os.path.join(t_d, "dsi_pars.csv"), index_col=0)

pyemu.os_utils.start_workers(
    t_d,"pestpp-ies","dsi.pst", num_workers=num_workers,
    worker_root=".", master_dir=m_d, 
    ppw_function=pyemu.helpers.dsi_pyworker,
    ppw_kwargs={
        "dsi": dsi, "pvals": pvals,
    }
)

In [None]:
pst = pyemu.Pst(os.path.join(m_d,"dsi.pst"))

phidf = pst.ies.phiactual

fig,ax=plt.subplots(1,1,figsize=(4,4))
ax.plot(phidf.index,phidf['mean'],"bo-", label='dsi')

ax.set_yscale('log')
ax.set_ylabel('phi')
ax.set_xlabel('iteration')

ax.text(0.7,0.9,f"nnz_obs: {pst.nnz_obs}\nphi_dsi: {phidf['mean'].iloc[-1]:.2f}",
        transform=ax.transAxes,ha="right",va="top")

ax.legend();

Not much difference in the prediction:

In [None]:
iteration=1
plot_dsi_hist(m_d,pst,iteration=iteration)

## Feature engineering and Data transformation

Ideally, DSI training-data should be pluri-Gaussian. We can not esnure that requirement, but we can try to ensure that at least each feature (observation) is aproximately Gaussian. We acomplish by transforming data and calcualting the covariance between trasnformed observations. Transformed observation values will have a distribution which is more similar to Gaussian, thus better respecting the assumptions on which DSI relies. Then, during the DSI forward run, emulator-derived outputs must be back-transformed into the original data space, so as to allow for direct comparison to measured values.

In the `pyemu` python implementation this comes at a slight cost of increased run time, both in setting up the emulator, as well as in the forward run. The latter may cost a couple of fractions of a second more.  If you are transfroming many observations, this can add up. Sorry about that.


In [None]:
data.shape

Lets take a look at some of the observations statistical distributions:

In [None]:
data.loc[:,predictions].hist(figsize=(4,4))
plt.tight_layout()

`part_time` is a clear candidate for transformation in the predictions, with a distinct log-normal distribution. We should probalby apply a log-transformation to this feautre.

Lets take a look at some non-zero obs:

In [None]:
data.loc[:,obs.loc[obs.weight>0,"obsnme"].tolist()].hist(figsize=(10,10))
plt.tight_layout()

A bit more chalenging here. Most of the `hds` observations are pretty close to normal. They might benefit somewhat from normal-score transformation, but generally they are already pretty close. 

More of a chalenge are the `gage-1` observations. At first glance it may appear that they too are close to log-normal. Unfortunatley, that is not quite the case. You see, these observations are "bounded" at zero. Flow through the gage can never be <0, so we end up with a whole bunch of realizations with a value of "zero". (See figure below). Transforming these types of distributions is chalenging. Doing so requires some dirty tricks in the background, which sometimes hinder DSI performance. 

In [None]:
vals = data.loc[:,obs.loc[obs.weight>0,"obsnme"].tolist()[-1]].values
plt.figure(figsize=(3,3))
plt.hist(np.log10(vals+vals.min()+.1), bins=30)
plt.tight_layout()


An alternative approach is to throw out realizations that duplicate values at the bounds...but this means losing training data. Sub-optimal, but if you can afford to generate more training data, it can be worth this type of sample rejection to create a cleaner training dataset (or, preferably build this conceptual knowledge into the prior).

So we know something: we have measured values for `gage-1`. We know these do not reach zero; in fact they dont seem to go below 1000. And looking at the prior distribution, it appears that we have a bunch of "outliers" with flows below 100 (2 on the log10 scale). So we can perhaps throw out realizations where flows during the historical period are below 100 as a form of "prior conditioning".

In [None]:
obsnmes = obs.loc[(obs.weight>0) & (obs.usecol=="gage-1"),"obsnme"].tolist()
keep_reals = data.loc[~data.loc[:,obsnmes].lt(100).any(axis=1)].index.tolist()
print("keep reals:", len(keep_reals))
o = obs.loc[obs.weight>0,"obsnme"].tolist()[-1]

plt.figure(figsize=(6,3))

vals = data.loc[:,o].values
logvals = np.log10(vals+vals.min()+1e-6)
bins = np.linspace(logvals.min(),logvals.max(),50)
plt.hist(logvals, bins=bins,label='full prior',color='0.5')
vals = data.loc[keep_reals,o].values
logvals = np.log10(vals+vals.min()+1e-6)
plt.hist(logvals, bins=bins,label='reduced prior',color='b',alpha=0.3)
plt.vlines(np.log10(obs.loc[o].obsval+vals.min()+1e-6), 0, 30, color='r',label="measured")
plt.legend(fontsize=8,loc='upper left')
plt.tight_layout()

Nice! That looks alot better. Pretty close to normal as it is. 

In [None]:
data.loc[keep_reals,obs.loc[obs.weight>0,"obsnme"].tolist()].hist(figsize=(10,10))
plt.tight_layout()

In [None]:
sim = flopy.mf6.MFSimulation.load(sim_ws=ies_d,verbosity_level=0)
gwf = sim.get_model()
top = gwf.dis.top.get_data()


usecols = obs.loc[obs.oname=='hds'].usecol.unique()
cells = np.array([[int(i.split("-")[-2]),int(i.split("-")[-1])] for i in usecols])
drop_rows=[]
for usecol in usecols:
    i,j = int(usecol.split("-")[-2]),int(usecol.split("-")[-1])
    elev = top.max()#top[i,j]
    obsnmes = obs.loc[(obs.usecol==usecol) & (obs.weight>0),"obsnme"].values
    if len(obsnmes) == 0:
        continue
     # drop rows in data where value > elev
    drop_rows.extend(data.loc[data.loc[:,obsnmes].gt(elev).any(axis=1)].index.tolist())

len(list(set(drop_rows)))
len(drop_rows)

In [None]:
data_redux = data.loc[keep_reals,:].copy()
data_redux.drop(index=[i for i in drop_rows if i in data_redux.index], inplace=True)

In [None]:
obs = pst_freyberg.observation_data
logcols =  obs.loc[(obs.usecol=="gage-1") & (obs.weight>0)].obsnme.tolist()


from sklearn.preprocessing import PowerTransformer
transforms = [
    {
        "type": "log10",
        "columns": ["part_time"],# + logcols,
    },
    #{
    #    "type": "sklearn",
    #    "columns": ["part_time"],
    #    "estimator": PowerTransformer(),
    #    "init_kwargs": {"method": "yeo-johnson","standardize": True}
    #},
    #{
    #    "type": "normal_score",
    #    "columns": obs.loc[obs.oname=="hds","obsnme"].tolist(),
    #    
    #},
    #{
    #    "type": "normal_score",
    #    "columns": obs.loc[obs.usecol=="tailwater"].obsnme.tolist(),#obs.loc[obs.oname=="sfr","obsnme"].tolist(),
    #    
    #},
    #{
    #    "type": "normal_score",
    #    "columns": obs.loc[obs.oname=="sfr","obsnme"].tolist(),
    #    
    #},
    {
        "type": "normal_score",
    }
]



dsi = DSI(pst=pst_freyberg,
          data=data.copy(), #if you want to see the impact of not removing outliers, replace with: data.copy()
          transforms=transforms,
          energy_threshold=1.,
          verbose=True)
dsi.fit();


t_d = "dsi_template"
dsi.prepare_pestpp(t_d=t_d)


# from pst_template copy exe files
found = False
for f in os.listdir(os.path.join(ies_d)):
    if f.startswith("pestpp-ies"):
        shutil.copy2(os.path.join(ies_d,f),os.path.join(t_d,f))
        found = True
if not found:
    raise Exception("couldn't find pestpp-ies binary in {0}".format(ies_d))

# load the control file
pst = pyemu.Pst(os.path.join(t_d,"dsi.pst"))

# lets specify the number of realizations to use; as usual this should be as many as you can afford
pst.pestpp_options["ies_num_reals"] = num_reals
pst.pestpp_options["ies_drop_conflicts"] = True

#noise.to_csv(os.path.join(t_d,"noise.csv"))
#pst.pestpp_options["ies_obs_en"] = "noise.csv"
newnoise.to_csv(os.path.join(t_d,"extreme_noise.csv"))
pst.pestpp_options["ies_observation_ensemble"] = "extreme_noise.csv"

# set noptmax 
pst.control_data.noptmax = 5

# and re-write
pst.write(os.path.join(t_d,"dsi.pst"),version=2)

# the master dir
m_d = "master_dsi_nst"

pvals = pd.read_csv(os.path.join(t_d, "dsi_pars.csv"), index_col=0)


pyemu.os_utils.start_workers(
    t_d,"pestpp-ies","dsi.pst", num_workers=num_workers,
    worker_root=".", master_dir=m_d, #port=_get_port(),
    ppw_function=pyemu.helpers.dsi_pyworker,
    ppw_kwargs={
        "dsi": dsi, "pvals": pvals,
    }
)

In [None]:
pst = pyemu.Pst(os.path.join(m_d,"dsi.pst"))
phidf = pst.ies.phiactual

fig,ax=plt.subplots(1,1,figsize=(4,4))
ax.plot(phidf.index,phidf['mean'],"bo-", label='dsi')

ax.set_yscale('log')
ax.set_ylabel('phi')
ax.set_xlabel('iteration')

ax.text(0.7,0.9,f"nnz_obs: {pst.nnz_obs}\nphi_dsi: {phidf['mean'].iloc[-1]:.2f}",
        transform=ax.transAxes,ha="right",va="top")
ax.legend();

Again, we achieved near noise levels of fit without too much effort. We would not want to fit any further...and we might even wish to not fit this well...But lets take the results of the last iteration and see how our predictions did:

In [None]:
plot_dsi_hist(m_d,pst,iteration=5)

Ok, so what has happened? Starting with the good news: `part_time` is behaving nicely now. That log-transform did the job. The same for the `hds` perdiction, DSi is slaying it! 

Bit off on the head/tailwater predicitons tho. We see a left leaning tail for DSI that was not ocurring with the full-order model...and was better represented without the transforms. This is an important detail: normal-score transformation introduces error! Specialy if the training data distribution is not a nice smooth statisticaly-relevant distribution. Beware of observations that have bounded distributions, and cases where the sample density does not provide a nice even distirbution acrsoss the enitre range or cases with long sparese tails. These will introduce "noise" into the mapping back and forth the data's distirbution and the normal-distribution. As we have seen, in some cases it can be more robust to not transform. Again, using more realizations in the training data should aid in obtaining a smoother normal-score transformer. Same as it ever was: use as many reals as you can...


Now a quick look at the time series:

In [None]:
pt_oe_dsi = pst.ies.get("obsen",5)
noise_oe = pst.ies.noise.copy()
fig = plot_tseries_ensembles(pt_oe_dsi, onames=["hds","sfr"], noise_oe=noise_oe)

Behaving a bit more "realistic". Still some weird responses in the head time series. Try running the last DSI history matching with the `data_redux` as training data. You should see a much more "realistic" time series behaviour. This suggest that, statisticaly, the measured data is insuficient to condition those forecasts. However, in the full-order model, physics does the job of translating that information to the forecast.

# Final remarks

Data space inversion provides a powerful tool for data assimilation and uncertainty quantification for cases in which model complexity and/or computational cost preclude the use of traditional methods. 

 - use as many reals as possible when generating training data. The more the better.
 - the same applies for DSI conditioning with IES. The more reals the better.
  - make sure the DSI generated prior reflects the physical model based prior. This may require a number of DSI realizations equal to or greater than the training data. Specially if using normal-score transformation.
 - if forecasts are leaning towards the ends (or beyond!) of the prior distribution, this is a red flag. It is probably worth revising the physics-based model prior parameter distributions and re-training the emulator.
 - don't overfit...#duh

 ### Benefits
 - extreme numerical efficiency. We only need a few hundred runs of the physics-based model as we are not using it for parameter adjustment. It is only used to construct/train the statistical model.
 - can use DSI as a verification of IES forecasts: are these too wide and/or not wide enough?
 - allows for parameter distributions of arbitrary complexity in the prior.
 - can be used for analyzing the worth of existing and as-of-yet uncollected data to reduce predictive uncertainty. With no assumption of linearity! (see ensemble dataworth notebook)

### Drawbacks
 - if the relationship between the past and future are highly non-linear, it will struggle but then, so will any other method.
 - it is not possible to view the process-based model (eg MODFLOW) inputs that produce the DSI forecast posterior distribution, so can't "see" what model inputs might be causing extreme results. Howver, you can incluce model parameters as "observations" Although the direct link between parameter values and model ouputs may not be quite right, the statistical distribution will be informative.
 - when simulating future scenarios, you have to rerun the training ensemble through the process-based model and also rerun the DSI training process for each scenario. But you would also have to run the ensemble for each scenario with the physics-based model...