# Noise and PESTPP-IES

As discussed previously, formal data assimilation methods (including ensemble methods such as those encoded in PESTPP-IES) explicitly recognize and make use of expected noise in the observations being assimilated.  While it may be tempting to think of this noise as simply "measurement noise", its actually much more complicated than that.  This noise should account for any expected deficiencies in the ability of the model to simulate the scales and processes that "generated" the observations.  That is, this noise should (attempt to) account for model error - yikes!  Well, what do we know about model error and how it manifests during the history matching process.  Previous theoretical and empirical works have shown that this "noise" is likely to be higly correlated in space and in time, and that this correlation probably cant be described by a simple (auto)correlation function, which sucks.  But just because we cant get something perfect, doesnt mean we can (greatly) improve upon it.

Let's take a deeper dive into the realm of noise...

## The Current Tutorial

In the current notebook we are going to pick up after the ["ies_1_basics"](../part2_06_ies/freyberg_ies_1_basics.ipynb) tutorial. We setup PEST++IES and ran it. We found that we can achieve great fits with historical data...but that (for some forecasts) the calculated posterior probabilities failed to cover the truth.

In this tutorial we are going to take a first stab at fixing that. We are going to implement localization to remove the potential for spurious correlations between observations and parameters incurred by using an "approximate" partial derivatives.  

### Admin

The next couple of cells load necessary dependencies and call a convenience function to prepare the PEST dataset folder for you. Simply press `shift+enter` to run the cells.

In [None]:
import os
import shutil
import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning) 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt;
import psutil

import sys
import pyemu
import flopy
assert "dependencies" in flopy.__file__
assert "dependencies" in pyemu.__file__
sys.path.insert(0,"..")
import herebedragons as hbd



Prepare the template directory:

In [None]:
# specify the temporary working folder
t_d = os.path.join('freyberg6_template_extreme')

org_t_d = os.path.join("master_ies_1")
if not os.path.exists(org_t_d):
    raise Exception("you need to run the '/freyberg_ies_1_basics.ipynb' notebook")

if os.path.exists(t_d):
    shutil.rmtree(t_d)
shutil.copytree(org_t_d,t_d)

## Looking at the previous run

First we need to load the existing control file and results from the basic PESTPP-IES run we did before:

In [None]:
pst_path = os.path.join(org_t_d, 'pest.pst')
pst = pyemu.Pst(pst_path)

Let's plot up the obs vs sim, including the noise realizations that PESTPP-IES generated for us using the `standard_deviation` column in "* observation data" section.  Here is our timeseries plotting function:

In [None]:
def plot_forecast_hist_compare(pr_oe,pt_oe, last_pt_oe=None,last_prior=None ):
        num_plots = len(pst.forecast_names)
        num_cols = 1
        if last_pt_oe is not None:
            num_cols=2
        fig,axes = plt.subplots(num_plots, num_cols, figsize=(5*num_cols,num_plots * 2.5), sharex='row',sharey='row')
        for axs,forecast in zip(axes, pst.forecast_names):
            # plot first column with currrent outcomes
            if num_cols==1:
                axs=[axs]
            ax = axs[0]
            # just for aesthetics
            bin_cols = [pt_oe.loc[:,forecast], pr_oe.loc[:,forecast],]
            if num_cols>1:
                bin_cols.extend([last_pt_oe.loc[:,forecast],last_prior.loc[:,forecast]])
            bins=np.histogram(pd.concat(bin_cols),
                                         bins=20)[1] #get the bin edges
            pr_oe.loc[:,forecast].hist(facecolor="0.5",alpha=0.5, bins=bins, ax=ax,density=True)
            pt_oe.loc[:,forecast].hist(facecolor="b",alpha=0.5, bins=bins, ax=ax,density=True)
            ax.set_title(forecast)
            fval = pst.observation_data.loc[forecast,"obsval"]
            ax.plot([fval,fval],ax.get_ylim(),"r-")
            ax.set_yticks([])
            # plot second column with other outcomes
            if num_cols >1:
                ax = axs[1]
                last_prior.loc[:,forecast].hist(facecolor="0.5",alpha=0.5, bins=bins, ax=ax,density=True)
                last_pt_oe.loc[:,forecast].hist(facecolor="b",alpha=0.5, bins=bins, ax=ax,density=True)
                ax.set_title(forecast)
                fval = pst.observation_data.loc[forecast,"obsval"]
                ax.plot([fval,fval],ax.get_ylim(),"r-")
                ax.set_yticks([])
                
        # set ax column titles
        if num_cols >1:
            axes.flatten()[0].text(0.5,1.2,"Current Attempt", transform=axes.flatten()[0].transAxes, weight='bold', fontsize=12, horizontalalignment='center')
            axes.flatten()[1].text(0.5,1.2,"Previous Attempt", transform=axes.flatten()[1].transAxes, weight='bold', fontsize=12, horizontalalignment='center')
        fig.tight_layout()
        

In [None]:
sim = flopy.mf6.MFSimulation.load(sim_ws=org_t_d)
ib = sim.get_model().dis.idomain.array[0,:,:]
def plot_hk(pr_oe, pt_oe, pst):
    obs = pst.observation_data
    hkobs = obs.loc[obs.oname=="hk",:].copy()
    hkobs["i"] = hkobs.i.astype(int)
    hkobs["j"] = hkobs.j.astype(int)
    real = pt_oe.index[0]
    
    fig,axes = plt.subplots(2,4,figsize=(15,10))
    prmn,prstd,prmev,prreal = np.zeros_like(ib,dtype=float),np.zeros_like(ib,dtype=float),np.zeros_like(ib,dtype=float),np.zeros_like(ib,dtype=float)
    prmn[hkobs.i,hkobs.j] = pr_oe.loc[:,hkobs.obsnme].mean()
    prstd[hkobs.i,hkobs.j] = pr_oe.loc[:,hkobs.obsnme].std()
    prmev[hkobs.i,hkobs.j] = pr_oe.loc["base",hkobs.obsnme].values   
    prreal[hkobs.i,hkobs.j] = pr_oe.loc[real,hkobs.obsnme].values   
    ptmn,ptstd,ptmev,ptreal = np.zeros_like(ib,dtype=float),np.zeros_like(ib,dtype=float),np.zeros_like(ib,dtype=float),np.zeros_like(ib,dtype=float)
    ptmn[hkobs.i,hkobs.j] = pt_oe.loc[:,hkobs.obsnme].mean()
    ptstd[hkobs.i,hkobs.j] = pt_oe.loc[:,hkobs.obsnme].std()
    ptmev[hkobs.i,hkobs.j] = pt_oe.loc["base",hkobs.obsnme].values  
    ptreal[hkobs.i,hkobs.j] = pt_oe.loc[real,hkobs.obsnme].values   
    for arr in [prmn,prstd,prmev,prreal,ptmn,ptstd,ptmev,ptreal]:
        #arr = np.log10(arr)
        arr[ib>0] = np.log10(arr[ib>0])
        arr[ib==0] = np.nan
    prarrs = [prmn,prstd,prmev,prreal]
    ptarrs = [ptmn,ptstd,ptmev,ptreal]
    titles = ["mean","stdev","MEV","realization {0}".format(real)]
    
    for pr,pt,axes,title in zip(prarrs,ptarrs,axes.transpose(),titles):
        vmn,vmx = min(np.nanmin(pr),np.nanmin(pt)),max(np.nanmax(pr),np.nanmax(pt))
        cb = axes[0].imshow(pr,vmin=vmn,vmax=vmx)
        plt.colorbar(cb,ax=axes[0],label="$log_{10}\\frac{m}{d}$")
        axes[0].set_title("prior "+title)
        cb = axes[1].imshow(pt,vmin=vmn,vmax=vmx)
        plt.colorbar(cb,ax=axes[1],label="$log_{10}\\frac{m}{d}$")
        axes[1].set_title("posterior "+title)
    plt.tight_layout()
    return fig,axes

In [None]:
def plot_tseries_ensembles(pr_oe, pt_oe, noise, onames=["hds","sfr"]):
    pst.try_parse_name_metadata()
    # get the observation data from the control file and select 
    obs = pst.observation_data.copy()
    # onames provided in oname argument
    obs = obs.loc[obs.oname.apply(lambda x: x in onames)]
    # only non-zero observations
    obs = obs.loc[obs.obgnme.apply(lambda x: x in pst.nnz_obs_groups),:]
    # make a plot
    ogs = obs.obgnme.unique()
    fig,axes = plt.subplots(len(ogs),1,figsize=(10,2*len(ogs)))
    ogs.sort()
    # for each observation group (i.e. timeseries)
    for ax,og in zip(axes,ogs):
        # get values for x axis
        oobs = obs.loc[obs.obgnme==og,:].copy()
        oobs.loc[:,"time"] = oobs.time.astype(float)
        oobs.sort_values(by="time",inplace=True)
        tvals = oobs.time.values
        onames = oobs.obsnme.values
        ylim = None
        if pt_oe is not None:
        # plot posterior
            [ax.plot(tvals,pt_oe.loc[i,onames].values,"b",lw=0.5,alpha=0.5) for i in pt_oe.index]
        # plot measured+noise 
        if noise is not None:
            oobs = oobs.loc[oobs.weight>0,:]
            tvals = oobs.time.values
            onames = oobs.obsnme.values
            [ax.plot(tvals,noise.loc[i,onames].values,"r",lw=0.5,alpha=0.5) for i in noise.index]
            ax.scatter(oobs.time,oobs.obsval,marker='^',s=20,c="r",zorder=10)
            ylim = ax.get_ylim()
        if pr_oe is not None:
            # plot prior
            [ax.plot(tvals,pr_oe.loc[i,onames].values,"0.5",lw=0.5,alpha=0.5,zorder=1) for i in pr_oe.index]
        if ylim is not None:
            ax.set_ylim(ylim)
        ax.set_title(og,loc="left")
    _ = fig.tight_layout()
    

In [None]:
org_pr_oe,org_pt_oe = pst.ies.obsen0,pst.ies.get("obsen",pst.ies.phiactual.iteration.max())
org_noise = pst.ies.noise

In [None]:

_ = plot_forecast_hist_compare(org_pr_oe,org_pt_oe)

In [None]:
plot_tseries_ensembles(org_pr_oe,org_pt_oe,org_noise)

In [None]:
_ = plot_hk(org_pr_oe,org_pt_oe, pst)

Notice how the posterior is more narrow than the noise - thats not good and it means that PESTPP-IES was not able to accomodate the noise during the assimilation process.

Lets just see the noise:

In [None]:
plot_tseries_ensembles(None,None,org_noise)

Let's think about this for a minute.  With our representation of noise, do we care about "high frequency" temporal variation/randomness?  Thats what is being represented here:  each point in each timeseries is independent of all other points and maybe this is fine if we are trying to capture "measurement noise" only, but as we stated before, noise also needs to (try to) capture the effects of model error.  Let's also think about what the model can simulate:  looking back at the timeseries from posterior realizations - they have very little high-frequency components, so why are we trying to force the model to reproduce high-frequence noise?

So how do we generate "noise" that is a) something the model can reproduce and b) represents not just measurement error, but all those other sources of error, include model error? There is not a single answer here other than "it depends", in that how you formulate and use noise will be very problem specific.  That being said, one interesting and relatively simple trick that can be done is to simply shift all the timeseries together by a constant amount - this is analogous to a "constant" multiplier parameter.  

To do this, first lets just generate some standard normal deviates (random numbers from a normal distribution with mean of zero and standard deviation of one):

In [None]:
#generate standard normal deviates
num_reals = pst.ies.paren0.shape[0]
np.random.seed(pyemu.en.SEED)
draws = np.random.normal(0,1,num_reals)
_ = plt.hist(draws)


Now lets find the timeseries observations we want to apply these offsets to:

In [None]:
obs = pst.observation_data
onames = obs.loc[(obs.weight>0) & (obs.oname.apply(lambda x: x in ["hds","sfr"])),"obsnme"]

So what we are gonna do is first generate an observation noise ensemble using the `standard_deviation` column in the observation data - this will generate noise for all the non-zero weighted observations (including the difference obs).  Then we will replace the noise realizations for the timeseries observations such that each realization is the observation value plus one of the standard normal deviates we just generated, but this deviate will be applied to all timeseries obs at once, yielding extremely correlated observation noise

In [None]:
# first generate a standard noise ensemble 
newnoise = pyemu.ObservationEnsemble.from_gaussian_draw(pst=pst,cov=pyemu.Cov.from_observation_data(pst),num_reals=num_reals)
newnoise.index = pst.ies.paren0.index
ovals =  obs.loc[onames,"obsval"].values
stdevs = obs.loc[onames,"standard_deviation"].values
for i,draw in enumerate(draws):
    newnoise.loc[i,onames] = ovals + (draw * stdevs)
if "base" in newnoise.index:
    newnoise.loc["base",:] = obs.loc[newnoise.columns,"obsval"]
newnoise.to_csv(os.path.join(t_d,"extreme_noise.csv"))
pst.pestpp_options["ies_observation_ensemble"] = "extreme_noise.csv"

## Run PESTPP-IES

Right then, let's do this!

In [None]:
# update NOPTMAX again and re-write the control file
pst.control_data.noptmax = -2
pst.pestpp_options.pop("ies_bad_phi_sigma",None)
pst.write(os.path.join(t_d, 'pest.pst'),version=2)

In [None]:
pyemu.os_utils.run("pestpp-ies pest.pst",cwd=t_d)


__Attention!__

You must specify the number which is adequate for ***your*** machine! Make sure to assign an appropriate value for the following `num_workers` variable - if its too large for your machine, #badtimes:

You can check the number of physical cores avalable on your machine using `psutils`:

In [None]:
psutil.cpu_count(logical=False)

In [None]:
num_workers = 14 #update this according to your resources

Next, we shall specify the PEST run-manager/master directory folder as `m_d`. This is where outcomes of the PEST run will be recorded. It should be different from the `t_d` folder, which contains the "template" of the PEST dataset. This keeps everything separate and avoids silly mistakes.

In [None]:
m_d = os.path.join('master_ies_1extreme')

The following cell deploys the PEST agents and manager and then starts the run using PESTPP-IES. Run it by pressing `shift+enter`. If you wish to see the outputs in real-time, switch over to the terminal window (the one which you used to launch the `jupyter notebook`). There you should see PESTPP-IES's progress. 

If you open the tutorial folder, you should also see a bunch of new folders there named `worker_0`, `worker_1`, etc. These are the agent folders. The `master_ies` folder is where the manager is running. 

This run should take several minutes to complete (depending on the number of workers and the speed of your machine). If you get an error, make sure that your firewall or antivirus software is not blocking PESTPP-IES from communicating with the agents (this is a common problem!).

> **Pro Tip**: Running PEST from within a `jupyter notebook` has a tendency to slow things down and hog alot of RAM. When modelling in the "real world" it is often more efficient to implement workflows in scripts which you can call from the command line. 

In [None]:
# update NOPTMAX again and re-write the control file
pst.control_data.noptmax = 5
pst.write(os.path.join(t_d, 'pest.pst'),version=2)

In [None]:
pyemu.os_utils.start_workers(t_d, # the folder which contains the "template" PEST dataset
                            'pestpp-ies', #the PEST software version we want to run
                            'pest.pst', # the control file to use with PEST
                            num_workers=num_workers, #how many agents to deploy
                            worker_root='.', #where to deploy the agent directories; relative to where python is running
                            master_dir=m_d, #the manager directory
                            )

## Explore the Outcomes

Right then. PESTPP-IES completed successfully. Let's take a look at some of the outcomes.

In [None]:
pst = pyemu.Pst(os.path.join(m_d,"pest.pst"))

In [None]:
fig, axes = plt.subplots(1, 2, sharey=True, figsize=(10,3.5))
# left
ax = axes[0]
phi = pst.ies.phiactual
phi.index = phi.total_runs
phi.iloc[:,6:].apply(np.log10).plot(legend=False,lw=0.5,color='k', ax=ax)
ax.set_title(r'Actual $\Phi$')
ax.set_ylabel(r'log $\Phi$')
ax.grid()
# right
ax = axes[-1]
phimeas = pst.ies.phimeas
phimeas.index = phi.total_runs
phimeas.iloc[:,6:].apply(np.log10).plot(legend=False,lw=0.2,color='r', ax=ax)
ax.set_title(r'Measured+Noise $\Phi$')
ax.grid()
fig.tight_layout()

its important to notice that now the "measured" phi is behaving differently than the "actual" phi, and in fact, measured phi is substantially lower than the actual phi.  This is an outcome want to see because it indicates that PESTPP-IES is able to assimilate our representation of noise.

Now lets inspect the observed vs simulated timeseries:

In [None]:
pr_oe = pst.ies.obsen0
pt_oe = pst.ies.get("obsen",pst.ies.phiactual.iteration.max())
noise = pst.ies.noise

In [None]:
fig = plot_tseries_ensembles(None, pt_oe, noise, onames=["hds","sfr"])

Ok!  Now that is a very different pattern from what we saw before.  Pay particular attention to how well the posterior simulated results (in blue) respect the noise realizations (in red).  Previously, we saw that the posterior was (much?) narrower than the noise realizations, indicating that PESTPP-IES was "overfit" WRT to the noise, this being an outcome of high-frequency statistically independent noise that could be not simulated/assimilated, which presented "irreducible residuals" to PESTPP-IES, in which case PESTPP-IES simple "shot through the middle".  But now, we have corresondence between the noise and posterior.  Conceptually, these noise realizations have forced PESTPP-IES to find parameter sets that yield simulation results that are consistently higher or lower than the observed values.  Maybe, for your settings, model, and prediction(s) this is what you want (Also, credit Eduardo de Sousa for this idea!)



In [None]:
_ = plot_hk(pr_oe,pt_oe, pst)

In [None]:
_ = plot_forecast_hist_compare(pr_oe,pt_oe,last_prior=org_pr_oe,last_pt_oe=org_pt_oe)

So we see that just messing with the noise has a nontrivial influence over the important predictive outcomes...kinda scary...

So that is a short exploration of noise and how it is used within ensemble methods.  Just like when design the parameterization and the Prior, and the weighting strategy, designing a clever and appropriate noise scheme will be problem specific and require considerable common (or hydro) sense.