# U.S. Geological Survey Class GW3099
Advanced Modeling of Groundwater Flow (GW3099)\
Boise, Idaho\
September 16 - 20, 2024

![title](../../images/ClassLocation.jpg)

# Multi-process models in pywatershed
*(Note that this notebook follows the notebook in the pywatershed repository [examples/01_multi-process_models.ipynb](https://github.com/EC-USGS/pywatershed/blob/develop/examples/01_multi-process_models.ipynb) but it deviates in some of the details covered.)*

In notebook [`step1_processes.ipynb`](step1_processes.ipynb), we looked at how individual Process representations work and are designed. In this notebook we learn how to put multiple `Processes` together into composite models using the `Model` class. 

The starting point for the development of `pywatershed` was the National Hydrologic Model (NHM, Regan et al., 2018) configuration of the Precipitation-Runoff Modeling System (PRMS, Regan et al., 2015). In this notebook, we'll first construct a full NHM configuration. We will again use the spatial domain of the Delaware River Basin. Once we construct the full NHM, we'll look at how we can also construct sub-models of the NHM.

Along the way, we'll get into some of the guts of using pywatershed.

## Prerequisites

In [None]:
import pathlib as pl
import shutil
from copy import deepcopy
from platform import processor
from pprint import pprint
from sys import platform

import hvplot.xarray  # noqa
import jupyter_black
import numpy as np
import pywatershed as pws
import xarray as xr
import yaml
from helpers import do_not_run_this_cell, help_head, read_yaml, write_yaml
from pywatershed.utils import gis_files
from pywatershed.utils.path import dict_pl_to_str
from tqdm.notebook import tqdm

jupyter_black.load()  # auto-format the code in this notebook

pws.utils.gis_files.download()

pkg_root_dir = pws.constants.__pywatershed_root__
repo_root_dir = pkg_root_dir.parent

nb_output_dir = pl.Path("./step2_multi-process_models")

## Domain Plot to get to know the area

Before diving in to pywatershed models, let's use one of its built-in tools to get familiar with the application domain. We'll combine the GIS files for the HRUs and the Segments in this domain with their parameters to learn more about how the model represents quantities in pyhiscal space. Please zoom in and out and select different layers. We aim to add more functionality to this plot over time, stay tuned.

In [None]:
domain_dir = pkg_root_dir / "data/drb_2yr"

domain_gis_dir = pkg_root_dir / "data/pywatershed_gis/drb_2yr"
shp_file_hru = domain_gis_dir / "HRU_subset.shp"
shp_file_seg = domain_gis_dir / "Segments_subset.shp"

In [None]:
dis_hru = pws.Parameters.from_netcdf(domain_dir / "parameters_dis_hru.nc")
start_lat = dis_hru.parameters["hru_lat"].mean()
start_lon = dis_hru.parameters["hru_lon"].mean()

pws.plot.DomainPlot(
    hru_shp_file=shp_file_hru,
    segment_shp_file=shp_file_seg,
    hru_parameters=domain_dir / "parameters_dis_hru.nc",
    hru_parameter_names=[
        "nhm_id",
        "hru_lat",
        "hru_lon",
        "hru_area",
    ],
    segment_parameters=domain_dir / "parameters_dis_seg.nc",
    segment_parameter_names=[
        "nhm_seg",
        "seg_length",
        "seg_slope",
        "seg_cum_area",
    ],
    start_lat=start_lat,
    start_lon=start_lon,
    start_zoom=7,
)

## An NHM multi-process model for the Delaware River Basin
The 8 conceptual `Process` classes that comprise the NHM are, in order:

In [None]:
nhm_processes = [
    pws.PRMSSolarGeometry,
    pws.PRMSAtmosphere,
    pws.PRMSCanopy,
    pws.PRMSSnow,
    pws.PRMSRunoff,
    pws.PRMSSoilzone,
    pws.PRMSGroundwater,
    pws.PRMSChannel,
]

We'll use this list of classes shortly to construct the NHM.

A multi-process model is assembled by the `Model` class. We can take a quick look at the first 22 lines of help on `Model`:

In [None]:
help_head(pws.Model, n=22)

The `help()` mentions that there are 2 distinct ways of instantiating a `Model` class. In this notebook, we focus on the pywatershed-centric instantiation and leave the PRMS-legacy instantiation for another time. 

With the pywatershed-centric approach, the first argument is a "model dictionary" which does nearly all the work (the other arguments will be their default values). The `help()` describes the model dictionary and provides examples. Please use it for reference and more details. Here we'll give an extended concrete example. The `help()` also describes how a `Model` can be instantiated from a model dictionary contained in a YAML file. First, we'll build a model dictionary in memory, then we'll write it out as a yaml file and instantiate our model directly from the YAML file. 

### Construct the model specification in memory
Because our (pre-existing) parameter files (which come with `pywatershed`) and our `Process` classes are consistently named, we can begin to build the model dictionary quickly.

In [None]:
model_dict = {}

for proc in nhm_processes:
    # this is the class name
    proc_name = proc.__name__
    # the processes can have arbitrary names in the model_dict and
    # an instance should not have capitalized name anyway (according to
    # python convention), so rename from the class name
    proc_rename = "prms_" + proc_name[4:].lower()
    # each process has a dictionary of information
    model_dict[proc_rename] = {}
    # alias to shorten lines below
    proc_dict = model_dict[proc_rename]
    # required key "class" specifys the class
    proc_dict["class"] = proc
    # the "parameters" key provides an instance of Parameters
    proc_param_file = domain_dir / f"parameters_{proc_name}.nc"
    proc_dict["parameters"] = pws.Parameters.from_netcdf(proc_param_file)
    # the "dis" key provides the name of the discretizations
    # which we'll supply shortly to the model dictionary
    if proc_rename == "prms_channel":
        proc_dict["dis"] = "dis_both"
    else:
        proc_dict["dis"] = "dis_hru"

Let's look at what we have so far in the `model_dict`.

In [None]:
pprint(model_dict, sort_dicts=False)

We have given a name to each process and then supplied the class, its parameters, and its discretization for the full set of processes. Now we'll need to add the discretizations to the model dictionary. They are added at the top level and correspond to `dis` the names the processes used. 

In [None]:
model_dict = model_dict | {
    "dis_hru": pws.Parameters.from_netcdf(
        domain_dir / "parameters_dis_hru.nc"
    ),
    "dis_both": pws.Parameters.from_netcdf(
        domain_dir / "parameters_dis_both.nc"
    ),
}
pprint(model_dict, sort_dicts=False)

For the time being, `PRMSChannel` needs to know about both HRUs and segments, so `dis_both` is used. We plan to remove this requirement in the near future by implementing "exchanges" between processes into the model dictionary.

You may have noticed that we are missing a `Control` object to provide time information to the processes. We'll create it and we'll also create a list of the order that the processes are executed.

In [None]:
run_dir = nb_output_dir / "run_dir"
control = pws.Control(
    start_time=np.datetime64("1979-01-01T00:00:00"),
    end_time=np.datetime64("1980-12-31T00:00:00"),
    time_step=np.timedelta64(24, "h"),
    options={
        "input_dir": domain_dir,
        "budget_type": "error",
        "netcdf_output_dir": run_dir,
    },
)
model_order = ["prms_" + proc.__name__[4:].lower() for proc in nhm_processes]
model_dict = model_dict | {"control": control, "model_order": model_order}
pprint(model_dict, sort_dicts=False)

### Instantiate the model

The `model_dict` above now specifies a complete model built from multiple processes. Connecting the processes is handled by the `Model` class which can figure it all out because each process fully describes itself (as we saw in the previous notebook), including its inputs and variables. If we instantiate a model from this `model_dict`,

In [None]:
model = pws.Model(model_dict)

### ModelGraph
Now we can examine how the `Processes` are all connected using the `ModelGraph` class. We'll bring in the default color scheme for NHM `Processes`.

In [None]:
palette = pws.analysis.utils.colorbrewer.nhm_process_colors(model)
pws.analysis.utils.colorbrewer.jupyter_palette(palette)
show_params = not (platform == "darwin" and processor() == "arm")
try:
    pws.analysis.ModelGraph(
        model,
        hide_variables=False,
        process_colors=palette,
        show_params=show_params,
    ).SVG(verbose=True, dpi=48)
except:
    static_url = "https://github.com/EC-USGS/pywatershed/releases/download/1.1.0/notebook_01_cell_11_model_graph.png"
    print(
        f"Dot fails on some machines. You can see the graph at this url: {static_url}"
    )
    from IPython.display import Image

    display(Image(url=static_url, width=1300))

### Questions
* What are the inputs for this model and where are these found? Is there anything special about those files? Could we drive any process from file?
* Can you see where each process gets its inputs from in this model? What is the largest number of other processes a single process draws inputs from?
* Are some of the arrows 2-way?
* Which processes are mass conservative? Can you see the terms involved in mass conservation?
* Which process has the greatest/smallest ratio of number of parameters to number of variables?

### Run the model
Now we'll initialize NetCDF output and run the model.

In [None]:
%%time
model.run(finalize=True)

Now we have a finalized run of our model. Finalizing is important mainly so that open output files are closed. We can quite easily look at all the output resulting from our run by looking at the netcdf files in the run directory. 

In [None]:
output_files = sorted(run_dir.glob("*.nc"))
print(len(output_files))
pprint(output_files)

The following code will let us examine output variables, plotting the full timeseries at individual locations which can be scrolled through using the bar on the right side. It will not work to look at the out budget output files, however. Note, this plot is not a custom plot function. It is base functionality in hvplot (with an xarray backend). Because of all the work getting dimensions and metadata into the NetCDF file, the scroll on the spatial dimension is appropriately named, the y-axis is appropriately labeled with units, and the time axis looks sharp.

In [None]:
var = "seg_outflow"
var_da = xr.load_dataarray(run_dir / f"{var}.nc")
var_da.hvplot(groupby=var_da.dims[1])

We'll plot the last variable in the loop, `unused_potet`:

In [None]:
%%do_not_run_this_cell
proc_plot = pws.analysis.process_plot.ProcessPlot(gis_files.gis_dir / "drb_2yr")
proc_classes = [model_dict[nn]["class"] for nn in model_order]


def get_var_proc_class(var_name):
    for proc_class in proc_classes:
        if var_name in proc_class.get_variables():
            return proc_class


proc_plot.plot_hru_var(
    var_name=var,
    process=get_var_proc_class(var),
    data=var_da.mean(dim="time"),
    data_units=var_da.attrs["units"],
    nhm_id=var_da["nhm_id"],
)

We can also make a spatial plot of the streamflow using a transform for line width representation. 

In [None]:
# var = "seg_outflow"
# var_da = xr.open_dataarray(run_dir / f"{var}.nc")

# def xform_width(vals):
#     flow_log = np.maximum(np.log(vals + 1.0e-4), 0.0)
#     width_max = 5
#     width_min = 0.2
#     flow_log_lw = (width_max - width_min) * (flow_log - np.min(flow_log)) / (
#         np.max(flow_log) - np.min(flow_log)
#     ) + width_min
#     return flow_log_lw


# proc_plot.plot(
#     var,
#     process=get_var_proc_class(var),
#     value_transform=xform_width,
#     data=var_da.mean(dim="time"),
#     title=f"{var}",
#     aesthetic_width=True,
# )

# #proc_plot.plot(var_name, proc, title=var_name)

### Reduce model output to disk
Quite a lot of output was written in the above example. In many cases, the amount of model output can be reduced in favor of imporving/reducing model run time. In the next cell, we show how you would reduce the output by setting `control.options['netcdf_output_var_names]`. We'll suppose that we only want the output variables from the `PRMSGroundwater` and `PRMSChannel` processes. Note that we are just combining the variable names returned by these two processes' `.get_variables()` methods. However, we could specify any list of variable names we like (variable names not present in the model are ignored silently, so spelling obviously matters). We dont run this cell, we just show what code you'd change above.

In [None]:
%%do_not_run_this_cell
desired_output = [
    *pws.PRMSGroundwater.get_variables(),
    *pws.PRMSChannel.get_variables(),
]control_cp.options["netcdf_output_var_names"] = desired_output

When I reduce the original ~150 output files to just those specified in the above cell, run time is reduced by about 60% on my Mac.

## NHM Submodel for the Delaware River Basin 
In many cases, running the full NHM model may not be necessary and it may be advantageous to just run some of the processes in it. Pywatershed gives you this flexibility. Suppose you wanted to change parameters or model process representation in just the PRMSSoilzone to better predict streamflow. As the model is 1-way coupled, you can simply run a submodel starting with PRMSSoilzone and running through PRMSChannel. 

In this example we'll construct our model using YAML file, instead of in memory as above. To see how this works, we'll start from a YAML file that specifies the full NHM that we ran above.

In [None]:
model_dict_yaml_file = repo_root_dir / "test_data/drb_2yr/nhm_model.yaml"
model_dict_yaml = read_yaml(model_dict_yaml_file)
display(model_dict_yaml)

We can see above that a YAML file specifies data via a control YAML files and all other data via NetCDF files. All other fields are strings. 

Let's write our own YAML file for our submodel. Files specified with relative paths are relative the location of the YAML file itself. We want to put this YAML file in to a new run directory, so we'll want to supply paths to existing files and since we dont want/need to copy those, we'll use absolute paths. All `pl.Path`s must be converted to `str`s in the YAML representation. 

In [None]:
model_dict_new = {
    "control": "nhm_control.yaml",
    "dis_hru": str(domain_dir / "parameters_dis_hru.nc"),
    "dis_both": str(domain_dir / "parameters_dis_both.nc"),
    "soilzone": {
        "class": "PRMSSoilzone",
        "parameters": str(domain_dir / "parameters_PRMSSoilzone.nc"),
        "dis": "dis_hru",
    },
    "groundwater": {
        "class": "PRMSGroundwater",
        "parameters": str(domain_dir / "parameters_PRMSGroundwater.nc"),
        "dis": "dis_hru",
    },
    "channel": {
        "class": "PRMSChannel",
        "parameters": str(domain_dir / "parameters_PRMSChannel.nc"),
        "dis": "dis_both",
    },
    "model_order": ["soilzone", "groundwater", "channel"],
}

We'll need to place a control YAML file in our run dir since that's where we said it would be. We'll use a control YAML file that is used for running the full model as a staring point. But will we need to edit it? Let's take a look.

In [None]:
model_control_yaml_file = repo_root_dir / "test_data/drb_2yr/nhm_control.yaml"
model_control_yaml = read_yaml(model_control_yaml_file)
display(model_control_yaml)

Looking closely at this, we'll notice that `input_dir` is not specified. Trying to instantiate a model will throw an error telling us this. But where do we get our inputs? What are our inputs? What were our inputs above? Maybe, let's try that same directory we used for the full model.

In [None]:
model_control_yaml["input_dir"] = str(control.options["input_dir"])
run_dir_submodel = nb_output_dir / "run_dir_submodel"
run_dir_submodel.mkdir(exist_ok=True)
model_control_yaml["netcdf_output_dir"] = str(run_dir_submodel)

Now let's write out our model and control YAML files.

In [None]:
submodel_yaml_file = run_dir_submodel / "submodel.yaml"
write_yaml(model_dict_new, submodel_yaml_file)
write_yaml(
    model_control_yaml, run_dir_submodel / "nhm_control.yaml"
)  # as specified in model_dict_new

We'll run the model from YAML files.

In [None]:
try:
    submodel = pws.Model.from_yaml(submodel_yaml_file)
except Exception as error:
    print("An exception occurred:", error)  #

We got an error that the `potet.nc` file was not found. What is going on? Why is that an input file? Let's take a look at the `ModelGraph` for this submodel.

In [None]:
show_params = not (platform == "darwin" and processor() == "arm")
try:
    pws.analysis.ModelGraph(
        submodel,
        hide_variables=False,
        process_colors=palette,
        show_params=show_params,
    ).SVG(verbose=True, dpi=48)
except:
    static_url = "https://github.com/EC-USGS/pywatershed/releases/download/1.1.0/notebook_01_cell_45_submodel_graph.png"
    print(
        f"Dot fails on some machines. You can see the graph at this url: {static_url}"
    )
    from IPython.display import Image

    display(Image(url=static_url, width=700))

OK, the submodel has a different set of inputs that the `ModelGraph` clearly shows. That's cool, but where will we find those files? Remember when we ran the full model above? Maybe it output the required inputs? How could we check this?

In [None]:
all_inputs = [
    *pws.PRMSSoilzone.get_inputs(),
    *pws.PRMSRunoff.get_inputs(),
    *pws.PRMSChannel.get_inputs(),
]
all_run_output_names = [ff.name[0:-3] for ff in sorted(run_dir.glob("*.nc"))]

In [None]:
set(all_inputs).difference(set(all_run_output_names))

Oh snap! All the inputs files are available from the first run. Let's fix our control's `input_dir`. 

In [None]:
model_control_yaml["input_dir"] = str(run_dir.resolve())
write_yaml(
    model_control_yaml, run_dir_submodel / "nhm_control.yaml"
)  # as specified in model_dict_new

In [None]:
submodel = pws.Model.from_yaml(submodel_yaml_file)

The model instantiated just fine. While we could just do `submodel.run(finalize=True)`, that'd be too easy. Let's write the expansion of the run loop implemented under the hood of the `Model` class so you can see how you might explore the internals of the a `Model` instance. You can see some basics of the relationship of a `Model` to its `Processes`.

In [None]:
%%time
submodel.initialize_netcdf()
for tt in tqdm(range(control.n_times)):
    submodel.control.advance()
    for cls in submodel.process_order:
        submodel.processes[cls].advance()
        submodel.processes[cls].calculate(1.0)
        submodel.processes[cls].output()

submodel.finalize()

Well, the submodel saved us some time. Again, about 60% of the original run time (like when reducing the number of output variables). Below, we'll show that the submodel run is identical to the original run, for the processes included. 

First, let's lookat the internals of the `submodel in a bit more detail. The final time is still in memory so we can take a closer look at, say, recharge. We'll look at its metadata, its dimensions, shape, type, and dtype in the next cell. 

In [None]:
pprint(pws.meta.find_variables("recharge"))
print(
    "PRMSSoilzone dimension names: ",
    submodel.processes["soilzone"].dimensions,
)
print("nhru: ", submodel.processes["soilzone"].nhru)
print(
    "PRMSSoilzone recharge shape: ",
    submodel.processes["soilzone"]["recharge"].shape,
)
print(
    "PRMSSoilzone recharge type: ",
    type(submodel.processes["soilzone"]["recharge"]),
)
print(
    "PRMSSoilzone recharge dtype: ",
    submodel.processes["soilzone"]["recharge"].dtype,
)

We see the length of the `nhru` dimension and that this is the only dimension on `recharge`. With the exception of the `PRMSSolar` and `PRMSAtmosphere` classes (which vectorizes compuations over time), `Processes` only have spatial dimensions. Their data is written to file with each timestep. Prognostic variables have a `variable_previous` (or `_old` or `_ante`, etc) version to store the antecedent values. One design feature of pywatershed is that all such prognostic variables can be identified in a `Process`'s `.advance()` method. 

FOr our current `submodel`, the last timestep is still in memory (even though we've finalized the run) and we can visualize it. The data are on the unstructured/polygon grid of Hydrologic Response Units (HRUs), we'll visualize the spatial distribution at this final time.

In [None]:
%%do_not_run_this_cell
proc_plot = pws.analysis.process_plot.ProcessPlot(gis_files.gis_dir / "drb_2yr")
proc_name = "soilzone"
var_name = "ssr_to_gw"
proc = submodel.processes[proc_name]
display(proc_plot.plot(var_name, proc))

We can easily check the results of our submodel model against our full model. This gives us an opportunity to look at the output files. We can start with recharge as our variable of interest. The model NetCDF output can be read in using `xarray` where we can see all the relevant metadata quickly.

In [None]:
var = "recharge"
nhm_da = xr.load_dataarray(run_dir_submodel / f"{var}.nc")
sub_da = xr.load_dataarray(run_dir / f"{var}.nc")

In [None]:
display(nhm_da)
display(sub_da)

Now we can compare all output variables common to both runs, asserting that the two runs gave equal output.

In [None]:
submodel_variables = [
    *pws.PRMSSoilzone.get_variables(),
    *pws.PRMSGroundwater.get_variables(),
    *pws.PRMSChannel.get_variables(),
]

In [None]:
for var in submodel_variables:
    nhm_da = xr.load_dataarray(run_dir / f"{var}.nc")
    sub_da = xr.load_dataarray(run_dir_submodel / f"{var}.nc")
    xr.testing.assert_equal(nhm_da, sub_da)

We can make some scatter plots and timeseries plots for any variable of interest, since you were not convinced by the `assert_equal` above.

In [None]:
var_name = "seg_outflow"
nhm_da = xr.load_dataarray(run_dir / f"{var_name}.nc")
sub_da = xr.load_dataarray(run_dir_submodel / f"{var_name}.nc")
scat = xr.merge(
    [nhm_da.rename(f"{var_name}_yaml"), sub_da.rename(f"{var_name}_subset")]
)
space_dim = sub_da.dims[1]
display(
    scat.hvplot(
        x=f"{var_name}_yaml", y=f"{var_name}_subset", groupby=space_dim
    ).opts(data_aspect=1)
)

scat.hvplot(y=f"{var_name}_subset", groupby=space_dim)

### Adapter class
The `Adapter` class is the bit of magic behind how we drive `Processes` from files or from other `Processes`. Here we'll give a quick demo of the how this class works. 

In [None]:
control = pws.Control.from_yaml(run_dir_submodel / "nhm_control.yaml")
recharge_adapter = pws.adapter_factory(
    run_dir_submodel / "recharge.nc", "recharge", control
)

Before the control and the adapter are advanced in time, the adapter has missing values.

In [None]:
recharge_adapter.current

We advance through all time and we'll check that we get the values that are still in memory. This demo shows how the adapter class can easily make a NetCDF file look like a `Process`.

In [None]:
for tt in range(control.n_times):
    control.advance()
    recharge_adapter.advance()
    if tt == 0:
        display(recharge_adapter.current)

In [None]:
all(recharge_adapter.current == submodel.processes["soilzone"]["recharge"])

## References
* Regan, R. S., Markstrom, S. L., Hay, L. E., Viger, R. J., Norton, P. A., Driscoll, J. M., & LaFontaine, J. H. (2018). Description of the national hydrologic model for use with the precipitation-runoff modeling system (prms) (No. 6-B9). US Geological Survey.
* Regan, R.S., Markstrom, S.L., LaFontaine, J.H., 2022, PRMS version 5.2.1: Precipitation-Runoff Modeling System (PRMS): U.S. Geological Survey Software Release, 02/10/2022.