# U.S. Geological Survey Class GW3099
Advanced Modeling of Groundwater Flow (GW3099)\
Boise, Idaho\
September 16 - 20, 2024

![title](../../images/ClassLocation.jpg)

# Multi-process models in pywatershed
*(Note that this notebook follows the notebook in the pywatershed repository [examples/01_multi-process_models.ipynb](https://github.com/EC-USGS/pywatershed/blob/develop/examples/01_multi-process_models.ipynb) but it deviates in some of the details covered.)*

In notebook [`step1_processes.ipynb`](step1_processes.ipynb), we looked at how individual Process representations work and are designed. In this notebook we learn how to put multiple `Processes` together into composite models using the `Model` class. 

The starting point for the development of `pywatershed` was the National Hydrologic Model (NHM, Regan et al., 2018) configuration of the Precipitation-Runoff Modeling System (PRMS, Regan et al., 2015). In this notebook, we'll first construct a full NHM configuration. The spatial domain we'll use will again be the Delaware River Basin. Once we construct the full NHM, we'll look at how we can also construct sub-models of the NHM.

## Prerequisites

In [None]:
import pathlib as pl
import pydoc
from copy import deepcopy
from platform import processor
from pprint import pprint
from sys import platform

import hvplot.xarray  # noqa
import jupyter_black
import numpy as np
import pywatershed as pws
import xarray as xr
import yaml
from helpers import do_not_run_this_cell
from pywatershed.utils import gis_files
from pywatershed.utils.path import dict_pl_to_str

jupyter_black.load()  # auto-format the code in this notebook

pws.utils.gis_files.download()  # this downloads GIS files

pkg_root_dir = pws.constants.__pywatershed_root__

## Domain Plot to get to know the area

Before diving in to pywatershed models, let's use one of its built-in tools to get familiar with the application domain. We'll combine the GIS files for the HRUs and the Segments in this domain with their parameters to learn more about how the model represents quantities in pyhiscal space. Please zoom in and out and select different layers. We aim to add more functionality to this plot over time, stay tuned.

In [None]:
nb_output_dir = pl.Path("./step2_multi-process_models")

domain_dir = pkg_root_dir / "data/drb_2yr"

domain_gis_dir = pkg_root_dir / "data/pywatershed_gis/drb_2yr"
shp_file_hru = domain_gis_dir / "HRU_subset.shp"
shp_file_seg = domain_gis_dir / "Segments_subset.shp"

In [None]:
dis_hru = pws.Parameters.from_netcdf(domain_dir / "parameters_dis_hru.nc")
start_lat = dis_hru.parameters["hru_lat"].mean()
start_lon = dis_hru.parameters["hru_lon"].mean()

pws.plot.DomainPlot(
    hru_shp_file=shp_file_hru,
    segment_shp_file=shp_file_seg,
    hru_parameters=domain_dir / "parameters_dis_hru.nc",
    hru_parameter_names=[
        "nhm_id",
        "hru_lat",
        "hru_lon",
        "hru_area",
    ],
    segment_parameters=domain_dir / "parameters_dis_seg.nc",
    segment_parameter_names=[
        "nhm_seg",
        "seg_length",
        "seg_slope",
        "seg_cum_area",
    ],
    start_lat=start_lat,
    start_lon=start_lon,
    start_zoom=7,
)

## An NHM multi-process model for the Delaware River Basin
The 8 conceptual `Process` classes that comprise the NHM are, in order:

In [None]:
nhm_processes = [
    pws.PRMSSolarGeometry,
    pws.PRMSAtmosphere,
    pws.PRMSCanopy,
    pws.PRMSSnow,
    pws.PRMSRunoff,
    pws.PRMSSoilzone,
    pws.PRMSGroundwater,
    pws.PRMSChannel,
]

We'll use this list of classes shortly to construct the NHM.

A multi-process model is assembled by the `Model` class. We can take a quick look at the first 21 lines of help on `Model`:

In [None]:
# this is equivalent to help() but we get the multiline string and just look at part of it
model_help = pydoc.render_doc(pws.Model, "Help on %s")
# the first 22 lines of help(pws.Model)
print("\n".join(model_help.splitlines()[0:22]))

The `help()` mentions that there are 2 distinct ways of instantiating a `Model` class. In this notebook, we focus on the pywatershed-centric instantiation and leave the PRMS-legacy instantiation for another time. 

With the pywatershed-centric approach, the first argument is a "model dictionary" which does nearly all the work (the other arguments will be their default values). The `help()` describes the model dictionary and provides examples. Please use it for reference and more details. Here we'll give an extended concrete example. The `help()` also describes how a `Model` can be instantiated from a model dictionary contained in a YAML file. First, we'll build a model dictionary in memory, then we'll write it out as a yaml file and instantiate our model directly from the YAML file. 

### Construct the model specification in memory
Because our (pre-existing) parameter files (which come with `pywatershed`) and our `Process` classes are consistently named, we can begin to build the model dictionary quickly.

In [None]:
model_dict = {}

for proc in nhm_processes:
    # this is the class name
    proc_name = proc.__name__
    # the processes can have arbitrary names in the model_dict and
    # an instance should not have capitalized name anyway (according to
    # python convention), so rename from the class name
    proc_rename = "prms_" + proc_name[4:].lower()
    # each process has a dictionary of information
    model_dict[proc_rename] = {}
    # alias to shorten lines below
    proc_dict = model_dict[proc_rename]
    # required key "class" specifys the class
    proc_dict["class"] = proc
    # the "parameters" key provides an instance of Parameters
    proc_param_file = domain_dir / f"parameters_{proc_name}.nc"
    proc_dict["parameters"] = pws.Parameters.from_netcdf(proc_param_file)
    # the "dis" key provides the name of the discretizations
    # which we'll supply shortly to the model dictionary
    if proc_rename == "prms_channel":
        proc_dict["dis"] = "dis_both"
    else:
        proc_dict["dis"] = "dis_hru"

Let's look at what we have so far in the `model_dict`.

In [None]:
pprint(model_dict, sort_dicts=False)

We have given a name to each process and then supplied the class, its parameters, and its discretization for the full set of processes. Now we'll need to add the discretizations to the model dictionary. They are added at the top level and correspond to the names the processes used. 

In [None]:
model_dict = model_dict | {
    "dis_hru": pws.Parameters.from_netcdf(
        domain_dir / "parameters_dis_hru.nc"
    ),
    "dis_both": pws.Parameters.from_netcdf(
        domain_dir / "parameters_dis_both.nc"
    ),
}
pprint(model_dict, sort_dicts=False)

For the time being, `PRMSChannel` needs to know about both HRUs and segments, so `dis_both` is used. We plan to remove this requirement in the near future by implementing "exchanges" between processes into the model dictionary.

You may have noticed that we are missing a `Control` object to provide time information to the processes. We'll create it and we'll also create a list of the order that the processes are executed.

Though we have input available to run 2 years of simulation, we'll restrict the model run to the first 6 months for demonstration purposes. (Feel free to increase this to the full 2 years available, if you like.)

In [None]:
run_dir = nb_output_dir / "run_dir"
control = pws.Control(
    start_time=np.datetime64("1979-01-01T00:00:00"),
    end_time=np.datetime64("1979-07-01T00:00:00"),
    time_step=np.timedelta64(24, "h"),
    options={
        "input_dir": domain_dir,
        "budget_type": "error",
        "netcdf_output_dir": run_dir,
    },
)
model_order = ["prms_" + proc.__name__[4:].lower() for proc in nhm_processes]
model_dict = model_dict | {"control": control, "model_order": model_order}
pprint(model_dict, sort_dicts=False)

### Instantiate the model and view the ModelGraph

The `model_dict` now specifies a complete model built from multiple processes. They way these processes are connected can be figured out by the `Model` class, because each process fully describes itself (as we saw in the previous notebook). If we instantiate a model from this `model_dict`,

In [None]:
model = pws.Model(model_dict)

we can examine how the `Processes` are all connected using the `ModelGraph` class. We'll bring in the default color scheme for NHM `Processes`.

In [None]:
palette = pws.analysis.utils.colorbrewer.nhm_process_colors(model)
pws.analysis.utils.colorbrewer.jupyter_palette(palette)
show_params = not (platform == "darwin" and processor() == "arm")
try:
    pws.analysis.ModelGraph(
        model,
        hide_variables=False,
        process_colors=palette,
        show_params=show_params,
    ).SVG(verbose=True, dpi=48)
except:
    static_url = "https://github.com/EC-USGS/pywatershed/releases/download/1.1.0/notebook_01_cell_11_model_graph.png"
    print(
        f"Dot fails on some machines. You can see the graph at this url: {static_url}"
    )
    from IPython.display import Image

    display(Image(url=static_url, width=1300))

### Questions
* What are the inputs for this model and where are these found? Is there anything special about those files? Could we drive any process from file?
* Can you see from where each process gets its inputs in this model? What is the largest number of other processes a single process draws inputs from?
* Are some of the arrows 2-way?
* Which processes are mass conservative? Can you see the terms involved in mass conservation?
* Which process has the greatest/smallest ratio of number of parameters to number of variables?

### Run the model
Now we'll initialize NetCDF output and run the model.

In [None]:
%%time
model.run(finalize=True)

Now we have a finalized run of our model. Before we look at the output of the run, note that the model specification in this example was constructed interactively in memory. We can also specify the model construction with a YAML file. This is shown in the notebook in the pywatershed repository [examples/01_multi-process_models.ipynb](https://github.com/EC-USGS/pywatershed/blob/develop/examples/01_multi-process_models.ipynb), on which this notebook is based. 

We can quite easily look at all the output resulting from our run by looking at the netcdf files in the run directory. 

In [None]:
output_files = sorted(run_dir.glob("*.nc"))
print(len(output_files))
pprint(output_files)

The following code will let us examine output variables, plotting the full timeseries at individual locations which can be scrolled through using the bar on the right side. It will not work to look at the out budget output files, however.

In [None]:
var = "albedo"
var_da = xr.open_dataarray(run_dir / f"{var}.nc")
var_da.hvplot(groupby=var_da.dims[1])

We'll plot the last variable in the loop, `unused_potet`:

In [None]:
proc_plot = pws.analysis.process_plot.ProcessPlot(
    gis_files.gis_dir / "drb_2yr"
)
proc_classes = [model_dict[nn]["class"] for nn in model_order]


def get_var_proc_class(var_name):
    for proc_class in proc_classes:
        if var_name in proc_class.get_variables():
            return proc_class


proc_plot.plot_hru_var(
    var_name=var,
    process=get_var_proc_class(var),
    data=var_da.mean(dim="time"),
    data_units=var_da.attrs["units"],
    nhm_id=var_da["nhm_id"],
)

We can also make a spatial plot of the streamflow using a transform for line width representation. 

In [None]:
# var = "seg_outflow"
# var_da = xr.open_dataarray(run_dir / f"{var}.nc")

# def xform_width(vals):
#     flow_log = np.maximum(np.log(vals + 1.0e-4), 0.0)
#     width_max = 5
#     width_min = 0.2
#     flow_log_lw = (width_max - width_min) * (flow_log - np.min(flow_log)) / (
#         np.max(flow_log) - np.min(flow_log)
#     ) + width_min
#     return flow_log_lw


# proc_plot.plot(
#     var,
#     process=get_var_proc_class(var),
#     value_transform=xform_width,
#     data=var_da.mean(dim="time"),
#     title=f"{var}",
#     aesthetic_width=True,
# )

# #proc_plot.plot(var_name, proc, title=var_name)

### Reduce model output to disk
It's worth noting that quite a lot of output was written and that in many cases the amount of output can be reduced in favor of imporving/reducing model run time. In the next cell, we show how you would reduce the output by setting `control.options['netcdf_output_var_names]`. We'll suppose that we only want the output variables from the `PRMSGroundwater` and `PRMSChannel` processes. Note that we are just combining the variable names returned by these two processes' `.get_variables()` methods. However, we could specify any list of variable names we like.

In [None]:
%%do_not_run_this_cell
desired_output = [
    *pws.PRMSGroundwater.get_variables(),
    *pws.PRMSChannel.get_variables(),
]control_cp.options["netcdf_output_var_names"] = desired_output

If I reduce the original ~150 output files to just those specified in the above cell, run time is about 60% on my Mac.

## NHM Submodel for the Delaware River Basin 
In many cases, running the full NHM model may not be necessary and it may be advantageous to just run some of the processes in it. Pywatershed gives you this flexibility. Suppose you wanted to change parameters or model process representation in the PRMSSoilzone to better predict streamflow. As the model is 1-way coupled, you can simply run a submodel starting with PRMSSoilzone and running through PRMSChannel.

In [None]:
submodel_processes = [pws.PRMSSoilzone, pws.PRMSGroundwater, pws.PRMSChannel]

This prompts the question, what inputs/forcing data do we need for this submodel? We can ask each individual process for its inputs

In [None]:
submodel_input_dict = {
    pp.__name__: pp.get_inputs() for pp in submodel_processes
}
pprint(submodel_input_dict)

And which inputs are supplied by variables within this submodel? We ask each process for its variables.

In [None]:
submodel_vars_dict = {
    pp.__name__: pp.get_variables() for pp in submodel_processes
}
pprint(submodel_vars_dict)

We consolidate inputs and variables (each over all processes) and take a set difference of inputs and variables to know what inputs/forcings we need from file. 

In [None]:
submodel_inputs = set([ii for tt in submodel_input_dict.values() for ii in tt])
submodel_variables = set(
    [ii for tt in submodel_vars_dict.values() for ii in tt]
)
submodel_file_inputs = tuple(submodel_inputs - submodel_variables)
pprint(submodel_file_inputs)

And where will we get these input files? You'll notice that these files do not come with the repository. Instead they are generated when we ran the full NHM model above.

In [None]:
yaml_output_dir = pl.Path(control.options["netcdf_output_dir"])
for ii in submodel_file_inputs:
    input_file = yaml_output_dir / f"{ii}.nc"
    assert input_file.exists()
    print(input_file)

Well, that was a lot of work. But, as alluded to above, the `Model` object does the above so you dont have to. You just learned something about how the flow of information between processes is enabled by the design and how one can query individual processes in `pywatershed`. But we could instantiate the submodel and plot this wiring up, just as we plotted the `ModelGraph` of the full model. We'll create the submodel in a new `run_dir` and we'll use outputs from the full model above as inputs to this submodel.

In [None]:
run_dir = pl.Path(nb_output_dir / "nhm_sub").resolve()
run_dir.mkdir(exist_ok=True)


control_cp = deepcopy(control)
# It is key that inputs exist from previous full-model run
control_cp.options["input_dir"] = yaml_output_dir.resolve()
control_cp.options["netcdf_output_dir"] = run_dir.resolve()
control_yaml_file = run_dir / "control.yaml"
control_cp.to_yaml(control_yaml_file)
pprint(control.to_dict(), sort_dicts=False)

Now we will use the existing `model_dict` in memory, tayloring to the above and just keeping the processes of interest in the submodel.

In [None]:
model_dict["control"] = str(control_yaml_file)
model_dict_yaml_file = run_dir / "model_dict.yaml"
keep_procs = ["prms_soilzone", "prms_groundwater", "prms_channel"]
model_dict["model_order"] = keep_procs
for kk in list(model_dict.keys()):
    if isinstance(model_dict[kk], dict) and kk not in keep_procs:
        del model_dict[kk]

pprint(model_dict, sort_dicts=False)

Now we write both the control and model dictionary to yaml files.

In [None]:
with open(model_dict_yaml_file, "w") as file:
    _ = yaml.dump(model_dict, file)

And finally we instantiate the submodel from the model dictionary yaml file. 

In [None]:
submodel = pws.Model.from_yaml(model_dict_yaml_file)
submodel

Now to look at the `ModelGraph` for the submodel.

In [None]:
show_params = not (platform == "darwin" and processor() == "arm")
try:
    pws.analysis.ModelGraph(
        submodel,
        hide_variables=False,
        process_colors=palette,
        show_params=show_params,
    ).SVG(verbose=True, dpi=48)
except:
    static_url = "https://github.com/EC-USGS/pywatershed/releases/download/1.1.0/notebook_01_cell_45_submodel_graph.png"
    print(
        f"Dot fails on some machines. You can see the graph at this url: {static_url}"
    )
    from IPython.display import Image

    display(Image(url=static_url, width=700))

Note that the required inputs to the submodel are quire different and rely on the existence of these files having already been output by the full model. 

Now we can initalize output and run the submodel.

In [None]:
%%time
submodel.run(finalize=True)

We'll, that saved us some time. The run is similar to before, just using fewer processes. 

The final time is still in memory. We can take a look at, say, recharge. Before plotting, let's take a look at the data and the metadata for recharge a bit closer.

In [None]:
pprint(pws.meta.find_variables("recharge"))
print(
    "PRMSSoilzone dimension names: ",
    submodel.processes["prms_soilzone"].dimensions,
)
print("nhru: ", submodel.processes["prms_soilzone"].nhru)
print(
    "PRMSSoilzone recharge shape: ",
    submodel.processes["prms_soilzone"]["recharge"].shape,
)
print(
    "PRMSSoilzone recharge type: ",
    type(submodel.processes["prms_soilzone"]["recharge"]),
)
print(
    "PRMSSoilzone recharge dtype: ",
    submodel.processes["prms_soilzone"]["recharge"].dtype,
)

First we access the metadata on `recharge` and we see its description, dimension, type, and units. The we look at the dimension names of the PRMSSoilzone process in whith it is found. We see the length of the `nhru` dimension and that this is the only dimension on `recharge`. We also see that `recharge` is a `numpy.ndarray` with data type `float64`.

So recharge only has spatial dimension. It is written to file with each timestep (or periodically). However, the last timestep is still in memory (even though we've finalized the run) and we can visualize it. The data are on the unstructured/polygon grid of Hydrologic Response Units (HRUs), we'll visualize the spatial distribution at this final time.

In [None]:
proc_plot = pws.analysis.process_plot.ProcessPlot(
    gis_files.gis_dir / "drb_2yr"
)
proc_name = "prms_soilzone"
var_name = "ssr_to_gw"
proc = submodel.processes[proc_name]
display(proc_plot.plot(var_name, proc))

We can easily check the results of our submodel model against our full model. This gives us an opportunity to look at the output files. We can start with recharge as our variable of interest. The model NetCDF output can be read in using `xarray` where we can see all the relevant metadata quickly.

In [None]:
var = "recharge"
nhm_da = xr.open_dataarray(yaml_output_dir / f"{var}.nc")
sub_da = xr.open_dataarray(run_dir / f"{var}.nc")

In [None]:
display(nhm_da)
display(sub_da)

Now we can compare all output variables common to both runs, asserting that the two runs gave equal output.

In [None]:
for var in submodel_variables:
    nhm_da = xr.open_dataarray(yaml_output_dir / f"{var}.nc")
    sub_da = xr.open_dataarray(run_dir / f"{var}.nc")
    xr.testing.assert_equal(nhm_da, sub_da)

In [None]:
# var_name = "dprst_seep_hru"
nhm_da = xr.open_dataarray(yaml_output_dir / f"{var_name}.nc")
sub_da = xr.open_dataarray(run_dir / f"{var_name}.nc")
scat = xr.merge(
    [nhm_da.rename(f"{var_name}_yaml"), sub_da.rename(f"{var_name}_subset")]
)

display(
    scat.hvplot(
        x=f"{var_name}_yaml", y=f"{var_name}_subset", groupby="nhm_id"
    ).opts(data_aspect=1)
)

scat.hvplot(y=f"{var_name}_subset", groupby="nhm_id")

## References
* Regan, R. S., Markstrom, S. L., Hay, L. E., Viger, R. J., Norton, P. A., Driscoll, J. M., & LaFontaine, J. H. (2018). Description of the national hydrologic model for use with the precipitation-runoff modeling system (prms) (No. 6-B9). US Geological Survey.
* Regan, R.S., Markstrom, S.L., LaFontaine, J.H., 2022, PRMS version 5.2.1: Precipitation-Runoff Modeling System (PRMS): U.S. Geological Survey Software Release, 02/10/2022.