<img SRC="https://avatars2.githubusercontent.com/u/31697400?s=400&u=a5a6fc31ec93c07853dd53835936fd90c44f7483&v=4" WIDTH=125 ALIGN="right">

# Caching

*O.N. Ebbens, Artesia, 2021*

Groundwater flow models are often data-intensive. Execution times can be shortened significantly by caching data. This notebooks shows some examples of caching using the nlmod package.

### Contents<a name="TOC"></a>
1. [Cache directory](#cachedir)
2. [Caching in nlmod](#cachingnlmod)
3. [Checking the cache](#3)
4. [Dicussion](#4)

In [10]:
import matplotlib.pyplot as plt
import flopy
import os
import geopandas as gpd
import xarray as xr

import nlmod

print(f'nlmod version: {nlmod.__version__}')

nlmod version: 0.0.3b


### [1. Cache directory](#TOC)<a name="cachedir"></a>

When you create a model you usually start by assigning a model workspace. This is a directory where model data is stored. The `nlmod.util.get_model_dirs()` function can be used to create a file structure in two steps.
First the model workspace directory is created if it does not exists yet. Secondly, two subdirectories are created: 'figure' and a 'cache'. Calling the function below we create the `figdir` and `cachedir` variables with the paths of the subdirectories. In this notebook we will use this `cachedir` to write and read cached data. It is possible to define your own cache directory.

In [2]:
model_ws = 'model5'

# Model directories
figdir, cachedir = nlmod.util.get_model_dirs(model_ws)

print(model_ws)
print(figdir)
print(cachedir)

model5
model5\figure
model5\cache


In [3]:
nlmod.read.regis.get_layer_models??

### [2. Caching in nlmod](#TOC)<a name="cachingnlmod"></a>

In `nlmod` you can use the `get_combined_layer_models` function to obtain a layer model based on `regis`. The function takes some time to complete because the data is read from a server and projected on the desired model grid. Everytime you run this function you have to wait for this process to finish which results in long execution times and an unhealthy number of coffee breaks. This is why we use caching.

In [21]:
layer_model = nlmod.read.regis.get_combined_layer_models(extent=[95000.0, 105000.0, 494000.0, 500000.0],
                                                         delr=100., delc=100., use_geotop=False)

The `layer_model` variable is an `xarray.Dataset`. An `xarray.Dataset` can be read and written easily using the NetCDF file format. To speed up execution times we write the `layer_model` to a NetCDF file so the next time we want to get the `layer_model` we can read the cached NetCDF file instead of downloading a new file.

In [22]:
# write netcdf with layer model data
layer_model.to_netcdf(os.path.join(cachedir, 'combined_layer_ds.nc'))

In [23]:
# read netcdf with layer model data
layer_model = xr.open_dataset(os.path.join(cachedir, 'combined_layer_ds.nc'))

Reading and writing netcdf files is the main principle behind caching in `nlmod`. Since we store a lot of data into `xarray.Datasets` we've created a general function to cache the data or read the cached data if available. This function is called `nlmod.util.get_cache_netcdf` and can be wrapped around any function that returns an `xarray.Dataset`. The `get_cache_netcdf` needs a few extra arguments for this:
- `use_cache`, to indicate if you want to use the cached file if it is available
- `cachedir`, the directory that is used to cache the data
- `cache_name`, the name of the .nc file of the cached data.
- `get_dataset_func`, this is the function that returns the `xarray.Dataset` that you want to cache.

In the cell below we wrap the cache function around the `get_combined_layer_models` model.

In [24]:
layer_model = nlmod.util.get_cache_netcdf(use_cache=True, cachedir=cachedir, 
                                          cache_name='combined_layer_ds.nc',
                                          get_dataset_func=nlmod.read.regis.get_combined_layer_models,
                                          verbose=True,
                                          extent=[95000.0, 105000.0, 494000.0, 500000.0],
                                          delr=100., delc=100.,
                                          use_geotop=False)

found cached combined_layer_ds.nc, loading cached dataset
delr of current grid is the same as cached grid
delc of current grid is the same as cached grid
extent of current grid is the same as cached grid


This function call of `get_cache_netcdf` function executes the following steps:
1. See if there is a netCDF file with the name 'combined_layer_ds.nc' in the cache directory. If the file exists go to step 2, otherwise go to step 3.
2. Check if the cached dataset has the same properties as the desired dataset. Which in this case means that the extent, delr and delc of the cached dataset correspond to the desired dataset. If so, return the cached dataset otherwise go to step 3. More info on this in [chapter 3](#3).
3. Call the `get_combined_layer_models` function to obtain a new dataset. Save this dataset as 'combined_layer_ds.nc' in the cache directory and return the dataset.

The `get_cache_netcdf` function has a lot of arguments that often have the same default values. Also it might feel counter-intuitive to call many different functions through the `get_cache_netcdf` wrapper.  Therefore we wrap the function call in another function call. This function is typically in the same module above defined above the function it wraps around. For the  `get_combined_layer_models` function we can call the `get_layer_models` function. 

In [25]:
# layer model
layer_model = nlmod.read.regis.get_layer_models(extent=[95000.0, 105000.0, 494000.0, 500000.0],
                                                cachedir=cachedir, use_cache=True,
                                                delr=100., delc=100., use_geotop=False)

layer_model

### [3. Checking the cache](#TOC)<a name="3"></a>
There are some issues with using cached data. For example: when you modify the model extent, you cannot use the cached data anymore. If we would've simply tried to read the cached data we get notoruously, indecipherable errors. Therefore we can do some standard checks in the `get_cache_netcdf` function. 

When calling the `get_cache_netcdf` function there are 3 optional argument `model_ds`,`check_grid` and `check_time`. The `model_ds` argument is used to obtain information about the desired grid and time discretisation. The `check_grid` and `check_time` arguments both indicate whether to check if the grid and/or time discretisation of the cached grid corresponds to the desired grid. If one of these cheks fails the cached data is not used and a new dataset is cached.

Below you can see what happens if we call the cache function from the previous chapter with a `delc` of 50 instead of 100. When we use `verbose=True` we can actually see the outcome of the checks and see that a new dataset is created because the cached data did not correspond to the desired grid.

Note, these checks are not a gaurantee that the cached data will be read exactly as you would expect. There are some cases where it is still difficult to know if the cached data can be used for the current model.

In [27]:
# layer model
layer_model = nlmod.read.regis.get_layer_models(extent=[95000.0, 105000.0, 494000.0, 500000.0],
                                                delr=100., delc=50., use_geotop=False,
                                                use_cache=True, fname_netcdf='combined_layer_ds.nc',
                                                cachedir=cachedir, verbose=True)
layer_model

found cached combined_layer_ds.nc, loading cached dataset
delr of current grid is the same as cached grid
delc of current grid is the same as cached grid
extent of current grid is the same as cached grid


### [4. Discussion](#TOC)<a name="4"></a>

caching in its current form has some considerable limitations:
- You store two functions of everything. The original function to obtain an xarray Dataset and the wrapper function that does the caching part. It is confusing and error pron to maintain two nearly identical functions.
- If you wrap the `get_cache_netcdf` around a function which in turn calls the `get_cache_netcdf` you get unexpected results since the `get_cache_netcdf` function does not transfer all parameters to the function it wraps around.