<img SRC="https://avatars2.githubusercontent.com/u/31697400?s=400&u=a5a6fc31ec93c07853dd53835936fd90c44f7483&v=4" WIDTH=125 ALIGN="right">

# Caching

*O.N. Ebbens, Artesia, 2021*

Groundwater flow models are often data-intensive. Execution times can be shortened significantly by caching data. This notebooks explains how this caching is implemented in `nlmod`. The first three chapters explain how to use the caching in nlmod. The last chapter contains more technical details on the implementation and limitations of caching in nlmod.

### Contents<a name="TOC"></a>
1. [Cache directory](#cachedir)
2. [Caching in nlmod](#cachingnlmod)
3. [Checking the cache](#3)
4. [Technical information](#4)

In [1]:
import matplotlib.pyplot as plt
import flopy
import os
import geopandas as gpd
import xarray as xr
import logging

import nlmod

# toon informatie bij het aanroepen van functies
logging.basicConfig(level=logging.INFO)
print(f'nlmod version: {nlmod.__version__}')

nlmod version: 0.1.1b


### [1. Cache directory](#TOC)<a name="cachedir"></a>

When you create a model you usually start by assigning a model workspace. This is a directory where model data is stored. The `nlmod.util.get_model_dirs()` function can be used to create a file structure in two steps:
1. The model workspace directory is created if it does not exist yet. 
2. Two subdirectories are created: 'figure' and 'cache'. 

Calling the function below we create the `figdir` and `cachedir` variables with the paths of the subdirectories. In this notebook we will use this `cachedir` to write and read cached data. It is possible to define your own cache directory.

In [2]:
model_ws = 'model5'

# Model directories
figdir, cachedir = nlmod.util.get_model_dirs(model_ws)

print(model_ws)
print(figdir)
print(cachedir)

model5
model5\figure
model5\cache


### [2. Caching in nlmod](#TOC)<a name="cachingnlmod"></a>

In `nlmod` you can use the `get_combined_layer_models` function to obtain a layer model based on `regis`.

In [5]:
layer_model = nlmod.read.regis.get_combined_layer_models(extent=[95000.0, 105000.0, 494000.0, 500000.0],
                                                         delr=100., delc=100., use_geotop=False)

INFO:nlmod.read.regis:redefining current extent: [95000.0, 105000.0, 494000.0, 500000.0], fit to regis raster
INFO:nlmod.read.regis:new extent is [95000.0, 105000.0, 494000.0, 500000.0] model has 60 rows and 100 columns
INFO:nlmod.read.regis:resample regis data to structured modelgrid
INFO:nlmod.read.regis:find active layers in raw layer model
INFO:nlmod.read.regis:there are 40 active layers within the extent
INFO:nlmod.read.regis:removing 40 nan layers from the model


As you may notice, this function takes some time to complete because the data is downloaded and projected on the desired model grid. Everytime you run this function you have to wait for the process to finish which results in an unhealthy number of coffee breaks. This is why we use caching. To store our cache we use netCDF files. The `layer_model` variable is an `xarray.Dataset`. You can read/write an `xarray.Dataset` to/from a NetCDF file using the code below.

In [6]:
# write netcdf with layer model data
layer_model.to_netcdf(os.path.join(cachedir, 'layer_test.nc'))

In [7]:
# read netcdf with layer model data
layer_model = xr.open_dataset(os.path.join(cachedir, 'layer_test.nc'))

Reading and writing netcdf files is the main principle behind caching in `nlmod`. We write the `layer_model` to a NetCDF file when we call the `get_combined_layer_models` function for the first time. The next time we call the function we can read the cached NetCDF file instead. This reduces exuction time signficantly. You can simply use this caching abilities by specifying a `cachedir` and a `cachename` in the function call.

In [8]:
layer_model = nlmod.read.regis.get_combined_layer_models(extent=[95000.0, 105000.0, 494000.0, 500000.0],
                                                         delr=100., delc=100., use_geotop=False,
                                                         cachedir=cachedir, 
                                                         cachename='combined_layer_ds.nc')

INFO:nlmod.read.regis:redefining current extent: [95000.0, 105000.0, 494000.0, 500000.0], fit to regis raster
INFO:nlmod.read.regis:new extent is [95000.0, 105000.0, 494000.0, 500000.0] model has 60 rows and 100 columns
INFO:nlmod.read.regis:resample regis data to structured modelgrid
INFO:nlmod.read.regis:find active layers in raw layer model
INFO:nlmod.read.regis:there are 40 active layers within the extent
INFO:nlmod.read.regis:removing 40 nan layers from the model
INFO:nlmod.cache:caching data -> combined_layer_ds.nc


### caching steps<a name="steps"></a>

This type of caching is applied to a number of functions in nlmod that have an xarray dataset as output. When you call these functions using the `cachedir` and `cachename` arguments these steps are taken:
1. See if there is a netCDF file with the specified cachename in the specified cache directory. If the file exists go to step 2, otherwise go to step 3.
2. Read the netCDF file and return as an xarray dataset if:
    1. The cached dataset was created using the same function arguments as the current function call. 
    2. The module where the function is defined has not been changed after the cache was created.
3. Run the function to obtain an xarray dataset. Save this dataset as a netCDF file, using the specified cachename and cache directory, for next time. Also return the dataset.

### caching functions

The following functions use the caching as described above:
- nlmod.read.regis.get_combined_layer_models
- nlmod.read.rws.surface_water_to_model_dataset
- nlmod.read.knmi.add_knmi_to_model_dataset
- nlmod.read.jarkus.find_sea_cells
- nlmod.read.jarkus.bathymetry_to_model_dataset
- nlmod.read.geotop.get_geotop_dataset
- nlmod.read.ahn.get_ahn_at_grid

### [3. Checking the cache](#TOC)<a name="3"></a>
One of the steps in the caching process ([step 2A](#steps)) is to check if the cache was created using the same function arguments as the current function call. This check has some limitations:
- Only function arguments with certain types are checked. These types include: int, float, bool, str, bytes, list, tuple, dict, numpy.ndarray, xarray.DataArray and xarray.Dataset. If a function argument has a different type the cache is never used. In time more types can be added to the checks.
- If one of the function arguments is an xarray Dataset the check is somewhat different. For a dataset we only check if it has identical dimensions and coordinates as the cached netcdf file. There is no check if the variables in the dataset are identical.
- It is not possible to cache function output with more than one xarray Dataset function argument. This is due to the difference in checking datasets. If more than one xarray dataset is given the cache decoraters raises a TypeError.
- If one of the function arguments is a filepath of type str we only check if the cached filepath is the same as the current filepath. We do not check if any changes were made to the file after the cache was created.

You can test how the caching works in different situation by running the function below a few times with different function arguments. The logs provide some information about using the cache or not.

In [10]:
# layer model
layer_model = nlmod.read.regis.get_combined_layer_models(extent=[95000.0, 105000.0, 494000.0, 500000.0],
                                                         delr=50., delc=100., use_geotop=False,
                                                         cachename='combined_layer_ds.nc',
                                                         cachedir=cachedir)
layer_model

INFO:nlmod.cache:using cached data -> combined_layer_ds.nc


### clearing the cache

Sometimes you want to get rid of all the cached files to free disk space or to support your minimalistic lifestyle. You can use the `clear_cache` function to clear all cached files in a specific cache directory.

In [4]:
#nlmod.cache.clear_cache(cachedir)

this will remove all cached files in {cachedir} are you sure [Y/N] y


### [4. Technical](#TOC)<a name="4"></a>

In nlmod we use a specific caching method called [memoization](https://en.wikipedia.org/wiki/Memoization). The memoization is implemented in the `nlmod` caching module. The `cache_netcdf` decorator function handles most of the magic for caching netcdf files. When the cache is created all function arguments are stored in a dictionary and saved (pickled) as a .pklz file. The check on function arguments (step 2A) is done by reading the pickle and comparing the output with the arguments of the current function call. 

Limitations:
- All function arguments are pickled and saved together with the netcdf file. If the function arguments use a lot of memory this process can be become slow. This should be taken into account when you decide to use the cache decorator.
- Function arguments that cannot be pickled result in an error in the caching process.
- If one of the function arguments is an xarray Dataset we only check if the dataset has the same dimensions and coordinates as the cached netcdf file. There is no check on the variables (DataArrays) in the dataset because it would simply take too much time to check all the variables in the dataset. Also, most of the time it is not necesary to check all the variables as they are not used to create the cached file. There is one example where a variable from the dataset is used to create the cached file. The `nlmod.read.jarkus.bathymetry_to_model_dataset` uses the 'Northsea' DataArray to create a bathymetry dataset. When we access the 'Northsea' DataArray using `model_ds['Northsea']` in the `bathymetry_to_model_dataset` function there would be no check if the 'Northsea' DataArray that was used to create the cache is the same as the 'Northsea' DataArray in the current function call. The current solution for this is to make the 'Northsea' DataArray a separate function argument in the `bathymetry_to_model_dataset` function. This makes it also more clear which data is used in the function.
- there is a check to see if the module where the function is defined has been changed since the cache was created. This helps not to use the cache when changes are made to the function. However when the function uses other functions from different modules these other modules are not checked.

#### storing cache on disk

Many memoization methods use a hash of the function arguments as the filename. Thus creating multiple files for different function calls. The memoization in `nlmod` uses a user-defined filename (`cachename`) to store the cache. This reduces the number of files and therefore the memory size on the disk.