# Using [Intake-ESM's](https://intake-esm.readthedocs.io/en/latest/) New Derived Variable Functionality

Last week, [Anderson Banihirwe](https://github.com/andersy005) added a new feature to intake-esm, enabling users to add "derived variables" to catalogs! This is an exciting addition, and although this has not been included in a release yet, you are welcome to test out the functionality!

## What is a "Derived Variable"
A derived variable in this case is a variable which can be derived using variables within some data catalog. The basic assumption is that they can be included **in the same dataset** (which ***could be extended*** to be able to work with variables from multiple datasets). This means that variables will need to be **on the same grid with the same dimensionality**.

An example of a derived variable could be temperature in degrees Fahrenheit. Often times, climate model models write temperature in Celsius or Kelvin, but the user may want degrees Fahrenheit!

Often times, this workflow consists of the following:
* Load the data within the catalog
* Apply some function to the loaded datasets
* Plot the output

But what if we could couple those first two steps? What if we could have some set of **variable definitions**, consisting of variable requirements, such as `dependent variables`, and a function which derives the quantity. This is what the `derived_variable` funtionality offers in `intake-esm`! This enables users to share a "registry" of derived variables across catalogs!

Let's get started with an example!

## Installing the Development Version of Intake-ESM
Since this has not yet been released in a version of Intake-ESM, you can install the development version using the following:

```bash
pip install git+https://github.com/intake/intake-esm.git
```

## Imports

In [116]:
import holoviews as hv
import hvplot
import hvplot.xarray
import intake
import numpy as np
from distributed import Client
from intake_esm.derived import DerivedVariableRegistry
from ncar_jobqueue import NCARCluster

hv.extension('bokeh')

## Spin up a Dask Cluster

In [54]:
cluster = NCARCluster()
cluster.scale(10)
client = Client(cluster)
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 40217 instead


0,1
Connection method: Cluster object,Cluster type: dask_jobqueue.PBSCluster
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/40217/status,

0,1
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/40217/status,Workers: 0
Total threads: 0,Total memory: 0 B

0,1
Comm: tcp://10.12.206.54:38580,Workers: 0
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/mgrover/proxy/40217/status,Total threads: 0
Started: Just now,Total memory: 0 B


## How to add a Derived Variable
Let's compute a derived variable - wind speed! This can be derived from using the zonal (`U`) and meridional (`V`) components of the wind.

In [28]:
def calc_wind_speed(u, v):
    return np.sqrt(u ** 2 + v ** 2)

### Creating our Derived Variable Registry
We need to instantiate our derived variable registry, which will store our derived variable information! We use the variable `dvr` for this (**D**erived**V**ariable**R**egistry).

In [89]:
dvr = DerivedVariableRegistry()

In order to register this derived variable we need to add a [decorator](https://www.python.org/dev/peps/pep-0318/) for our function, as seen below. This allows us to define our derived variable, dependent variables, and the function associated with the calculation.

In [90]:
@dvr.register(variable='wind_speed', dependent_variables=['U', 'V'])
def calc_wind_speed(ds):
    ds['wind_speed'] = np.sqrt(ds.U ** 2 + ds.V ** 2)
    return ds

You'll notice `dvr` now has a registered variable, `wind_speed`, which was defined in the cell above!

In [91]:
dvr

DerivedVariableRegistry({'wind_speed': DerivedVariable(func=<function calc_wind_speed at 0x2b84d46c84c0>, variable='wind_speed', dependent_variables=['U', 'V'])})

## Read in Data with our Registry
In this case, we will use data from the CESM Large Ensemble. We load in our derived variable catalog using the `registry` argument.

In [92]:
data_catalog = intake.open_esm_datastore(
    'https://raw.githubusercontent.com/NCAR/cesm-lens-aws/master/intake-catalogs/aws-cesm1-le.json',
    registry=dvr,
)

You'll notice we have a new field - `derived_variable` which has 1 unique value.

In [93]:
data_catalog

Unnamed: 0,unique
variable,78
long_name,75
component,5
experiment,4
frequency,6
vertical_levels,3
spatial_domain,5
units,25
start_time,12
end_time,13


Let's also subset for monthly frequency, as well as the 20th century (20C) and RCP 8.5 (RCP85) experiments.

In [94]:
catalog_subset = data_catalog.search(
    variable=['wind_speed'], frequency='monthly', experiment=['20C', 'RCP85']
)

catalog_subset

Unnamed: 0,unique
variable,2
long_name,2
component,1
experiment,2
frequency,1
vertical_levels,1
spatial_domain,1
units,1
start_time,2
end_time,2


### Calling `to_dataset_dict` to Load in the Data
We load in the data, which lazily adds our calculation for `wind_speed` to the datasets!

In [95]:
dsets = catalog_subset.to_dataset_dict(
    xarray_open_kwargs={'backend_kwargs': {'storage_options': {'anon': True}}}
)


--> The keys in the returned dictionary of datasets are constructed as follows:
	'component.experiment.frequency'


In [99]:
ds = dsets['atm.RCP85.monthly']
ds

Unnamed: 0,Array,Chunk
Bytes,17.81 kiB,17.81 kiB
Shape,"(1140, 2)","(1140, 2)"
Count,5 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 17.81 kiB 17.81 kiB Shape (1140, 2) (1140, 2) Count 5 Tasks 1 Chunks Type object numpy.ndarray",2  1140,

Unnamed: 0,Array,Chunk
Bytes,17.81 kiB,17.81 kiB
Shape,"(1140, 2)","(1140, 2)"
Count,5 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,281.80 GiB,113.91 MiB
Shape,"(40, 1140, 30, 192, 288)","(1, 18, 30, 192, 288)"
Count,2561 Tasks,2560 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 281.80 GiB 113.91 MiB Shape (40, 1140, 30, 192, 288) (1, 18, 30, 192, 288) Count 2561 Tasks 2560 Chunks Type float32 numpy.ndarray",1140  40  288  192  30,

Unnamed: 0,Array,Chunk
Bytes,281.80 GiB,113.91 MiB
Shape,"(40, 1140, 30, 192, 288)","(1, 18, 30, 192, 288)"
Count,2561 Tasks,2560 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,281.80 GiB,113.91 MiB
Shape,"(40, 1140, 30, 192, 288)","(1, 18, 30, 192, 288)"
Count,2561 Tasks,2560 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 281.80 GiB 113.91 MiB Shape (40, 1140, 30, 192, 288) (1, 18, 30, 192, 288) Count 2561 Tasks 2560 Chunks Type float32 numpy.ndarray",1140  40  288  192  30,

Unnamed: 0,Array,Chunk
Bytes,281.80 GiB,113.91 MiB
Shape,"(40, 1140, 30, 192, 288)","(1, 18, 30, 192, 288)"
Count,2561 Tasks,2560 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,281.80 GiB,113.91 MiB
Shape,"(40, 1140, 30, 192, 288)","(1, 18, 30, 192, 288)"
Count,15362 Tasks,2560 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 281.80 GiB 113.91 MiB Shape (40, 1140, 30, 192, 288) (1, 18, 30, 192, 288) Count 15362 Tasks 2560 Chunks Type float32 numpy.ndarray",1140  40  288  192  30,

Unnamed: 0,Array,Chunk
Bytes,281.80 GiB,113.91 MiB
Shape,"(40, 1140, 30, 192, 288)","(1, 18, 30, 192, 288)"
Count,15362 Tasks,2560 Chunks
Type,float32,numpy.ndarray


## Apply an Annual Mean Calculation
Let's apply an annual average to the data - since all the years are 365 days long, we do not need any special weighting

In [110]:
annual_mean = ds.groupby('time.year').mean('time')

## Plot the Output
We use `hvPlot` here to plot a global map, selecting only the bottom 8 levels of the atmosphere.

In [115]:
annual_mean.isel(lev=range(-8, 0)).wind_speed.hvplot.quadmesh(
    x='lon',
    y='lat',
    geo=True,
    rasterize=True,
    cmap='viridis',
    coastline=True,
    title='Wind Speed (m/s)',
)