# cf_xarray : Scale your analysis across datasets with less data wrangling and more metadata handling

_Deepak Cherian, Mattia Almansi, Pascal Bourgault_

There has been an explosion in the availability of terabyte to petabyte-scale
geoscience datasets, particularly on the cloud, prompting the development of
scalable tools and workflows to handle such big datasets by Earthcube projects
such as Pangeo. There is a parallel need for tools that enable the analysis of
datasets from a wide variety of sources that each have their own nomenclature.

Xarray is a python package that enables easy and convenient labelled data
analytics by allowing users to leverage metadata such as dimension names and
coordinate labels. cf_xarray is an open-source Apache licensed Xarray extension
that decodes Climate and Forecast (CF) Metadata conventions adopted by the
geoscience community, allowing users to extensively use standardized metadata
such as “standard names” in their analysis pipelines. For example, the zonal
average of an Xarray dataset `ds` is seamlessly calculated as
`ds.cf.mean("longitude")` on a wide variety of CF-compliant datasets, regardless
of the actual name of the “longitude” variable (e.g. “lon”, “lon_rho”, “long”).
cf_xarray also provides tools and heuristics to optionally guess absent
attributes, allowing usage on incompletely tagged datasets. cf_xarray is now
seeing adoption in other packages such as xESMF, a package for regridding of
Xarray datasets; and NOAA’s Model Diagnostic Task Force (MDTF) diagnostic
workflow for validating model simulations.

Our notebook will demonstrate the use of cf_xarray to build an analysis pipeline
that works on a wide variety of cloud-available datasets such as the CMIP6
archive, the CESM Large Ensemble, various satellite datasets, and that uses
xESMF to regrid this wide variety of datasets to a common grid to facilitate
analysis of anomalies.


## Imports


In [None]:
import cf_xarray

import xarray as xr
import intake
import dask

import matplotlib.pyplot as plt

dask.config.set(**{"array.slicing.split_large_chunks": False})

## Open example datasets

The following functions are used in this notebook to create an example dataset.


In [None]:
def assign_coordinates_and_cell_measures(ds):

    # Some CF metadata is missing in the example dataset.
    # CF-compliant datasets do not need this step.
    # Furthermore, functions to automatically assign missing coordinates
    # and measures metadata will be implemented in cf_xarray:
    # https://github.com/xarray-contrib/cf-xarray/issues/201

    for varname, variable in ds.data_vars.items():

        # Add coordinates attribute
        coordinates = []
        for coord in sum(ds.cf.coordinates.values(), []):
            if set(ds[coord].dims) <= set(variable.dims):
                coordinates.append(coord)
        if coordinates:
            variable.attrs["coordinates"] = " ".join(coordinates)
        else:
            variable.attrs.pop("coordinates", None)

        # Add cell_measures attribute
        cell_measures = {}
        for stdname in {"cell_thickness", "cell_area", "ocean_volume"} & set(
            ds.cf.standard_names
        ):
            key = stdname.split("_")[-1]
            value = ds.cf.standard_names[stdname]
            for measure in value:
                if (
                    set(ds[measure].dims) <= set(variable.dims)
                    and measure != varname
                ):
                    cell_measures[key] = measure
        if cell_measures:
            variable.attrs["cell_measures"] = " ".join(
                [f"{k}: {v}" for k, v in cell_measures.items()]
            )
        else:
            variable.attrs.pop("cell_measures", None)


# MOM6
# Open grid and variables, then merge
grid = xr.open_dataset("data/ocean_grid_sym_OM4_05.nc")
ds = xr.open_dataset(
    "http://35.188.34.63:8080/thredds/dodsC/OM4p5/ocean_monthly_z.200301-200712.nc4",
    chunks={"time": 1, "z_l": 5},
)
mom6_ds = xr.merge([grid, ds], compat="override")

# Illustrate the equivalent of a curvilinear grid case,
# where axes and coordinates are different
axes = ["xh", "xq", "yh", "yq"]
mom6_ds = mom6_ds.drop_vars(axes)
mom6_ds = mom6_ds.assign_coords({axis: mom6_ds[axis] for axis in axes})
mom6_ds = mom6_ds.set_coords(
    [
        var
        for var in mom6_ds.variables
        for prefix in ["geo"]
        if var.startswith(prefix)
    ]
)
assign_coordinates_and_cell_measures(mom6_ds)

# CMIP6
col = intake.open_esm_datastore(
    "https://storage.googleapis.com/cmip6/pangeo-cmip6.json"
)
cat = col().search(
    table_id="Omon",
    grid_label="gn",
    source_id="GFDL-CM4",
    experiment_id=["historical"],
    variable_id=["thetao", ".*cello"],
)
ddict = cat.to_dataset_dict(
    zarr_kwargs={
        "consolidated": True,
        "decode_times": True,
        "use_cftime": True,
    }
)
_, cmip6_ds = ddict.popitem()
assign_coordinates_and_cell_measures(cmip6_ds)

# `cf_xarray` features:


## 1. Wrapped functions

When `cf_xarray` is imported, the `cf` accessor is automatically added to any
`xarray` object.  
`cf_xarray` is able to wrap most of `xarray` functions.


In [None]:
# cf_xarray accessor has been added to the xarray object
assert hasattr(mom6_ds, "cf")

# For example, we can apply mean along any dimensions using either xarray or cf_xarray.
# cf_xarray adds the understanding of CF conventions.
# Therefore, .cf.mean() accepts either dimensions names or CF keys.
for obj in [mom6_ds, mom6_ds.cf]:
    assert hasattr(obj, "mean")

## 2. Utility functions

`cf_xarray` also adds several utility functions to work with CF metadata.  
For example, in the cell below, we use `cf_xarray` to identify, guess, and add
missing CF metadata.


In [None]:
mom6_ds = mom6_ds.cf.guess_coord_axis(verbose=True)

## 3. Dictionaries mapping CF keys to variable names

The example object contains variables lying on staggered grids.  
Therefore, a CF key can be associated with multiple variables.  
`cf_xarray` provides several dictionaries mapping CF keys to lists of variable
names, such as:

- `.cf.axes`
- `.cf.coordinates`
- `.cf.cell_measures`
- `.cf.standard_names`
- `.cf.bounds`

The representation of `cf_xarray` accessor is a handy tool to explore all CF
metadata.


In [None]:
mom6_ds.cf

## 4. Get variables using CF keys

CF metadata precisely describes the physical quantities being represented by all
variables.


In [None]:
# Extract oceanic bathymetry using the appropriate standard name
xr_da = mom6_ds["deptho"]
cf_da = mom6_ds.cf["sea_floor_depth_below_geoid"]
cf_da

`cf_xarray` decodes CF metadata linking variables with each other (e.g.,
`coordinates`, `cell_measures`, `ancillary_variables`).  
As opposed to `xr_da`, `cf_da` extracted in the previous cell contains all
`cell_measures` associated with the variable extracted.


In [None]:
additional_coords = set(cf_da.coords) - set(xr_da.coords)
print("Cell measure extracted by cf_xarray:", additional_coords)
mom6_ds[list(additional_coords)[0]]

## 5. Automagically set optional arguments

`cf_xarray` sets some of the optional keyword arguments of wrapped functions.  
As opposed to `xarray`, in the example below `cf_xarray` assigns the appropriate
coordinates to the plot axes (i.e., longitude and latitude).


In [None]:
fig, (xr_ax, cf_ax) = plt.subplots(1, 2, figsize=(12, 4))

# xarray plot
xr_da.plot(ax=xr_ax)
xr_ax.set_title("xarray")
# cf_xarray plot
cf_da.cf.plot(ax=cf_ax)
cf_ax.set_title("cf_xarray")

plt.tight_layout()

## 6. Expand CF keys

As mentioned above, the example dataset is characterized by multiple dimensions
associated with the same spatial axes.  
Such information is decoded by `cf_xarray` and is used under the hood of wrapped
functions. In the example below, the CF Axes keys (i.e., "X", "Y", and "Z") are
expanded and multiple dimensions are sliced at once:


In [None]:
mom6_ds_sliced = mom6_ds.cf.isel(
    X=slice(10), Y=slice(10), Z=slice(10), T=slice(10)
)
print("Original dataset sizes:", dict(mom6_ds.sizes))
print("  Sliced dataset sizes:", dict(mom6_ds_sliced.sizes))

# A comprehensive example

One of the advantages of using `cf_xarray` is that the same code can be applied
to a wide variety of CF compliant objects that each has their own
nomenclature.  
In the example below, we define a function that uses many `cf_xarray` features,
then we apply to objects with different dimension and coordinate names.


In [None]:
def plot_top_10m_temp_anomaly(ds, **kwargs):

    # Compute and plot line
    with xr.set_options(keep_attrs=True):
        # Extract temperature using
        da = ds.cf["sea_water_potential_temperature"]
        # Fill wights missing values with zeros
        da = da.cf.assign_coords(volume=da.cf.coords["volume"].fillna(0))
        # Select temperature in the top 10m in 2003
        da = da.cf.sel(T="2003", Z=slice(0, 10))
        # Compute weighted mean
        da = da.cf.weighted("volume").mean(["X", "Y", "Z"])
        # Subtract climatology
        da = da - da.cf.mean("T")

    # Update metadata
    da.attrs["standard_name"] += "_anomaly"
    da.attrs["long_name"] += " Anomaly"

    # Plot
    da.squeeze(drop=True).cf.plot(**kwargs)


plot_top_10m_temp_anomaly(mom6_ds, label="mom6_ds")
plot_top_10m_temp_anomaly(cmip6_ds, label="cmip6_ds")
_ = plt.legend()

Alternatively, `cf_xarray` provides utility functions to rename variables and
dimensions in one object to match another object. Matching variables/dimensions
are determined using CF metadata.


In [None]:
mom6_da = mom6_ds.cf["sea_water_potential_temperature"]
cmip6_da = cmip6_ds.cf["sea_water_potential_temperature"]
renamed_mom6_da = mom6_da.cf.rename_like(cmip6_da)
print("        MOM6 dimensions:", mom6_da.dims)
print("       CMIP6 dimensions:", cmip6_da.dims)
print("renamed MOM6 dimensions:", renamed_mom6_da.dims)