# Xube: The CF data model built on top of Iris and powered by XArray
## Why do Atmospheric, Oceanic, and Earth Scientists have to choose between a the strength of Iris' CF implementation and the strength of XArray's labelled nd-arrays?  
## Or: how you can have your cake and eat it.

![](https://www.questers.com/sites/default/files/styles/header_image/public/1140x390-header-python.jpg?itok=kXCVEJNw)

[XArray](http://xarray.pydata.org/en/stable/) adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences: xarray.Dataset is an in-memory representation of a netCDF file.

[Iris](https://scitools.org.uk/iris/docs/latest/) implements a [Climate and Forecast](http://cfconventions.org/) conventions inspired datamodel which gives specific meaning to certain metadata.

They are therefore data model implementations at different levels of abstraction. These different levels gave rise to the statement:

> **XArray *represents* your data, Iris *understands* your data.**

The CF-conventions are born out of providing semantic meaning to netCDF-like data, so it follows that Iris is a higher level abstraction than XArray. If one were to attempt sharing the constructs of XArray and Iris, it would logically follow that Iris' could be built on top of the XArray data model. Chronology is perhaps the main reason that Iris was not built upon XArray, but what follows is a prototype that allows Iris to interface directly to XArray allows us to really have our cake and eat it.

### The Iris data model becomes a proxy

The API of XArray and Iris is somewhat different. We have recieved a significant amount of conflicting feedback about both APIs, and it is clear that there are strong preferences towards both approaches. Furthermore, Iris is the basis for a corpus of applications and scripts that ammount to *multiple centuries* of accumulated scientist time (100+ scientists over the last 6 years). It is therefore highly undesirable to *significantly* break/remove API.

> For Iris to be built upon XArray, it **must** retain its API and constructs.

Iris proxying out to XArray is the obvious solution to this condition. Rather than duplicating data and making redundant copies of data/metadata the *only* viable approach is for Iris' objects to be very thin wrappers on top of XArray.

> All metadata **must** live within the XArray data objects.

Given these conditions, we have implementations of Iris' objects that cannonically store data and metadata on DataArray and Dataset objects, translating to Iris constructs on-the-fly as requested.

In [1]:
import iris
import xarray as xr

fname = iris.sample_data_path('E1_north_america.nc')
ds = xr.open_dataset(fname)
ds

<xarray.Dataset>
Dimensions:                  (bnds: 2, latitude: 37, longitude: 49, time: 240)
Coordinates:
  * time                     (time) object 1860-06-01 00:00:00 ... 2099-06-01 00:00:00
  * latitude                 (latitude) float32 15.0 16.25 17.5 ... 58.75 60.0
  * longitude                (longitude) float32 225.0 226.875 ... 313.125 315.0
    forecast_period          (time) timedelta64[ns] ...
    forecast_reference_time  object ...
    height                   float64 ...
Dimensions without coordinates: bnds
Data variables:
    air_temperature          (time, latitude, longitude) float32 ...
    latitude_longitude       int32 ...
    time_bnds                (time, bnds) float64 ...
Attributes:
    Conventions:  CF-1.5

In [2]:
import iris.xcube as xube

cube = xube.XCube(ds, 'air_temperature')
cube

Air Temperature (K),time,latitude,longitude
Shape,240,37,49
Dimension coordinates,,,
time,x,-,-
latitude,-,x,-
longitude,-,-,x
Auxiliary coordinates,,,
forecast_reference_time,-,-,-
height,-,-,-
forecast_period,x,-,-
Attributes,,,


One can't appreciate that this really is a proxy until we try modifying some details on the cube and can see them updated on the dataset (and vice-versa).

In [3]:
del cube.attributes['Model scenario']
del ds['air_temperature'].attrs['source']

print(ds['air_temperature'])
cube

<xarray.DataArray 'air_temperature' (time: 240, latitude: 37, longitude: 49)>
[435120 values with dtype=float32]
Coordinates:
  * time                     (time) object 1860-06-01 00:00:00 ... 2099-06-01 00:00:00
  * latitude                 (latitude) float32 15.0 16.25 17.5 ... 58.75 60.0
  * longitude                (longitude) float32 225.0 226.875 ... 313.125 315.0
    forecast_period          (time) timedelta64[ns] 449 days 18:00:00 ... 86489 days 18:00:00
    forecast_reference_time  object 1859-09-01 06:00:00
    height                   float64 1.5
Attributes:
    standard_name:          air_temperature
    units:                  K
    ukmo__um_stash_source:  m01s03i236
    cell_methods:           time: mean (interval: 6 hour)
    grid_mapping:           latitude_longitude


Air Temperature (K),time,latitude,longitude
Shape,240,37,49
Dimension coordinates,,,
time,x,-,-
latitude,-,x,-
longitude,-,-,x
Auxiliary coordinates,,,
forecast_reference_time,-,-,-
height,-,-,-
forecast_period,x,-,-
Attributes,,,


Notice how source and Model scenario are removed from both cube and dataset. There is no clever syncronisation of data going on here - there is **only one source of data** (XArray objects) that are being proxied into Iris objects on-the-fly.

A similar story is true for coordinates on a cube - they are 100% proxy objects that only exist when requested.

In [4]:
print(ds['height'].attrs)
print(cube.coord('height'))

OrderedDict([('units', 'm'), ('standard_name', 'height'), ('positive', 'up')])
XAuxCoord(array(1.5), standard_name='height', units=Unit('m'), var_name='height')


In [5]:
cube.coord('height').units = 'km'

print(cube.coord('height'))
print(ds['height'].attrs)

XAuxCoord(array(1.5), standard_name='height', units=Unit('km'), var_name='height')
OrderedDict([('units', 'km'), ('standard_name', 'height'), ('positive', 'up')])


Those familiar with Iris's API may wish to add metadata to their dataset, which is also supported:

In [6]:
import iris.coords
realization = iris.coords.AuxCoord(points=range(240), long_name='custom_index')
cube.add_aux_coord(realization, 0)
ds

<xarray.Dataset>
Dimensions:                  (bnds: 2, latitude: 37, longitude: 49, time: 240)
Coordinates:
  * time                     (time) object 1860-06-01 00:00:00 ... 2099-06-01 00:00:00
  * latitude                 (latitude) float32 15.0 16.25 17.5 ... 58.75 60.0
  * longitude                (longitude) float32 225.0 226.875 ... 313.125 315.0
    forecast_period          (time) timedelta64[ns] 449 days 18:00:00 ... 86489 days 18:00:00
    forecast_reference_time  object 1859-09-01 06:00:00
    height                   float64 1.5
Dimensions without coordinates: bnds
Data variables:
    air_temperature          (time, latitude, longitude) float32 ...
    latitude_longitude       int32 ...
    time_bnds                (time, bnds) float64 -9.511e+05 ... 1.122e+06
    custom_index             (time) int64 0 1 2 3 4 5 ... 235 236 237 238 239
Attributes:
    Conventions:  CF-1.5

From day 1 Iris has supported the concept of bounded coordinates. In XArray these are just auxilliary variables that must be associated together manually by the user using knowledge of the CF conventions. This is fully automatic in the Iris proxy: 

In [7]:
print(cube.coord('longitude').has_bounds())
print(cube.coord('time').has_bounds())

False
True


This is still a very early prototype and there is much functionality that either doesn't work or hasn't been tested. That said, parts of Iris are robust to these changes without the need for any kind of modification.

For example, Iris understands units and how to convert from Kelvin to Centigrade...

In [8]:
tmp_dv = ds['air_temperature']
print(tmp_dv.data.min(), tmp_dv.data.max())
cube.convert_units('degC')
print(tmp_dv.data.min(), tmp_dv.data.max())

257.31882 303.8437
-15.8311825 30.69369
