Lets load a cube from metadata alone. This is a proof of concept showing the potential of storing metadata seperately from cubes.

The metadata here is json dumps of the original netcdf file headers. It could just as easily have come from a database query etc. There is a benefit here that the metadata is already on my local disk rather than requiring a database query.

It is ~500 faster to load cubes in this way (almost entirely down to network traffic). 210s -> 0.4s.

I don't bother adding attributes to the cube, because it messes up merging. It would be trivial to add all the attributes, and with a little thought we could add only the attributes that don't break the merge.

I also haven't added cell methods, which are very important. I think it's easy though.

There is a bug in this code. The notebook to generate this metdata only captures variable points where len(variable.shape) == 1. E.g. lat and lon points, but not data points. This is not correct - height is a scalar variable with len(variable.shape) == 0, and time_bnds is a variable with len(variable.shape) == 2, and both should have been captured. Easy to fix but again important.

In [21]:
import json
import iris
import dask.array as da

In [3]:
records= json.load(open('./example_nc_headers.json', 'r'))
print(len(records))

90


In [4]:
records[0]['dimensions']

{'bnds': {'name': 'bnds', 'size': 2, 'unlimited': False},
 'lat': {'name': 'lat', 'size': 324, 'unlimited': False},
 'lon': {'name': 'lon', 'size': 432, 'unlimited': False},
 'time': {'name': 'time', 'size': 48, 'unlimited': True}}

In [5]:
print(records[0]['variables']['height'])

{'axis': 'Z', 'chartostring': 'True', 'datatype': 'float64', 'dimensions': '()', 'dtype': 'float64', 'long_name': 'height', 'mask': 'True', 'name': 'height', 'ndim': '0', 'positive': 'up', 'scale': 'True', 'shape': '()', 'size': '1.0', 'standard_name': 'height', 'units': 'm'}


In [6]:
# fix bug in how the metadata was extracted
# (code only got data for variables with len(shape) == 1, height has len(shape) 0)
for record in records:
    record['variables']['height'].update({'points': [1.5]})

In [7]:
def variable_to_dimcoord(variable):
    attrs = ['points', 'standard_name', 'long_name', 'var_name', 'units']
    points = variable['points']
    standard_name = variable['standard_name']
    long_name = variable['long_name']
    var_name = None #variable['var_name']
    units = variable['units']
    return iris.coords.DimCoord(
        points=points,
        standard_name=standard_name,
        long_name=long_name,
        var_name=var_name, units=units)
    #bounds=None, attributes=None, coord_system=None, circular=False)


In [8]:
lat = variable_to_dimcoord(records[0]['variables']['lat'])
lon = variable_to_dimcoord(records[0]['variables']['lon'])

In [9]:
def variable_to_cube(record, variable_name='tas'):
    # bunch of evals in here because the export from netcdf just stringified everything
    # obviously not a good way to re-import.
    var = record['variables'][variable_name] # dictionary
    dims = eval(var['dimensions']) # yeah yeah. it's a tuple of dim names
    dim_coords = [variable_to_dimcoord(record['variables'][dim]) for dim in dims]
    
    # data object
    shape = eval(var['shape'])
    dtype = var['dtype']
    path = record['filename']
    data = iris.fileformats.netcdf.NetCDFDataProxy(
        shape=shape,
        dtype=dtype,
        path=path,
        variable_name=variable_name,
        fill_value=None)
    
    
    cube = iris.cube.Cube(
        data=da.from_array(data, chunks=shape),
        standard_name=var['standard_name'],
        long_name=var['long_name'],
        var_name=None,
        units = var['units'],
        dim_coords_and_dims=[(coord, i) for i, coord in enumerate(dim_coords)]
        )
    
    return cube

In [11]:
%%time
c = iris.cube.CubeList([variable_to_cube(record) for record in records]).concatenate()

CPU times: user 363 ms, sys: 7.51 ms, total: 371 ms
Wall time: 372 ms


In [16]:
print(c) # realization is only in the nc file as an attribute

0: air_temperature / (K)               (time: 648; latitude: 324; longitude: 432)
1: air_temperature / (K)               (time: 648; latitude: 324; longitude: 432)
2: air_temperature / (K)               (time: 648; latitude: 324; longitude: 432)
3: air_temperature / (K)               (time: 648; latitude: 324; longitude: 432)
4: air_temperature / (K)               (time: 648; latitude: 324; longitude: 432)
5: air_temperature / (K)               (time: 648; latitude: 324; longitude: 432)
6: air_temperature / (K)               (time: 648; latitude: 324; longitude: 432)
7: air_temperature / (K)               (time: 648; latitude: 324; longitude: 432)
8: air_temperature / (K)               (time: 648; latitude: 324; longitude: 432)
9: air_temperature / (K)               (time: 648; latitude: 324; longitude: 432)
10: air_temperature / (K)               (time: 648; latitude: 324; longitude: 432)
11: air_temperature / (K)               (time: 648; latitude: 324; longitude: 432)
12: air_temper

In [None]:
import warnings


In [20]:
"%e"% (648*324*432*14)

'1.269790e+09'

In [17]:
c.merge()

DuplicateDataError: failed to merge into a single cube.
  Duplicate 'air_temperature' cube, with scalar coordinates 

In [None]:
%%time
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    c2 = iris.load([record['filename'] for record in records]).concatenate()

In [None]:
print(c2) # iris won't concat as attributes differ (e.g. each nc file has a uuid :/)