Lets load a cube from metadata alone. This is a proof of concept showing the potential of storing metadata seperately from cubes.

The metadata here is json dumps of the original netcdf file headers. It could just as easily have come from a database query etc. There is a benefit here that the metadata is already on my local disk rather than requiring a database query.

It is ~1000 faster to load cubes in this way because network traffic is minimal. 3mins30 -> 400ms.

I don't bother adding attributes to the cube, because it messes up merging. It would be trivial to add all the attributes, and with a little thought we could add only the attributes that don't break the merge.

I also haven't added cell methods, which are very important. I think it's easy though.

There is a bug in this code. The notebook to generate this metdata only captures variable points where len(variable.shape) == 1. E.g. lat and lon points, but not data points. This is not correct - height is a scalar variable with len(variable.shape) == 0, and time_bnds is a variable with len(variable.shape) == 2, and both should have been captured. Easy to fix but again important.

In [78]:
import json
import iris
import dask.array as da

In [99]:
records= json.load(open('./metadata.json', 'r'))
print(len(records))

90


In [4]:
records[0]['dimensions']

{'bnds': {'name': 'bnds', 'size': 2, 'unlimited': False},
 'lat': {'name': 'lat', 'size': 324, 'unlimited': False},
 'lon': {'name': 'lon', 'size': 432, 'unlimited': False},
 'time': {'name': 'time', 'size': 48, 'unlimited': True}}

In [126]:
print(records[0]['variables']['height'])

{'axis': 'Z', 'chartostring': 'True', 'datatype': 'float64', 'dimensions': '()', 'dtype': 'float64', 'long_name': 'height', 'mask': 'True', 'name': 'height', 'ndim': '0', 'positive': 'up', 'scale': 'True', 'shape': '()', 'size': '1.0', 'standard_name': 'height', 'units': 'm', 'points': [1.5]}


In [125]:
# fix bug in how the metadata was extracted
# (code only got data for variables with len(shape) == 1, height has len(shape) 0)
for record in records:
    record['variables']['height'].update({'points': [1.5]})

In [40]:
def variable_to_dimcoord(variable):
    attrs = ['points', 'standard_name', 'long_name', 'var_name', 'units']
    points = variable['points']
    standard_name = variable['standard_name']
    long_name = variable['long_name']
    var_name = None #variable['var_name']
    units = variable['units']
    return iris.coords.DimCoord(
        points=points,
        standard_name=standard_name,
        long_name=long_name,
        var_name=var_name, units=units)
    #bounds=None, attributes=None, coord_system=None, circular=False)


In [41]:
lat = variable_to_dimcoord(records[0]['variables']['lat'])
lon = variable_to_dimcoord(records[0]['variables']['lon'])

In [136]:
def variable_to_cube(record, variable_name='tas'):
    # bunch of evals in here because the export from netcdf just stringified everything
    # obviously not a good way to re-import.
    var = record['variables'][variable_name] # dictionary
    dims = eval(var['dimensions']) # yeah yeah. it's a tuple of dim names
    dim_coords = [variable_to_dimcoord(record['variables'][dim]) for dim in dims]
    
    # data object
    shape = eval(var['shape'])
    dtype = var['dtype']
    path = record['filename']
    data = iris.fileformats.netcdf.NetCDFDataProxy(
        shape=shape,
        dtype=dtype,
        path=path,
        variable_name=variable_name,
        fill_value=None)
    
    
    cube = iris.cube.Cube(
        data=da.from_array(data, chunks=shape),
        standard_name=var['standard_name'],
        long_name=var['long_name'],
        var_name=None,
        units = var['units'],
        dim_coords_and_dims=[(coord, i) for i, coord in enumerate(dim_coords)]
        )
    
    return cube

In [137]:
%%time
c = iris.cube.CubeList([variable_to_cube(record) for record in records]).concatenate()

CPU times: user 312 ms, sys: 0 ns, total: 312 ms
Wall time: 371 ms


In [138]:
c # realization is only in the nc file as an attribute

[<iris 'Cube' of air_temperature / (K) (time: 648; latitude: 324; longitude: 432)>,
<iris 'Cube' of air_temperature / (K) (time: 648; latitude: 324; longitude: 432)>,
<iris 'Cube' of air_temperature / (K) (time: 648; latitude: 324; longitude: 432)>,
<iris 'Cube' of air_temperature / (K) (time: 648; latitude: 324; longitude: 432)>,
<iris 'Cube' of air_temperature / (K) (time: 648; latitude: 324; longitude: 432)>,
<iris 'Cube' of air_temperature / (K) (time: 648; latitude: 324; longitude: 432)>,
<iris 'Cube' of air_temperature / (K) (time: 648; latitude: 324; longitude: 432)>,
<iris 'Cube' of air_temperature / (K) (time: 648; latitude: 324; longitude: 432)>,
<iris 'Cube' of air_temperature / (K) (time: 648; latitude: 324; longitude: 432)>,
<iris 'Cube' of air_temperature / (K) (time: 648; latitude: 324; longitude: 432)>,
<iris 'Cube' of air_temperature / (K) (time: 648; latitude: 324; longitude: 432)>,
<iris 'Cube' of air_temperature / (K) (time: 648; latitude: 324; longitude: 432)>,
<ir

In [134]:
%%time
c2 = iris.load([record['filename'] for record in records]).concatenate()



CPU times: user 3.88 s, sys: 336 ms, total: 4.22 s
Wall time: 3min 35s


In [102]:
print(c2[0])

air_temperature / (K)               (time: 120; latitude: 324; longitude: 432)
     Dimension coordinates:
          time                           x              -               -
          latitude                       -              x               -
          longitude                      -              -               x
     Scalar coordinates:
          height: 1.5 m
     Attributes:
          Conventions: CF-1.4
          associated_files: baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_atmos_fx_HadGEM3-A-N216_historical_r0i0p0.nc...
          branch_time: 0.0
          cmor_version: 2.9.1
          contact: peter.stott@metoffice.gov.uk, andrew.ciavarella@metoffice.gov.uk
          creation_date: 2015-07-31T05:55:16Z
          experiment: historical
          experiment_id: historical
          forcing: GHG, Oz, LU, Sl, Vl, AA, (GHG = CO2, N2O, CH4, CFCs, HFCs)
          frequency: mon
          history: 2015-07-31T05:55:16Z altered by CMOR: Treat

In [103]:
records[0]

{'attributes': {'Conventions': 'CF-1.4',
  'branch_time': '0.0',
  'cmor_version': '2.9.1',
  'contact': 'peter.stott@metoffice.gov.uk, andrew.ciavarella@metoffice.gov.uk',
  'creation_date': '2015-07-24T16:40:09Z',
  'experiment': 'historical',
  'experiment_id': 'historical',
  'forcing': 'GHG, Oz, LU, Sl, Vl, AA, (GHG = CO2, N2O, CH4, CFCs, HFCs)',
  'frequency': 'mon',
  'history': 'MOHC pp to CMOR/NetCDF convertor (version 1.16.2) 2015-07-22T13:51:32Z CMOR rewrote data to comply with CF standards and EUCLEIA requirements.',
  'initialization_method': '1',
  'institute_id': 'MOHC',
  'institution': 'Met Office Hadley Centre, Fitzroy Road, Exeter, Devon, EX1 3PB, UK, (http://www.metoffice.gov.uk)',
  'mo_runid': 'aojac',
  'model_id': 'HadGEM3-A-N216',
  'modeling_realm': 'atmos',
  'parent_experiment': 'N/A',
  'parent_experiment_id': 'N/A',
  'parent_experiment_rip': 'N/A',
  'physics_version': '3',
  'product': 'output',
  'project_id': 'EUCLEIA',
  'realization': '1',
  'source'

In [106]:
import netCDF4

In [107]:
ds = netCDF4.Dataset(records[0]['filename'])

In [111]:
ds.variables['tas']

<class 'netCDF4._netCDF4.Variable'>
float32 tas(time, lat, lon)
    standard_name: air_temperature
    long_name: Near-Surface Air Temperature
    units: K
    original_name: mo: m01s03i236
    cell_methods: time: mean
    cell_measures: area: areacella
    history: 2015-07-24T11:32:26Z altered by CMOR: Treated scalar dimension: 'height'. 2015-07-24T11:32:26Z altered by CMOR: replaced missing value flag (-1.07374e+09) with standard missing value (1e+20).
    coordinates: height
    missing_value: 1e+20
    _FillValue: 1e+20
    associated_files: baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_atmos_fx_HadGEM3-A-N216_historical_r0i0p0.nc areacella: areacella_fx_HadGEM3-A-N216_historical_r0i0p0.nc
unlimited dimensions: time
current shape = (48, 324, 432)
filling off

In [114]:
ds.variables['height']['points']

IndexError: only integers, slices (`:`), ellipsis (`...`), and 1-d integer or boolean arrays are valid indices

In [120]:
ds.dimensions['lat']

<class 'netCDF4._netCDF4.Dimension'>: name = 'lat', size = 324

In [121]:
ds.variables['lat'].shape

(324,)

In [122]:
ds.variables['height'].shape

()