 # S3-NetCDF / CFA
https://github.com/cedadev/S3-netcdf-python An extension package to netCDF4-python to enable reading and writing netCDF files and CFA-netcdf files from / to object stores and public cloud with a S3 HTTP interface, to disk or to OPeNDAP.

CFA is CF Aggregation and it's a way of grouping may smaller data sets (nc files) in to one logical dataset. A CFA file provides the metadata (such as dimensions) for a master array and provides pointers/references to regions of the master array where the data is stored in other files/objects. 

Pros:

* read/write S3. 
* extension to netCDF4-python, migration to/utilisation of in existing libraries should be made easier.
* Chunks potentially large 'master array' NetCDF up in to smaller chunks.
* Each chunk is self describing NetCDF
* While iris doesn't support CFA (as far as I can tell) it's conceptually very similar to a merged cube created from multiple files. It should be possible to integrate CFA directly into iris rather than via [S3-NetCDF](https://github.com/cedadev/S3-netcdf-python ) this could gain better integration and perhaps performance but lose the S3 element. If Iris could write CFA-NetCDF we would have a method (iris `merge`) for creating the 'master' files. 


Cons (as is):
* Read and write in serial
* Doesn't provide tools to take existing files and create a CFA-NetCDF file of them (though other tools may exist or could be hand done)
* Each 'chunk' is read in whole (downloaded from S3) not partially using byte range requests. The performance of this may not suit our existing files/chunk scheme.
* Python 2 not 3
* Much written in cython (only bad in that we don’t have experience so developing might be harder).
* Not using dask or lazy, this is maybe fine, we can wrap with dask.array as array like but under the hood it does (potentially but currently not) parallel tasks when subsetting the data and it might work best (?) if this parallel area used the same paradigm as whatever is using the library (such as iris wrapping it in dask)
* The master files (CFA files) are NetCDF files so are not human readable (such as [ncml](https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/)) And it's not obvious that could eaisily put into some searchable datastore like elastic search (if that even makes sense).


## Useful concepts

Design is based on [CF Aggregation](http://www.met.reading.ac.uk/~david/cfa/0.4/introduction.html) a proposed (perhaps now accepted) "general methodology for combining two or more CF fields which logically form one, larger dataset." The methodology for this a partition matrix [see here](http://www.met.reading.ac.uk/~david/cfa/0.4/framework.html) and the [S3 NetCDF README.md](https://github.com/cedadev/S3-netcdf-python/blob/master/README.md#cfa-netcdf-files) for an explanation of this.

## What we would need to do to use it
* Make it read in parallel
* Make it Python 3?
* Design/find a tool that could make CFA files from our existing data or adapt Iris to do so.
* Think about the fact it doesn't make byte range requests and if we are happy with this (has pros and cons)
* Figure out how to hook it into Iris xaarray etc, but hopefully not to hard.



# Zarr

Pros:
* optimised for nD arrays stored on s3. 
* finding favour with many in the Pango community

Cons:
* Chunks are 'meaningless' without the header/zarr file
* A new format without clear / easy route to implement into our existing libraries
* Unclear to me how metadata is stored (thought this is likely easily done)
* Requires us to re-process all our data and either duplicate it or throw away the existing NetCDF files that are self describing, much of the Amos Sci community understand and have tools and experience to deal with.




# The NetCDF Markup Language (NcML)
https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/

Can be used to represent the content of a NetCDF file as XML. Can also be used to represent multiple files and [join/aggregate](https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/Aggregation.html) them into a logical dataset.

> Multiple CDM datasets can be aggregated into a single, logical dataset. This is done with the aggregation NcML element. There are several types of aggregation:
> 
> Union: The union of all the dimensions, attributes, and variables in multiple NetCDF files.
JoinExisting: Variables of the same name (in different files) are connected along their existing, outer dimension, called the aggregation dimension. A coordinate variable must exist for the dimension.
JoinNew: Variables of the same name (in different files) are connected along a new outer dimension. Each file becomes one . A new coordinate variable is created for the dimension.
ForecastModelRunCollection (FMRC): A collection of forecast model runs with two time coordinates: a run time and a forecast time.

Pros:
* there is [Iris extension for NcML](https://github.com/rockdoc/rockdoc.github.io/wiki/NcML-File-Loader-for-Iris) writen by a AVD developer which if it no longer works at least shows there is potential for it to.
* master files are human readable and (if not to large) editable
* NcML seems simple enough that if there are no tools to create the files that suite us we could probably hand write/script our own.

Cons:
* Supported tools to work with it are in Java 
* XML verbose and less storage efficient (though could be compressed)
* XML harder to put in elastic search or other if we want to make searchable catalogue (but this might be a separate issue).


