# earthdata-varinfo

Contact: owen.m.littlejohns@nasa.gov

### What is earthdata-varinfo?

* A NASA Earth Observing System Data and Information System (EOSDIS) maintained software package.
* A Pip-installable Python package that parses granule metadata from a number of input sources.
* Used extensively in 4 [Harmony](https://harmony.earthdata.nasa.gov) backend services (HOSS, Swath Projector, Trajectory Subsetter, Harmony Regridding Service).
* Leans heavily on the [Climate and Forecast metadata conventions](http://cfconventions.org/).
* Born from an effort to consolidate metadata and variable relationship parsing so that services being developed didn't have to repeat the same code.
* First internal release 2021-03-11, first PyPI release June 2023.
* Includes variable (UMM-Var) JSON generation support.
* Maintained by the NASA EOSDIS Transformation Train.
* Contributions welcome!

### Features

* Parses and extracts variable metadata from within source granules.
* Also parses relationships between variables, primarily using dimension information or CF-Convention attributes expected to contain such information.
* Variable classification via CF-Convention-based heuristics (e.g., using `units` and other attributes).
* Metadata can be supplemented or overwritten with a configuration file (which has a fully defined JSON schema).
* Extensible for further input formats (uses abstract base classes).
* Common Metadata Repository ([CMR](https://www.earthdata.nasa.gov/eosdis/science-system-description/eosdis-components/cmr)) compliant UMM-Var JSON generation.

### How to install earthdata-varinfo

```
pip install earthdata-varinfo
```

If this doesn't work, alternatively you can clone the [git repository](https://github.com/nasa/earthdata-varinfo), and install the package in editable mode:

```
git clone https://github.com/nasa/earthdata-varinfo
cd earthdata-varinfo
pip install -e .
```

### Other notebook requirements:

When installing `earthdata-varinfo` via PyPI required packages should automatically be installed as dependencies. For local development, without a standard pip installation, third party requirements can be installed from the following files:

```
pip install -r requirements.txt -r dev-requirements.txt
pip install notebook
```


### Input formats

* OPeNDAP Dataset Metadata Response files (DMR).
* NetCDF-4 files (can also parse HDF-5 files, with varying success regarding variable relationships).

### Current output format

Currently `earthdata-varinfo` produces Python classes that can be used within Python scripts, Jupyter notebooks or a Python Read-Eval-Print Loop (REPL).

`earthdata-varinfo` is also able to create UMM-Var compliant JSON records.

### Basic classes:

**VarInfoFromNetCDF4 and VarInfoFromDMR:**

The main parent classes that represents the contents of a granule. It contains dictionaries of variable instances, and class methods to retrieve variables (e.g., "required variables" for a given set of requested variables).

**VariableFromNetCDF4 and VariableFromDMR:**

A representation of a single granule. Extracts metadata attributes from the input source and fully qualifies references to other variables (to allow determination of relationships with other variables).

**CFConfig:**

One instance of `CFConfig` is associated with a single `VarInfoFromDMR` or `VarInfoFromNetCDF4` object. When declaring those classes, a short name is either supplied or searched for in the granule metadata. The `CFConfig` instance then retrieves any rules from the specified configuration file that apply to that collection and/or mission.

When each individual variable is parsed, any applicable rules from the configuration file are used to, for example, update metadata attribute values.


# Example usage:

Granules used:

* [GPM_3IMERGHH](https://cmr.uat.earthdata.nasa.gov/search/concepts/G1256265181-EEDTEST.umm_json)
* [GEDI L4A](https://cmr.uat.earthdata.nasa.gov/search/concepts/G1245557637-EEDTEST.umm_json)

The granules linked to above will not be circulated with this notebook, but can be downloaded via the `GET DATA` URLs in the UMM-G records.

In [None]:
import json

from varinfo import VarInfoFromNetCDF4

# Update the following paths to where you have downloaded the data using the links above:
gpm_granule_path = '/path/to/locally/saved/file/3B-HHR.MS.MRG.3IMERG.20200201-S233000-E235959.1410.V06B.HDF5'
gedi_l4a_granule_path = '/path/to/locally/saved/file/GEDI04_A_2021216232727_O14984_01_T04304_02_002_01_V002.h5'

### Instantiate VarInfoFromNetCDF4 for GPM_3IMERGHH collection:

In [None]:
gpm_imerg = VarInfoFromNetCDF4(gpm_granule_path, short_name='GPM_3IMERGHH')

### See which variables are in the granule:

In [None]:
gpm_imerg.get_all_variables()

### Inspect information on a single variable:

In [None]:
calibrated_precipitation = gpm_imerg.get_variable('/Grid/precipitationCal')
print('Variable attributes:')
print(calibrated_precipitation.attributes)

print('\n\nVariable references:')
print(calibrated_precipitation.get_references())

### Required variables:

One of the primary use-cases for `earthdata-varinfo` is traversing the relationships between variables. This is done (recursively) below:

In [None]:
gpm_imerg.get_required_variables({'/Grid/precipitationCal', })

In the output above, `/Grid/precipiationCal` has three dimensions: `/Grid/time`, `/Grid/lat` and `/Grid/lon`. These dimension variables also refer to their respective [bounds variables](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html#cell-boundaries). Because `get_required_variables` is recursive, the bounds attributes are also considered required variables for `/Grid/precipitationCal`.

### More variable relationship examples:

Users can supply a set of variable full paths to the following functions to retrieve only the spatial or temporal dimensions for a given variable (methods below are present on both the `VarInfoFromNetCDF4` and `VarInfoFromDMR` classes:

* `get_geographic_spatial_dimensions` - Filters the retrieved set of recursively required dimensions to only return those that are considered geographic horizontal spatial dimensions per the [CF-Conventions](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html#latitude-coordinate). This means that the `units` metadata attribute has an expected format of the values `degrees_east` or `degrees_north`.
* `get_projected_spatial_dimensions` - Filters the retrieved set of recursively required dimensions to only return those that are considered projected horizontal spatial dimensions per the [CF-Conventions](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html#grid-mappings-and-projections). Specifically, the `standard_name` attribute is checked to see if it matches one of `projection_x_coordinate`, `projection_y_coordinate`, `projection_x_angular_coordinate` or `projection_y_angular_coordinate`.
* `get_spatial_dimensions` - Creates an output that combines results from both `get_geographic_spatial_dimensions` and `get_projected_spatial_dimensions`.
* `get_temporal_dimensions` - Filters the retrieved set of recursively required dimensions to only retrieve those that are considered temporal dimensions per the [CF-Conventions](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html#time-coordinate). This checks the `units` metadata attribute of attributes for a string containing `' since '`.

In [None]:
print('Spatial dimensions for /Grid/precipitationCal')
print(gpm_imerg.get_spatial_dimensions({'/Grid/precipitationCal', }))

print('\nTemporal dimensions for /Grid/precipationCal')
print(gpm_imerg.get_temporal_dimensions({'/Grid/precipitationCal', }))

### Grouping variables by dimensions:

One common task for variables in a NetCDF-4 file is to identify or group variables that share a common set of dimensions. For example, variables that are all mapped to the same spatiotemporal grid. `earthdata-varinfo` contains methods on the `VarInfoFromNetCDF4` and `VarInfoFromDMR` classes to facilitate this.

* `get_variables_with_dimensions` - returns a set of variable paths for variables that have all of the listed input dimensions. Note - the returned variables might also have other dimensions, too.
* `group_variables_by_dimensions` - Returns a dictionary structure, where the keys are tuples of dimensions and the values are full paths of the variables with exactly those dimensions.
* `group_variables_by_horizontal_dimensions` - Returns a dictionary structure, where the keys are tuples of dimensions and the values are full paths of the variables with those spatial dimensions. Note - the listed variables might have other dimensions in addition to the specified horizontal spatial dimensions, such that variables with dimensions `(time, latitude, longitude)` would be grouped with `(latitude, longitude)`.

The example below shows usage of the `group_variables_by_dimensions` method:

In [None]:
gpm_imerg.group_variables_by_dimensions()

## UMM-Var JSON generation:

`earthdata-varinfo` is able to create UMM-Var compliant JSON records, supporting the following fields (items in bold are required fields):

* **Name**
* **LongName**
* **Definition**
* StandardName
* DataType
* Units
* Scale
* Offset
* Dimensions (although not all dimension types and size are captured)
* FillValues
* ValidRanges
* **MetadataSpecification**

### For a single variable:

In [None]:
from varinfo import VarInfoFromNetCDF4
from varinfo.umm_var import get_all_umm_var, get_umm_var

var_info = VarInfoFromNetCDF4(gpm_granule_path, short_name='GPM_3IMERGHH')

# Get single UMM-Var record by variable name:
precipitation_variable = var_info.get_variable('/Grid/precipitationCal')
single_umm_var = get_umm_var(var_info, precipitation_variable)

print(json.dumps(single_umm_var, indent=2))

### For all variables in a granule:

Note: Non-variable dimensions (e.g., `/Grid/latv` for `/Grid/lat_bnds`) with be listed as dimensions within a UMM-Var record to ensure the UMM-Var record captures the true shape of the variable with such dimensions.

In [None]:
all_umm_var = get_all_umm_var(var_info)

print(json.dumps(all_umm_var, indent=2))

### Saving to disk

This was a requested feature, to enable manual inspection and editing of JSON files. Each UMM-Var record will be saved as a separate JSON file:

In [None]:
from glob import glob
from tempfile import mkdtemp

from varinfo.umm_var import export_all_umm_var_to_json


temp_dir = mkdtemp()

export_all_umm_var_to_json(all_umm_var.values(), output_dir=temp_dir)

print(f'Files stored in {temp_dir}\n')
print(json.dumps(glob(f'{temp_dir}/*json'), indent=2))

### UMM-Var validation:

In [None]:
from jsonschema import validate


# The path to the UMM-Var schema might need to be updated depending on
# the root path of your Jupyter notebook server.
# This file can also be obtained from:
# https://github.com/nasa/earthdata-varinfo/main/tests/unit/data/umm_var_json_schema_1.8.2.json
with open('../tests/unit/data/umm_var_json_schema_1.8.2.json') as file_handler:
    umm_var_schema = json.load(file_handler)

for umm_var_record in all_umm_var.values():
    validate(schema=umm_var_schema, instance=umm_var_record)

### Proving bad records can be identified:

Adding `jsonschema.validate` to the unit tests helped identify a bug with a poorly read `scale` metadata attribute. The cell below will raise a validation error, as it updates the `Scale` property to be invalid per the UMM-Var schema.

In [None]:
bad_record = all_umm_var['/Grid/precipitationUncal'].copy()
bad_record['Scale'] = True

validate(schema=umm_var_schema, instance=bad_record)

### earthdata-varinfo UMM-Var future work:

The list below refers to potential improvements that can be made within the core Python package to improve the schema coverage of the generated UMM-Var records:

* Ensuring variables parsed from a DMR file also contain shape information, similar to current functionality of NetCDF-4 file parsing. This information is stored within a DMR in separate `<Dimension />` XML elements. This would allow UMM-Var JSON generated from DMR data to have sizes on their dimensions.
* Adding suitable heuristics to the `VariableBase` class to identify vertical spatial dimensions. Currently these map to the dimension type of "OTHER".
* Adding along- and across-track swath dimension identification heuristics to `VariableBase`.
* Improving the metadata for projected horizontal spatial dimensions. These are currently mapped to a dimension type of "OTHER". While they can be identified within the `Variable` classes, there is not currently an applicable UMM-Var option in the `DimensionType.Type` enumeration:
  * "LATITUDE_DIMENSION"
  * "LONGITUDE_DIMENSION"
  * "ALONG_TRACK_DIMENSION"
  * "CROSS_TRACK_DIMENSION"
  * "PRESSURE_DIMENSION"
  * "HEIGHT_DIMENSION"
  * "DEPTH_DIMENSION"
  * "TIME_DIMENSION"
  * "OTHER"
* Adding a mechanism to indicate that a variable size may vary between granules, beyond manual editing of generated UMM-Var JSON. This might be possible via comparing the parsing of multiple granules from the same collection.
* All fill values are currently denoted as "SCIENCE_FILLVALUES" - definitely interested in any heuristics to improve this.
* Implementing semantically anchored vocabulary standardisation.
* Considering how UMM-Var to UMM-Var associations can be identified, given that `earthdata-varinfo` identifies required variables via metadata attributes and dimensions.