# Science variables, required fields, `VarInfo` and PyDAP

_Used as a sprint demonstration for 20.2.2, on 2020-06-01._

See also: https://wiki.earthdata.nasa.gov/display/SITC/Science+Variables%2C+Required+Fields%2C+VarInfo+and+PyDAP

### Use case:

An end-user wants to subset a datafile to contain just a few science variables of interest, for follow-on processing (e.g., spatial subsetting, reprojection, further analysis).

**Elaboration:**

The end-user does not necessarily know about the dimensions, dimension-scale variables, coordinates, ancillary control or reference data, etc.  In particular, for ICESat-2, it can become complicated to trace subsetting and processing dependencies, including some cases of nested or recursive dependencies.  Further, with 1,000+ sub-datasets within the data files, it is not practical to manually curate this data without supporting tools.

**Approach:**

We should not be creating UMM-Var records for all of the variables within these data files.  Rather, we should be focusing on the science datasets that carry the instrument data, and exclude the supporting datasets, those that provide geolocation and other organizational support, and metadata for the granule and the sub-groups of data.  We are calling these the "required fields" associated with the requested variables.

**Example:**

![Metadata for `gt3r_land_segments_terrain_h_te_best_fit`](figures/variable_attributes_2020-06-01.png)

In the above example, requesting the `/gt3r/land_segments/terrain/h_te_best_fit` variable should result in the following list of required variables:

```python
required_variables = {'/gt3r/land_segments/terrain/h_te_best_fit',
                      '/gt3r/land_segments/delta_time',
                      '/gt3r/land_segments/latitude',
                      '/gt3r/land_segments/longitude'}
```

**Principal Basis for Automation:**

* Initial approximation - Those variables with coordinate references are science variables. Those without are metadata.
* First refinement - All variables that are referred to by other variables, e.g., as dimension scales, coordinates, ancillary variables, should be excluded from the list of science variables
* Second refinement - a configuration file is often necessary to apply CF conventions where missing, and to supplement the CF attributes, in some cases extensively, to provide the necessary attributes to support subsetting and other operations.

Note - while the configuration file can be challenging, it is easier than manually curating the data and/or embedding complex rules within the subsetters. 

In [1]:
from typing import Dict, Set, Tuple
import re

from pydap.cas.urs import setup_session
from pydap.client import open_url
from pydap.model import BaseType

In [2]:
class Variable:
    """ A class to represent a single variable within the dmr or dmrpp file
        representing a granule.

    """

    def __init__(self, variable: BaseType):
        """ Create Variable object containing information compatible with
            UMM-Var records.

        """
        self.data_type = variable.dtype.name
        self.long_name = variable.attributes.get('long_name')
        self.definition = variable.attributes.get('description')
        self.scale = variable.attributes.get('scale', 1)
        self.offset = variable.attributes.get('offset', 0)
        self.acquisition_source_name = variable.attributes.get('source')
        self.units = variable.attributes.get('units')
        self.full_name_path = variable.attributes.get('fullnamepath')

        (self.group_path, self.name) = self._extract_group_and_name(variable)
        self.coordinates = self._extract_coordinates(variable)
        self.dimensions = self._extract_dimensions(variable)

        self.fill_value = variable.attributes.get('_FillValue')
        self.valid_max = variable.attributes.get('valid_max')
        self.valid_min = variable.attributes.get('valid_min')

    def _extract_coordinates(self, variable: BaseType) -> Set[str]:
        """ Check the child elements for an Attribute element with the name
            'coordinates'. From this element, retrieve the set of coordinate
            datasets.

        """
        coordinates_string = variable.attributes.get('coordinates')

        if coordinates_string is not None:
            raw_coordinates = re.split('\s+|,\s*', coordinates_string)
            coordinates = self._qualify_references(raw_coordinates)
        else:
            coordinates = set()

        return coordinates


    def _extract_dimensions(self, variable: BaseType) -> Set[str]:
        """ Find the dimensions for the variable in question. Note, this will
            only return a set of fully qualified paths to the dimension, not
            a set of UMM-Var compatible objects.

        """
        return self._qualify_references(variable.dimensions)

    def _qualify_references(self, raw_references: Tuple[str]) -> Set[str]:
        """ Take a tuple of local references to other dataset, and prepend
            the group path, if it isn't already present in the reference.

        """
        if self.group_path is not None:
            references = {self._construct_absolute_path(reference)
                          if reference.startswith('../')
                          else f'{self.group_path}/{reference}'
                          if not reference.startswith(self.group_path)
                          else reference
                          for reference in raw_references}
        else:
            references = set(raw_references)

        return references

    def _construct_absolute_path(self, reference: str) -> str:
        """ For a relative reference to another variable (e.g. '../latitude'),
            construct an absolute path by combining the reference with the
            group path of the variable.

        """
        relative_prefix = '../'
        group_path_pieces = self.group_path.split('/')

        while reference.startswith(relative_prefix):
            reference = reference[len(relative_prefix):]
            group_path_pieces.pop()

        absolute_path = group_path_pieces + [reference]
        return '/'.join(absolute_path)

    def _extract_group_and_name(self, variable: BaseType) -> Tuple[str]:
        """ Check if the 'fullpathname' attribute is defined. If so, derive the
            group and local name of the variable.

        """
        if self.full_name_path is not None:
            split_full_path = self.full_name_path.split('/')
            name = split_full_path.pop(-1)
            group_path = '/'.join(split_full_path) or None
        else:
            name = variable.name
            group_path = None

        return group_path, name

In [3]:
class VarInfo:
    """ A class to represent the full dataset of a granule, having read
        information from the dmr or dmrpp for it.

    """

    def __init__(self, dmr_file_url: str):
        """ Distinguish between variables containing references to other
            datasets, and those that do not. The former are considered science
            variables, providing they are not considered coordinates or
            dimensions for another variable.

            Unlike NCInfo, in SwotRepr, each variable contains references to
            their specific coordinates and dimensions, allowing the retrieval
            of all required variables for a specified list of science
            variables.

        """
        self.metadata_variables: Dict[str, Variable] = {}
        self.variables_with_coordinates: Dict[str, Variable] = {}
        self.ancillary_data: Set[str] = set()
        self.coordinates: Set[str] = set()
        self.dimensions: Set[str] = set()
        self.metadata = {}

        self._read_dataset_from_dmr(dmr_file_url)

    def _read_dataset_from_dmr(self, file_url: str):
        """ This method parses the specified dmr file. """
        self.dmr_input = open_url(file_url)

        for variable in self.dmr_input.values():
            variable_object = Variable(variable)
            if variable_object.coordinates is not None:
                self.coordinates.update(variable_object.coordinates)
                self.variables_with_coordinates[variable_object.full_name_path] = variable_object
            else:
                self.metadata_variables[variable_object.full_name_path] = variable_object

            if variable_object.dimensions is not None:
                self.dimensions.update(variable_object.dimensions)

    def get_science_variables(self) -> Set[str]:
        """ Retrieve set of names for all variables that have coordinate
            references, that are not themselves used as dimensions, coordinates
            or ancillary date for another variable.

        """
        return (set(self.variables_with_coordinates.keys()) - self.dimensions -
                self.coordinates - self.ancillary_data)

    def get_metadata_variables(self) -> Set[str]:
        """ Retrieve set of names for all variables that do no have
            coordaintes references, that are not themselves used as dimensions,
            coordinates or ancillary data for another variable.

        """
        return (set(self.metadata_variables.keys()) - self.dimensions -
                self.coordinates - self.ancillary_data)

    def get_required_variables(self, requested_variables: Set[str]) -> Set[str]:
        """ Retrieve requested variables and recursively search for all
            associated dimension and coordinate variables. The returned set
            should be the union of the science variables, coordinates and
            dimensions.

        """
        required_variables: Set[str] = set()

        while len(requested_variables) > 0:
            variable_name = requested_variables.pop()
            variable = (self.variables_with_coordinates.get(variable_name) or
                        self.metadata_variables.get(variable_name))

            if variable is not None:
                # Add variable. Enqueue coordinates and dimensions not already
                # present in required set.
                required_variables.add(variable_name)
                requested_variables.update(
                    variable.coordinates.difference(required_variables)
                )
                requested_variables.update(
                    variable.dimensions.difference(required_variables)
                )

        return required_variables

## Retrieve data from test server:

`VarInfo` uses the `pydap` package to request a dataset object from an OPeNDAP server. Currently, for development purposes, this is not a granule stored in the cloud.

Once the `pydap.client.open_url` request is completed, `VarInfo` iterates through all the variables present in the retrieved data, and produces a dictionary of `Variable` objects. These extract information such as the variables listed in the `coordinates` attribute, and the related dimension variables.

Sets are maintained containing the fully qualified path names for all science variables, coordinates, metadata and dimensions (with definitions described above).

In [4]:
dataset = VarInfo('http://test.opendap.org/opendap/hyrax/slav/ATL08_20181016124656_02730110_002_01.h5')

### Display a list of all variables with coordinate attributes:

The `variables_with_coordinates` class property is a dictionary storing each `Variable` object based on its fully qualified variable name (e.g. `/gt3r/land_segments/terrain/h_te_best_fit` not `h_te_best_fit`).

In [5]:
dataset.variables_with_coordinates.keys()

dict_keys(['/orbit_info/cycle_number', '/orbit_info/orbit_number', '/orbit_info/lan', '/orbit_info/sc_orient', '/orbit_info/rgt', '/gt1r/signal_photons/classed_pc_flag', '/gt1r/signal_photons/classed_pc_indx', '/gt1r/signal_photons/d_flag', '/gt1r/signal_photons/ph_segment_id', '/gt1r/land_segments/segment_watermask', '/gt1r/land_segments/delta_time_end', '/gt1r/land_segments/rgt', '/gt1r/land_segments/dem_flag', '/gt1r/land_segments/msw_flag', '/gt1r/land_segments/cloud_flag_atm', '/gt1r/land_segments/snr', '/gt1r/land_segments/canopy/h_canopy', '/gt1r/land_segments/canopy/canopy_rh_conf', '/gt1r/land_segments/canopy/h_median_canopy_abs', '/gt1r/land_segments/canopy/h_min_canopy', '/gt1r/land_segments/canopy/h_mean_canopy_abs', '/gt1r/land_segments/canopy/h_median_canopy', '/gt1r/land_segments/canopy/h_canopy_abs', '/gt1r/land_segments/canopy/toc_roughness', '/gt1r/land_segments/canopy/h_min_canopy_abs', '/gt1r/land_segments/canopy/h_dif_canopy', '/gt1r/land_segments/canopy/h_canopy_q

### Science variables:

The class method used below retrieves only those variables with a `coordinates` reference, that aren't listed within the `coordinates` attribute of another variable. The set difference is taken between all variables with coordinate attributes, and those listed as dimensions or coordinates.

This does not currently implement a configuration file, which is upcoming work for sprint 20.2.3.

In [6]:
dataset.get_science_variables()

{'/ancillary_data/atlas_sdp_gps_epoch',
 '/ancillary_data/control',
 '/ancillary_data/data_end_utc',
 '/ancillary_data/data_start_utc',
 '/ancillary_data/end_cycle',
 '/ancillary_data/end_delta_time',
 '/ancillary_data/end_geoseg',
 '/ancillary_data/end_gpssow',
 '/ancillary_data/end_gpsweek',
 '/ancillary_data/end_orbit',
 '/ancillary_data/end_region',
 '/ancillary_data/end_rgt',
 '/ancillary_data/granule_end_utc',
 '/ancillary_data/granule_start_utc',
 '/ancillary_data/land/atl08_region',
 '/ancillary_data/land/bin_size_h',
 '/ancillary_data/land/bin_size_n',
 '/ancillary_data/land/bright_thresh',
 '/ancillary_data/land/ca_class',
 '/ancillary_data/land/can_noise_thresh',
 '/ancillary_data/land/can_stat_thresh',
 '/ancillary_data/land/canopy_flag_switch',
 '/ancillary_data/land/canopy_seg',
 '/ancillary_data/land/class_thresh',
 '/ancillary_data/land/cloud_filter_switch',
 '/ancillary_data/land/del_amp',
 '/ancillary_data/land/del_mu',
 '/ancillary_data/land/del_sigma',
 '/ancillary_

In [7]:
print(f'There are {len(dataset.variables_with_coordinates)} variables with coordinates.')
print(f'There are {len(dataset.get_science_variables())} science variables.')

There are 430 variables with coordinates.
There are 410 science variables.


### Required variables:

The cell below demonstrates the class method that will retrieve all required variables for an input set of variable names. The output includes all the requested variables plus their supporting coordinate and dimension variables.

The retrieval is recursive, beginning with the initial requested variables. All required variables are checked in turn for further dependencies on other variables.

In [8]:
dataset.get_required_variables({'/gt3r/land_segments/terrain/h_te_best_fit'})

{'/gt3r/land_segments/delta_time',
 '/gt3r/land_segments/latitude',
 '/gt3r/land_segments/longitude',
 '/gt3r/land_segments/terrain/h_te_best_fit'}