# Associating collections with HOSS.

This notebook describes the steps required to associate a new collection with the [Harmony OPeNDAP SubSetter (HOSS)](https://github.com/nasa/harmony-opendap-subsetter), and provides example requests that can be performed to ensure the collection is compatible with HOSS.

## Contact:

There are a number of NASA EOSDIS Slack channels dedicated to either HOSS or Harmony:

* `#variable_subsetter` - A HOSS-specific channel.
* `#harmony` - A place for all things Harmony.
* `#harmony-service-providers` - A place for backend service specific Harmony discussions.

Alternatively, reach out via email to: <owen.m.littlejohns@nasa.gov> or <david.p.auty@nasa.gov>.

## Notebook prerequisites:

This Jupyter notebook assumes it is running in an environment containing the following Python packages:

* [harmony-py](https://github.com/nasa/harmony-py) - used to make requests against Harmony.
* [notebook](https://pypi.org/project/notebook/) - used to run this notebook.
* [netCDF4](https://pypi.org/project/netCDF4/) - used by `xarray` to open NetCDF-4 files.
* [xarray](https://pypi.org/project/xarray/) - used to verify output

This notebook also assumes that an end-user has a `.netrc` file configured on their local machine, which should contain an entry for the Earthdata Login environment that will be used for test requests. Such an entry will look like:

```
machine urs.earthdata.nasa.gov
    login <EDL username>
    password <EDL password>
```

## Collection prerequisites:

HOSS is primarily designed to perform variable, temporal, bounding box spatial and shape file spatial subsetting on gridded collections (L3/L4). It is expected that these collections contain 1-D grid dimension variables metadata adhering to the [NetCDF Climate and Forecast (CF) metadata conventions](http://cfconventions.org/).

The requirements of HOSS include:

* The collection has been ingested to the cloud, via either Cumulus, or within the UAT EEDTEST provider.
* Each granule in the collection is accessible via OPeNDAP, with a sidecar `.dmrpp` file and an OPeNDAP related URL in its UMM-G record.
* All gridded variables with the source data file contain named dimensions, which have accompanying 1-D dimension variables within the granule.
* Any variables that support gridded variables are indicated via the appropriate metadata attribute, conforming to the CF-Conventions. Examples include "coordinates", "bounds" and "grid_mapping".
* [Projection-gridded collections must have a variable that encapsulates the Coordinate Reference System (CRS) of the granule](http://cfconventions.org/Data/cf-conventions/cf-conventions-1.10/cf-conventions.html#grid-mappings-and-projections). Each science variable that uses this projected grid must have a "grid_mapping" metadata attribute that refers to this CRS variable.

## Making a UMM-S to UMM-C association:

The Common Metadata Repository (CMR) offers the ability to associate entities from different providers. For this reason, it is recommended that operators associate their collections to UMM-S records provided by the maintainers of HOSS. [This can be most easily achieved via the Metadata Management Tool](https://wiki.earthdata.nasa.gov/display/CMR/Metadata+Management+Tool+%28MMT%29+User%27s+Guide#MetadataManagementTool(MMT)User'sGuide-AssociateaServicewithoneormoreCollectionsformyprovider) ([MMT](mmt.earthdata.nasa.gov)).

A collection will only need to be associated with a single UMM-S record:

| Service Name                | Environment | UMM-S Concept ID     | What to associate with it                |
|:----------------------------|:-----------:|:--------------------:|:-----------------------------------------|
| SDS/HOSS Geographic         | Production  | S2164732315-XYZ_PROV | L3/L4 geographically gridded collections |
| SDS/HOSS Projection-gridded | Production  | S2300730272-XYZ_PROV | L3/L4 projection-gridded collections     |
| SDS/HOSS Geographic         | UAT         | S1240682712-EEDTEST  | L3/L4 geographically gridded collections |
| SDS/HOSS Projection-gridded | UAT         | S1245117629-EEDTEST  | L3/L4 projection-gridded collections     |
| sds-variable-subsetter      | UAT         | S1237976118-EEDTEST  | Non-gridded collections requiring variable subsetting |

These records should be visible via a search of service records in the appropriate version of MMT (either production or UAT). If they do not show up, please contact the maintainers of HOSS via the `#variable_subsetter` channel in the NASA EOSDIS Slack workspace.

Please note that Harmony makes use of the UAT instance of CMR in both its UAT _and_ SIT environment, as well as any local testing via Harmony-in-a-Box. Associating a collection to HOSS in UAT makes it also available for use with HOSS in SIT and locally.

### What is the "Variable Subsetter"?

The table above has a separate entry for a Variable Subsetter service. This was the initial name for HOSS, as the first version of the service only offered variable subsetting via OPeNDAP as a transformation option. More capabilities have been added to HOSS, including spatial and temporal subsetting. Currently, the Harmony UAT instance maintains an sds-variable-subsetter service, which uses the same Docker image as HOSS, but only accepts parameters that will define a variable subset. This service is retained for use with collections that are not gridded (non L3 or L4).

Upon migration to the NASA open-source GitHub organisation, efforts were taken to name the service and associated artefacts with terms directly relating to HOSS, as this is the form of the service that is available in production and is primarily what data curators are interested in using.

## Verifying the output:

The following sections of this notebook assume that a UMM-C to UMM-S association has been added between the collection to be tested and the appropriate HOSS UMM-S record, as described above.

Because collections vary in their variable content, the notebook below does not attempt to plot any output. Instead, consider using a tool like [Panoply](https://www.giss.nasa.gov/tools/panoply/download/) for visual verification of output.

### Import required functions and classes:

In [None]:
from datetime import datetime
from os import replace

from harmony import BBox, Client, Collection, Environment, Request
import xarray as xr

### Configure the notebook to test the associated collection:

The values in the following cells should be set as described. Example values for the GPM/IMERG half hourly precipitation collection have been entered as a guide.

First, select the environment containing the collection. This should be `Environment.PROD`, `Environment.UAT` or `Environment.SIT`.

In [None]:
harmony_environment = Environment.UAT

Next enter the UMM-C concept ID for the collection that has been associated with HOSS:

In [None]:
collection_concept_id = 'C1245618475-EEDTEST'

Enter the name of a variable that is within each granule of the collection. This should be the full path to the variable. Note - for some files without hierarchy, the full path may not need a leading slash.

In [None]:
variable_to_subset = '/Grid/precipitationCal'

Define a temporal range that should match at least one granule in the test collection:

In [None]:
temporal_range = {'start': datetime(2020, 1, 1, 0, 0, 0),
                  'stop': datetime(2020, 1, 31, 23, 59, 59)}

Define a bounding box within the coverage of the collection data:

In [None]:
bounding_box = BBox(w=-50, s=30, e=-20, n=60)

Finally, specify a path to a local GeoJSON file that defines a shape file for spatial subsetting:

In [None]:
shape_file_path = 'shape_files/bermuda_triangle.geo.json'

After this point, none of the remaining cells should need to be updated.

### Set up a client with Harmony:

In [None]:
harmony_client = Client(env=harmony_environment)
collection = Collection(id=collection_concept_id)

### Extract variable parent group, for `xarray`:

In [None]:
variable_group = '/'.join(variable_to_subset.split('/')[:-1])

### A variable subset:

This request will limit the returned variables to the one specified as `variable_to_subset`. The output will also include any supporting variables required to make a valid output. These include 1-D dimension variable or bounds variables.

In [None]:
# Define the request:
variable_subset_request = Request(collection=collection, variables=[variable_to_subset], max_results=1)

# Submit the request and download the results
variable_subset_job_id = harmony_client.submit(variable_subset_request)
harmony_client.wait_for_processing(variable_subset_job_id, show_progress=True)
variable_subset_outputs = [file_future.result()
                           for file_future
                           in harmony_client.download_all(variable_subset_job_id, overwrite=True)]

replace(variable_subset_outputs[0], 'hoss_variable_subset.nc4')


# Inspect the results:
with xr.open_dataset('hoss_variable_subset.nc4', group=variable_group) as dataset:
    print(dataset)

### A temporal subset:

This request will limit the temporal range of the output to only include pixels that cover the specified range. If the collection that has been associated does not contain a temporal grid dimension, then this will only act as a filter on the granules identified in CMR.

In [None]:
# Define the request:
temporal_subset_request = Request(collection=collection, temporal=temporal_range,
                                  variables=[variable_to_subset], max_results=1)

# Submit the request and download the results
temporal_subset_job_id = harmony_client.submit(temporal_subset_request)
harmony_client.wait_for_processing(temporal_subset_job_id, show_progress=True)
temporal_subset_outputs = [file_future.result()
                           for file_future
                           in harmony_client.download_all(temporal_subset_job_id, overwrite=True)]

replace(temporal_subset_outputs[0], 'hoss_temporal_subset.nc4')

# Inspect the results:
with xr.open_dataset('hoss_temporal_subset.nc4', group=variable_group) as dataset:
    print(dataset)

### A bounding box spatial subset:

This request will limit the spatial extent of the returned output. This request will be fulfilled differently depending on the UMM-S record associated with the data. This can be observed via the `/workflow-ui` endpoint of Harmony.

* SDS/HOSS Geographic: Will call:
  * `query-cmr` to filter granules to those with matching spatial coverage.
  * `ghcr.io/nasa/harmony-opendap-subsetter` to perform HOSS operations and extract a rectangular portion of the longitude latitude grid. This will match the bounding box.
* SDS/HOSS Projection-gridded:
  * `query-cmr` to filter granules to those with matching spatial coverage.
  * `ghcr.io/nasa/harmony-opendap-subsetter` to perform HOSS operations and extract a rectangular portion of the x, y grid. There will be pixels requiring filling in this output.
  * `ghcr.io/nasa/harmony-maskfill` to fill any pixels in the rectangular array segment, but outside the bounding box.

In [None]:
# Define the request:
bbox_subset_request = Request(collection=collection, spatial=bounding_box, max_results=1)

# Submit the request and download the results
bbox_subset_job_id = harmony_client.submit(bbox_subset_request)
harmony_client.wait_for_processing(bbox_subset_job_id, show_progress=True)
bbox_subset_outputs = [file_future.result()
                       for file_future
                       in harmony_client.download_all(bbox_subset_job_id, overwrite=True)]

replace(bbox_subset_outputs[0], 'hoss_bbox_subset.nc4')

# Inspect the results:
with xr.open_dataset('hoss_bbox_subset.nc4', group=variable_group) as dataset:
    print(dataset)

### A polygon spatial subset:

This request will limit the spatial extent of the returned output. This request will be fulfilled using three steps. This can be observed via the `/workflow-ui` endpoint of Harmony.

* `query-cmr` to filter granules to those with matching spatial coverage.
* `ghcr.io/nasa/harmony-opendap-subsetter` to perform HOSS operations and extract a rectangular portion of the longitude latitude grid. This will minimally encompass the user-defined GeoJSON shape.
* `ghcr.io/nasa/harmony-maskfill` to fill any pixels in the rectangular array segment, but outside the GeoJSON shape.

In [None]:
# Define the request:
shape_file_subset_request = Request(collection=collection, shape='shape_files/bermuda_triangle.geo.json', max_results=1)

# Submit the request and download the results
shape_file_subset_job_id = harmony_client.submit(shape_file_subset_request)
harmony_client.wait_for_processing(shape_file_subset_job_id, show_progress=True)
shape_file_subset_outputs = [file_future.result()
                             for file_future
                             in harmony_client.download_all(shape_file_subset_job_id, overwrite=True)]

replace(shape_file_subset_outputs[0], 'hoss_shape_file_subset.nc4')
# Inspect the results:
with xr.open_dataset('hoss_shape_file_subset.nc4', group=variable_group) as dataset:
    print(dataset)