# How to use `earthdata-varinfo` to publish UMM-Var records to CMR

This notebook demonstrates how to create and publish, Unified Metadata Model-Variable (UMM-Var) records to NASA's Common Metadata Repository (CMR) with, `earthdata-varinfo` >= 2.0.0.

There are three main workflows described in this notebook:

* The use of a single overarching function `generate_collection_umm_var`, which:
  * Uses [python-cmr](https://github.com/nasa/python_cmr) to query CMR for collection granules.
  * Downloads one of these granules to the local machine.
  * Uses `VarInfoFromNetCDF4` to parse in-file metadata from the granule.
  * Creates UMM-Var JSON objects for each of the variables found in the downloaded granule.
  * (Optionally) Publishes these UMM-Var objects to CMR (in a specified environment: `OPS`, `UAT` or `SIT`).
* Performing the same workflow as above, but using individual functions and classes to perform each step in isolation.
* Publication of a single UMM-Var record.

It is recommended to use `generate_collection_umm_var` for most use-cases. However, if a local file already exists on the machine running this notebook, or a collection doesn't yet have granule metadata, then the second workflow can be used to skip the initial steps of identifying and downloading a granule.

### Setting up your environment to run this notebook

**Recommended option:**

Create and activate your `pyenv` or conda environment, then:

```
pip install earthdata-varinfo
```

**Alternative:**

If this doesn't work, alternatively you can clone the git repository, and install the package in editable mode:

```
git clone https://github.com/nasa/earthdata-varinfo
cd earthdata-varinfo
pip install -e .
```

### Other notebook requirements:

When installing `earthdata-varinfo` via PyPI, required packages should automatically be installed as dependencies. 
For local development, without a standard pip installation, third party requirements can be installed from the following files:

```
pip install -r requirements.txt -r dev-requirements.txt
pip install notebook
```

### Authorization:

This notebook uses two types of tokens for authentication with external resources.

* Launchpad tokens are required to query for and publish metadata records to CMR. The `Authorization` header for these token does not include an HTTP authentication scheme, so the value for the `Authorization` header looks as follow:
  * `<Launchpad token>`
* Earthdata Login (EDL) tokens are used to download granule files. EDL tokens use the `Bearer` authentication scheme, meaning the `Authorization` header is as follows:
  * `Bearer <EDL token>`

To request a Launchpad Token visit:
* [Launchpad Authentication User's Guide](https://wiki.earthdata.nasa.gov/display/CMR/Launchpad+Authentication+User%27s+Guide)

### UMM-Var native IDs:

UMM records have a native ID that is required for publication of any record. `earthdata-varinfo` implements the following scheme for native IDs:

```
<collection_concept_id>-<variable path>

# e.g.:
C1234567890-PROV-variable_name

# Or for a nested variable:
C1234567890-PROV-variable_group_variable_path
```

Using `earthdata-varinfo` multiple times to generate UMM-Var record for the same collection will result in updating existing records, rather than creating duplicate UMM-Var records for the same variables.

### Examples in this notebook:

* Using `generate_collection_umm_var`: [GLDAS_NOAH10_3H](https://cmr.uat.earthdata.nasa.gov/search/collections.umm_json?concept-id=C1256543837-EEDTEST)
* Using individual functions: [M2I1NXASM](https://cmr.uat.earthdata.nasa.gov/search/collections.umm_json?concept-id=C1256535511-EEDTEST)

## Workflow 1: Using the single `generate_collection_umm_var` function (recommended):

**This option is recommended if you have a collection in CMR with granules, and want the simplest workflow**

This example shows how to publish UMM-Var records for **GLDAS_NOAH10_3H** with `generate_collection_umm_var`. `generate_collection_umm_var` is a wrapper function that combines the functionality of individual classes and functions of `earthdata-varinfo`, including: `varinfo.cmr_search`, `VarInfoFromNetCDF4` and `varinfo.umm_var`.

`generate_collection_umm_var` will:

* Query CMR to find the collection specified and links to granules in that collection.
* Download the most recent granule for **GLDAS_NOAH10_3H**.
* Parse the in-file metadata for the downloaded granule.
* Generate the UMM-Var records from the parsed file information.
* Publish these records to CMR if `publish=True`.
* If `publish=True`, a list of ingested variable concept-ids or the error(s) from an unsucessful ingest is returned
    * `['V1259971755-EEDTEST', 'V1259971757-EEDTEST', ...]` 
    * `['V1259971755-EEDTEST', '#: CMR error 1\n  #: CMR error 2', ...]`
* If `publish=False` (default) a list of UMM-Var entries is returned:
    * `[...{'Name': 'lat', 'LongName': 'lat', ...}, {'Name': 'time', 'LongName': 'time', ...}...]`


**Customising the cell below for a different collection:**

The following cell specifies the collection concept ID of **GLDAS_NOAH10_3H** (from the `EEDTEST` CMR provider). 
This can be updated to any concept-id for any provider.

Update `auth_header` in the cell below to include your Launchpad token.
An optional config file can be passed to override default configuration

In [None]:
from varinfo.generate_umm_var import generate_collection_umm_var


auth_header = '<Launchpad token>'
collection_concept_id_gldas = 'C1256543837-EEDTEST'
test_config_file = 'tests/unit/data/test_config.json'
generate_collection_umm_var(
    collection_concept_id=collection_concept_id_gldas,
    auth_header=auth_header,
    publish=True,
    config_file=test_config_file,
)

## Workflow 2: Publishing CMR records with lower-level functions:

**This workflow is primarily for collections without granules or when a file already exists on a local machine.**

The following example will publish and create UMM-Var entries for **M2I1NXASM**. It does so by using the individual pieces of functionality wrapped by `generate_collection_umm_var`.

* `varinfo.cmr_search`: queries CMR for a granule download link and downloads granules locally
* `VarInfoFromNetCDF4`: varinfo parent class that represents the contents of a granule
* `varinfo.umm_var`: contains functions for creating and publishing UMM-Var records to CMR
* `CMR_UAT` is a string constant (e.g. https://cmr.uat.earthdata.nasa.gov/search/) of a CMR environment

First import the individual functions and classes required:

In [None]:
from cmr import CMR_UAT

from varinfo import VarInfoFromNetCDF4
from varinfo.cmr_search import download_granule, get_granules, get_granule_link
from varinfo.umm_var import get_all_umm_var, publish_all_umm_var, publish_umm_var

Next define the CMR concept ID of the collection that will have UMM-Var records generated. In this example, the `collection_concept_id` used is for the **M2I1NXASM** collection in the EEDTEST provider, but this can be updated to a collection concept ID from any provider.

In [None]:
collection_concept_id_merra = 'C1256535511-EEDTEST'

Get the granule record and granule download URL with `get_granules` and `get_granule_link`

* `get_granules`: queries `CMR_UAT` (default is `CMR_OPS`) for a UMM-G record (granule record) given a collection or granule concept-id
    * you can query any CMR environment by adding `cmr_env=CMR_UAT` or `cmr_env=CMR_SIT`
* `get_granule_link`: parses the UMM-G record from `get_granules` for a data download URL

**This step can be skipped if a granule file is already present on your machine.**

In [None]:
granule_response = get_granules(
    concept_id=collection_concept_id_merra, cmr_env=CMR_UAT, auth_header=auth_header
)

url = get_granule_link(granule_response)
print(url)

Download the granule locally with `download_granule`
* Defaults to current directory
* Add optional argument `out_directory=/path/to/save/granule` to save to specified path
* Returns the path the granule was downloaded to (e.g. `/path/granule/was/saved/to`)

**This step can be skipped if a granule file is already present on your machine.**

In [None]:
download_granule(url, auth_header=auth_header)

**Start here if you have a local granule file already.**

Instantiate a ```VarInfoFromNetCDF4``` object for a local NetCDF-4 file. This will parse the in-file metadata for the specified NetCDF-4 file, including relationships between variables (such as coordinates, bounds, and dimensions).

In [None]:
var_info = VarInfoFromNetCDF4(
    'MERRA2_400.inst1_2d_asm_Nx.20220130.nc4', short_name='M2I1NXASM'
)

Instantiate a VarInfoFromNetCDF4 object with an optional config file. This will override default configuration.

In [None]:
test_config_file = 'tests/unit/data/test_config.json'
var_info = VarInfoFromNetCDF4(
    'MERRA2_400.inst1_2d_asm_Nx.20220130.nc4',
    short_name='M2I1NXASM',
    config_file=test_config_file,
)

Retrieve a dictionary of UMM-Var JSON records
* Returns a nested dictionary of UMM-Var records with full variable paths as keys and their UMM-Var records as values
* e.g. `{'/lon': {'Name': 'lon', 'LongName': 'lon', ...}, '/lat': {'Name': 'lat', 'LongName': 'lat', ...}...}`

In [None]:
umm_var_dict = get_all_umm_var(var_info)
print(umm_var_dict)

Publish all UMM-Var records for **M2I1NXASM** to CMR_UAT with `publish_all_umm_var`
* Returns a dictionary of variable names and variable concept-ids as key value pairs respectively.
* Example output: ```{'/lon': 'V1259972387-EEDTEST', '/lat': 'V1259972389-EEDTEST'...}```

In [None]:
publish_all_umm_var(
    collection_concept_id_merra, umm_var_dict, auth_header=auth_header, cmr_env=CMR_UAT
)

## Workflow 3: Publishing a single UMM-Var record:

**This workflow is for updating or creating a single UMM-Var record.**

This example is another alternative to using `generate_collection_umm_var`. In this example we use a locally downloaded granule (**M2I1NXASM**) to create and ingest a single UMM-Var record for a variable of interest.
* Use `var_info.get_variable()` to retrieve the variable object from `var_info`
* Keys are the full variable paths (e.g. `'/TROPPV'`)

In [None]:
from cmr import CMR_UAT

from varinfo import VarInfoFromNetCDF4
from varinfo.umm_var import get_umm_var, publish_umm_var

First parse the local file, and from it identify the variable of interest:

In [None]:
var_info = VarInfoFromNetCDF4(
    'MERRA2_400.inst1_2d_asm_Nx.20220130.nc4', short_name='M2I1NXASM'
)

variable = var_info.get_variable('/TROPPV')

Check if the variable exists and, if so, get a dictionary of the variable's UMM-Var JSON record

In [None]:
if variable is not None:
    umm_var_entry = get_umm_var(var_info, variable)
else:
    print('Selected variable was not found in granule')

umm_var_entry

Publish the UMM-Var record for `TROPPV` (from **M2I1NXASM**) to CMR_UAT with `publish_umm_var`. This will return a variable concept-id (e.g. `'V1259972421-EEDTEST'`).

In [None]:
publish_umm_var(
    collection_concept_id_merra, umm_var_entry, auth_header=auth_header, cmr_env=CMR_UAT
)