<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Getting-Started" data-toc-modified-id="Getting-Started-1">Getting Started</a></span><ul class="toc-item"><li><span><a href="#Parameters" data-toc-modified-id="Parameters-1.1">Parameters</a></span></li><li><span><a href="#Requirements" data-toc-modified-id="Requirements-1.2">Requirements</a></span><ul class="toc-item"><li><span><a href="#Earthdata-Authentication" data-toc-modified-id="Earthdata-Authentication-1.2.1">Earthdata Authentication</a></span></li></ul></li></ul></li><li><span><a href="#Hands-off-workflow" data-toc-modified-id="Hands-off-workflow-2">Hands-off workflow</a></span><ul class="toc-item"><li><span><a href="#1.-Retrieve-a-new-granule-listing-from-CMR" data-toc-modified-id="1.-Retrieve-a-new-granule-listing-from-CMR-2.1">1. Retrieve a new granule listing from CMR</a></span></li><li><span><a href="#1.-Check-the-status-of-the-local-data-directory" data-toc-modified-id="1.-Check-the-status-of-the-local-data-directory-2.2">1. Check the status of the local <code>data</code> directory</a></span></li><li><span><a href="#3.-Reconcile-local-data-with-current-CMR-granules" data-toc-modified-id="3.-Reconcile-local-data-with-current-CMR-granules-2.3">3. Reconcile local <code>data</code> with current CMR granules</a></span></li><li><span><a href="#Finishing-up" data-toc-modified-id="Finishing-up-2.4">Finishing up</a></span></li></ul></li></ul></div>

# Access Sentinel-6 Data

This notebook shows a simple way to download granules using the [CMR Search API](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html). 

> **User note:**
>  The notebook actually points to a MODIS SST collection for now ([https://doi.org/10.5067/GHMDA-2PJ19](https://doi.org/10.5067/GHMDA-2PJ19)). It will work just the same for Sentinel-6.

## Getting Started

### Parameters

This workflow/notebook can be used to download Sentinel-6 data granules using configurable CMR Search queries.

The variables in the cell below determine the workflow behavior:

* `params`: A dictionary of CMR Search API query parameters for the *granules* endpoint. 
  * May be passed to the notebook as a JSON string with [papermill](https://papermill.readthedocs.io). 
  * Read about the many granules search options in the [CMR Search API documentation](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html)
* `data`: The path to a local directory in which to download/maintain a copy of the NRT granule time series.
* `cmr`: The domain of the target CMR instance, either `cmr.earthdata.nasa.gov` or `cmr.uat.earthdata.nasa.gov`.

This dictionary covers most of the useful granule search options:

In [1]:
#
# This cell accepts parameters from command line with papermill: 
#  https://papermill.readthedocs.io
#
# These variables should be set before the first run, then they 
#  should be left alone. All subsequent runs expect the values 
#  for params, cmr, data to be unchanged. The mins value has no 
#  impact on subsequent runs.
#

params = {

    # TARGET COLLECTION:
    'ShortName': "MODIS_T-JPL-L2P-v2019.0",       # "MODIS_T-JPL-L2P-v2019.0"
    'collection_concept_id': None,                # "C1234724471-POCLOUD"

    # TEMPORAL COVERAGE ("START,END"):
    'temporal': "2019-01-01T00:00:00Z,2019-01-10T00:00:00Z",

    # CMR/INGEST EVENT:
    'created_at': None,
    'updated_since': None,
    'production_date': None,
    'revision_date': None,
    
    # ORBIT/ACQUISITION REFERENCE ID:
    'orbit_number': None,
    'equator_crossing_longitude': None,
    'equator_crossing_date': None,
    'day_night_flag': None,                       # "day"/"night"/"unspecified"
    'cycle': None ,                               # 230
    'passes[0][pass]': None,                      # 1
    'passes[1][pass]': None,                      # 2
    'passes[2][pass]': None,                      # 3

    # SPATIAL COVERAGE: 
    'bounding_box': "-150,55,-145,60",            # "-146.5,57.5,-146,58"
    'polygon': None,                              # "10,10,30,10,30,20,10,20,10,10"
    'point': None,                                # "100,20"
    'line': None,                                 # "-0.37,-14.07,4.75,1.27,25.13,-15.51"
    'circle': None,                               # "-87.629717,41.878112,1000"
    'shapefile': None,

    # API BEHAVIOR (Prob leave these alone.)
    "scroll": "true",
    "page_size": 20,
    "sort_key": "-start_date",

}

cmr = "cmr.uat.earthdata.nasa.gov"                # "cmr.earthdata.nasa.gov"

data = "resources/detailed"

The variable `data` is pointed at a nearby folder [`resources/comprehensive`](resources/comprehensive/) by default. You should change `data` to a suitable download path on your file system. An unlucky sequence of git commands could disappear that folder and its downloads if your not careful. Just change it.

If replacement `params` were NOT provided from the command line (as a JSON string; read [papermill](#)), clean up the dictionary of defaults by dropping all fields filled by `None`:

In [2]:
params = {par: val for par, val in params.items() if val is not None}
params

{'ShortName': 'MODIS_T-JPL-L2P-v2019.0',
 'temporal': '2019-01-01T00:00:00Z,2019-01-10T00:00:00Z',
 'bounding_box': '-150,55,-145,60',
 'scroll': 'true',
 'page_size': 20,
 'sort_key': '-start_date'}

### Requirements
All package imports are in the Python 3 standard library:

In [3]:
from netrc import netrc
from getpass import getpass
from json import dumps, loads
from os import makedirs
from os.path import isdir, basename
from datetime import datetime, timedelta
from http.cookiejar import CookieJar
from urllib.error import HTTPError, URLError
from urllib.parse import urlencode
from urllib import request

#### Earthdata Authentication

Before you beginning this tutorial, make sure you have an Earthdata account: [https://urs.earthdata.nasa.gov](https://urs.earthdata.nasa.gov) for the operations envionrment (most common) or [https://uat.urs.earthdata.nasa.gov](https://uat.urs.earthdata.nasa.gov) for the UAT environment. Accounts are free to create and take just a moment to set up.

We need some boilerplate up front to log in to Earthdata Login.  The function below will allow Python
scripts to log into any Earthdata Login application programmatically.  To avoid being prompted for
credentials every time you run and also allow clients such as curl to log in, you can add the following
to a `.netrc` (`_netrc` on Windows) file in your home directory:

```
machine uat.urs.earthdata.nasa.gov
    login <your username>
    password <your password>
```

Make sure that this file is only readable by the current user or you will receive an error stating
"netrc access too permissive."

`$ chmod 0600 ~/.netrc` 

*You'll need to authenticate using the netrc method when running from command line with [`papermill`](https://papermill.readthedocs.io/en/latest/). You can log in manually by executing the cell below when running in the notebook client in your browser.*

In [4]:
def setup_earthdata_login_auth(endpoint):
    """
    Set up the request library so that it authenticates against the given Earthdata Login
    endpoint and is able to track cookies between requests.  This looks in the .netrc file 
    first and if no credentials are found, it prompts for them.

    Valid endpoints include:
        uat.urs.earthdata.nasa.gov - Earthdata Login UAT (Harmony's current default)
        urs.earthdata.nasa.gov - Earthdata Login production
    """
    try:
        username, _, password = netrc().authenticators(endpoint)
    except (FileNotFoundError, TypeError):
        # FileNotFound = There's no .netrc file
        # TypeError = The endpoint isn't in the netrc file, causing the above to try unpacking None
        print('Please provide your Earthdata Login credentials to allow data access')
        print('Your credentials will only be passed to %s and will not be exposed in Jupyter' % (endpoint))
        username = input('Username:')
        password = getpass()

    manager = request.HTTPPasswordMgrWithDefaultRealm()
    manager.add_password(None, endpoint, username, password)
    auth = request.HTTPBasicAuthHandler(manager)

    jar = CookieJar()
    processor = request.HTTPCookieProcessor(jar)
    opener = request.build_opener(auth, processor)
    request.install_opener(opener)

Authenticate with either `uat.urs.earthdata.nasa.gov` or `urs.earthdata.nasa.gov` depending on which CMR instance was selected in the parameters cell.

In [5]:
if cmr == "cmr.uat.earthdata.nasa.gov":
    setup_earthdata_login_auth('uat.urs.earthdata.nasa.gov')
elif cmr == "cmr.earthdata.nasa.gov":
    setup_earthdata_login_auth('urs.earthdata.nasa.gov')
else:
    raise Exception(f"ERROR: The CMR base url is invalid ({cmr})")

## Hands-off workflow

### 1. Retrieve a new granule listing from CMR

Get the query parameters as a string and then the complete search url:

In [6]:
query = urlencode(params)
url = f"https://{cmr}/search/granules.umm_json?{query}"
print(url)

https://cmr.uat.earthdata.nasa.gov/search/granules.umm_json?ShortName=MODIS_T-JPL-L2P-v2019.0&temporal=2019-01-01T00%3A00%3A00Z%2C2019-01-10T00%3A00%3A00Z&bounding_box=-150%2C55%2C-145%2C60&scroll=true&page_size=20&sort_key=-start_date


In [7]:
with request.urlopen(url) as response:
    timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")  # Record the time of the request.
    try:
        results = loads(response.read().decode())                 # Read-Decode-Load results from JSON.
        scroll = response.getheader("Cmr-Scroll-Id")              # Get response header for 'Cmr-Scroll-Id'.
    except (HTTPError, URLError) as e:
        raise e
    else:
        req = request.Request(url, headers={'Client-Id': "S6", 'CMR-Scroll-Id': scroll})

print(f"Status after one request: ({len(results['items'])}/{results['hits']})")

Status after one request: (20/100)


Submit as many requests as needed to get all the target granule records.

In [8]:
while int(results['hits'])!=len(results['items']):
    try:
        with request.urlopen(req) as response:
            aresults = loads(response.read().decode())
    except (HTTPError, URLError) as e:
        raise e
    else:
        try:
            results['items'].extend(aresults['items'])
        except KeyError as e:
            break

print(f"Status after request loop: ({len(results['items'])}/{results['hits']})")

Status after request loop: (100/100)


Neatly print the `RelatedUrls` field of the first granule record (if at least one was returned):

In [9]:
if len(results['items']) >= 1:
    print(dumps(results['items'][0]['umm']['RelatedUrls'], indent=2))

[
  {
    "URL": "https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20190109213501-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.nc",
    "Type": "GET DATA",
    "Description": "The base directory location for the granule."
  },
  {
    "URL": "https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-public/MODIS_T-JPL-L2P-v2019.0/20190109213501-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.cmr.json",
    "Type": "EXTENDED METADATA",
    "Description": "File to download"
  },
  {
    "URL": "https://archive.podaac.uat.earthdata.nasa.gov/s3credentials",
    "Type": "VIEW RELATED INFORMATION",
    "Description": "api endpoint to retrieve temporary credentials valid for same-region direct s3 access"
  }
]


### 1. Check the status of the local `data` directory

The target directory for our granule downloads is defined by the `data` variable.

Make the output directory at the path given by `data`, if it doesn't already exist. If it *does* exist, look for a file `.history` inside and load it.

In [10]:
history_file = join(data, ".history")
history_data = {}

if not isdir(data):
    print(f"info:\tCreate the target `data` directory ({data})")
    makedirs(data)
else:
    print(f"info:\tTarget `data` directory already exists ({data})")
    if not isfile(history_file):
        print(f"warn:\tNo history file found in data directory ({history_file})")
    else:
        with open(history_file, "r") as f:
            history_data = load(f)
        print(f"info:\tThere are currently '{len(history_data)}' granules in local `data`.")

info:	Target `data` directory already exists (resources/detailed)
info:	There are currently '100' granules in local `data`.


### 3. Reconcile local `data` with current CMR granules

We'll iterate over the granules records returned by our most recent search(es) and download all the new ones that weren't downloaded during previous runs. If they *WERE downloaded in previous runs*, we'll download the latest version ('revision', technically) and overwrite them.

We'll track all changes into perpetuity using the `.history` file.

In [11]:
def progress_bar(i: int, n: int):
    """Hideous function that produces a nice ASCII progress bar."""
    F, P = int(55*i//n), ("{0:."+str(1)+"f}").format(100*(i/float(n)))
    print(f'\rProgress: |{"#"*F+"-"*(55-F)}| {i} / {n} ({P}%)', end='\r')
    if i == n:
        print()


for i, granule in enumerate(results['items']):

    # Get a timestamp marking the start of this potential granule download.
    time = datetime.now().strftime("%Y-%m-%dT%H:%M:%SZ")
    
    # Print a simple ASCII progress bar to notify user of the current status.
    progress_bar(i+1, len(results['items']))
    
    # Select CMR Search and UMM metadata for the current granule (dicts).
    meta, umm = list(granule.values())
    
    # Get the granule's unique 'native-id' from the CMR Search metadata.
    nativeid, https, opendap = meta['native-id'], None, None

    # Loop over the "RelatedUrls" until access urls are identified.
    for ru in umm['RelatedUrls']:
        if ru['Type'] == "GET DATA":
            https = ru['URL']
        if "opendap" in (ru['URL']+ru['Description']).lower():
            opendap = ru['URL'].replace(".html","")
    
    # Get the 'native-id' of the current granule.
    nativeid = meta['native-id']
    
    # Print a warning if an HTTPS url was not found.
    if https is None:
        print(f"[{time}] warn: no HTTPS download for '{nativeid}'.\n")
        continue

    # 1. Is the granule referenced in the history file?
    if nativeid in list(history_data.keys()):
        # 2. Is some version of the granule already in the data directory?
        if basename(https) in listdir(data):
            # 3. Is the old 'revision-id' the same as the one in CMR?
            if meta['revision-id']==history_data[nativeid]['revision-id']:
                continue  # If all three are True, skip the download.
    try:
        # Attempt the download if satisfied all conditions above.
        request.urlretrieve(https, f"{data}/{basename(https)}")
    except Exception as e:
        print(f"[{time}] warn: HTTPS download failure for '{nativeid}'.\n")
        raise e

    # Replace the entry in the history dictionary to reflect new granule.
    history_data[nativeid] = {**meta, 'https': https, 'opendap': opendap, 'recon': time}

    # Dump the history file to overwrite the old one, capturing this successful download.
    with open(history_file, "w") as f:
        f.write(dumps(history_data))

Progress: |-------------------------------------------------------| 1 / 100 (1.0%)Progress: |#------------------------------------------------------| 2 / 100 (2.0%)Progress: |#------------------------------------------------------| 3 / 100 (3.0%)Progress: |##-----------------------------------------------------| 4 / 100 (4.0%)Progress: |##-----------------------------------------------------| 5 / 100 (5.0%)Progress: |###----------------------------------------------------| 6 / 100 (6.0%)Progress: |###----------------------------------------------------| 7 / 100 (7.0%)Progress: |####---------------------------------------------------| 8 / 100 (8.0%)Progress: |####---------------------------------------------------| 9 / 100 (9.0%)Progress: |#####--------------------------------------------------| 10 / 100 (10.0%)Progress: |######-------------------------------------------------| 11 / 100 (11.0%)Progress: |######-------------------------------------------------| 12 

### Finishing up

Now read the history file back into a Python dictionary, print its size (item count) and its first item for the user's reference. We're storing:

* the CMR Search metadata (key pattern: `results['items'][#]['meta']`),
* the access urls from the UMM metadata (key pattern: `results['items'][#]['umm']['RelatedUrls']`),
* a timestamp indicating the time the granule was last downloaded/updated locally.

In [12]:
with open(history_file, "r") as f:
    history_data = load(f)

print(f"Count: {len(history_data)}")

Count: 100


In [13]:
print(dumps(list(history_data.items())[0][1], indent=2))

{
  "concept-type": "granule",
  "concept-id": "G1235913051-POCLOUD",
  "revision-id": 1,
  "native-id": "20190109213501-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0",
  "provider-id": "POCLOUD",
  "format": "application/vnd.nasa.cmr.umm+json",
  "revision-date": "2020-07-04T08:44:56.751Z",
  "https": "https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20190109213501-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.nc",
  "opendap": null,
  "recon": "2020-09-10T16:36:16Z"
}
