<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Before-you-start" data-toc-modified-id="Before-you-start-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Before you start</a></span></li><li><span><a href="#Authentication-setup" data-toc-modified-id="Authentication-setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Authentication setup</a></span></li><li><span><a href="#Hands-off-workflow" data-toc-modified-id="Hands-off-workflow-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Hands-off workflow</a></span></li></ul></div>

# Access Sentinel-6 NRT Data

This notebook shows a simple way to maintain a local time series of Sentinel-6 NRT data using the [CMR Search API](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html). It downloads granules the ingested since the previous run to a designated data folder and overwrites a hidden file inside with the timestamp of the CMR Search request on success.

> **User note:**
>  The notebook actually points to a MODIS SST collection for now ([https://doi.org/10.5067/GHMDA-2PJ19](https://doi.org/10.5067/GHMDA-2PJ19)). It will work just the same for Sentinel-6.

## Before you start

Before you beginning this tutorial, make sure you have an Earthdata account: [https://urs.earthdata.nasa.gov](https://urs.earthdata.nasa.gov) for the operations envionrment (most common) or [https://uat.urs.earthdata.nasa.gov](https://uat.urs.earthdata.nasa.gov) for the UAT environment.

Accounts are free to create and take just a moment to set up.

## Authentication setup

We need some boilerplate up front to log in to Earthdata Login.  The function below will allow Python
scripts to log into any Earthdata Login application programmatically.  To avoid being prompted for
credentials every time you run and also allow clients such as curl to log in, you can add the following
to a `.netrc` (`_netrc` on Windows) file in your home directory:

```
machine uat.urs.earthdata.nasa.gov
    login <your username>
    password <your password>
```

Make sure that this file is only readable by the current user or you will receive an error stating
"netrc access too permissive."

`$ chmod 0600 ~/.netrc` 

*You'll need to authenticate using the netrc method when running from command line with [`papermill`](https://papermill.readthedocs.io/en/latest/). You can log in manually by executing the cell below when running in the notebook client in your browser.*

In [1]:
from urllib import request
from http.cookiejar import CookieJar
import getpass
import netrc


def setup_earthdata_login_auth(endpoint):
    """
    Set up the request library so that it authenticates against the given Earthdata Login
    endpoint and is able to track cookies between requests.  This looks in the .netrc file 
    first and if no credentials are found, it prompts for them.

    Valid endpoints include:
        uat.urs.earthdata.nasa.gov - Earthdata Login UAT (Harmony's current default)
        urs.earthdata.nasa.gov - Earthdata Login production
    """
    try:
        username, _, password = netrc.netrc().authenticators(endpoint)
    except (FileNotFoundError, TypeError):
        # FileNotFound = There's no .netrc file
        # TypeError = The endpoint isn't in the netrc file, causing the above to try unpacking None
        print('Please provide your Earthdata Login credentials to allow data access')
        print('Your credentials will only be passed to %s and will not be exposed in Jupyter' % (endpoint))
        username = input('Username:')
        password = getpass.getpass()

    manager = request.HTTPPasswordMgrWithDefaultRealm()
    manager.add_password(None, endpoint, username, password)
    auth = request.HTTPBasicAuthHandler(manager)

    jar = CookieJar()
    processor = request.HTTPCookieProcessor(jar)
    opener = request.build_opener(auth, processor)
    request.install_opener(opener)

In [2]:
setup_earthdata_login_auth('uat.urs.earthdata.nasa.gov')
#setup_earthdata_login_auth('urs.earthdata.nasa.gov')

## Hands-off workflow

This workflow/notebook can be run routinely to maintain a time series of NRT data, downloading new granules as they become available in CMR. 

The notebook writes/overwrites a file `.update` to the target data directory with each successful run. The file tracks to date and time of the most recent update to the time series of NRT granules using a timestamp in the format `yyyy-mm-ddThh:mm:ssZ`. 

The timestamp matches the value used for the [`created_at`](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#g-created-at) parameter in the last successful run. This parameter finds the granules created within a range of datetimes. This workflow leverages the `created_at` parameter to search backwards in time for new granules ingested between the time of our timestamp and now.

The variables in the cell below determine the workflow behavior on its initial run:

* `mins`: Initialize a new local time series by starting with the granules ingested since ___ minutes ago. 
* `cmr`: The domain of the target CMR instance, either `cmr.earthdata.nasa.gov` or `cmr.uat.earthdata.nasa.gov`.
* `ShortName`: PODAAC's unique string identifier for the desired Collection.
* `data`: The path to a local directory in which to download/maintain a copy of the NRT granule time series.

In [3]:
#
# This cell accepts parameters from command line with papermill: 
#  https://papermill.readthedocs.io
#
# These variables should be set before the first run, then they 
#  should be left alone. All subsequent runs expect the values 
#  for cmr, ShortName, data to be unchanged. The mins value has no 
#  impact on subsequent runs.
#

mins = 10

cmr = "cmr.uat.earthdata.nasa.gov"

ShortName = "MODIS_T-JPL-L2P-v2019.0"  # "MODIS_A-JPL-L2P-v2019.0"

data = "resources/nrt"

The variable `data` is pointed at a nearby folder [`resources/nrt`](resources/nrt/) by default. **You should change `data` to a suitable download path on your file system.** An unlucky sequence of git commands could disappear that folder and its downloads if your not careful. Just change it.

The Python imports relevant to the workflow

In [4]:
from os import makedirs
from os.path import isdir, basename
from urllib.parse import urlencode
from urllib.error import HTTPError, URLError
from urllib.request import urlopen, urlretrieve
from datetime import datetime, timedelta
from json import dumps, loads

**The search retrieves granules ingested during the last `n` minutes.** A file in your local data dir  file that tracks updates to your data directory, if one file exists. The CMR Search falls back on the ten minute window if not.

In [5]:
timestamp = (datetime.utcnow()-timedelta(minutes=mins)).strftime("%Y-%m-%dT%H:%M:%SZ")
timestamp

'2020-09-10T21:13:17Z'

This cell will replace the timestamp above with the one read from the `.update` file in the data directory, if it exists.

In [6]:
if not isdir(data):
    print(f"NOTE: Making new data directory at '{data}'. (This is the first run.)")
    makedirs(data)
else:
    try:
        with open(f"{data}/.update", "r") as f:
            timestamp = f.read()
    except FileNotFoundError:
        print("WARN: No .update in the data directory. (Is this the first run?)")
    else:
        print(f"NOTE: .update found in the data directory. (The last run was at {timestamp}.)")

WARN: No .update in the data directory. (Is this the first run?)


There are several ways to query for CMR updates that occured during a given timeframe. Read on in the CMR Search documentation:

* https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-with-new-granules (Collections)
* https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-with-revised-granules (Collections)
* https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#g-production-date (Granules)
* https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#g-created-at (Granules)

The `created_at` parameter works for our purposes. It's a granule search parameter that returns the records ingested since the input timestamp.

In [7]:
params = {
    'scroll': "true",
    'page_size': 2000,
    'sort_key': "-start_date",
    'ShortName': ShortName,
    'created_at': timestamp,
}

params

{'scroll': 'true',
 'page_size': 2000,
 'sort_key': '-start_date',
 'ShortName': 'MODIS_T-JPL-L2P-v2019.0',
 'created_at': '2020-09-10T21:13:17Z'}

Get the query parameters as a string and then the complete search url:

In [8]:
query = urlencode(params)
url = f"https://{cmr}/search/granules.umm_json?{query}"
print(url)

https://cmr.uat.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&ShortName=MODIS_T-JPL-L2P-v2019.0&created_at=2020-09-10T21%3A13%3A17Z


Get a new timestamp that represents the UTC time of the search. Then download the records in `umm_json` format for granules that match our search parameters:

In [9]:
with request.urlopen(url) as response:
    timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")  # Record the time of the request.
    try:
        results = loads(response.read().decode())                 # Read-Decode-Load results from JSON.
        scroll = response.getheader("Cmr-Scroll-Id")              # Get response header for 'Cmr-Scroll-Id'.
    except (HTTPError, URLError) as e:
        raise e
    else:
        req = request.Request(url, headers={'Client-Id': "S6", 'CMR-Scroll-Id': scroll})

print(f"Status after request loop: ({len(results['items'])}/{results['hits']})")

Status after request loop: (74/74)


Submit as many requests as needed to get all the target granule records.

In [10]:
while int(results['hits'])!=len(results['items']):
    try:
        with request.urlopen(req) as response:
            aresults = loads(response.read().decode())
    except (HTTPError, URLError) as e:
        raise e
    else:
        try:
            results['items'].extend(aresults['items'])
        except KeyError as e:
            break

print(f"Status after request loop: ({len(results['items'])}/{results['hits']})")

Status after request loop: (74/74)


Neatly print the first granule record (if one was returned):

In [11]:
if len(results['items'])>0:
    print(dumps(results['items'][0], indent=2))

{
  "meta": {
    "concept-type": "granule",
    "concept-id": "G1237944663-POCLOUD",
    "revision-id": 1,
    "native-id": "20031031035506-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0",
    "provider-id": "POCLOUD",
    "format": "application/vnd.nasa.cmr.umm+json",
    "revision-date": "2020-09-10T21:19:57.693Z"
  },
  "umm": {
    "RelatedUrls": [
      {
        "URL": "https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20031031035506-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.nc",
        "Type": "GET DATA",
        "Description": "Download 20031031035506-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.nc"
      },
      {
        "URL": "https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-public/MODIS_T-JPL-L2P-v2019.0/20031031035506-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.nc.md5",
        "Type": "EXTENDED METADATA",
        "Description": "Download 20031031035506-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.

The link for http access can be retrieved from each granule record's `RelatedUrls` field. The download link is identified by `"Type": "GET DATA"` .

Select the download URL for each of the granule records:

In [12]:
downloads = [[u['URL'] for u in r['umm']['RelatedUrls'] if u['Type']=="GET DATA"][0] for r in results['items']]
downloads[5:]

['https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20031030035006-JPL-L2P_GHRSST-SSTskin-MODIS_T-D-v02.0-fv01.0.nc',
 'https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20031030032006-JPL-L2P_GHRSST-SSTskin-MODIS_T-D-v02.0-fv01.0.nc',
 'https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20031030031005-JPL-L2P_GHRSST-SSTskin-MODIS_T-D-v02.0-fv01.0.nc',
 'https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20031029235506-JPL-L2P_GHRSST-SSTskin-MODIS_T-D-v02.0-fv01.0.nc',
 'https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20031029113005-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.nc',
 'https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20031029043506-JPL-L2P_GHRSST-SSTskin-MODIS_T-D-v02.0-fv01.0.nc'

Finish by downloading the files to the data directory in a loop. Overwrite `.update` with a new timestamp on success.

In [13]:
for f in downloads:
    try:
        request.urlretrieve(f, f"{data}/{basename(f)}")
    except Exception as e:
        print(f"[{datetime.now()}] FAILURE: {f}\n\n{e}\n")
        raise e
    else:
        print(f"[{datetime.now()}] SUCCESS: {f}")

[2020-09-10 17:23:26.789531] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20031031035506-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.nc
[2020-09-10 17:23:31.105989] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20031030223506-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.nc
[2020-09-10 17:23:35.368470] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20031030215506-JPL-L2P_GHRSST-SSTskin-MODIS_T-D-v02.0-fv01.0.nc
[2020-09-10 17:23:39.873141] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20031030055005-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.nc
[2020-09-10 17:23:44.276951] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20031030044505-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.

[2020-09-10 17:26:48.127118] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20010118002005-JPL-L2P_GHRSST-SSTskin-MODIS_T-D-v02.0-fv01.0.nc
[2020-09-10 17:26:53.445490] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20010117165505-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.nc
[2020-09-10 17:26:57.776385] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20010117051005-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.nc
[2020-09-10 17:27:01.920498] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20010116194006-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.0.nc
[2020-09-10 17:27:05.969631] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_T-JPL-L2P-v2019.0/20010116083005-JPL-L2P_GHRSST-SSTskin-MODIS_T-N-v02.0-fv01.

If there were updates to the local time series during this run and no exceptions were raised during the download loop, then overwrite the timestamp file that tracks updates to the data folder (`resources/nrt/.update`):

In [14]:
if len(results['items'])>0:
    with open(f"{data}/.update", "w") as f:
        f.write(timestamp)