<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Before-you-start" data-toc-modified-id="Before-you-start-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Before you start</a></span></li><li><span><a href="#Authentication-setup" data-toc-modified-id="Authentication-setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Authentication setup</a></span></li><li><span><a href="#Hands-off-workflow" data-toc-modified-id="Hands-off-workflow-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Hands-off workflow</a></span></li></ul></div>

# Access Sentinel-6 NRT Data

This notebook shows a simple way to maintain a local time series of [Sentinel-6](#) NRT data using the [CMR Search API](#). It downloads granules the ingested since the previous run to a designated data folder and overwrites a hidden file inside with the timestamp of the CMR Search request on success.

> User note: The notebook actually points to a MODIS SST collection for now ([10.5067/GHMDA-2PJ19](https://doi.org/10.5067/GHMDA-2PJ19). It'll work just the same for Sentinel-6.

## Before you start

Before you beginning this tutorial, make sure you have an Earthdata account [https://uat.urs.earthdata.nasa.gov](https://uat.urs.earthdata.nasa.gov).

Accounts are free to create and take just a moment to set up.

## Authentication setup

*You'll probably need to use the netrc method when running from command line.* 

We need some boilerplate up front to log in to Earthdata Login.  The function below will allow Python
scripts to log into any Earthdata Login application programmatically.  To avoid being prompted for
credentials every time you run and also allow clients such as curl to log in, you can add the following
to a `.netrc` (`_netrc` on Windows) file in your home directory:

```
machine uat.urs.earthdata.nasa.gov
    login <your username>
    password <your password>
```

Make sure that this file is only readable by the current user or you will receive an error stating
"netrc access too permissive."

`$ chmod 0600 ~/.netrc` 


In [3]:
from urllib import request
from http.cookiejar import CookieJar
import getpass
import netrc

def setup_earthdata_login_auth(endpoint):
    """
    Set up the request library so that it authenticates against the given Earthdata Login
    endpoint and is able to track cookies between requests.  This looks in the .netrc file 
    first and if no credentials are found, it prompts for them.

    Valid endpoints include:
        uat.urs.earthdata.nasa.gov - Earthdata Login UAT (Harmony's current default)
        urs.earthdata.nasa.gov - Earthdata Login production
    """
    try:
        username, _, password = netrc.netrc().authenticators(endpoint)
    except (FileNotFoundError, TypeError):
        # FileNotFound = There's no .netrc file
        # TypeError = The endpoint isn't in the netrc file, causing the above to try unpacking None
        print('Please provide your Earthdata Login credentials to allow data access')
        print('Your credentials will only be passed to %s and will not be exposed in Jupyter' % (endpoint))
        username = input('Username:')
        password = getpass.getpass()

    manager = request.HTTPPasswordMgrWithDefaultRealm()
    manager.add_password(None, endpoint, username, password)
    auth = request.HTTPBasicAuthHandler(manager)

    jar = CookieJar()
    processor = request.HTTPCookieProcessor(jar)
    opener = request.build_opener(auth, processor)
    request.install_opener(opener)

In [4]:
setup_earthdata_login_auth('uat.urs.earthdata.nasa.gov')

## Hands-off workflow

This workflow/notebook can be run routinely to maintain a time series of NRT data, downloading new granules as they become available in CMR. There are at least a few ways to run it:

* from the notebook server, 
* from the command line with papermill, and
* using nbconvert/papermill with a job scheduler like cron.

The notebook writes a file `.update` to the target data directory the first time it's run. Every subsequent run that finishes successfully overwrites the file with a new timestamp corresponding to the time used for the  granules search parameter `created_at`.

The variables in the cell below determine the workflow behavior in it's initial run. (*`mins` only applies to the first run.*)

In [5]:
#
# This cell accepts parameters from command line with papermill: 
#  https://papermill.readthedocs.io
#

mins = 20  # Limit the results to granules ingested in the last ___ minutes.

cmr = "cmr.uat.earthdata.nasa.gov"  # The domain for CMR or CMR UAT

ccid = "C1234724470-POCLOUD"  # The 'concept-id' of the desired CMR collection

data = "resources/nrt"  # The target path for the NRT granules to download

The variable `data` is pointed at a nearby folder [`resources/nrt`](resources/nrt/) by default. **You should change `data` to a suitable download path on your file system.** An unlucky sequence of git commands could disappear that folder and its downloads, if your not careful. Just change it.

In [6]:
from os import makedirs
from os.path import isdir, basename
from urllib.request import urlopen, urlretrieve
from datetime import datetime, timedelta
from json import dumps, loads

**The search retrieves granules ingested during the last `n` minutes.** A file in your local data dir  file that tracks updates to your data directory, if one file exists. The CMR Search falls back on the ten minute window if not.

In [7]:
timestamp = (datetime.utcnow()-timedelta(minutes=mins)).strftime("%Y-%m-%dT%H:%M:%SZ")
timestamp

'2020-07-28T02:55:58Z'

This cell will replace the timestamp above with the one read from the `.update` file in the data directory, if it exists.

In [9]:
if not isdir(data):
    print(f"NOTE: Making new data directory at '{data}'. (This is the first run.)")
    makedirs(data)
else:
    try:
        with open(f"{data}/.update", "r") as f:
            timestamp = f.read()
    except FileNotFoundError:
        print("WARN: No .update in the data directory. (Is this the first run?)")
    else:
        print(f"NOTE: .update found in the data directory. (The last run was at {timestamp}.)")

NOTE: .update found in the data directory. (The last run was at 2020-07-27T21:56:41Z.)


There are several ways to query for CMR updates that occured during a given timeframe. Read on in the CMR Search documentation:

* https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-with-new-granules (Collections)
* https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-with-revised-granules (Collections)
* https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#g-production-date (Granules)
* https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#g-created-at (Granules)

The `created_at` parameter works for our purposes. It's a granule search parameter that returns the records ingested since the input timestamp.

In [10]:
params = {
    'scroll': "true",
    'page_size': 2000,
    'sort_key': "-start_date",
    'collection_concept_id': ccid, 
    'created_at': timestamp,
    # Limit results to coverage for .5deg bbox in Gulf of Alaska:
    'bounding_box': "-146.5,57.5,-146,58",
}

params

{'scroll': 'true',
 'page_size': 2000,
 'sort_key': '-start_date',
 'collection_concept_id': 'C1234724470-POCLOUD',
 'created_at': '2020-07-27T21:56:41Z',
 'bounding_box': '-146.5,57.5,-146,58'}

Join the parameters dictionary into the query string by joining the parameters and values with `=`, then the `parameter=value` pairs to each other with `&`.

In [11]:
query = "&".join([f"{p}={v}" for p,v in params.items()])
query

'scroll=true&page_size=2000&sort_key=-start_date&collection_concept_id=C1234724470-POCLOUD&created_at=2020-07-27T21:56:41Z&bounding_box=-146.5,57.5,-146,58'

Append to the CMR Search endpoint for collections.

In [12]:
url = f"https://{cmr}/search/granules.umm_json?{query}"
print(url)

https://cmr.uat.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&collection_concept_id=C1234724470-POCLOUD&created_at=2020-07-27T21:56:41Z&bounding_box=-146.5,57.5,-146,58


Download the granule records that match our search parameters.

In [13]:
with urlopen(url) as f:
    results = loads(f.read().decode())

print(f"{results['hits']} new granules ingested for '{ccid}' since '{timestamp}'.")

14 new granules ingested for 'C1234724470-POCLOUD' since '2020-07-27T21:56:41Z'.


Neatly print the first granule's record for reference (assuming at least one was returned).

In [14]:
if len(results['items'])>0:
    print(dumps(results['items'][0], indent=2))

{
  "meta": {
    "concept-type": "granule",
    "concept-id": "G1236616733-POCLOUD",
    "revision-id": 1,
    "native-id": "20200727220501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0",
    "provider-id": "POCLOUD",
    "format": "application/vnd.nasa.cmr.umm+json",
    "revision-date": "2020-07-28T01:28:46Z"
  },
  "umm": {
    "RelatedUrls": [
      {
        "URL": "https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20200727220501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc",
        "Type": "GET DATA",
        "Description": "The base directory location for the granule."
      },
      {
        "URL": "https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-public/MODIS_A-JPL-L2P-v2019.0/20200727220501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.cmr.json",
        "Type": "EXTENDED METADATA",
        "Description": "File to download"
      },
      {
        "URL": "https://archive.podaac.uat.earthdata.nasa.gov/s3c

The link for http access denoted by `"Type": "GET DATA"` in the list of `RelatedUrls`.

Grab the download URL, but do it in a way that'll work for search results returning any number of granule records:

In [15]:
downloads = [r['umm']['RelatedUrls'][0]['URL'] for r in results['items']]
downloads

['https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20200727220501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc',
 'https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20200727203000-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc',
 'https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20020725150006-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc',
 'https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20020723115505-JPL-L2P_GHRSST-SSTskin-MODIS_A-N-v02.0-fv01.0.nc',
 'https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20020718113505-JPL-L2P_GHRSST-SSTskin-MODIS_A-N-v02.0-fv01.0.nc',
 'https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20020716150506-JPL-L2P_GHRSST-SSTskin-MODIS_A-N-v02.0-fv01.0.nc'

Finish by downloading the files to the data directory in a loop. Overwrite `.update` with a new timestamp on success.

In [17]:
for f in downloads:
    try:
        urlretrieve(f, f"{data}/{basename(f)}")
    except Exception as e:
        print(f"[{datetime.now()}] FAILURE: {f}\n\n{e}\n")
        raise e
    else:
        print(f"[{datetime.now()}] SUCCESS: {f}")

[2020-07-27 23:17:26.617694] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20200727220501-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc
[2020-07-27 23:17:32.628612] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20200727203000-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc
[2020-07-27 23:17:37.616443] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20020725150006-JPL-L2P_GHRSST-SSTskin-MODIS_A-D-v02.0-fv01.0.nc
[2020-07-27 23:17:43.343401] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20020723115505-JPL-L2P_GHRSST-SSTskin-MODIS_A-N-v02.0-fv01.0.nc
[2020-07-27 23:17:48.407171] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/MODIS_A-JPL-L2P-v2019.0/20020718113505-JPL-L2P_GHRSST-SSTskin-MODIS_A-N-v02.0-fv01.

If there were updates to the local time series during this run and no exceptions were raised during the download loop, then overwrite the timestamp file that tracks updates to the data folder (`resources/nrt/.update`):

In [18]:
if len(results['items'])>0:
    with open(f"{data}/.update", "w") as f:
        f.write(timestamp)