<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Before-you-start" data-toc-modified-id="Before-you-start-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Before you start</a></span></li><li><span><a href="#Authentication-setup" data-toc-modified-id="Authentication-setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Authentication setup</a></span></li><li><span><a href="#Find-granules-by-cycle/pass-number" data-toc-modified-id="Find-granules-by-cycle/pass-number-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Find granules by cycle/pass number</a></span></li></ul></div>

# Access Sentinel-6 Data by Cycle and Pass Number

This notebook shows a simple way to search for Sentinel-6 data granules for a specific cycle and pass using the [CMR Search API](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html) and download them to a local directory.

> **User note:**
>  The notebook actually points to a JASON-1 SSHA collection for now ([https://doi.org/10.5067/J1GPR-NC00E](https://doi.org/10.5067/J1GPR-NC00E)). It will work just the same for Sentinel-6.

## Before you start

Before you beginning this tutorial, make sure you have an Earthdata account [https://uat.urs.earthdata.nasa.gov](https://uat.urs.earthdata.nasa.gov).

Accounts are free to create and take just a moment to set up.

## Authentication setup

*You'll probably need to use the netrc method when running from command line.* 

We need some boilerplate up front to log in to Earthdata Login.  The function below will allow Python
scripts to log into any Earthdata Login application programmatically.  To avoid being prompted for
credentials every time you run and also allow clients such as curl to log in, you can add the following
to a `.netrc` (`_netrc` on Windows) file in your home directory:

```
machine uat.urs.earthdata.nasa.gov
    login <your username>
    password <your password>
```

Make sure that this file is only readable by the current user or you will receive an error stating
"netrc access too permissive."

`$ chmod 0600 ~/.netrc` 


In [1]:
from urllib import request
from http.cookiejar import CookieJar
import getpass
import netrc

def setup_earthdata_login_auth(endpoint):
    """
    Set up the request library so that it authenticates against the given Earthdata Login
    endpoint and is able to track cookies between requests.  This looks in the .netrc file 
    first and if no credentials are found, it prompts for them.

    Valid endpoints include:
        uat.urs.earthdata.nasa.gov - Earthdata Login UAT (Harmony's current default)
        urs.earthdata.nasa.gov - Earthdata Login production
    """
    try:
        username, _, password = netrc.netrc().authenticators(endpoint)
    except (FileNotFoundError, TypeError):
        # FileNotFound = There's no .netrc file
        # TypeError = The endpoint isn't in the netrc file, causing the above to try unpacking None
        print('Please provide your Earthdata Login credentials to allow data access')
        print('Your credentials will only be passed to %s and will not be exposed in Jupyter' % (endpoint))
        username = input('Username:')
        password = getpass.getpass()

    manager = request.HTTPPasswordMgrWithDefaultRealm()
    manager.add_password(None, endpoint, username, password)
    auth = request.HTTPBasicAuthHandler(manager)

    jar = CookieJar()
    processor = request.HTTPCookieProcessor(jar)
    opener = request.build_opener(auth, processor)
    request.install_opener(opener)

In [2]:
setup_earthdata_login_auth('uat.urs.earthdata.nasa.gov')

## Find granules by cycle/pass number

The CMR Search API provides for searching ingested granules by their cycle and pass numbers. A third parameter, the `tile` identifier, is provisioned for use during the upcoming SWOT mission but isn't used by CMR Search at this time. Read more about these orbit identifiers [here](#). 

Passes within a cycle are unique, there will be no repeats until the next cycle. Tile numbers are only unique within a pass, so if you're looking only at tile numbers there will be over 300 per cycle, but only 1 per pass.

*Info below may only apply to NRT use case:*

>This workflow/notebook can be run routinely to maintain a time series of NRT data, downloading new granules as they become available in CMR. 
>
>The notebook writes/overwrites a file `.update` to the target data directory with each successful run. The file tracks to date and time of the most recent update to the time series of NRT granules using a timestamp in the format `yyyy-mm-ddThh:mm:ssZ`. 
>
>The timestamp matches the value used for the [`created_at`](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#g-created-at) parameter in the last successful run. This parameter finds the granules created within a range of datetimes. This workflow leverages the `created_at` parameter to search backwards in time for new granules ingested between the time of our timestamp and now.

The variables in the cell below determine the workflow behavior on its initial run:

* `trackcycle` and `trackpass`: Set the cycle and pass numbers to use for the CMR granule search.
* `cmr`: The domain of the target CMR instance, either `cmr.earthdata.nasa.gov` or `cmr.uat.earthdata.nasa.gov`.
* `ccid`: The unique CMR `concept-id` of the desired collection.
* `data`: The path to a local directory in which to download/maintain a copy of the NRT granule time series.

In [3]:
#
# This cell accepts parameters from command line with papermill: 
#  https://papermill.readthedocs.io
#
# These variables should be set before the first run, then they 
#  should be left alone. All subsequent runs expect the values 
#  for cmr, ccid, data to be unchanged. The mins value has no 
#  impact on subsequent runs.
#

trackcycle = 230
trackpass = 1

cmr = "cmr.uat.earthdata.nasa.gov"

ccid = "C1234208437-POCLOUD"

data = "resources/cyclepass"

The variable `data` is pointed at a nearby folder [`resources/cyclepass`](resources/cyclepass/) by default. **You should change `data` to a suitable download path on your file system.** An unlucky sequence of git commands could disappear that folder and its downloads, if your not careful. Just change it.

In [4]:
from os import makedirs
from os.path import isdir, basename
from urllib.parse import urlencode
from urllib.request import urlopen, urlretrieve
from datetime import datetime, timedelta
from json import dumps, loads

**The search retrieves granules ingested during the last `n` minutes.** A file in your local data dir  file that tracks updates to your data directory, if one file exists. The CMR Search falls back on the ten minute window if not.

In [5]:
#timestamp = (datetime.utcnow()-timedelta(minutes=mins)).strftime("%Y-%m-%dT%H:%M:%SZ")
#timestamp

This cell will replace the timestamp above with the one read from the `.update` file in the data directory, if it exists.

In [6]:
if not isdir(data):
    print(f"NOTE: Making new data directory at '{data}'. (This is the first run.)")
    makedirs(data)
#else:
#    try:
#        with open(f"{data}/.update", "r") as f:
#            timestamp = f.read()
#    except FileNotFoundError:
#        print("WARN: No .update in the data directory. (Is this the first run?)")
#    else:
#        print(f"NOTE: .update found in the data directory. (The last run was at {timestamp}.)")

NOTE: Making new data directory at 'resources/cyclepass'. (This is the first run.)


There are several ways to query for CMR updates that occured during a given timeframe. Read on in the CMR Search documentation:

* https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-with-new-granules (Collections)
* https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#c-with-revised-granules (Collections)
* https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#g-production-date (Granules)
* https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#g-created-at (Granules)

The `created_at` parameter works for our purposes. It's a granule search parameter that returns the records ingested since the input timestamp.

In [7]:
params = {
    'scroll': "true",
    'page_size': 2000,
    'sort_key': "-start_date",
    'collection_concept_id': ccid, 
    #'created_at': timestamp,
    # Limit results to granules matching cycle, pass numbers:
    'cycle': trackcycle,
    'passes[0][pass]': trackpass,
}

params

{'scroll': 'true',
 'page_size': 2000,
 'sort_key': '-start_date',
 'collection_concept_id': 'C1234208437-POCLOUD',
 'cycle': 230,
 'passes[0][pass]': 1}

Get the query parameters as a string and then the complete search url:

In [8]:
query = urlencode(params)
url = f"https://{cmr}/search/granules.umm_json?{query}"
print(url)

https://cmr.uat.earthdata.nasa.gov/search/granules.umm_json?scroll=true&page_size=2000&sort_key=-start_date&collection_concept_id=C1234208437-POCLOUD&cycle=230&passes%5B0%5D%5Bpass%5D=1


Download the granule records that match our search parameters.

In [9]:
with urlopen(url) as f:
    results = loads(f.read().decode())

print(f"{results['hits']} granules results for '{ccid}' cycle '{trackcycle}' and pass '{trackpass}'.")

1 granules results for 'C1234208437-POCLOUD' cycle '230' and pass '1'.


Neatly print the first granule's record for reference (assuming at least one was returned).

In [10]:
#if len(results['items'])>0:
#    print(dumps(results['items'][0], indent=2))
#    
#    # Also, replace timestamp with one corresponding to time of the search.
#    timestamp = datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")

The link for http access denoted by `"Type": "GET DATA"` in the list of `RelatedUrls`.

Grab the download URL, but do it in a way that'll work for search results returning any number of granule records:

In [11]:
downloads = [r['umm']['RelatedUrls'][0]['URL'] for r in results['items']]
downloads

['https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/JASON-1_L2_OST_GPR_E/JA1_GPR_2PeP230_001_20080403_213355_20080403_223004.nc']

Finish by downloading the files to the data directory in a loop. Overwrite `.update` with a new timestamp on success.

In [12]:
for f in downloads:
    try:
        urlretrieve(f, f"{data}/{basename(f)}")
    except Exception as e:
        print(f"[{datetime.now()}] FAILURE: {f}\n\n{e}\n")
        raise e
    else:
        print(f"[{datetime.now()}] SUCCESS: {f}")

[2020-09-02 12:59:27.589523] SUCCESS: https://archive.podaac.uat.earthdata.nasa.gov/podaac-uat-cumulus-protected/JASON-1_L2_OST_GPR_E/JA1_GPR_2PeP230_001_20080403_213355_20080403_223004.nc


If there were updates to the local time series during this run and no exceptions were raised during the download loop, then overwrite the timestamp file that tracks updates to the data folder (`resources/nrt/.update`):

In [13]:
#if len(results['items'])>0:
#    with open(f"{data}/.update", "w") as f:
#        f.write(timestamp)