# Demonstrating Earthdata Download w/ Harmony Subsetting

Much of this script was adapted from Jack McNelis's "Co-locate satellite and in-situ data for cross-validation" file, located here: https://github.com/podaac/tutorials/blob/master/notebooks/SWOT-EA-2021/Colocate_satellite_insitu_ocean.ipynb

**Goal**
<br/>
To access the MUR 1-km dataset stored on Amazon Web Services (AWS), to spatially and temporally subset it using Harmony, and to download that subset to the local machine.

**Dataset**
<br/>
MUR 1-km L4 SST (requires AWS early access in order to view on Earthdata Search) https://podaac.jpl.nasa.gov/MEaSUREs-MUR?tab=background&sections=about%2Bdata

### Import Modules

In [4]:
from netrc import netrc
from urllib import request
from platform import system
from getpass import getpass
from datetime import datetime
from http.cookiejar import CookieJar
from os.path import join, isfile, basename, abspath, expanduser
import xarray as xr
import requests
import json

### Earthdata Login
<br/>
An Earthdata Login account is required to access data, as well as discover restricted data, from the NASA Earthdata system. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. This account is free to create and only takes a moment to set up.

The setup_earthdata_login_auth function will allow Python scripts to log into any Earthdata Login application programmatically. To avoid being prompted for credentials every time you run and also allow clients such as curl to log in, you can add the following to a .netrc (_netrc on Windows) file in your home directory:

    machine urs.earthdata.nasa.gov
    login <your username>
    password <your password>

Make sure that this file is only readable by the current user or you will receive an error stating "netrc access too permissive."

    $ chmod 0600 ~/.netrc

In [5]:
TOKEN_DATA = ("<token>"
              "<username>%s</username>"
              "<password>%s</password>"
              "<client_id>PODAAC CMR Client</client_id>"
              "<user_ip_address>%s</user_ip_address>"
              "</token>")


def setup_cmr_token_auth(endpoint: str='cmr.earthdata.nasa.gov'):
    ip = requests.get("https://ipinfo.io/ip").text.strip()
    return requests.post(
        url="https://%s/legacy-services/rest/tokens" % endpoint,
        data=TOKEN_DATA % (input("Username: "), getpass("Password: "), ip),
        headers={'Content-Type': 'application/xml', 'Accept': 'application/json'}
    ).json()['token']['id']


def setup_earthdata_login_auth(endpoint: str='urs.earthdata.nasa.gov'):
    netrc_name = "_netrc" if system()=="Windows" else ".netrc"
    try:
        username, _, password = netrc(file=join(expanduser('~'), netrc_name)).authenticators(endpoint)
    except (FileNotFoundError, TypeError):
        print('Please provide your Earthdata Login credentials for access.')
        print('Your info will only be passed to %s and will not be exposed in Jupyter.' % (endpoint))
        username = input('Username: ')
        password = getpass('Password: ')
    manager = request.HTTPPasswordMgrWithDefaultRealm()
    manager.add_password(None, endpoint, username, password)
    auth = request.HTTPBasicAuthHandler(manager)
    jar = CookieJar()
    processor = request.HTTPCookieProcessor(jar)
    opener = request.build_opener(auth, processor)
    request.install_opener(opener)


# Get your authentication token for searching restricted records in the CMR:
_token = setup_cmr_token_auth(endpoint="cmr.earthdata.nasa.gov")

# Start authenticated session with URS to allow restricted data downloads:
setup_earthdata_login_auth(endpoint="urs.earthdata.nasa.gov")

Username: Matthew.A.Thompson
Password: ········


### Study region and period

In [6]:
# The timeframe of interest
start_date = "2019-01-01"
end_date   = "2019-01-31"

# The area/region of interest by latitude/longitude:
aoi_minlon = -160
aoi_minlat = 18
aoi_maxlon = -150
aoi_maxlat = 24

### Find Dataset Concept-Id
<br/>
The Harmony API requires a dataset identifier that we must obtain from the Common Metadata Repository (CMR). In the next cell, submit a request to the CMR API to grab the metadata for to the dataset/collection.

In [7]:
mur_results = requests.get(
    url='https://cmr.earthdata.nasa.gov/search/collections.umm_json', 
    params={'provider': "POCLOUD",
            'ShortName': "MUR-JPL-L4-GLOB-v4.1",
            'token': _token}
).json()

# Select the first/only record in the JSON response:
mur_coll = mur_results['items'][0]

# Select the 'concept-id' from the 'meta' dictionary:
mur_ccid = mur_coll['meta']['concept-id']

mur_ccid

'C1996881146-POCLOUD'

### Request Subsets from Harmony API
<br/>
We will submit a request to the Harmony API. The API is under active development, and it's therefore recommended that you test your input parameters in the Swagger API interface.

The next cell joins the base url for the API to the concept-id obtained above. Run the cell and print the complete url to confirm:

In [8]:
harmony_url = "https://harmony.earthdata.nasa.gov"
harmony_url_mur = f"{harmony_url}/{mur_ccid}/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?"

print(harmony_url_mur)

https://harmony.earthdata.nasa.gov/C1996881146-POCLOUD/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?


Make a dictionary of subset parameters and format the values to meet requirements of the Harmony API. (See the Swagger UI linked above for more information about those requirements.)

In [9]:
harmony_params_mur = {
    'time': f'("{start_date}T00:00:00.000Z":"{end_date}T23:59:59.999Z")',
    'lat': f'({aoi_minlat}:{aoi_maxlat})',
    'lon': f'({aoi_minlon}:{aoi_maxlon})',
}

harmony_params_mur

{'time': '("2019-01-01T00:00:00.000Z":"2019-01-31T23:59:59.999Z")',
 'lat': '(18:24)',
 'lon': '(-160:-150)'}

Complete the url by formatting the query portion using the parameters dictionary:

In [10]:
request_url_mur = harmony_url_mur+"subset=time{time}&subset=lat{lat}&subset=lon{lon}".format(**harmony_params_mur)

print(request_url_mur)

https://harmony.earthdata.nasa.gov/C1996881146-POCLOUD/ogc-api-coverages/1.0.0/collections/all/coverage/rangeset?subset=time("2019-01-01T00:00:00.000Z":"2019-01-31T23:59:59.999Z")&subset=lat(18:24)&subset=lon(-160:-150)


### Submit Request Parameters to Harmony API Endpoint
<br/>
The blank "job_status" list will be filled by the following cells.

In [11]:
job_status = []

The next cell should download a JSON for your new request. Print the message field of the JSON response:

In [12]:
request_urls_for_mur = [request_url_mur]

if len(job_status)==0:
    # Loop over the list of request urls:
    for r in request_urls_for_mur:
        # Submit the request and decode the response from json string to dict:
        response_mur = requests.get(r)
        # If the response came back with something other than '2xx', raise an error:
        if not response_mur.status_code // 100 == 2: 
            raise Exception(response_mur.text)
        else:
            response_data = response_mur.json()
        # Append the status endpoint to the list of 'job_status' urls:
        job_status.append(response_data['links'][0]['href'])
else:
    response_data = requests.get(job_status[0]).json()

response_data['message']

'Returning direct download links because the requested combination of operations: spatial subsetting on C1996881146-POCLOUD is unsupported.'

### Insights
<br/>
This output of the message field of the JSON response represents an unsuccessful request, and as you will see a few cells down it eventually results in an error. In theory, this code should work, however, we discovered that the dataset we are attempting to access is not configured to be spatially subsetted while stored in AWS. If that option is later turned on, this script will be revisited. There is also the possibility that the call used to request the MUR URL via Harmony is incorrect for an L4 dataset. This code was adapted from a script accessing an L2 dataset. 

### Following Code

In [13]:
if len(job_status)==0:
    try:
        job_status = [l['href'] for l in response_data['links'] if l['title']=="Job Status"]
    except (KeyError, IndexError) as e:
        raise e

print(job_status)

['https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20190101090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc']


In [14]:
wait = 10       # The number of seconds to wait between each status check
completed = {}  # A dict of JSON responses for completed jobs

# Loop repeatedly to check job status. Wait before retrying.
while True:
    for j in job_status:  # Iterate over list of job urls
        if j in completed:  # Skip if completed.
            continue
        # Get the current job's status as a JSON object.
        job_data = requests.get(j).json()
        if job_data['status']!='running':
            completed[j] = job_data  # Add to 'completed' if finished
    # Break loop if 'completed' dictionary contains all jobs.
    if len(completed)==2:
        break
    # If still processing, print a status update and wait ten seconds.
    print(f"# Job(s) in progress ({len(completed)+1}/{len(job_status)})")
    time.sleep(wait)
    
print(f"\n{'&'*40}\n%\t\tDONE!\n{'&'*40}\n")

KeyboardInterrupt: 

In [None]:
print(json.dumps({k:v for k, v in job_data.items() if k!="links"}, indent=2))