# Demonstration of S3 in-region access to Earthdata Cloud data
In this notebook we will demonstrate how you can find cloud-hosted data within EDC and access that data using AWS' S3 API for in-region compute.
EDC data is hosted in us-west-2. In order to run this notebook you need the following,
- An EDL account (you can gain one at urs.earthdata.nasa.gov)
- A notebook server running in us-west-2

In [None]:
from urllib import request, parse
from http.cookiejar import CookieJar
import getpass
import netrc
import requests
import json
import os
import boto3
from urllib.parse import urlparse

## Registration and authentication

In order to access EDC data you need to register with Earthdata Login (EDL) and obtain EDL credentials for your data access.

This function below will allow Python scripts to log into the Earthdata Login application programmatically. To avoid being prompted for credentials every time you run and also allow clients such as curl to log in, you can add the following to a .netrc (_netrc on Windows) file in your home directory:

machine urs.earthdata.nasa.gov
    login <your username>
    password <your password>
Make sure that this file is only readable by the current user or you will receive an error stating "netrc access too permissive."

$ chmod 0600 ~/.netrc

In [None]:
def setup_earthdata_login_auth(endpoint):
    """
    Set up the request library so that it authenticates against the given Earthdata Login
    endpoint and is able to track cookies between requests.  This looks in the .netrc file 
    first and if no credentials are found, it prompts for them.

    Valid endpoints include:
        uat.urs.earthdata.nasa.gov - Earthdata Login UAT
        urs.earthdata.nasa.gov - Earthdata Login production
    """
    try:
        username, _, password = netrc.netrc().authenticators(endpoint)
    except (FileNotFoundError, TypeError):
        # FileNotFound = There's no .netrc file
        # TypeError = The endpoint isn't in the netrc file, causing the above to try unpacking None
        print('Please provide your Earthdata Login credentials to allow data access')
        print('Your credentials will only be passed to %s and will not be exposed in Jupyter' % (endpoint))
        username = input('Username:')
        password = getpass.getpass()

    manager = request.HTTPPasswordMgrWithDefaultRealm()
    manager.add_password(None, endpoint, username, password)
    auth = request.HTTPBasicAuthHandler(manager)

    jar = CookieJar()
    processor = request.HTTPCookieProcessor(jar)
    opener = request.build_opener(auth, processor)
    request.install_opener(opener)

Let's set up our EDL authentication against the producton environment at urs.earthdata.nasa.gov

In [None]:
setup_earthdata_login_auth('urs.earthdata.nasa.gov')

## Data discovery via the Common Metadata Repository (CMR)
### Step 1: Collection/Dataset discovery.
We can search for collections of interest in our cloud provider POCLOUD using CMR

In [None]:
response = requests.get('https://cmr.earthdata.nasa.gov/search/collections.json', params={'provider': 'POCLOUD'})
results = json.loads(response.content)

concept_id = results["feed"]["entry"][0]["id"]
print("Unique identifier of collection: " + concept_id)

Note that this collection metadata describes several things you need to know about accessing EDC data via S3
In order to use EDC S3 we need to know the following
- the region the data is housed in
- how to obtain AWS STS credentials (ie. the STS credentials endpoint and documentation
EDC needs to metric each data access in terms of the user performing the access. This is done by linking your EDL user name to an STS role. The STS credential endpoint does that by asking for your EDL credentials and returning temporay STS credentials that you can use to set up an AWS S3 client.

In [None]:
response = requests.get('https://cmr.earthdata.nasa.gov/search/concepts/' + concept_id + '.umm-json')
results = json.loads(response.content)
aws_region = results["DirectDistributionInformation"]["Region"]
print('AWS region: ' + aws_region)
sts_endpoint = results["DirectDistributionInformation"]["S3CredentialsAPIEndpoint"]
print('AWS STS endpoint: ' + sts_endpoint)

### Step 2: Granule/file discovery.
Using the unique identifier for the first collection returned, we can search for granules and obtain one or more S3 urls locating the data

In [None]:
response = requests.get('https://cmr.earthdata.nasa.gov/search/granules.json', params={'concept_id': concept_id})
results = json.loads(response.content)

links = results["feed"]["entry"][0]["links"]
for link in links:
    if link['rel'] == "http://esipfed.org/ns/fedsearch/1.1/s3#":
        url = link['href']
        break;
print("S3 URL for data: " + url)
o = urlparse(url, allow_fragments=False)

bucket = o.netloc

key = o.path.lstrip('/')
print("S3 bucket: " + bucket)
print("S3 key: " + key)

## Accessing data
Now we have found the location of the data, we need to leverage the S3 API to access it. 
### Step 1: Obtain AWS STS credentials.
EDC requires AWS STS credentials for data access the STS endpoint allows us to use our EDL credentials to obtain them.
Our EDL credentials are in our https session so the STS endpoint will recognize that and use them to return us STS credentials.

In [None]:
response = requests.get(sts_endpoint)
creds = json.loads(response.content)

### Step 2: Accessing the data via the AWS S3 API

In order to use the S3 API, supply the access key id, secret access key and session token you retrieved from the STS credentials endpoint.
Note: STS credentials are valid for one hour!

In [None]:

client = boto3.client(
        's3',
        aws_access_key_id=creds["accessKeyId"],
        aws_secret_access_key=creds["secretAccessKey"],
        aws_session_token=creds["sessionToken"]
    )

print("STS credentials expire on " + creds["expiration"])

# Get your data
response = client.get_object(
    Bucket=bucket,
    Key=key,
)

print("You just accessed " + response["ResponseMetadata"]["HTTPHeaders"]["content-length"] + " bytes of data in-region via the AWS S3 API.")