# Using CMR to View Cloud-Hosted Datasets
### Author: Chris Battisto
### Date Authored: 1-31-22

### Timing

Exercise: 15 minutes

### Overview

This notebook demonstrates how to access cloud-hosted GES DISC granules using the [Commmon Metadata Repository (CMR) API](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html).

### Prerequisites

This notebook was written using Python 3.8, and requires these libraries and files: 
- xarray
- S3FS

Note: This notebook **will only run in an environment with <code>us-west-2</code> AWS region access**.

### Import Libraries

In [None]:
import requests
import xarray as xr
import s3fs


### Create a Function for CMR Catalog Requests

In [None]:
#

### Search CMR Catalogs and Obtain Data URLs

First, check that the CMR catalog can be accessed:

In [None]:
# 

Lets see how many cloud-hosted data collections are currently in the GES DISC CMR catalog:

In [None]:
#

Here are the current GES DISC datasets available in the cloud as of March 2022:

In [None]:
#

S3 URLs currently cannot be obtained through the CMR API; instead, they are accessed manually through the Earthdata Cloud search tool, or OPeNDAP, which will be preserved. These dataset directories can have their parent link switched to S3 (for example, change <code>https://goldsmr4.gesdisc.eosdis.nasa.gov/data/MERRA2/</code> to <code>s3://gesdisc-cumulus-prod-protected/MERRA2/</code> for easily switching between cloud-hosted and on-prem data. Remember that datasets like GPM IMERG may have different file organization structures, and it is recommended to use the GES DISC subsetting tool, CMR, or Earthdata Search to generate links.

Once we know the date and time of the granule(s) that we want to access, we can simply replace their on-prem links with the 's3' prefix by using Python's <code>replace</code> function.

In [None]:
#


### Obtain S3 credentials and bucket links

Remember that the credential token requires a previously generated netrc file, and that it will only last for one hour before needing to be regenerated.

In [None]:
gesdisc_s3 = "https://data.gesdisc.earthdata.nasa.gov/s3credentials"

# Define a function for S3 access credentials

def begin_s3_direct_access(url: str=gesdisc_s3):
    response = requests.get(url).json()
    return s3fs.S3FileSystem(key=response['accessKeyId'],
                             secret=response['secretAccessKey'],
                             token=response['sessionToken'],
                             client_kwargs={'region_name':'us-west-2'})

fs = begin_s3_direct_access()

# Check that the file system is intact as an S3FileSystem object, which means that token is valid
# Common causes of rejected S3 access tokens include incorrect passwords stored in the netrc file, or a non-existent netrc file
type(fs)

Finally, we can open the granule in Xarray:

In [None]:
#

### Additional Exercise: Compare On-prem and S3 granules:

Xarray's <code>equals()</code> function can be called to compare any two Xarray data objects, or in this case, for seeing if the on-prem and S3 granules have identical data:

In [None]:
#