# Using CMR to View Cloud-Hosted Datasets
### Author: Chris Battisto
### Date Authored: 1-31-22

### Timing

Exercise: 15 minutes

### Overview

This notebook demonstrates how to access cloud-hosted GES DISC granules using the [Commmon Metadata Repository (CMR) API](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html).

### Prerequisites

This notebook was written using Python 3.8, and requires these libraries and files: 
- xarray
- S3FS

Note: This notebook **will only run in an environment with <code>us-west-2</code> AWS region access**.

### Import Libraries

In [1]:
import requests
import xarray as xr
import s3fs


### Create a Function for CMR Catalog Requests

In [2]:
def request_collection(params):
    response = requests.get(url,
                        params=params,
                        headers={
                            'Accept': 'application/json',
                        }
                       )
    return response

### Search CMR Catalogs and Obtain Data URLs

First, check that the CMR catalog can be accessed:

In [3]:
url = 'https://cmr.earthdata.nasa.gov/search/collections'

# Create our request for finding cloud-hosted granules, and check that we can access CMR
response = request_collection({
                            'cloud_hosted': 'True',
                            'has_granules': 'True'
                            })

if response.status_code == 200:
    print(str(response.status_code) + ", CMR is accessible")
else:
    print(str(response.status_code) + ", CMR is not accessible, check for outages")

200, CMR is accessible


Lets see how many cloud-hosted data collections are currently in the GES DISC CMR catalog:

In [4]:

provider = 'GES_DISC'
response = request_collection({
                            'cloud_hosted': 'True',
                            'has_granules': 'True',
                            'provider': provider, # Only look for data hosted by GES-DISC
                            })

# See how many hits are returned
hits = int(response.headers['cmr-hits'])
print(hits)

10


Here are the current GES DISC datasets available in the cloud as of March 2022:

In [5]:
for hit in range(0, hits):
    print(response.json()['feed']['entry'][hit]['dataset_id'])
    
merra_dataset_id = response.json()['feed']['entry'][0]['id']

MERRA-2 tavg1_2d_slv_Nx: 2d,1-Hourly,Time-Averaged,Single-Level,Assimilation,Single-Level Diagnostics 0.625 x 0.5 degree V5.12.4 (M2T1NXSLV) at GES DISC
GPM IMERG Final Precipitation L3 Half Hourly 0.1 degree x 0.1 degree V06 (GPM_3IMERGHH) at GES DISC
Sounder SIPS: Sun Synchronous 13:30 orbit Climate Hyperspectral InfraRed Product (CHIRP): Calibrated Radiances from EOS-Aqua, S-NPP, JPSS-1/NOAA-20, V2 (SNDR13CHRP1) at GES DISC
Sounder SIPS: Sun Synchronous 13:30 orbit Climate Hyperspectral InfraRed Product (CHIRP): Calibrated Radiances from EOS-Aqua, V2 (SNDR13CHRP1AQCal) at GES DISC
Sounder SIPS: Sun Synchronous 13:30 orbit Climate Hyperspectral InfraRed Product (CHIRP): Calibrated Radiances from JPSS-1/NOAA-20, V2 (SNDR13CHRP1J1Cal) at GES DISC
Sounder SIPS: Sun Synchronous 13:30 orbit Climate Hyperspectral InfraRed Product (CHIRP): Calibrated Radiances from S-NPP, V2 (SNDR13CHRP1SNCal) at GES DISC
Sounder SIPS: AQUA AIRS IR + MW Level 2 CLIMCAPS: Atmosphere, cloud and surface geophy

S3 URLs currently cannot be obtained through the CMR API; instead, they are accessed manually through the Earthdata Cloud search tool, or OPeNDAP, which will be preserved. These dataset directories can have their parent link switched to S3 (for example, change <code>https://goldsmr4.gesdisc.eosdis.nasa.gov/data/MERRA2/</code> to <code>s3://gesdisc-cumulus-prod-protected/MERRA2/</code> for easily switching between cloud-hosted and on-prem data. Remember that datasets like GPM IMERG may have different file organization structures, and it is recommended to use the GES DISC subsetting tool, CMR, or Earthdata Search to generate links.

Once we know the date and time of the granule(s) that we want to access, we can simply replace their on-prem links with the 's3' prefix by using Python's <code>replace</code> function.

In [6]:
# Paste link generated by GES DISC subsetter

merra_opendap_link = 'https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/M2T1NXSLV.5.12.4/2013/05/MERRA2_400.tavg1_2d_slv_Nx.20130531.nc4'
print('OPeNDAP Link:', merra_opendap_link)

# Manually replace the on-prem server link with S3 for file list generation
merra_s3_link = merra_opendap_link.replace('https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/', 
                                 's3://gesdisc-cumulus-prod-protected/')

print('S3 Link:', merra_s3_link)


OPeNDAP Link: https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/M2T1NXSLV.5.12.4/2013/05/MERRA2_400.tavg1_2d_slv_Nx.20130531.nc4
S3 Link: s3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2013/05/MERRA2_400.tavg1_2d_slv_Nx.20130531.nc4


### Obtain S3 credentials and bucket links

Remember that the credential token requires a previously generated netrc file, and that it will only last for one hour before needing to be regenerated.

In [7]:
gesdisc_s3 = "https://data.gesdisc.earthdata.nasa.gov/s3credentials"

# Define a function for S3 access credentials

def begin_s3_direct_access(url: str=gesdisc_s3):
    response = requests.get(url).json()
    return s3fs.S3FileSystem(key=response['accessKeyId'],
                             secret=response['secretAccessKey'],
                             token=response['sessionToken'],
                             client_kwargs={'region_name':'us-west-2'})

fs = begin_s3_direct_access()

# Check that the file system is intact as an S3FileSystem object, which means that token is valid
# Common causes of rejected S3 access tokens include incorrect passwords stored in the netrc file, or a non-existent netrc file
type(fs)

s3fs.core.S3FileSystem

Finally, we can open the granule in Xarray:

In [8]:
ds_merra_s3 = xr.open_dataset(fs.open(merra_s3_link))
ds_merra_s3

### Additional Exercise: Compare On-prem and S3 granules:

Xarray's <code>equals()</code> function can be called to compare any two Xarray data objects, or in this case, for seeing if the on-prem and S3 granules have identical data:

In [9]:
ds_merra_on_prem = xr.open_dataset(merra_opendap_link)

# Always use equals() for checking if Xarray datasets are identical
if ds_merra_s3.equals(ds_merra_on_prem):
    print('The on-prem and S3 datasets are equal and intact!')

The on-prem and S3 datasets are equal and intact!
