# Using CMR to View Cloud-Hosted Datasets
### Author: Chris Battisto
### Date Authored: 1-31-22

### Timing

Exercise: 15 minutes

### Overview

This notebook demonstrates how to access cloud-hosted GES DISC granules using the [Commmon Metadata Repository (CMR) API](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html), before comparing them with their equivalent on-prem hosted granule.

### Prerequisites

This notebook was written using Python 3.8, and requires these libraries and files: 
- xarray
- S3FS

Note: This notebook **will only run in an environment with <code>us-west-2</code> AWS region access**.

### Import Libraries

In [1]:
import requests
import xarray as xr
import s3fs


### Create a Function for CMR Catalog Requests

In [2]:
def request_collection(params):
    response = requests.get(url,
                        params=params,
                        headers={
                            'Accept': 'application/json',
                        }
                       )
    return response

### Search CMR Catalogs and Obtain Data URLs

First, check that the CMR catalog can be accessed:

In [3]:
url = 'https://cmr.earthdata.nasa.gov/search/collections'

# Create our request for finding cloud-hosted granules, and check that we can access CMR
response = request_collection({
                            'cloud_hosted': 'True',
                            'has_granules': 'True'
                            })

if response.status_code == 200:
    print(str(response.status_code) + ", CMR is accessible")
else:
    print(str(response.status_code) + ", CMR is not accessible, check for outages")

200, CMR is accessible


Lets see how many cloud-hosted data collections are currently in the GES DISC CMR catalog:

In [4]:

provider = 'GES_DISC'
response = request_collection({
                            'cloud_hosted': 'True',
                            'has_granules': 'True',
                            'provider': provider, # Only look for data hosted by GES-DISC
                            })

# See how many hits are returned
hits = int(response.headers['cmr-hits'])
print(hits)

10


Here are the current GES DISC datasets available in the cloud as of March 2022:

In [6]:
for hit in range(0, hits):
    print(response.json()['feed']['entry'][hit]['dataset_id'])

MERRA-2 tavg1_2d_slv_Nx: 2d,1-Hourly,Time-Averaged,Single-Level,Assimilation,Single-Level Diagnostics 0.625 x 0.5 degree V5.12.4 (M2T1NXSLV) at GES DISC
GPM IMERG Final Precipitation L3 Half Hourly 0.1 degree x 0.1 degree V06 (GPM_3IMERGHH) at GES DISC
Sounder SIPS: Sun Synchronous 13:30 orbit Climate Hyperspectral InfraRed Product (CHIRP): Calibrated Radiances from EOS-Aqua, S-NPP, JPSS-1/NOAA-20, V2 (SNDR13CHRP1) at GES DISC
Sounder SIPS: Sun Synchronous 13:30 orbit Climate Hyperspectral InfraRed Product (CHIRP): Calibrated Radiances from EOS-Aqua, V2 (SNDR13CHRP1AQCal) at GES DISC
Sounder SIPS: Sun Synchronous 13:30 orbit Climate Hyperspectral InfraRed Product (CHIRP): Calibrated Radiances from JPSS-1/NOAA-20, V2 (SNDR13CHRP1J1Cal) at GES DISC
Sounder SIPS: Sun Synchronous 13:30 orbit Climate Hyperspectral InfraRed Product (CHIRP): Calibrated Radiances from S-NPP, V2 (SNDR13CHRP1SNCal) at GES DISC
Sounder SIPS: AQUA AIRS IR + MW Level 2 CLIMCAPS: Atmosphere, cloud and surface geophy

After sifting through the returned JSON response, we can get our on-prem links for both datasets. Alternately, data can be subset and then downloaded with wget/cURL by generating OPeNDAP links with the GES DISC subsetting tool.

In [10]:
# First, for MERRA-2
response.json()['feed']['entry'][0]['links'][7]

{'rel': 'http://esipfed.org/ns/fedsearch/1.1/service#',
 'hreflang': 'en-US',
 'href': 'https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/M2T1NXSLV.5.12.4/contents.html'}

In [9]:
# Then for GPM_IMERG
response.json()['feed']['entry'][1]['links'][4]

{'rel': 'http://esipfed.org/ns/fedsearch/1.1/service#',
 'hreflang': 'en-US',
 'href': 'https://gpm1.gesdisc.eosdis.nasa.gov/opendap/GPM_L3/GPM_3IMERGHH.06/contents.html'}

### Obtain S3 credentials and bucket links

Remember that the credential token requires a previously generated netrc file, and that it will only last for one hour before needing to be regenerated.

In [None]:
gesdisc_s3 = "https://data.gesdisc.earthdata.nasa.gov/s3credentials"

# Define a function for S3 access credentials

def begin_s3_direct_access(url: str=gesdisc_s3):
    response = requests.get(url).json()
    return s3fs.S3FileSystem(key=response['accessKeyId'],
                             secret=response['secretAccessKey'],
                             token=response['sessionToken'],
                             client_kwargs={'region_name':'us-west-2'})

fs = begin_s3_direct_access()

# Check that the file system is intact as an S3FileSystem object, which means that token is valid
# Common causes of rejected S3 access tokens include incorrect passwords stored in the netrc file, or a non-existent netrc file
type(fs)

s3fs.core.S3FileSystem

S3 URLs currently cannot be obtained through the CMR API; instead, they are accessed manually through the Earthdata Cloud search tool, or OPeNDAP, which will be preserved. These dataset directories can have their parent link switched to S3 (for example, change <code>https://goldsmr4.gesdisc.eosdis.nasa.gov/data/MERRA2/</code> to <code>s3://gesdisc-cumulus-prod-protected/MERRA2/</code> for easily switching between cloud-hosted and on-prem data. Remember that datasets like GPM IMERG may have different file organization structures, and it is recommended to use the GES DISC subsetting tool, CMR, or Earthdata Search to generate links.

Now that all of our links are obtained, we can open them up in Xarray for comparisons.

In [None]:
# S3 GPM IMERG and MERRA-2 bucket datasets, both from 31 May 2013 at separate times

merra_fn = 's3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2013/05/MERRA2_400.tavg1_2d_slv_Nx.20130531.nc4'
imerg_fn = 's3://gesdisc-cumulus-prod-protected/GPM_L3/GPM_3IMERGHH.06/2013/151/3B-HHR.MS.MRG.3IMERG.20130531-S000000-E002959.0000.V06B.HDF5'

ds_merra_s3 = xr.open_dataset(fs.open(merra_fn),
                              decode_cf=True,)

ds_imerg_s3 = xr.open_dataset(fs.open(imerg_fn),
                              decode_cf=True,
                              engine='h5netcdf')

### Check that the datasets are the same

First, check if the MERRA-2 dataset is the same:

In [None]:
ds_merra_on_prem = xr.open_dataset("https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/M2T1NXSLV.5.12.4/2013/05/MERRA2_400.tavg1_2d_slv_Nx.20130531.nc4",
                                  decode_cf=True,)

# Always use equals() for checking if Xarray datasets are identical
if ds_merra_s3.equals(ds_merra_on_prem):
    print('The on-prem and S3 datasets are equal and intact!')

The on-prem and S3 datasets are equal and intact!


Finally, see if the GPM IMERG datasets are the same: 

In [None]:
ds_imerg_on_prem = xr.open_dataset("https://gpm1.gesdisc.eosdis.nasa.gov/opendap/GPM_L3/GPM_3IMERGHH.06/2013/151/3B-HHR.MS.MRG.3IMERG.20130531-S000000-E002959.0000.V06B.HDF5",
                                  decode_cf=True,
                                  engine='h5netcdf')

if ds_imerg_s3.equals(ds_imerg_on_prem):
    print('The on-prem and S3 datasets are equal and intact!')

The on-prem and S3 datasets are equal and intact!
