This notebook needs updating

CMR (Common metadata repository) is the API that is behind NASA earthdata searches. More details can be found at [here](https://earthdata.nasa.gov/about/science-system-description/eosdis-components/common-metadata-repository). It looks like it aims to be the go-to API for all earth data queries, and provides [Unified Metadata Model](https://earthdata.nasa.gov/about/science-system-description/eosdis-components/common-metadata-repository/unified-metadata-model-umm) tools for interchanging between different metadata formats.

In [2]:
# We use the default config file, which may have limitations
from pyCMR.pyCMR import CMR
cmr = CMR("cmr.cfg")

In [3]:
# Search for something generic to get an idea of a sample response
results = cmr.searchCollection(keyword='land')

In [4]:
# Let's examine this thing
import json
import re
cln = results[0] # A python API-specific object with methods
cln_json = json.dumps(cln) # A string
cln_dict = json.loads(cln_json) # A dictionary of arrays and dictionaries


In [5]:
# Plug the result of this into jsonformatter.curiousconcept.com
print cln_json



That is a lot... and the format may vary across different collections. This happens to be in the echo10 format.

Questions:
- How do we obtain the actual data granules associated (these have more metadata along with the actual data)?
- Do we want to capture the various fieldnames or just the field entries?
- What do we want to do with the URLs in here?
- How do we extract keywords?

In [6]:
# Some useful properties of a data collection:
title = cln_dict['Collection']['DataSetId']
fmt = cln_dict['format']
desc = cln_dict['Collection']['Description']

print 'TITLE: %s\n\nFORMAT: %s\n\nDESC: %s\n' % (title, fmt, desc)

print 'URLS:'
online_ress = cln_dict['Collection']['OnlineResources']['OnlineResource']
for online_res in online_ress:
    print '*', online_res['Type']
    print ' ', online_res['URL']
    


TITLE: FLDAS Noah Land Surface Model L4 monthly 0.1 x 0.1 degree for Eastern Africa (MERRA-2 and CHIRPS) V001 (FLDAS_NOAH01_C_EA_M) at GES DISC

FORMAT: application/echo10+xml


This simulation was forced by combination of New version of the Modern Era Retrospective-analysis for Research and Applications (MERRA-2) and Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS).

The simulation was initialized on 1 January 1982 using soil moisture and other state fields from a FLDAS/Noah model climatology for that day of the year.

URLS:
* GET SERVICE : OPENDAP DATA
  https://hydro1.gesdisc.eosdis.nasa.gov/opendap/FLDAS/FLDAS_NOAH01_C_EA_M.001/
* VIEW RELATED INFORMATION : USER'S GUIDE
  https://hydro1.gesdisc.eosdis.nasa.gov/data/FLDAS/FLDAS_NOAH01_C_EA_M.001/doc/README_FLDAS.pdf
* PublicationURL : VIEW RELATED INFORMATION
  https://ldas.gsfc.nasa.gov/FLDAS/
* VIEW RELATED INFORMATION : HOW-TO
  https://disc.gsfc.nasa.gov/information/howto


Spatial and temporal extents are available as well.

In [7]:
# NASA organizes science keywords into a hierarchy of levels.
# We will ignore this structure for now.
kw_str = json.dumps(cln_dict['Collection']['ScienceKeywords'])
kws = re.split('[^a-zA-Z]', kw_str)
kws = filter(None, kws)
kws = map(lambda kw: kw.lower(), kws)
kws = filter(lambda kw: 'keyword' not in kw, kws)
not_kws = ['properties', 'variablelevel', 'value', 'content', 'measurements']
kws = set(filter(lambda kw: kw not in not_kws, kws))

# Each dataset has spatial keywords as well
spatial_kws = cln_dict['Collection']['SpatialKeywords'].values()
spatial_kws = map(lambda kw: str(kw.lower()), spatial_kws)

In [8]:
kws.union(spatial_kws)

{'atmosphere',
 'atmospheric',
 'earth',
 'eastern africa',
 'evapotranspiration',
 'flux',
 'heat',
 'humidity',
 'hydrosphere',
 'indicators',
 'land',
 'liquid',
 'longwave',
 'moisture',
 'precipitation',
 'pressure',
 'processes',
 'radiation',
 'rain',
 'runoff',
 'science',
 'shortwave',
 'soil',
 'soils',
 'surface',
 'temperature',
 'terrestrial',
 'thermal',
 'vapor',
 'water',
 'winds'}

We will come back to explore keywords in more depth later. Now: How do we get the actual data/granules?

In [9]:
# The title or "DataSetId" appears to link the metadata with granules
print 'Searching for granules with title:', title
results = cmr.searchGranule(entry_title=title)
print 'Found %d results' % len(results)

Searching for granules with title: FLDAS Noah Land Surface Model L4 monthly 0.1 x 0.1 degree for Eastern Africa (MERRA-2 and CHIRPS) V001 (FLDAS_NOAH01_C_EA_M) at GES DISC
Found 100 results


In [10]:
# Plug the result of this into jsonformatter.curiousconcept.com

# Let's see what a granule holds
granule = results[0]
granule_json = json.dumps(granule)

# Again, plug the result of this into jsonformatter.curiousconcept.com
print granule_json

{"concept-id": "G1235361682-GES_DISC", "collection-concept-id": "C1234511042-GES_DISC", "revision-id": "2", "Granule": {"LastUpdate": "2017-04-27T17:49:28Z", "OnlineAccessURLs": {"OnlineAccessURL": {"URL": "http://hydro1.gesdisc.eosdis.nasa.gov/data/FLDAS/FLDAS_NOAH01_C_EA_M.001/1982/FLDAS_NOAH01_C_EA_M.A198201.001.nc"}}, "Temporal": {"RangeDateTime": {"EndingDateTime": "1982-01-31T23:59:59Z", "BeginningDateTime": "1982-01-01T00:00:00Z"}}, "Collection": {"ShortName": "FLDAS_NOAH01_C_EA_M", "VersionId": "001"}, "MeasuredParameters": {"MeasuredParameter": [{"QAStats": null, "ParameterName": "Evap_tavg:total evapotranspiration [kg m-2 s-1]", "QAFlags": null}, {"QAStats": null, "ParameterName": "LWdown_f_tavg:surface downward longwave radiation [W m-2]", "QAFlags": null}, {"QAStats": null, "ParameterName": "Lwnet_tavg:net downward longwave radiation [W m-2]", "QAFlags": null}, {"QAStats": null, "ParameterName": "Psurf_f_tavg:surface pressure [Pa]", "QAFlags": null}, {"QAStats": null, "Para

Again, we can get the spatial and temporal constraints of the granule. But let's see if we can actually get the data.

In [11]:
import os

# First of all get the data's format
data_fmt = granule['Granule']['DataFormat']
data_url = granule['Granule']['OnlineAccessURLs']['OnlineAccessURL']['URL']

print 'RESPONSE FORMAT:', granule['format']
print 'DATA FORMAT:', data_fmt
print 'DATA URL:', data_url

# To actually retrieve the data we must be authenticated
username = os.environ.get('ED_USERNAME')
password = os.environ.get('ED_PASSWORD')
url = data_url

RESPONSE FORMAT: application/echo10+xml
DATA FORMAT: NETCDF
DATA URL: http://hydro1.gesdisc.eosdis.nasa.gov/data/FLDAS/FLDAS_NOAH01_C_EA_M.001/1982/FLDAS_NOAH01_C_EA_M.A198201.001.nc


NameError: name 'FILLMEIN' is not defined

If I plug the above URL into my browser, the data is retrieved. However, due to redirect complexities or authentication, provided sample scripts for downloading data in python do not do the trick.

In [12]:
# TODO: Copy HTTP headers exactly with request in order to get the data

In [15]:
import os
print os.environ.get('ED_PASSWORD')


None
