### Introduction to Programmatic Common Metadata Repository Search

This notebook will introduce you to programmatic Common Metadata Repository (CMR) search in python, using PO.DAAC Data as the example of data we're interested in. While these tutorials focus on PO.DAAC data, the same strategies and code snippets can be used for other earthdata collections.


## API Documentation

This tutorial is not meant to be a replacement for the official CMR documentation. Its features are well documented and that should be the first place to go for information. It can be found at https://cmr.earthdata.nasa.gov/search. Some users may find it easier to navigate the Earthdata Search interface, find data of interest, and then automate the results using scripts. We'd suggest visiting https://search.earthdata.nasa.gov/


## CMR Background information

CMR houses metadata for the 12 different DAACs. These come in the following forms:

* Collections
* Granules
* Variables
* Services
* Visualizations
* Tools

This tutorial will focus on Collections and Granules. for more information, see the https://earthdata.nasa.gov/learn/user-resources/glossary

## Collection / Dataset Series

Collection of datasets sharing the same product specification. They are synonym of EO collections. They are named dataset series as they may be mapped to ‘dataset series’ according to the terminology defined in ISO 19113, ISO 19114 and ISO 19115. 

## Granule

The smallest aggregation of data which is independently managed (i. e. described, inventoried, retrievable). Granules may be managed as logical granules and/or physical granules. See also Scene.

Note that granule is often equivalent to Data Set.

## Data Set

A logically meaningful grouping or collection of similar or related data. Data having all of the same characteristics (source or class of source, processing level, resolution, etc.) but different independent variable ranges and/or responding to a specific need are normally considered part of a single data set. A data set is typically composed by products from several missions, gathered together to respond to the overall coverage or revisit requirements from a specific group of users.

In the context of EO data preservation a data set consists of the data records of one mission, sensor, and product type and the associated knowledge(information, tools). See collection.

## What does all of this mean?

For the most part, users want to discover collections of interest to them, usually defined by parameter (Sea Surface Temperature, Ocean Winds, Sea Surface Height, etc), Level, spatial and temporal coverage, etc. Lets show an example.

## Find collections by parameter

In [7]:
from urllib import request
import json
import pprint

cmr_url = "https://cmr.earthdata.nasa.gov/search/"

with request.urlopen(cmr_url+"collections.umm_json?science_keywords[0][topic]=OCEANS") as response:
    data = response.read()
    encoding = response.info().get_content_charset('utf-8')
    JSON_object = json.loads(data.decode(encoding))
    pp = pprint.PrettyPrinter(indent=2)
    pp.pprint(JSON_object)

{ 'hits': 10904,
  'items': [ { 'meta': { 'concept-id': 'C1214305813-AU_AADC',
                         'concept-type': 'collection',
                         'deleted': False,
                         'format': 'application/dif10+xml',
                         'granule-count': 0,
                         'has-formats': False,
                         'has-spatial-subsetting': False,
                         'has-temporal-subsetting': False,
                         'has-transforms': False,
                         'has-variables': False,
                         'native-id': 'ASAC_2201_HCL_0.5',
                         'provider-id': 'AU_AADC',
                         'revision-date': '2019-12-12T16:00:14Z',
                         'revision-id': 9,
                         'user-id': 'sritz'},
               'umm': { 'Abstract': 'These results are for the 0.5 hour '
                                    'extraction of HCl.\n'
                                    '\n'
                

There's a lot going on here. First off, the url:
```
https://cmr.earthdata.nasa.gov/search/collections.umm_json?science_keywords[0][topic]=OCEANS
```

The basic premise is this: We are asking for all collections (../search/collections) that fall under the 'OCEANS' science topic as defined by GCMD. We are requesting this in the umm_json format (.umm_json). What we get back is a listing of those collections matching this. When last run, this was over 10900 collections! that's a lot. Let's get that down a bit...


In [49]:
with request.urlopen(cmr_url+"collections.umm_json?science_keywords[0][topic]=OCEANS&science_keywords[0][term]=Ocean%20Temperature&has_granules_or_cwic=true&page_size=50") as response:
    data = response.read()
    encoding = response.info().get_content_charset('utf-8')
    JSON_object = json.loads(data.decode(encoding))
    pp = pprint.PrettyPrinter(indent=2)
    pp.pprint(JSON_object)

{ 'hits': 483,
  'items': [ { 'meta': { 'concept-id': 'C1597928934-NOAA_NCEI',
                         'concept-type': 'collection',
                         'deleted': False,
                         'format': 'application/iso19115+xml',
                         'granule-count': 0,
                         'has-formats': False,
                         'has-spatial-subsetting': False,
                         'has-temporal-subsetting': False,
                         'has-transforms': False,
                         'has-variables': False,
                         'native-id': 'GHRSST-VIIRS_N20-OSPO-L2P',
                         'provider-id': 'NOAA_NCEI',
                         'revision-date': '2019-08-12T19:50:50Z',
                         'revision-id': 2,
                         'user-id': 'mmorahan'},
               'umm': { 'Abstract': 'NOAA-20 (N20/JPSS-1/J1) is the second '
                                    'satellite in the US NOAA latest '
                          

Here we're now limiting the search to those with the Oceans topic as well as the 'Ocean Temperature' term. To limit this further, we are only searching for collections that contain granules (data). We do this by specifying

```
has_granules_or_cwic=true
```

So we are closing in on this. Let's add a time range and find some PO.DAAC data:

In [50]:
with request.urlopen(cmr_url+"collections.umm_json?science_keywords[0][topic]=OCEANS&science_keywords[0][term]=Ocean%20Temperature&has_granules_or_cwic=true&temporal=2018-01-01T10:00:00Z,2019-01-01T10:00:00Z&provider_short_name=PODAAC&processing_level_id=4&page_size=50") as response:
    data = response.read()
    encoding = response.info().get_content_charset('utf-8')
    JSON_object = json.loads(data.decode(encoding))
    pp = pprint.PrettyPrinter(indent=2)
    pp.pprint(JSON_object)

{ 'hits': 29,
  'items': [ { 'meta': { 'concept-id': 'C1658476070-PODAAC',
                         'concept-type': 'collection',
                         'deleted': False,
                         'format': 'application/echo10+xml',
                         'granule-count': 0,
                         'has-formats': False,
                         'has-spatial-subsetting': False,
                         'has-temporal-subsetting': False,
                         'has-transforms': False,
                         'has-variables': False,
                         'native-id': 'GHRSST+Level+4+RAMSSA+Australian+Regional+Foundation+Sea+Surface+Temperature+Analysis',
                         'provider-id': 'PODAAC',
                         'revision-date': '2019-11-20T20:39:28Z',
                         'revision-id': 2,
                         'user-id': 'cia001'},
               'umm': { 'Abstract': 'A Group for High Resolution Sea Surface '
                                    'Temperatu

```
temporal=2018-01-01T10:00:00Z,2019-01-01T10:00:00Z&provider_short_name=PODAAC&processing_level_id=4
```

we sepcified a temporal range for all of 2018, PODAAC as the provider, and level 4 data, since it's a bit easier for us to work with.

ok, that got us down to ~29 collections. Let's use python to get some information we're interested in.

In [53]:
for i in JSON_object["items"]:
  print(i['meta']['concept-id'] + " " + i['meta']['native-id'].replace('+',' '))
  #print("\t"+i['meta']['native-id'].replace('+',' '))
  print("\tBeginning Data Time: "+str(i['umm']['TemporalExtents'][0]['RangeDateTimes'][0]['BeginningDateTime']))
    
  # Bounding Box Info:
  br_array = i['umm']['SpatialExtent']['HorizontalSpatialDomain']['Geometry']['BoundingRectangles']
  for br in br_array:
    print("\tBounding Rectangle: West: {}, North: {}, East: {}, South: {}".format(br['WestBoundingCoordinate'], br['NorthBoundingCoordinate'], br['EastBoundingCoordinate'], br['SouthBoundingCoordinate']))

    
    
    
    

C1658476070-PODAAC GHRSST Level 4 RAMSSA Australian Regional Foundation Sea Surface Temperature Analysis
	Beginning Data Time: 2008-04-01T00:00:00.000Z
	Bounding Rectangle: West: 60.0, North: 20.0, East: 180.0, South: -70.0
	Bounding Rectangle: West: -180.0, North: 20.0, East: -170.0, South: -70.0
C1657548743-PODAAC GHRSST Level 4 GAMSSA Global Foundation Sea Surface Temperature Analysis
	Beginning Data Time: 2008-08-24T00:00:00.000Z
	Bounding Rectangle: West: -180.0, North: 90.0, East: 180.0, South: -90.0
C1652971997-PODAAC GHRSST Level 4 AVHRR_OI Global Blended Sea Surface Temperature Analysis (GDS version 2) from NCEI
	Beginning Data Time: 1981-09-01T00:00:00.000Z
	Bounding Rectangle: West: -180.0, North: 90.0, East: 180.0, South: -90.0
C1652972273-PODAAC GHRSST Level 4 CMC0.1deg Global Foundation Sea Surface Temperature Analysis (GDS version 2)
	Beginning Data Time: 2016-01-01T00:00:00.000Z
	Bounding Rectangle: West: -180.0, North: 90.0, East: 180.0, South: -90.0
C1653649473-PODAAC

We now have the start times, CMR Concept-ID (the unique collection identifier), title, and Bounding rectangles for spatial coverage. This is a lot of information we can use to decide on a dataset, and we can keep adding more information. 

For now, lets choose "C1664741463-PODAAC": GHRSST Level 4 MUR Global Foundation Sea Surface Temperature Analysis (v4.1)

## Granule Search

Using this collection, and more specifically, its concept-ID, we can now search for data we're interested in.


In [58]:
with request.urlopen(cmr_url+"granules.umm_json?concept-id=C1664741463-PODAAC") as response:
    data = response.read()
    encoding = response.info().get_content_charset('utf-8')
    JSON_object = json.loads(data.decode(encoding))
    pp = pprint.PrettyPrinter(indent=2)
    pp.pprint(JSON_object)

{ 'hits': 6497,
  'items': [ { 'meta': { 'concept-id': 'G1664772388-PODAAC',
                         'concept-type': 'granule',
                         'format': 'application/echo10+xml',
                         'native-id': '20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
                         'provider-id': 'PODAAC',
                         'revision-date': '2019-12-06T16:29:51Z',
                         'revision-id': 1},
               'umm': { 'CollectionReference': { 'ShortName': 'MUR-JPL-L4-GLOB-v4.1',
                                                 'Version': '4.1'},
                        'DataGranule': { 'ArchiveAndDistributionInformation': [ { 'Name': 'Not '
                                                                                          'provided',
                                                                                  'Size': 332.35974979400635,
                                                                                  'Si

Alright, 6497 hits (at time of this writing). Let's once again use some parsing magic to get some information on these data granules.

In [78]:
for i in JSON_object["items"]:
  print(i['meta']['concept-id'] + " " + i['meta']['native-id'].replace('+',' '))


  dist_info = i['umm']['DataGranule']['ArchiveAndDistributionInformation'][0]
  print("\tGranule Size: "+"{:.3f}".format(dist_info['Size']) + " " + str(dist_info['SizeUnit']))
  print("\tBeginning Data Time: "+str(i['umm']['TemporalExtent']['RangeDateTime']['BeginningDateTime']))
    
  # Bounding Box Info:
  br_array = i['umm']['SpatialExtent']['HorizontalSpatialDomain']['Geometry']['BoundingRectangles']
  for br in br_array:
    print("\tBounding Rectangle: West: {}, North: {}, East: {}, South: {}".format(br['WestBoundingCoordinate'], br['NorthBoundingCoordinate'], br['EastBoundingCoordinate'], br['SouthBoundingCoordinate']))

  related_urls = i['umm']['RelatedUrls']
  for url in related_urls:
        print("\t{} ({})".format(url["URL"], url['Description']))
    
    

G1664772388-PODAAC 20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
	Granule Size: 332.360 MB
	Beginning Data Time: 2002-06-01T09:00:00.000Z
	Bounding Rectangle: West: -179.641, North: 53.855, East: 58.885, South: -87.3
	https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc (The HTTP location for the granule.)
	https://podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/152/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.html (The OPENDAP location for the granule.)
G1664777267-PODAAC 20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc
	Granule Size: 331.335 MB
	Beginning Data Time: 2002-06-02T09:00:00.000Z
	Bounding Rectangle: West: -179.641, North: 53.855, East: 58.885, South: -87.3
	https://podaac-tools.jpl.nasa.gov/drive/files/allData/ghrsst/data/GDS2/L4/GLOB/JPL/MUR/v4.1/2002/153/20020602090000-JPL-L4_GHRSST

Using the above information, we can find the size and location (both whole file and OPeNDAP) URLs.