# Data discovery with NASA's CMR

## Summary

In this notebook, we will walk through how to search for Earthdata data collections and granules. Along the way we will explore the available search parameters, information return, and specific contrains when using the CMR API. Our object is to identify assets to access that we would downloaded, or perform S3 direct access, within an analysis workflow 

We will be querying CMR for [Harmonized Landsat Sentinel-2 (HLS)]() collections/granules to identify assets we would downloaded, or perform S3 direct access, within an analysis workflow

## Requirements

### 1. Earthdata Login

An Earthdata Login account is required to access data, as well as discover restricted data, from the NASA Earthdata system. Thus, to access NASA data, you need Earthdata Login. Please visit <https://urs.earthdata.nasa.gov> to register and manage your Earthdata Login account. This account is free to create and only takes a moment to set up.

## Learning Objectives

- understand what CMR/CMR API is and what CMR/CMR API can be used for 
- how to use the `requests` package to search data collections and granules  
- how to parse the results of these searches.

## What is CMR
CMR is the Common Metadata Repository.  It catalogs all data for NASA's Earth Observing System Data and Information System (EOSDIS).  It is the backend of [Earthdata Search](https://search.earthdata.nasa.gov/search), the GUI search interface you are probably familiar with.  More information about CMR can be found [here](https://earthdata.nasa.gov/eosdis/science-system-description/eosdis-components/cmr).

Unfortunately, the GUI for Earthdata Search is not accessible from a cloud instance - at least not without some work.  Earthdata Search is also not immediately reproducible.  What I mean by that is if you create a search using the GUI you would have to note the search criteria (date range, search area, collection name, etc), take a screenshot, copy the search url, or save the list of data granules returned by the search, in order to recreate the search.  This information would have to be re-entered each time you or someone else wanted to do the search.  You could make typos or other mistakes.  A cleaner, reproducible solution is to search CMR programmatically using the CMR API.

## What is the CMR API
API stands for Application Programming Interface.  It allows applications (software, services, etc) to send information to each other.  A helpful analogy is a waiter in a restaurant.  The waiter takes your drink or food order that you select from the menu, often translated into short-hand, to the bar or kitchen, and then returns (hopefully) with what you ordered when it is ready.

The CMR API accepts search terms such as collection name, keywords, datetime range, and location, queries the CMR database and returns the results.

---

## Getting Started: How to search CMR from Python
The first step is to import python packages.  We will use:  
- `requests` This package does most of the work for us accessing the CMR API using HTTP methods. 
- `pprint` to _pretty print_ the results of the search.  

A more in-depth tutorial on `requests` is [here](https://realpython.com/python-requests/)

In [1]:
import requests
from pprint import pprint

To conduct a search using the CMR API, `requests` needs the url for the root CMR search endpoint. We'll assign this url to a python variable as a _string_.

In [2]:
CMR_OPS = 'https://cmr.earthdata.nasa.gov/search'

CMR allows search by __collections__, which are datasets, and __granules__, which are files that contain data. Many of the same search parameters can be used for collections and granules but the type of results returned differ. Search parameters can be found in the [API Documentation](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html).  

Whether we search __collections__ or __granules__ is distinguished by adding `"collections"` or `"granules"` to the end of the CMR endpoint URL.  

We are going to search collections first, so we add `"collections"` to the URL. We are using a `python` format string in the examples below.

In [3]:
url = f'{CMR_OPS}/{"collections"}'

In this first example, I want to retrieve a list of collections available from the Earthdata Cloud using the `cloud_hosted` parameter in the request.  

We want to get the content in `json` (pronounced "jason") format, so I pass a dictionary to the header keyword argument to say that I want results returned as `json`.

The `.get()` method is used to send this information to the CMR API. `get()` calls the HTTP method __GET__. 

In [4]:
response = requests.get(url,
                        params={
                            'cloud_hosted': 'True',
                            'has_granules': 'True',
                        },
                        headers={
                            'Accept': 'application/json',
                        }
                       )

The request returns a `Response` object.    

To check that our request was successful we can print the `response` variable we saved the request to.

In [5]:
response

<Response [200]>

A __200__ response is what we want. This means that the requests was successful. For more information on HTTP status codes see <https://en.wikipedia.org/wiki/List_of_HTTP_status_codes>

A more explict way to check the status code is to use the `status_code` attribute. Both methods return a HTTP status code.

In [6]:
response.status_code

200

The response from `requests.get` returns the results of the search and metadata about those results in the `headers`.  

More information about the `response` object can be found by typing `help(response)`.

`headers` contains useful information in a case-insensitive dictionary. We requested (above) that the information be return in json which means the object return is a dictionary in our Python environment. We'll iterate through the returned dictionary, looping throught each field (`k`) and its associated value (`v`). For more on interating through dictionary object click [here](https://realpython.com/iterate-through-dictionary-python/).

In [7]:
for k, v in response.headers.items():
    print(f'{k}: {v}')

Content-Type: application/json;charset=utf-8
Content-Length: 4517
Connection: keep-alive
Date: Mon, 04 Apr 2022 21:04:20 GMT
X-Frame-Options: SAMEORIGIN
Access-Control-Allow-Origin: *
X-XSS-Protection: 1; mode=block
CMR-Request-Id: 6b187d39-c5b8-40f7-9b06-9228ea3c33b2
Strict-Transport-Security: max-age=31536000
CMR-Search-After: [0.0,18400.0,"SENTINEL-1A_OCN","1",1214472977,1250]
CMR-Hits: 1169
Access-Control-Expose-Headers: CMR-Hits, CMR-Request-Id, X-Request-Id, CMR-Scroll-Id, CMR-Search-After, CMR-Timed-Out, CMR-Shapefile-Original-Point-Count, CMR-Shapefile-Simplified-Point-Count
X-Content-Type-Options: nosniff
CMR-Took: 303
X-Request-Id: zbAtpLRlRe2kPJi8PNdomG7u6C22HPSRbAy5E2m23uYIKga7V8DEfg==
Vary: Accept-Encoding, User-Agent
Content-Encoding: gzip
Server: ServerTokens ProductOnly
X-Cache: Miss from cloudfront
Via: 1.1 e9c8cd6cad69627cb7c9d88123e6e2cc.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: HIO50-C2
X-Amz-Cf-Id: zbAtpLRlRe2kPJi8PNdomG7u6C22HPSRbAy5E2m23uYIKga7V8DEfg==


Each item in the dictionary can be accessed in the normal way you access a `python` dictionary but the keys uniquely case-insensitive. Let's take a look at the commonly used `CMR-Hits` key.



In [8]:
response.headers['CMR-Hits']

'1169'

Note that "cmr-hits" works as well!

In [9]:
response.headers['cmr-hits']

'1169'

In some situations the response to your query can return a very large number of result, some of which may not be relevant. We can add additional [query parameters](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html) to restrict the information returned. We're going to restrict the search by the `provider` parameter.

You can modify the code below to explore all Earthdata data products hosted by the various providers. When searching by provider, use _Cloud Provider_ to search for cloud-hosted datasets and _On-Premises Provider_ to search for datasets archived at the DAACs. A partial list of providers is given below.

DAAC      | Short Name                              | Cloud Provider | On-Premises Provider  
----------|-----------------------------------------|----------------|----------------------  
NSIDC     | National Snow and Ice Data Center       | NSIDC_CPRD     | NSIDC_ECS  
GHRC DAAC | Global Hydrometeorology Resource Center | GHRC_DAAC      | GHRC_DAAC  
PO DAAC   | Physical Oceanography Distributed Active Archive Center | POCLOUD | PODAAC  
ASF       | Alaska Satellite Facility | ASF | ASF  
ORNL DAAC | Oak Ridge National Laboratory | ORNL_CLOUD | ORNL_DAAC  
LP DAAC   | Land Processes Distributed Active Archive Center | LPCLOUD | LPDAAC_ECS
GES DISC  | NASA Goddard Earth Sciences (GES) Data and Information Services Center (DISC) | GES_DISC | GES_DISC
OB DAAC   | NASA's Ocean Biology Distributed Active Archive Center |   | OB_DAAC
SEDAC     | NASA's Socioeconomic Data and Applications Center |   | SEDAC

We'll assign the provider to a variable as a _string_ and insert the variable into the parameter argument in the request.

In [10]:
provider = 'LPCLOUD'

In [11]:
response = requests.get(url,
                        params={
                            'cloud_hosted': 'True',
                            'has_granules': 'True',
                            'provider': provider,
                        },
                        headers={
                            'Accept': 'application/json'
                        }
                       )
response

<Response [200]>

In [12]:
response.headers['cmr-hits']

'3'

Search results are contained in the __content__ part of the Response object. However, `response.content` returns information in bytes.

In [13]:
response.content

b'{"feed":{"updated":"2022-04-04T21:04:21.524Z","id":"https://cmr.earthdata.nasa.gov:443/search/collections.json?cloud_hosted=True&has_granules=True&provider=LPCLOUD","title":"ECHO dataset metadata","entry":[{"processing_level_id":"3","boxes":["-90 -180 90 180"],"time_start":"2013-04-11T00:00:00.000Z","version_id":"2.0","updated":"2015-12-03T10:57:07.000Z","dataset_id":"HLS Landsat Operational Land Imager Surface Reflectance and TOA Brightness Daily Global 30m v2.0","has_spatial_subsetting":false,"has_transforms":false,"has_variables":false,"data_center":"LPCLOUD","short_name":"HLSL30","organizations":["LP DAAC","NASA/IMPACT"],"title":"HLS Landsat Operational Land Imager Surface Reflectance and TOA Brightness Daily Global 30m v2.0","coordinate_system":"CARTESIAN","summary":"The Harmonized Landsat and Sentinel-2 (HLS) project provides consistent surface reflectance (SR) and top of atmosphere (TOA) brightness data from the Operational Land Imager (OLI) aboard the joint NASA/USGS Landsat 

A more convenient way to work with this information is to use `json` formatted data. I'm using pretty print `pprint` to print the data in an easy to read way.    

**Note**
- `response.json()` will format our response in `json` 
- `['feed']['entry']` returns all entries that CMR returned in the request (not the same as __CMR-Hits__)
- `[0]` returns the first entry. Reminder that python starts indexing at 0, not 1!

In [14]:
pprint(response.json()['feed']['entry'][0])

{'archive_center': 'LP DAAC',
 'boxes': ['-90 -180 90 180'],
 'browse_flag': True,
 'collection_data_type': 'OTHER',
 'consortiums': ['GEOSS', 'EOSDIS'],
 'coordinate_system': 'CARTESIAN',
 'data_center': 'LPCLOUD',
 'dataset_id': 'HLS Landsat Operational Land Imager Surface Reflectance and '
               'TOA Brightness Daily Global 30m v2.0',
 'has_formats': False,
 'has_spatial_subsetting': False,
 'has_temporal_subsetting': False,
 'has_transforms': False,
 'has_variables': False,
 'id': 'C2021957657-LPCLOUD',
 'links': [{'href': 'https://search.earthdata.nasa.gov/search?q=C2021957657-LPCLOUD',
            'hreflang': 'en-US',
            'rel': 'http://esipfed.org/ns/fedsearch/1.1/data#'},
           {'href': 'https://doi.org/10.5067/HLS/HLSL30.002',
            'hreflang': 'en-US',
            'rel': 'http://esipfed.org/ns/fedsearch/1.1/metadata#'},
           {'href': 'https://lpdaac.usgs.gov/',
            'hreflang': 'en-US',
            'rel': 'http://esipfed.org/ns/fedsear

The first response contains a lot more information than we need. We'll narrow in on a few fields to get a feel for what we have. We'll print the name of the dataset (`dataset_id`) and the concept id (`id`). We can build this variable and print statement like we did above with the `url` variable. 

In [15]:
collections = response.json()['feed']['entry']

In [16]:
for collection in collections:
    print(f'{collection["archive_center"]} | {collection["dataset_id"]} | {collection["id"]}')

LP DAAC | HLS Landsat Operational Land Imager Surface Reflectance and TOA Brightness Daily Global 30m v2.0 | C2021957657-LPCLOUD
LP DAAC | HLS Sentinel-2 Multi-spectral Instrument Surface Reflectance Daily Global 30m v2.0 | C2021957295-LPCLOUD
LP DAAC | ASTER Global Digital Elevation Model V003 | C1711961296-LPCLOUD


In some situations we may be expecting a certain number of results. Only 10 datasets are return be default. This can be modified by setting the `page_size` parameter to a different any value less than or equal to 2000 (2000 is the maximum number of results return by CMR). Note, this is different that what we see from `CMR-Hits` in the header, which is the number of entries found that are available for request.

## Searching for Granules
In NASA speak, Granules are files or groups of files. In this example, we will search for ECO2LSTE version 1 for a specified region of interest and datetime range.  

We need to change the resource url to look for __granules__ instead of collections

In [17]:
url = f'{CMR_OPS}/{"granules"}'

We will search by `concept_id`, `temporal`, and `bounding_box`.  Details about these search parameters can be found in the CMR API Documentation.

The formatting of the values for each parameter is quite specific.

__concept_id__ parameter is one to many collections/collection id(s) assigned to a list   
__temporal__ parameter are dates in ISO 8061 format `yyyy-MM-ddTHH:mm:ssZ`    
__bounding_box__ parameter are coordinates in the order: lower left longitude, lower left latitude, upper right longitude, upper right latitude  

In [18]:
collection_id = ['C2021957657-LPCLOUD', 'C2021957295-LPCLOUD']
date_range = '2020-10-17T00:00:00Z,2020-11-18T23:59:59Z'
bbox = '-120.45264628,34.51050622,-120.40432448,34.53239876'

In [19]:
response = requests.get(url, 
                        params={
                            'concept_id': collection_id,
                            'temporal': date_range,
                            'bounding_box': bbox,
                            'page_size': 200,
                            },
                        headers={
                            'Accept': 'application/json'
                            }
                       )
print(response.status_code)

200


In [20]:
print(response.headers['CMR-Hits'])

10


In [21]:
granules = response.json()['feed']['entry']
for granule in granules:
    print(f'{granule["data_center"]} | {granule["dataset_id"]} | {granule["id"]}')

LPCLOUD | HLS Sentinel-2 Multi-spectral Instrument Surface Reflectance Daily Global 30m v2.0 | G2167580011-LPCLOUD
LPCLOUD | HLS Landsat Operational Land Imager Surface Reflectance and TOA Brightness Daily Global 30m v2.0 | G2152756095-LPCLOUD
LPCLOUD | HLS Sentinel-2 Multi-spectral Instrument Surface Reflectance Daily Global 30m v2.0 | G2167475882-LPCLOUD
LPCLOUD | HLS Sentinel-2 Multi-spectral Instrument Surface Reflectance Daily Global 30m v2.0 | G2167295721-LPCLOUD
LPCLOUD | HLS Landsat Operational Land Imager Surface Reflectance and TOA Brightness Daily Global 30m v2.0 | G2152682346-LPCLOUD
LPCLOUD | HLS Sentinel-2 Multi-spectral Instrument Surface Reflectance Daily Global 30m v2.0 | G2166867404-LPCLOUD
LPCLOUD | HLS Sentinel-2 Multi-spectral Instrument Surface Reflectance Daily Global 30m v2.0 | G2166768089-LPCLOUD
LPCLOUD | HLS Sentinel-2 Multi-spectral Instrument Surface Reflectance Daily Global 30m v2.0 | G2166693426-LPCLOUD
LPCLOUD | HLS Landsat Operational Land Imager Surfac

In [22]:
pprint(granules[0])

{'browse_flag': True,
 'collection_concept_id': 'C2021957295-LPCLOUD',
 'coordinate_system': 'GEODETIC',
 'data_center': 'LPCLOUD',
 'dataset_id': 'HLS Sentinel-2 Multi-spectral Instrument Surface Reflectance '
               'Daily Global 30m v2.0',
 'day_night_flag': 'DAY',
 'id': 'G2167580011-LPCLOUD',
 'links': [{'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSS30.020/HLS.S30.T10SGD.2020292T184411.v2.0/HLS.S30.T10SGD.2020292T184411.v2.0.B09.tif',
            'hreflang': 'en-US',
            'rel': 'http://esipfed.org/ns/fedsearch/1.1/data#',
            'title': 'Download HLS.S30.T10SGD.2020292T184411.v2.0.B09.tif'},
           {'href': 's3://lp-prod-protected/HLSS30.020/HLS.S30.T10SGD.2020292T184411.v2.0/HLS.S30.T10SGD.2020292T184411.v2.0.B09.tif',
            'hreflang': 'en-US',
            'rel': 'http://esipfed.org/ns/fedsearch/1.1/s3#',
            'title': 'This link provides direct download access via S3 to the '
                     'granule'},
  