# NASA Earthdata API Client 🌍

## Overview

> TL;DR: **earthaccess** is uses NASA APIs to search, preview and access NASA datasets on-prem and in the cloud with 4 lines of Python.

There are many ways to access NASA datasets, we can use the Earthdata search portal. We can use DAAC specific portals or tools.
We could even use data.gov! These web portals are great but... they are not designed for programmatic access and reproducible workflows. 
This is extremely important in the age of the cloud and reproducible open science.

The good news is that NASA also exposes APIs that allows us to search, transform and access data in a programmatic way. 
There are already some very useful client libraries for these APIs:

* python-cmr
* eo-metadata-tools
* harmony-py
* Hyrax (OpenDAP)
* cmr-stac
* others

Each of these libraries has amazing features and some similarities. 
* [cmr-stac](https://medium.com/pangeo/intake-stac-nasa-4cd78d6246b7) is probably the best option for a streamlined workflow from dataset search and discovery to efficiently loading data using python libraries like pandas or xarray.
* [*Harmony-py*](https://harmony.earthaccess.nasa.gov/) is the more capable client if we want to pre process the data beforehand(reformat NetCDF to Zarr, reproject, subset). Unfortunately not all datasets are yet covered by Harmony.

In this context, **earthaccess** aims to be a simple library that can deal with the important parts of the metadata so we can access or download data without having to worry if a given dataset is on-prem or in the cloud.

### NASA EDL and the Auth class

In [28]:
# We import earthaccess and authenticate
import earthaccess

# the core function of auth is to deal with cloud credentials and remote file sessions (fsspec or requests).
# Essentially, anything that requires you to log in to Earthdata.
# Most of this will happen behind-the-scenes for you once you have been authenticated.
auth = earthaccess.login()

In some cases, we can query anonymously without authentication to get basic information about what data is available.

We can query for collections

```python
# An anonymous query to CMR
Query = earthaccess.collection_query().keyword('elevation')
```

or for granules

```python
# An anonymous query to CMR
Query = earthaccess.granule_query().keyword('elevation')
```

## Querying for data collections
The DataCollection client, accessed via `earthaccess.collection_query()`, can query CMR for any collection using all of CMR's Query parameters and has built-in accessors for the common ones.
This makes it ideal for one liners and easier notation.

In [None]:
# We can now search for collections using a pythonic API client for CMR.
# Query = earthaccess.collection_query().keyword('fire').temporal("2016-01-01", "2020-12-12")
# Query = earthaccess.collection_query().keyword('GEDI').bounding_box(-134.7,58.9,-133.9,59.2)

Query = (
    earthaccess.collection_query()
    .keyword("elevation")
    .bounding_box(-134.7, 58.9, -133.9, 59.2)
)

print(f"Collections found: {Query.hits()}")

# filtering what UMM fields to print, to see the full record we omit the fields filters
# meta is always included as
collections = Query.fields(["ShortName", "Abstract"]).get(3)
# Inspect 3 results printing just the ShortName and Abstract
collections[0:3]

In [None]:
# the results from DataCollections and DataGranules are enhanced python dict objects, we still can get all the fields from CMR
collections[0]["umm"]["ShortName"]

The results of your collection_query (a DataCollections class object) are python dictionaries with some handy methods.

```python 
collection.concept_id() # returns the concept-id, used to search for data granules
collection.abstract() # returns the abstract
collection.landing_page() # returns the landing page if present in the UMM fields
collection.get_data() # returns the portal where data can be accessed.
```

The same results can be obtained using the `dict` syntax:

```python
collection["meta"]["concept-id"] # concept-id
collection["umm"]["RelatedUrls"] # URLs, with GET DATA, LANDING PAGE etc
```


In [None]:
# We can now search for collections using a pythonic API client for CMR.
Query = earthaccess.collection_query().daac("PODAAC")

print(f"Collections found: {Query.hits()}")
collections = Query.fields(["ShortName"]).get(20)
# Printing 3 collections
collections[0:3]

In [None]:
# What if we want cloud collections
Query = earthaccess.collection_query().daac("PODAAC").cloud_hosted(True)

print(f"Collections found: {Query.hits()}")
collections = Query.fields(["ShortName"]).get(20)
# Printing 3 collections
collections[0:3]

In [None]:
# Printing the concept-id for the first 10 collections
[collection.concept_id() for collection in collections[0:10]]

## Querying for data granules

The DataGranules client, accessed via `earthaccess.granule_query()`, provides similar functionality as the collection class. To query for granules in a more reliable way, concept-id would be the main key.
You can search data granules using a short name but that could (and more likely will) return multiple versions of the same data granules. 

In this example we're querying for 20 data grnaules from ICESat-2  [ATL03](https://nsidc.org/data/ATL03/versions/) version `006` dataset. 

In [None]:
Query = (
    earthaccess.granule_query()
    .short_name("ATL03")
    .version("006")
    .bounding_box(-134.7, 58.9, -133.9, 59.2)
)
granules = Query.get(20)
granules[0:2]

### Pretty printing data granules

Since we are in a notebook we can take advantage of it to see a more user friendly version of the granules with the built-in function `display`
This will render browse image for the granule if available and eventually will have a similar representation as the one from the Earthdata search portal

In [None]:
# printing 2 granules using display
[display(granule) for granule in granules[0:2]]

### Spatiotemporal queries

Our granules and collection classes accept the same spatial and temporal arguments as CMR so we can search for granules that match spatiotemporal criteria.



In [None]:
Query = (
    earthaccess.granule_query()
    .short_name("ATL03")
    .temporal("2020-03-01", "2020-03-30")
    .bounding_box(-134.7, 58.9, -133.9, 59.2)
    .version("006")
)
# Always inspects the hits before retrieving the granule metadata, just because it's very verbose.
print(f"Granules found: {Query.hits()}")

In [None]:
# Now we can print some info about these granules using the built-in methods
granules = Query.get(4)
data_links = [{"links": g.data_links(), "size (MB):": g.size()} for g in granules]
data_links

In [None]:
# More datasets to try

# C1908348134-LPDAAC_ECS: GEDI L2A Elevation and Height Metrics Data Global Footprint Level V002
# C1968980609-POCLOUD: Sentinel-6A MF Jason-CS L2 P4 Altimeter Low Resolution (LR) STC Ocean Surface Topography
# C1575731655-LPDAAC_ECS: ASTER Global Digital Elevation Model NetCDF V003
# Query = earthaccess.granule_query().concept_id("C1575731655-LPDAAC_ECS").bounding_box(-134.7,58.9,-133.9,59.2)
Query = (
    earthaccess.granule_query()
    .short_name("ATL03")
    .version("006")
    .bounding_box(-134.7, 58.9, -133.9, 59.2)
)

print(f"Granules found: {Query.hits()}")

In [None]:
# Not all granules have data previews. If they have the granule class will show up to 2 preview images while using Jupyter's display() function
granules = Query.get(10)
[display(g) for g in granules[0:5]]

In [None]:
# Granules are python dictionaries, with fancy nested key/value notation and some extra built-in methods.
granules[0]["umm"]["TemporalExtent"]["RangeDateTime"]

In [None]:
# Size in MB
data_links = [{"links": g.data_links(), "size (MB):": g.size()} for g in granules]
data_links

## **Accessing the data**

The cloud is not something magical, but having infrastructure on-demand is quite handy to have on many scientific workflows, especially if the data already lives in "the cloud".
As for NASA, a data migration started in 2020 and will continue on the foreseeable future. Not all but most of NASA data will be available on AWS object storage system (i.e. S3).

To work with this data the first thing we need to do is to get the proper credentials for accessing data on NASA DAAC S3 buckets. These credentials are on a per-DAAC base and last a mere 1 hour. In the near future the Auth class will keep track of this to regenerate the credentials as needed.

With `earthaccess` a researcher can get the files regardless if they are on-prem or cloud-based with the same API call, although an important consideration is that if we want to access data in the cloud we must run the code in the cloud. This is because some S3 buckets are configured to only allow direct access (s3:// links) if the requester is in the same zone, `us-west-2`.

## On-prem access  📡

DAAC hosted data

In [None]:
Query = (
    earthaccess.granule_query()
    .short_name("ATL06")
    .bounding_box(-134.7, 54.9, -100.9, 69.2)
    .debug(True)
)
print(f"Granule hits: {Query.hits()}")
# getting more than 6,000 metadata records for demo purposes is going to slow us down a bit so let's get only 100
granules = Query.get(10)

In [None]:
granules[0]

In [None]:
# Does this granule belong to a cloud-based collection?
granules[0].cloud_hosted

In [None]:
# since the response is an array of dictionaries we can do pythonic things like ordering the granules by size
import operator

granules_by_size = sorted(granules, key=operator.itemgetter("size"))
# now our array is sorted by size from less to more. Let's print the first 3
granules_by_size[0:3]

In [None]:
%%time
# accessing the data on prem means downloading it if we are in a local environment or "uploading them" if we are in the cloud.
files = earthaccess.download(granules_by_size[0:2], "./data/demo-atl03")

## Cloud access ☁️

Same API, just a different place

In [None]:
Query = (
    earthaccess.granule_query()
    .short_name("ATL06")
    .cloud_hosted(True)
    .bounding_box(-134.7, 54.9, -100.9, 69.2)
)
print(f"Granule hits: {Query.hits()}")
cloud_granules = Query.get(10)
# is this a cloud hosted data granule?
cloud_granules[0].cloud_hosted

In [None]:
# Let's pretty print this
cloud_granules[0]

In [None]:
# Let's order them by size again.
import operator

cloud_granules_by_size = sorted(cloud_granules, key=operator.itemgetter("size"))
# now our array is sorted by size from less to more. Let's print the first 3
cloud_granules_by_size[0:3]

In [None]:
%%time

files = earthaccess.download(cloud_granules_by_size[0:3], local_path="./data/demo")
files

## Recap

```python
import earthaccess 

auth = earthaccess.login()

Query = earthaccess.granule_query().concept_id("C1575731655-LPDAAC_ECS").bounding_box(-134.7,58.9,-133.9,59.2)
granules = Query.get(10)
# preview the data granules
granules 
# get the files
files = earthaccess.download(granules, "data")

```

### Related links

**CMR** API documentation: https://cmr.earthaccess.nasa.gov/search/site/docs/search/api.html

**EDL** API documentation: https://urs.earthdata.nasa.gov/documentation

NASA OpenScapes: https://nasa-openscapes.github.io/earthaccess-cloud-cookbook/

NSIDC: https://nsidc.org