# Introduction to NASA `earthaccess`

## Summary

This notebook demonstrates how to search for ECOSTRESS version 2 data collections from NASA's Earthdata Cloud using the [`earthaccess`](https://github.com/nsidc/earthaccess) package. `earthaccess` is a python library to search, download, or stream NASA Earth science data with just a few lines of code. The library abstracts the [NASA Common Metadata Repository (CMR) API](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html), manages authentication, and enables reproducible programmatic search and access for both DAAC-hosted (on-prem) and cloud-hosted (Earthdata Cloud) data.  

A NASA Earthdata Login account ([EDL](https://urs.earthdata.nasa.gov/profile)) is required to download or access data. Earthdata Login accounts are free and can be set up in only a few minutes. Remember your EDL username and password as they are needed for authentication in this and other data access resources.

> **Note** Generally speaking we do not need authentication for querying collections and granules unless they are restricted datasets for early adopters.  

## Requirements  

- A NASA [Earthdata Login](https://urs.earthdata.nasa.gov/) account is required   

## Learning Objectives  

- How to get information about data collections using `earthaccess`
- How to query for data using spatiotemporal parameters
- How to work with `earthaccess` request objects

## Exercise  

Let's start by loading in the needed packages.  

In [1]:
import earthaccess
import os
import geopandas as gp
import hvplot.pandas

`earthaccess` provides 3 different "strategies" to authenticate with NASA EDL.

* **netrc**: Do we have a **.netrc** file with our EDL credentials? if so, we can use it with `earthaccess`.
If we don't have it and want to create one we can. `earthaccess` allows users to type their credentials and persist them into a .netrc file.
* **environment**: If we have our EDL credentials as environment variables 
  * EDL_USERNAME
  * EDL_PASSWORD
* **interactive**: We will be asked for our EDL credentials with optional persistance to .netrc

The below code with cycle through the strategies and automatically persist a **.netrc** file

In [2]:
auth = earthaccess.login(persist = True)
# are we authenticated?
print(auth.authenticated)

True


In [3]:
#auth.refresh_tokens()

`earthaccess` creates and leverages Earthdata Login tokens to authenticate with NASA systems. Earthdata Login tokens expire after a month and will no longer work when trying to download or stream data using `earthaccess`. Use the `refresh_tokens()` to generate a new token for use.

## Querying for datasets, AKA collections

We need information about the data collection we're interested in before we can find the data granules we would like to process. We'll use the `search_datasets()` function to query for collections that match our input parameters.

In [4]:
collections_req = earthaccess.search_datasets(
    provider='LPCLOUD',    # LPCLOUD is the LP DAAC Archive in Earthdata Cloud
    keyword='ecostress',
    version='002'
)

`earthaccess` creates and leverages Earthdata Login tokens to authenticate with NASA systems. Earthdata Login tokens expire after a month and will no longer work when trying to download or stream data using `earthaccess`. Use the `refresh_tokens()` to generate a new token for use.

In [5]:
print(f'collections_req is a {type(collections_req)} of {type(collections_req[0])}')

collections_req is a <class 'list'> of <class 'earthaccess.results.DataCollection'>


Queries return a list of `earthaccess` `DataCollections`. `earthaccess` `DataCollections` are enhanced python dictionaries and as such, can be interacted with like any Python dictionary. Let's take a look at the first collection in `collections_req`

In [6]:
collection = collections_req[0]    # Get the first earthaccess DataCollection in the list
collection

{
  "meta": {
    "revision-id": 68,
    "deleted": false,
    "format": "application/vnd.nasa.cmr.umm+json",
    "provider-id": "LPCLOUD",
    "has-combine": false,
    "user-id": "dnilsen_eros",
    "has-formats": false,
    "associations": {
      "variables": [
        "V3208263018-LPCLOUD",
        "V3208263052-LPCLOUD",
        "V3208263087-LPCLOUD",
        "V3208263133-LPCLOUD",
        "V3208263178-LPCLOUD",
        "V3208263195-LPCLOUD",
        "V3208263215-LPCLOUD",
        "V3208263230-LPCLOUD"
      ],
      "tools": [
        "TL1860232272-LPDAAC_ECS"
      ]
    },
    "s3-links": [
      "s3://lp-prod-protected/ECO_L2T_LSTE.002",
      "s3://lp-prod-public/ECO_L2T_LSTE.002"
    ],
    "has-spatial-subsetting": false,
    "native-id": "ECO_L2T_LSTEV002",
    "has-transforms": false,
    "association-details": {
      "variables": [
        {
          "concept-id": "V3208263018-LPCLOUD"
        },
        {
          "concept-id": "V3208263052-LPCLOUD"
        },
      

In [7]:
collection.keys()

dict_keys(['meta', 'umm'])

In [8]:
collection['umm'].keys()

dict_keys(['TilingIdentificationSystems', 'CollectionCitations', 'AdditionalAttributes', 'SpatialExtent', 'CollectionProgress', 'ScienceKeywords', 'TemporalExtents', 'ProcessingLevel', 'DOI', 'ShortName', 'EntryTitle', 'DirectDistributionInformation', 'AccessConstraints', 'RelatedUrls', 'DataDates', 'Abstract', 'Purpose', 'LocationKeywords', 'MetadataDates', 'VersionDescription', 'Version', 'Projects', 'UseConstraints', 'ContactPersons', 'CollectionDataType', 'DataCenters', 'TemporalKeywords', 'Platforms', 'MetadataSpecification', 'ArchiveAndDistributionInformation'])

In [9]:
collection['umm']['ShortName']

'ECO_L2T_LSTE'

The `DataCollections` class also has some handy helper methods.

```python 
collection.concept_id() # returns the concept-id, used to search for data granules
collection.abstract() # returns the abstract
collection.landing_page() # returns the landing page if present in the UMM fields
collection.get_data() # returns the portal where data can be accessed.
```

The same results can be obtained using the `dict` syntax:

```python
collection["meta"]["concept-id"] # concept-id
collection["umm"]["RelatedUrls"] # URLs, with GET DATA, LANDING PAGE etc
```


Another helpful method is `summary()`. This method prints some of the more common collection metadata information used for additional queries against the individual collection.

In [10]:
collection.summary()

{'short-name': 'ECO_L2T_LSTE',
 'concept-id': 'C2076090826-LPCLOUD',
 'version': '002',
 'file-type': "[{'Format': 'Cloud Optimized GeoTIFF (COG)', 'FormatType': 'Native', 'Media': ['Earthdata Cloud', 'HTTPS'], 'AverageFileSize': 30, 'AverageFileSizeUnit': 'MB', 'TotalCollectionFileSizeBeginDate': '2018-07-09T00:00:00.000Z'}]",
 'get-data': ['https://search.earthdata.nasa.gov/search?q=C2076090826-LPCLOUD',
  'https://appeears.earthdatacloud.nasa.gov/'],
 'cloud-info': {'Region': 'us-west-2',
  'S3BucketAndObjectPrefixNames': ['s3://lp-prod-protected/ECO_L2T_LSTE.002',
   's3://lp-prod-public/ECO_L2T_LSTE.002'],
  'S3CredentialsAPIEndpoint': 'https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials',
  'S3CredentialsAPIDocumentationURL': 'https://data.lpdaac.earthdatacloud.nasa.gov/s3credentialsREADME'}}

The most common way to query for information or metadata related to a collection is to use the **concept-id**, the collection **doi**, or a combination of both the **short-name** and **version**. The `summary()` method gives us the **concept-id**, **short-name**, and **version**. We can use this knowledge to create a list containing this information for later queries against the specific collections.

In [11]:
collections_info = [{n:[c.summary()['short-name'], c.summary()['concept-id'], c.summary()['version']]} for n, c in enumerate(collections_req)]
collections_info

[{0: ['ECO_L2T_LSTE', 'C2076090826-LPCLOUD', '002']},
 {1: ['ECO_L2_LSTE', 'C2076114664-LPCLOUD', '002']},
 {2: ['ECO_L1B_GEO', 'C2076087338-LPCLOUD', '002']},
 {3: ['ECO_L2_CLOUD', 'C2076115306-LPCLOUD', '002']},
 {4: ['ECO_L1CT_RAD', 'C2595678301-LPCLOUD', '002']},
 {5: ['ECO_L4G_ESI', 'C2076110703-LPCLOUD', '002']},
 {6: ['ECO_L4G_ESI_ALEXI', 'C2683457290-LPCLOUD', '002']},
 {7: ['ECO_L4G_WUE', 'C2076109886-LPCLOUD', '002']},
 {8: ['ECO_L4T_ESI', 'C2076104650-LPCLOUD', '002']},
 {9: ['ECO_L4T_ESI_ALEXI', 'C2683463996-LPCLOUD', '002']},
 {10: ['ECO_L4T_WUE', 'C2076102081-LPCLOUD', '002']},
 {11: ['ECO_L3G_ET_ALEXI', 'C2076108728-LPCLOUD', '002']},
 {12: ['ECO_L3G_JET', 'C2076112011-LPCLOUD', '002']},
 {13: ['ECO_L3G_MET', 'C2074897737-LPCLOUD', '002']},
 {14: ['ECO_L3G_SEB', 'C2074855428-LPCLOUD', '002']},
 {15: ['ECO_L3G_SM', 'C2074890845-LPCLOUD', '002']},
 {16: ['ECO_L3T_ET_ALEXI', 'C2076105456-LPCLOUD', '002']},
 {17: ['ECO_L3T_JET', 'C2076106409-LPCLOUD', '002']},
 {18: ['ECO_L3

In [12]:
collections_info[0]

{0: ['ECO_L2T_LSTE', 'C2076090826-LPCLOUD', '002']}

## Querying for data files (granules)

We can use the collection information from above to start finding data files we want to work with. In this example, we will use the **short_name** and the **version** to query for data granules from [`ECO_L2T_LSTE`](https://doi.org/10.5067/ECOSTRESS/ECO_L2_LSTE.002) version `002` dataset. You can search data granules using just a **short_name** but there is the potential that multiple versions of the data collection will be return. To query for granules in a more explicit way, a **concept-id** would be the best option.  

In [13]:
# We build our query
granules_request = earthaccess.search_data(
    short_name='ECO_L2T_LSTE',
    version='002',
    provider='LPCLOUD',
    count=100
)

In [14]:
print(f'granules_request is a {type(granules_request)} of {type(granules_request[0])}')

granules_request is a <class 'list'> of <class 'earthaccess.results.DataGranule'>


Again, our query for granules has returned a list of python dictionaries (`earthaccess.results.DataGranule`). We can therefore access all the keys and values like we usually do with Python dictionaries.  

This query returned a lot of granules. Let's refine our results using **bounding box** and **temporal** constraints.  

### Spatiotemporal queries

The `earthaccess.results.DataGranule` and `earthaccess.results.DataCollection` classes accept the same spatial and temporal arguments as CMR, so we can search for granules that match spatiotemporal criteria.

#### Specify Spatial Parameters

Search queries can be refined by using spatial parameters. `earthaccess` accepts point and area arguments. For point features a longitude and latitude coordinate pair must be passes as a tuple to the `point` parameter. For example:

```
point=(-105.64788824641289,39.98286247719818)
```

For areas features, queries can leverage the `bounding_box` -- a tuple containing coordinates in the order lower_left_lon, lower_left_lat, upper_right_lon, upper_right_lat -- or `polygon` -- list of (lon, lat) tuples -- parameters. We'll use the `bounding_box` parameter in this example.

**Reading a geojson file**

In [15]:
geojson = gp.read_file('../../data/NIWO_box.geojson')

In [16]:
geojson_plot = geojson.hvplot(tiles='ESRI', color='yellow', alpha=0.5, crs='EPSG:4326')
geojson_plot

**Reading a shapefile**

In [17]:
shp = gp.read_file('../../data/NIWO_ShrubDensity.shp')

In [18]:
shp_plot = shp.hvplot(tiles='ESRI', color='blue', alpha=0.5, crs='EPSG:4326')
geojson_plot * shp_plot

We can get the bounding box that encompasses all of the features using the `total_bounds` method and pass it to the `bounding_box` parameter in our search query.  

In [19]:
bbox = tuple(list(shp.total_bounds))
bbox

(np.float64(-105.58650854045227),
 np.float64(40.05184049311418),
 np.float64(-105.5832945453711),
 np.float64(40.05424331298521))

#### Specify Temporal Parameters  

To specify the dates we are interested in, we pass a tuple containing the start and end date in the form of yyyy-mm-dd.  

In [20]:
date = ('2023-05-01','2023-09-30')    # tuple containing the start and end date

In [21]:
granules_request = earthaccess.search_data(
    short_name='ECO_L2T_LSTE',
    version='002',
    provider='LPCLOUD',
    bounding_box=bbox,
    temporal=date,
    count=100
)

If we wanted to perform a query for a point location, we would only need to swap out the **bounding_bbox** parameter for the **point** parameter. The input for the **point** parameter is a tuple containing a longitude and latitude coordinate pair. For example:

```python
point=(-105.58650854045227,40.05184049311418)
```

Let's look at the first granule from our bounding box query.  

In [22]:
granule = granules_request[0]

In [23]:
granule.keys()

dict_keys(['meta', 'umm', 'size'])

In [24]:
granule['umm'].keys()

dict_keys(['TemporalExtent', 'OrbitCalculatedSpatialDomains', 'GranuleUR', 'AdditionalAttributes', 'MeasuredParameters', 'SpatialExtent', 'ProviderDates', 'CollectionReference', 'PGEVersionClass', 'RelatedUrls', 'DataGranule', 'Platforms', 'MetadataSpecification'])

The `DataGranule` class also has several convenience methods. The `data_links()` method can extract all of the data links associated with each granule.  

In [25]:
granule.data_links()

['https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_27348_005_13TDE_20230503T075020_0711_01/ECOv002_L2T_LSTE_27348_005_13TDE_20230503T075020_0711_01_water.tif',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_27348_005_13TDE_20230503T075020_0711_01/ECOv002_L2T_LSTE_27348_005_13TDE_20230503T075020_0711_01_cloud.tif',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_27348_005_13TDE_20230503T075020_0711_01/ECOv002_L2T_LSTE_27348_005_13TDE_20230503T075020_0711_01_view_zenith.tif',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_27348_005_13TDE_20230503T075020_0711_01/ECOv002_L2T_LSTE_27348_005_13TDE_20230503T075020_0711_01_height.tif',
 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_27348_005_13TDE_20230503T075020_0711_01/ECOv002_L2T_LSTE_27348_005_

Granules for **ECO_L2T_LSTE** are made up of multiple files. This is the case for several of the `LPCLOUD` provider collection. Other collections may only have a single file.  

#### Printing data granules  

Since we are in a notebook we can take advantage of it to see a more user friendly version of the granules with the built-in function `display`. This will render browse image for the granule if available and eventually will have a similar representation as the one from the Earthdata Search client.  

In [26]:
# printing 2 granules using display
[display(granule) for granule in granules_request[0:2]]

[None, None]

## Working with `earthaccess` Data Links  

With `earthaccess` we can get the files regardless if they are on-prem or cloud based with the same API call, although an important consideration is that if we want to access data in the cloud (direct access) we must run the code in the cloud. This is because some S3 buckets are configured to only allow direct access (s3:// links) if the requester is in the same zone, `us-west-2`.  

### Streaming Data - Reading a Cloud Optimized GeoTIFF (COG) File  

Currently, `earthaccess` doesn't have any helper methods to aid in reading COG files. Fortunately, the `rioxarray` library has great support for reading COG files. NASA, however, requires authentication when accessing NASA data. Below, we create a runtime context (**rio_env**) that will pass along our Earthdata Login credentials from a **.netrc** file stored in our home directory.  

In [27]:
import rioxarray as rxr
import rasterio as rio
import hvplot
import hvplot.xarray

rio_env = rio.Env(GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR',
                  GDAL_HTTP_COOKIEFILE=os.path.expanduser('~/cookies.txt'),
                  GDAL_HTTP_COOKIEJAR=os.path.expanduser('~/cookies.txt'))
rio_env.__enter__()

<rasterio.env.Env at 0x7f6d399a6710>

ECO_L2T_LSTE version 2 contain multiple files per granules. In this case, multiple cloud optimized GeoTIFF (COG) files. Here we are only interested in the **LST** (land surface temperature) files. In an analysis we'd need to consult the other associated files for quality control. We'll use a Python list comprehension to return only data files that contain **LST.tif** in their file name.  

In [28]:
lst_links = [l for dl in granules_request for l in dl.data_links() if 'LST.tif' in l]
lst_links[0]

'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_27348_005_13TDE_20230503T075020_0711_01/ECOv002_L2T_LSTE_27348_005_13TDE_20230503T075020_0711_01_LST.tif'

Now we'll use `rioxarray` to read the first file in our list. `rioxarray` reads the COG file in as an `xarray` `DataArray`. Normally when a COG file is read in using `rioxarray`, a **band** coordinate variable is created. In most circumstances this coordinate variable is not need. We use the `squeeze` function to remove **band** from our object.  

In [29]:
lst_da = rxr.open_rasterio(filename=lst_links[0]).squeeze('band', drop=True)
lst_da

We can do a cursory plot of the data using `hvplot`

In [30]:
size_opts = dict(frame_height=405, frame_width=720, fontscale=2)

lst_da.rio.reproject('EPSG:4326').hvplot.image(x='x', y='y', **size_opts, cmap='inferno', tiles='ESRI', crs='EPSG:4326', rasterize=True) * shp.hvplot(color = '#FF000000', crs='EPSG:4326')

The shape is very small, so you'll have to zoom in to see where it intersects the scene, somewhat near the center.

In [31]:
earthaccess.download(lst_links[0], local_path='../../data')

QUEUEING TASKS | :   0%|          | 0/1 [00:00<?, ?it/s]

PROCESSING TASKS | :   0%|          | 0/1 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/1 [00:00<?, ?it/s]

['../../data/ECOv002_L2T_LSTE_27348_005_13TDE_20230503T075020_0711_01_LST.tif']

## Contact Information  

**Authors:**  LP DAAC¹  
**Contact:** LPDAAC@usgs.gov  
**Voice:** +1-866-573-3222  
**Organization:** Land Processes Distributed Active Archive Center (LP DAAC)  
**Website:** [https://lpdaac.usgs.gov/](https://lpdaac.usgs.gov/)  

¹Work performed under USGS contract G15PD00467 for LP DAAC under NASA contract NNG14HH33I.  