# Obtain a List of S3 URLs for a GES DISC Collection Using Python
### Authors: Chris Battisto, Alexis Hunzinger
### Date Authored: 1-31-22
### Date Updated: 11-12-24

### Timing

Exercise: 15 minutes

### Overview

This notebook demonstrates how to obtain a list of S3 URLs for desired cloud-hosted GES DISC granules using the Python libraries, `earthaccess` and `python-cmr`.

Two methods are shown, one with `earthaccess` and one with `python-cmr`. Which one is best for you?
- `python-cmr`:
    -  Aids only in **searching** for data, without direct data access capabilities.
    -  Provides methods to preemptively check for invalid input and handle URL encoding required by the [Common Metadata Repository (CMR) API](https://cmr.earthdata.nasa.gov/search/).
    -  Allows extensive customization by accepting all CMR API parameters.
- `earthaccess`:
    - Designed for **searching, downloading or streaming** NASA Earth science data with minimal code.
    - Makes querying the CMR API intuitive and less error-prone, but does not allow for customized querying.
    - Includes functions to access data, optimizing the data source based on your computing environment (cloud or local).
    - Continually under development, with incomplete documentation.

### Prerequisites

This notebook was written using Python 3.8, and requires these libraries and files: 
- earthaccess
- python-cmr

Identify your data collection of interest and acquire its shortname or concept ID.
- These can be found on the collection's Dataset Landing Page on the GES DISC website. For example, this is the Dataset Landing Page for Hourly MERRA-2 SLV: https://disc.gsfc.nasa.gov/datasets/M2T1NXSLV_5.12.4/summary  
- Hover your mouse over the "Cloud Enabled" badge to find the collection's Concept ID, or find the collection's shortname in the Product Summary tab.

![](../../images/GESDISC-DSL-Shortname-ConceptID.png)


## Option 1: `earthaccess`

### Import libraries

In [1]:
import earthaccess

### Search for granules using the function, `search_data()`

It is helpful to know some identifying information about the collection you're interested in. The table below lists some data collection identifiers and corresponding `earthaccess` parameters.
| Data Collection Identifier | Parameter |
| -------- | ------- |
| Concept ID | `concept_id` |
| Dataset shortname | `short_name` |
| DOI    | `doi`    |
| Version| `version` |

Further customize your search with spatial and temporal bounds (i.e. `bounding box`, `temporal`). Read more about the `search_data()` function on the [earthaccess Read The Docs](https://earthaccess.readthedocs.io/en/latest/user-reference/api/api/#earthaccess.api.search_data) page. 

In [22]:
short_name = 'M2T1NXSLV'
version = '5.12.4'
start_time = '2022-09-25'
end_time = '2022-09-27'

granules = earthaccess.search_data(
    short_name = short_name,
    version = version,
    cloud_hosted = True,
    bounding_box = (-10,20,10,50),
    temporal = (start_time,end_time),
)

Granules found: 3


### Identify the S3 URL from each granule response and save to a list

To ensure the data link in the data search response is for an S3 URL, specify `access="direct"` which refers to direct S3 access.

In [19]:
s3_urls = [granule.data_links(access="direct") for granule in granules]
s3_urls

[['s3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2022/09/MERRA2_400.tavg1_2d_slv_Nx.20220925.nc4'],
 ['s3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2022/09/MERRA2_400.tavg1_2d_slv_Nx.20220926.nc4'],
 ['s3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2022/09/MERRA2_400.tavg1_2d_slv_Nx.20220927.nc4']]

***
## Option 2: `python-cmr`

### Import libraries
The `python-cmr` package contains many functions that aid in searching the CMR Catalog. Here we will only use the `GranuleQuery()` function.

In [12]:
from cmr import GranuleQuery

### Search for granules using the function, `GranuleQuery()`

It is helpful to know some identifying information about the collection you're interested in. The table below lists some data collection identifiers and corresponding `python-cmr` parameters. 
| Data Collection Identifier | Parameter |
| -------- | ------- |
| Concept ID | `concept_id` |
| Dataset shortname | `short_name` |
| Version| `version` |

Further customize your search with spatial and temporal bounds (i.e. `bounding box`, `point`, `polygon`, `temporal`). Read more about `GranuleQuery()` and other functions on the [python-cmr README](https://github.com/nasa/python_cmr/blob/develop/README.md).

In [20]:
short_name = 'M2T1NXSLV'
version = '5.12.4'
start_time = '2022-09-25T00:00:00Z'
end_time = '2022-09-27T00:00:00Z'

api = GranuleQuery()
granules = api.short_name(short_name).version(version).temporal(start_time,end_time).get()

### Identify the S3 URL from each granule response and save to a list

In [21]:
s3_urls = []
for granule in granules:
    for link in granule.get('links',[]):
        if 'rel' in link and 'href' in link and 'inherited' not in link:
            if 'http://esipfed.org/ns/fedsearch/1.1/s3#' in link['rel']: # It's an s3 url
                s3_urls.append(link['href'])

s3_urls

['s3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2022/09/MERRA2_400.tavg1_2d_slv_Nx.20220925.nc4',
 's3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2022/09/MERRA2_400.tavg1_2d_slv_Nx.20220926.nc4',
 's3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2022/09/MERRA2_400.tavg1_2d_slv_Nx.20220927.nc4']