# Obtain a List of S3 URLs for a GES DISC Collection Using Python
### Authors: Chris Battisto, Alexis Hunzinger
### Date Authored: 1-31-22
### Date Updated: 11-6-24

### Timing

Exercise: 15 minutes

### Overview

This notebook demonstrates how to obtain a list of S3 URLs for desired cloud-hosted GES DISC granules using the Python libraries, earthaccess and python-cmr.

### Prerequisites

This notebook was written using Python 3.8, and requires these libraries and files: 
- earthaccess
- python-cmr

Identify your data collection of interest and acquire its shortname or concept ID.
- These can be found on the collection's Dataset Landing Page on the GES DISC website. For example, this is the Dataset Landing Page for Hourly MERRA-2 SLV: https://disc.gsfc.nasa.gov/datasets/M2T1NXSLV_5.12.4/summary  
- Hover your mouse over the "Cloud Enabled" badge to find the collection's Concept ID, or find the collection's shortname in the Product Summary tab.

![](../../images/GESDISC-DSL-Shortname-ConceptID.png)


# Option 1: `earthaccess`

## Import libraries

In [1]:
import earthaccess

## Search for granules using the earthaccess function, `search_data()`

It is helpful to know some identifying information about the collection you're interested in. The table below lists some data collection identiers and corresponding `earthaccess` parameters.
| Identifier | Parameter |
| -------- | ------- |
| Concept ID | `concept_id` |
| Dataset shortname | `short_name` |
| DOI    | `doi`    |
| Version| `version` |

Further customize your search with spatial and temporal bounds (i.e. `bounding box`, `temporal` range). Read more about the earthaccess package on its [Read The Docs](https://earthaccess.readthedocs.io/en/latest/) page. 




In [9]:
short_name = 'M2T1NXSLV'
version = '5.12.4'
#concept_id = 'C1276812863-GES_DISC'
start_time = '2022-09-25'
end_time = '2022-09-27'

results = earthaccess.search_data(
    short_name = short_name,
    version = version,
    cloud_hosted = True,
    #bounding_box = (-10,20,10,50),
    temporal = (start_time,end_time),
    count = 100
)

Granules found: 3


In [13]:
#results

#Alexis note: The S3 URL is not explicitly shown.
# You can perform direct S3 access, but you are not shown the specific S3 URL.
# What do we do here?

# Option 2: `python_cmr`

## Import libraries
The python_cmr package contains many functions that aid in searching the CMR Catalog. See installation instructions and examples in the [python_cmr README](https://github.com/nasa/python_cmr/tree/develop).

Here we will use the `GranuleQuery` function.

In [2]:
from cmr import GranuleQuery

## Search for granules using the python_cmr function, `GranuleQuery`

It is helpful to know some identifying information about the collection and granules you're interested in, including dataset shortname and version. Further customize your search with spatial and temporal bounds (i.e. `bounding box`, `temporal` range). 

In [12]:
short_name = 'M2T1NXSLV'
version = '5.12.4'
#concept_id = 'C1276812863-GES_DISC'
start_time = '2022-09-25T00:00:00Z'
end_time = '2022-09-27T00:00:00Z'

api = GranuleQuery()
granules = api.short_name(short_name).version(version).temporal(start_time,end_time).get()

s3_urls = []
for granule in granules:
    for link in granule.get('links',[]):
        if 'rel' in link and 'href' in link and 'inherited' not in link:
            if 'http://esipfed.org/ns/fedsearch/1.1/s3#' in link['rel']: # It's an s3 url
                s3_urls.append(link['href'])

s3_urls

['s3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2022/09/MERRA2_400.tavg1_2d_slv_Nx.20220925.nc4',
 's3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2022/09/MERRA2_400.tavg1_2d_slv_Nx.20220926.nc4',
 's3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2022/09/MERRA2_400.tavg1_2d_slv_Nx.20220927.nc4']

### Search CMR with your desired data collection's shortname or concept ID and desired date range

Using the collection's shortname or concept ID, we can obtain individual granules by querying https://cmr.earthdata.nasa.gov/search/granules. By querying a JSON response of the granules that we want, we can obtain each granule's S3 URL. 

Here, we will parse out an S3 URL for the AQUA AIRS IR + MW Level 2 CLIMCAPS dataset. 
- **Shortname**: M2T1NXSLV
- **Concept ID**: C1276812863-GES_DISC

Our desried date range is September 25, 2022 00:00:00 to September 27, 2022 00:00:00. 

In [16]:
url = 'https://cmr.earthdata.nasa.gov/search/granules'

short_name = 'M2T1NXSLV'
concept_id = 'C1276812863-GES_DISC'

start_time = '2022-09-25T00:00:00Z'
end_time = '2022-09-27T00:00:00Z'

# OPTION 1: Using shortname
response = cmr_request({
                        'short_name': short_name,
                        'temporal': start_time+','+end_time,
                        'page_size': 200
                        })

# OPTION 2: Using concept ID
response = cmr_request({
                        'concept_id': concept_id,
                        'temporal': start_time+','+end_time,
                        'page_size': 200
                        })

granules = response.json()['feed']['entry']

### Identify the S3 URL from each granule response and save to a list

Now you can create an empty list and fill it with the S3 URL contained in each granule's response. 

In [17]:
s3_urls = []
for granule in granules:
    s3_urls.append(next((item['href'] for item in granule['links'] if item["href"].startswith("s3://")), None))

### View your list of S3 URLs

In [18]:
s3_urls

['s3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2022/09/MERRA2_400.tavg1_2d_slv_Nx.20220925.nc4',
 's3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2022/09/MERRA2_400.tavg1_2d_slv_Nx.20220926.nc4',
 's3://gesdisc-cumulus-prod-protected/MERRA2/M2T1NXSLV.5.12.4/2022/09/MERRA2_400.tavg1_2d_slv_Nx.20220927.nc4']