# In-Cloud Data Access Clinic (June 2024)

<p></p>

<div style="background:#fc9090;border:1px solid #cccccc;padding:5px 10px;"><big><b>Warning:  </b>Because this notebook uses the S3 protocol, <em><strong>it will only run in an environment with <a href="https://disc.gsfc.nasa.gov/information/glossary?title=AWS%20region">us-west-2 AWS access</a></strong></em>.</big></div>

### Overview

This notebook contains steps for searching, accessing, subsetting, and averaging GPM_3IMERGHH granules using Python and in-cloud methods. It demonstrates how to search for S3 URLs using the python_cmr and earthaccess libraries, opening them with the earthaccess library, and finally, searching and accessing the GPM_3IMERGHH_precipitationCal Zarr store using the S3FS library.

### Prerequisites

This notebook was written using Python 3.9, and requires these libraries and files: 
- netrc file with valid Earthdata Login credentials.
- Xarray
- python_cmr
- earthaccess
- requests
- S3FS
- NumPy

### Other Links:
- [Earthdata Webinar: Analyzing Precipitation Extremes Using Cloud Computing (Zarr)](https://www.earthdata.nasa.gov/learn/webinars-and-tutorials/ges-disc-30-10-2023)
- [`earthaccess` Library How-tos](https://earthaccess.readthedocs.io/en/latest/howto/access-data/)
- [Workshop Slides](https://docs.google.com/presentation/d/129rErCamSIO1bK0A_112pqPmvV82O9Z-a2LJLSRYtIY/edit?usp=sharing)

### 1. Import Modules

In [None]:
import xarray as xr
import s3fs
import earthaccess
import numpy as np
from cmr import VariableQuery
import requests

import warnings
warnings.filterwarnings("ignore")

### 2. Authenticate with `earthaccess`

The `earthaccess` library contains methods to automatically generate the .netrc file required for accessing S3 buckets, if it is not already present, using the `strategy="interactive"` parameter. If the .netrc file does not exist, it will prompt you; if it does exist, or after it is created, it will automatically authenticate your S3 access with a token that will expire in one hour.

### 3. Search and Access Granules Directly from the S3 Bucket using `earthaccess`

In [None]:
# Search for the data



Once granule URLs are found, `earthaccess` will use its `open()` method to automatically format the object storage and authenticate so that we can access it. Then, we can open all the granules using `xr.open_mfdataset()`, while remembering to pass `group=Grid`.

In [None]:
%%time

# Only open one day-this takes quite some time!



### 4. Subset over Northern California for February 2020

In [None]:
bbox = [-124.295,38.954,-119.989,42.03]
lon_slice = slice(bbox[0], bbox[2])
lat_slice = slice(bbox[1], bbox[3])
year = 2020

start_time = f"{year}-02-01T00:00:00"
end_time = f"{year}-02-29T23:30:00"
time_slice = slice(start_time, end_time)


### 5. Calculate Monthly Mean and Plot

### 6. Check Direct/External Links

In [None]:
# Download files directly from S3



In [None]:
# Download files directly from on-premises HTTPS, or from the cloud, if available



### 7. Access S3 Zarr Store Using `python-cmr` and `Xarray`

Below, we create functions for authentication and opening the Zarr store, which is basically inside of another S3 bucket, and uses a different credential endpoint:

In [None]:
def retrieve_credentials():
    """Makes the Oauth calls to authenticate with EDS and return a set of s3 same-region, read-only credentials."""
    response = requests.get("https://api.giovanni.earthdata.nasa.gov/s3credentials")
    response.raise_for_status()
    return response.json()

def open_zarr_store(bucket_path):
    creds = retrieve_credentials()

    s3 = s3fs.S3FileSystem(key=creds["AccessKeyId"], 
                           secret=creds["SecretAccessKey"], 
                           token=creds["SessionToken"])

    store = s3.get_mapper(bucket_path)
    return store

### 8. Search for Zarr Stores using `python_cmr`

Here, we will use the `python_cmr` library to search for Zarr stores by variable, provider, and collection. Note that we have to manually parse out the collection alongside the variable, as there are multiple collections with `precipitationCal` as a variable.

In [None]:
var = "precipitationCal"
provider = "GES_DISC"
collection = "GPM_3IMERGHH_06"

zarr_stores = [var for var in VariableQuery().provider(provider).get_all() if "instance_information" in var]

for item in zarr_stores:
    if collection in item.get('native_id', ''):
        zarr_store_path = item['instance_information']['url']

zarr_store_path

### 9. Access Zarr Store using `xarray`

Remember, this is only one variable! You will also notice that it is referred to as `variable`, and not `precipitationCal`.

In [None]:
%%time



# Mask fill values. This is required until this bug is fixed
ds_masked_dropped = ds_zarr.where(ds_zarr["time"] != ds_zarr["time"]._FillValue, drop=True)
ds_masked_dropped

### 10. Subset over Northern California in February 2020

Note that the latitude and longitude coordinates are now "latitude" and "longitude", instead of "lat" and "lon".

In [None]:
bbox = [-124.295,38.954,-119.989,42.03]
lon_slice = slice(bbox[0], bbox[2])
lat_slice = slice(bbox[1], bbox[3])
year = 2020

start_time = f"{year}-02-01T00:00:00"
end_time = f"{year}-02-28T23:30:00" if year % 4 != 0 else f"{year}-02-29T23:30:00" # Handling leap years
time_slice = slice(start_time, end_time)



Next, calculate the mean, and plot in a much faster duration!

In [None]:
%%time
