# ICESat-2 AWS cloud data access
This notebook ({nb-download}`download <IS2_cloud_data_access.ipynb>`) illustrates the use of icepyx for accessing ICESat-2 data currently available through the AWS (Amazon Web Services) us-west2 hub s3 data bucket.

## Notes
1. ICESat-2 data became publicly available on the cloud on 29 September 2022. Thus, access methods and example workflows are still being developed by NSIDC, and the underlying code in icepyx will need to be updated now that these data (and the associated metadata) are available. We appreciate your patience and contributions (e.g. reporting bugs, sharing your code, etc.) during this transition!
2. This example and the code it describes are part of ongoing development. Current limitations to using these features are described throughout the example, as appropriate.
3. You **MUST** be working within an AWS instance. Otherwise, you will get a permissions error.
4. Authentication is still more steps than we'd like. We're working to address this - let us know if you'd like to join the conversation!

In [None]:
import icepyx as ipx

add a new cell block for testing pushing to GH

In [None]:
%pwd
%pip install ../../../
%pip install -e../../../

In [None]:
%load_ext autoreload
import icepyx as ipx
%autoreload 2

print(ipx.__version__)

In [None]:
import logging 
logging.basicConfig(level=logging.DEBUG)

In [None]:
import earthaccess

Create an icepyx Query object

In [None]:
# bounding box
# "producerGranuleId": "ATL03_20191130221008_09930503_004_01.h5",
short_name = 'ATL03'
spatial_extent = [-45, 58, -35, 75]
date_range = ['2019-11-30','2019-11-30'] ### NOTE THESE PARAMETERS BREAK FOR v006!

In [None]:
reg=ipx.Query(short_name, spatial_extent, date_range, version="005")

## Get the granule s3 urls
You must specify `cloud=True` to get the needed s3 urls.
This function returns a list containing the list of the granule IDs and a list of the corresponding urls.

In [None]:
gran_ids = reg.avail_granules(ids=True, cloud=True)
gran_ids

In [None]:
s3urls = gran_ids[1]

In [None]:
s3urls

## Log in to Earthdata and generate an s3 token
You can use icepyx's existing login functionality to generate your s3 data access token, which will be valid for *one* hour.

We currently do not have this set up to automatically renew, but [earthaccess](), which icepyx will soon be adopting for authentication, is working on handling the limits imposed by expiring s3 tokens. If you're interested in working on helping icepyx and NSIDC (and DAACs more broadly) address these challenges, please get in touch or submit a PR. Documentation/example testers are always appreciated (so you don't have to understand the code)!

In [None]:
reg.earthdata_login(s3token=True)

## Set up your s3 access using your credentials

In [None]:
s3 = earthaccess.get_s3fs_session(daac='NSIDC', provider=reg._s3login_credentials)

In [None]:
reg._s3login_credentials

## H5Coro Playtime

In [None]:
# (1) import
from h5coro import h5coro, s3driver, filedriver

# (2) configure
h5coro.config(errorChecking=True, verbose=False, enableAttributes=False)

In [None]:
s3url = s3urls[0]
print(s3url)

In [None]:
# what's currently in the variables module
%time

import h5py

# in order to treat the inputs "like" files
fileset = s3.open(s3url)

_avail = []

def visitor_func(name, node):
    if isinstance(node, h5py.Group):
        # node is a Group
        pass
    else:
        # node is a Dataset
        _avail.append(name)

with h5py.File(fileset, "r") as h5f:
    h5f.visititems(visitor_func)


In [None]:
# (3) create

my_bucket = 's3'
path_to_hdf5_file = 'nsidc-cumulus-prod-protected/ATLAS/ATL03/005/2019/11/30/ATL03_20191130112041_09860505_005_01.h5'

# h5obj = h5coro.H5Coro(f'{s3url}',
h5obj = h5coro.H5Coro(f'{my_bucket}/{path_to_hdf5_file}', 
                      s3driver.S3Driver,
                     credentials={"aws_access_key_id":reg._s3login_credentials["accessKeyId"],
                                 "aws_secret_access_key":reg._s3login_credentials["secretAccessKey"],
                                 "aws_session_token":reg._s3login_credentials["sessionToken"], })


In [None]:
# (4) read
datasets = [{'dataset': '/path/to/dataset1', 'startrow': 0, 'numrows': h5coro.ALL_ROWS},
            {'dataset': '/path/to/dataset2', 'startrow': 324, 'numrows': 50}]
h5obj.readDatasets(datasets=datasets, block=True)

# (5) display
for dataset in h5obj:
    print(dataset)

## Try to make data read-in work

In [None]:
# in order to treat the inputs "like" files
fileset = [s3.open(file) for file in s3urls]

In [None]:
fileset

In [None]:
type(fileset[0])

In [None]:
import xarray as xr

In [None]:
ssh_ds = xr.open_mfdataset(fileset,
                           combine='by_coords',
                           mask_and_scale=True,
                           decode_cf=True,
                           chunks='auto')
ssh_ds

In [None]:
# to do: revisit how want to accept s3 inputs. Can users use the filename matching, or should they provide an explicit list of urls? Based on that, will need to either use _check_source_for_pattern or add antoher fn that just returns the right filelist and handles credentialling

In [None]:
path_root = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/005/2019/11/30/'

In [None]:
pattern = "ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5"
# reader = ipx.Read(path_root, "ATL03", pattern)

In [None]:
reader = ipx.Read(fileset[0], "ATL03", pattern)

In [None]:
reader._source_type

In [None]:
# next steps: see also note above

See notes and warnings within code and here to address the shortcuts taken to get to the load() step to try it
(e.g.

    let source be the s3url list, and then don't require the pattern input (or let it default?)
    update the pattern check function to actually check the urls (and return false otherwise)... lines 391... may need to write a new fnmatch function as for if/elif above
    *** can still only load data from one product type at a time, so the pattern check will accomplish this!
)

17 May 2023
some work will be needed on the intake/catalog side to make this work for data read-in and merging.
This is likely a good space for Rachel to focus (perhaps first adopting earthaccess under the hood wherever possible?)
For the purposes of the GeoSMART tutorial and ATL11 cloud read-in, we're going to have to stick with datatree or a more manual approach in the short term...
                                                                                                                                                    


In [None]:
reader.vars.avail()  # NOTE THIS WAS REALLY SLOW!!
#What is best approach to letting the user know what's available? 
#Should there be per-product default lists for the cloud, since all vars are always available?
# if bloating icepyx itself is in question, perhaps they could be generated in a separate repo and update (e.g. monthly)
#by a cron job. then behind the scenes we can just grab the list for the data product of interest,
# since if the user is on the cloud (and authenticated) they clearly have internet access...

In [None]:
reader.vars.append(var_list=['dist_ph_along','dist_ph_across','signal_conf_ph','h_ph'])

In [None]:
reader.vars.wanted

In [None]:
reader._source_type

In [None]:
ds = reader.load()

## Select an s3 url and access the data
Data read in capabilities for cloud data are coming soon in icepyx (targeted Winter 2022-2023). Stay tuned and we'd love for you to join us and contribute!

**Note: If you get a PermissionDenied Error when trying to read in the data, you may not be sending your request from an AWS hub in us-west2. We're currently working on how to alert users if they will not be able to access ICESat-2 data in the cloud for this reason**

In [None]:
# the first index, [1], gets us into the list of s3 urls
# the second index, [0], gets us the first entry in that list.
s3url = gran_ids[1][0]
# s3url =  's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2019/11/30/ATL03_20191130221008_09930503_004_01.h5'

In [None]:
import h5py
import numpy as np

In [None]:
%time f = h5py.File(fs.open(s3url,'rb'),'r')

#### Credits
* notebook by: Jessica Scheick
* source material: [is2-nsidc-cloud.py](https://gist.github.com/bradlipovsky/80ab6a7aff3d3524b9616a9fc176065e#file-is2-nsidc-cloud-py-L28) by Brad Lipovsky