<center>
<img src='./img/nsidc_logo.png'/>

# **Using Coiled and h5coro to Produce ICESat-2 Sea Ice Height Time Series**

</center>

---

## **1. Tutorial Introduction/Overview**

Tutorial designed for the "DAAC data access in the cloud hands-on experience" session at the 2023 NSIDC DAAC User Working Group (UWG) Meeting. This is a copy of the `2_ATL07_timeseries` notebook for use with Coiled.


TODOS:
* Explain Coiled
* Question for Luis: Why would I use the decorator function (` @coiled.function()`) vs:

```
cluster = coiled.Cluster(n_workers=20, region="us-west-2")
client = cluster.get_client()
client
```
* How do we incorporate https://medium.com/coiled-hq/processing-a-250-tb-dataset-with-coiled-dask-and-xarray-574370ba5bde ? 


### Installing last versions from earthaccess and coiled

**NOTE**: Restart the kernel and clean output after the next cell

In [1]:
%%capture 

!pip install coiled==0.9.26

!pip uninstall -y earthaccess
!pip install git+https://github.com/nsidc/earthaccess.git@main

Found existing installation: earthaccess 0.5.4
Uninstalling earthaccess-0.5.4:
  Successfully uninstalled earthaccess-0.5.4
Collecting git+https://github.com/nsidc/earthaccess.git@main
  Cloning https://github.com/nsidc/earthaccess.git (to revision main) to /tmp/pip-req-build-1pktmhn1
  Running command git clone --filter=blob:none --quiet https://github.com/nsidc/earthaccess.git /tmp/pip-req-build-1pktmhn1
  Resolved https://github.com/nsidc/earthaccess.git to commit 18f05edad5bce5441ac914804c7189dd0b0d7dde
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: earthaccess
  Building wheel for earthaccess (pyproject.toml) ... [?25ldone
[?25h  Created wheel for earthaccess: filename=earthaccess-0.5.4-py3-none-any.whl size=54732 sha256=8f88cf84a612eab25572c359467bd2bf0b8b17454698e82f49a0819c7ec196ce
  Stored in directory: /tmp/pip-ephem-w

## **2. Tutorial steps**

Resoruces: each granule is approx 60-120 MB, A month of data for the Ross ocean returns 59 granules ~4.6 GB. We should use an instance preferable double the memory of the aprox data size we use.

### **Import Packages**

In [2]:
# For Coiled cloud compute
import coiled

# For searching NASA data
import earthaccess

# For reading data, analysis and plotting
import xarray as xr
import numpy as np
import geopandas as gpd
import pandas as pd
import hvplot.xarray

import pprint
from affine import Affine
from pyproj import CRS

from pqdm.threads import pqdm

print(coiled.__version__)
print(earthaccess.__version__)


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


0.9.26
0.5.4


### **Authenticate**

In [3]:
auth = earthaccess.login()

EARTHDATA_USERNAME and EARTHDATA_PASSWORD are not set in the current environment, try setting them or use a different strategy (netrc, interactive)
You're now authenticated with NASA Earthdata Login
Using token with expiration date: 11/18/2023
Using .netrc file for EDL


### **Search for ICESat-2 ATL07 data**

Using spatial/temporal range from https://icesat-2-2023.hackweek.io/tutorials/sea_ice/1_sea_ice_tutorial.html :


```
# Spatial extent: Ross Sea, Antarctica
spatial_extent = [-180, -78, -160, -74]

# Time range
date_range = ['2019-09-16','2019-09-16'] # first time period
# date_range = ['2019-11-13','2019-11-13'] # second time period
```

In [4]:
region = "Ross Sea"
ross_sea = (-180, -78, -160, -74)
antarctic = (-180, -90, 180, -60)
this_region = antarctic if region == "Antarctica" else ross_sea

In [6]:
atl10 = {}
total_results = 0

for year in range(2019,2020):
    
    print(f"Searching year {year} ...")
    granules = earthaccess.search_data(
        short_name = 'ATL10',
        version = '006',
        cloud_hosted = True,
        bounding_box = this_region,
        temporal = (f'{year}-09-01',f'{year}-09-30'),
    )
    total_results += len(granules)
    atl10[str(year)] = granules
print(f"Total: {total_results}")

Searching year 2019 ...
Granules found: 59
Total: 59


In [None]:
r = [display(r) for r in atl10["2019"][0:2]]

### **Extract freeboard segments**

We now create a geopandas dataset from our results. 

Because ATL10 is not a gridded prduct we need to extract coordinates and variables from their groups inside the HDF5 file.

#### Open the files using the `open` method. 

The auth object created at the start of the notebook is used to provide Earthdata Login authentication and AWS credentials.

In [7]:
file_tree = {}

for year, granules in atl10.items():
    file_tree[year] = earthaccess.open(granules)


 Opening 59 granules, approx size: 4.6 GB
using provider: NSIDC_CPRD


QUEUEING TASKS | : 0it [00:00, ?it/s]

PROCESSING TASKS | :   0%|          | 0/59 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/59 [00:00<?, ?it/s]

In [8]:
# files[0].f.s3.storage_options
print(file_tree["2019"][0].f.info())

{'ETag': '"e123bb7ed68661d31fea92a0abdf8fc0-1"', 'LastModified': datetime.datetime(2023, 6, 24, 0, 19, 44, tzinfo=tzutc()), 'size': 46754892, 'name': 'nsidc-cumulus-prod-protected/ATLAS/ATL10/006/2019/09/01/ATL10-02_20190901100614_09980401_006_02.h5', 'type': 'file', 'StorageClass': 'STANDARD_IA', 'VersionId': None, 'ContentType': 'binary/octet-stream'}


In [11]:
import h5py

with h5py.File(file_tree["2019"][0],'r') as f:
    obj = f["gt1r"]['freeboard_segment/delta_time']
    for attr, value in obj.attrs.items():
        print(f"{attr}: {value}")
    time = obj[:]
    
time

CLASS: b'DIMENSION_SCALE'
NAME: b'gt1r/freeboard_segment/delta_time'
REFERENCE_LIST: [(<HDF5 object reference>, 0) (<HDF5 object reference>, 0)
 (<HDF5 object reference>, 0) (<HDF5 object reference>, 0)
 (<HDF5 object reference>, 0) (<HDF5 object reference>, 0)
 (<HDF5 object reference>, 0) (<HDF5 object reference>, 0)
 (<HDF5 object reference>, 0) (<HDF5 object reference>, 0)
 (<HDF5 object reference>, 0)]
contentType: b'physicalMeasurement'
coordinates: b'latitude longitude'
description: b'Number of GPS seconds since the ATLAS SDP epoch. The ATLAS Standard Data Products (SDP) epoch offset is defined within /ancillary_data/atlas_sdp_gps_epoch as the number of GPS seconds between the GPS epoch (1980-01-06T00:00:00.000000Z UTC) and the ATLAS SDP epoch. By adding the offset contained within atlas_sdp_gps_epoch to delta time parameters, the time in gps_seconds relative to the GPS epoch can be computed.'
long_name: b'Elapsed GPS seconds'
source: b'Derived via Time Tagging'
standard_name: b

array([52571344.33341975, 52571344.33341975, 52571344.33341975, ...,
       52572219.13913625, 52572219.14612295, 52572219.15575895])

In [10]:
ds = xr.open_dataset(file_tree["2019"][0], group="gt1r/freeboard_segment/")
ds

### Pre-warming the Coiled instance.

Once we get to run this with Coiled it would be good to instantiate the cluster beforehand

In [None]:
# @coiled.function(region="us-west-2",
#                  memory="16 GiB")
# def trivial(param):
#     print(param)
#     return param

In [None]:
# trivial("test")

In [12]:
## Based on the READ function form Younghyun Koo for the sea ice tutorial at the IS2 hackweek

# @coiled.function(region="us-west-2",
#                  memory="16 GiB")

# Modifications to streamline
# - helper function for orinetation
# - helper function to reformat credentials
# - use datasets to read arrays
# - add data to dictionary

def strong_beams(f):
    """Returns ground track for strong beams based on IS2 orientation"""
    orient  = f['orbit_info/sc_orient'][0]

    if orient == 0:
        return [f"gt{i}l" for i in [1, 2, 3]]
    elif orient == 1:
        return [f"gt{i}r" for i in [1, 2, 3]]
    else:
        raise KeyError("Spacecraft orientation neither forward nor backward")


def get_credentials(file):
    """Returns credentials dict with keys expected by h5coro
    
    TODO: could add as option for earthaccess
    """
    return {
        "aws_access_key_id": file.s3.storage_options["key"],
        "aws_secret_access_key": file.s3.storage_options["secret"],
        "aws_session_token": file.s3.storage_options["token"]
    }
    
    
def read_atl10_local(files, executors):
    """Returns a consolidated GeoPandas dataframe for a set of ATL10 file pointers.
    
    Parameters:
        files (list[S3FSFile]): list of authenticated fsspec file references to ATL10 on S3 (via earthaccess)
        executors (int): number of threads
    
    """
    from h5coro import h5coro, s3driver, filedriver
    from itertools import product
    import geopandas as gpd
    import pandas as pd
    import numpy as np
    import gc
    
    def read_atl10(file):
        # Create a list for saving ATL10 beam track data
        tracks = []
        
        f = h5coro.H5Coro(file.info()["name"], s3driver.S3Driver, credentials=get_credentials(file))
        f.readDatasets(datasets=["orbit_info/sc_orient"], block=True)
        
        # Check the orbit orientation
        orient

            
            
        datasets = ["freeboard_segment/latitude",
                    "freeboard_segment/longitude",
                    "freeboard_segment/delta_time",
                    "freeboard_segment/seg_dist_x",
                    "freeboard_segment/heights/height_segment_length_seg",
                    "freeboard_segment/beam_fb_height",
                    "freeboard_segment/heights/height_segment_type"]
            
        ds_list = ["/".join(p) for p in list(product(strong_beams, datasets))]
        f.readDatasets(datasets=ds_list, block=True)
        
        # not taking into account 37 leap seconds
        gps_epoch = pd.to_datetime('1980-01-06 00:00:00')
    
        for beam in strong_beams(f):
            lat = f[f'{beam}/freeboard_segment/latitude'][:]
            lon = f[f'{beam}/freeboard_segment/longitude'][:]
            gps_since_epoch = f[f'{beam}/freeboard_segment/delta_time'][:]
            seg_x = f[f'{beam}/freeboard_segment/seg_dist_x'][:] / 1000 # (m to km)
            seg_len = f[f'{beam}/freeboard_segment/heights/height_segment_length_seg'][:]
            fb = f[f'{beam}/freeboard_segment/beam_fb_height'][:]
            surface_type = f[f'{beam}/freeboard_segment/heights/height_segment_type'][:]
            fb[fb>100] = np.nan
            
            # ATL10 ATB
            is2_epoch = 1.1988e+9
            
            date_time = gps_epoch + pd.to_timedelta(gps_since_epoch+is2_epoch, unit='s')

            df = pd.DataFrame({'lat': lat, 'lon': lon, 'time': date_time, 'seg_x': seg_x, 'seg_len': seg_len,
                              'freeboard': fb, 'stype': surface_type})
            df['beam'] = beam
            df = df.dropna().reset_index(drop = True)
            gdf = gpd.GeoDataFrame(
                    df, geometry=gpd.points_from_xy(df.lon, df.lat), crs="EPSG:4326"
            )
            del gdf["lat"]
            del gdf["lon"]

            gc.collect()
            tracks.append(gdf)
        # print(f"Done with {file.info()['name']}")
        return tracks
    df = pqdm(files, read_atl10, n_jobs=executors)
    combined = pd.concat([t[0] for t in df if type(t) is list])
    return combined

The idea would be to split each year into its own Dask worker

In [13]:
%%time
tracks = read_atl10_local(file_tree["2019"], executors=16)

QUEUEING TASKS | :   0%|          | 0/59 [00:00<?, ?it/s]

PROCESSING TASKS | :   0%|          | 0/59 [00:00<?, ?it/s]

COLLECTING RESULTS | :   0%|          | 0/59 [00:00<?, ?it/s]

CPU times: user 3min 9s, sys: 55.2 s, total: 4min 4s
Wall time: 2min 21s


In [14]:
tracks

Unnamed: 0,time,seg_x,seg_len,freeboard,stype,beam,geometry
0,2019-09-01 11:10:03.645610094,27207.520458,20.290331,0.253510,1,gt1l,POINT (11.39429 -64.01932)
1,2019-09-01 11:10:03.647197247,27207.531553,19.569931,0.265972,1,gt1l,POINT (11.39427 -64.01942)
2,2019-09-01 11:10:03.648262024,27207.538993,16.069815,0.276316,1,gt1l,POINT (11.39426 -64.01948)
3,2019-09-01 11:10:03.649195671,27207.545516,14.670819,0.303078,1,gt1l,POINT (11.39424 -64.01954)
4,2019-09-01 11:10:03.650213718,27207.552629,14.671224,0.324290,1,gt1l,POINT (11.39423 -64.01960)
...,...,...,...,...,...,...,...
42404,2019-09-29 21:26:48.509730577,33553.268957,44.508583,0.181769,7,gt1r,POINT (24.15736 -59.13850)
42405,2019-09-29 21:26:48.512042761,33553.285300,39.595520,0.149934,7,gt1r,POINT (24.15733 -59.13835)
42406,2019-09-29 21:26:48.514954329,33553.305890,36.785496,0.146254,1,gt1r,POINT (24.15730 -59.13817)
42407,2019-09-29 21:26:48.516793250,33553.318902,28.306177,0.137055,1,gt1r,POINT (24.15727 -59.13805)


In [15]:
tracks.info(memory_usage='deep')  # what does this do?

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 5205785 entries, 0 to 42408
Data columns (total 7 columns):
 #   Column     Dtype         
---  ------     -----         
 0   time       datetime64[ns]
 1   seg_x      float64       
 2   seg_len    float32       
 3   freeboard  float32       
 4   stype      int8          
 5   beam       object        
 6   geometry   geometry      
dtypes: datetime64[ns](1), float32(2), float64(1), geometry(1), int8(1), object(1)
memory usage: 506.4 MB


### For future IO eficient operations we save the geodataframe as parquet

In [16]:
tracks.to_parquet("atl10-2019.parquet")

ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 1567336203645610094

#### Geopandas Read function 

The function below extracts latitude, longitude, segment distance, segment length, surface type, and freeboard height. See the [NSIDC's ATL10 User Guide](https://nsidc.org/sites/default/files/documents/user-guide/atl10-v006-userguide.pdf) for more details on these variables.

## Grid track data

This follows the processing steps described in the ATL20 - Gridded Sea Ice Freeboard - ATBD but gridding to a EASE-Grid v2 6.25 km grid.  Any projected coordinate system or grid could be chosen.  The procedure could be modified with extra QC steps or modifications.  **The world is your oyster - or [Aplacophoran](https://antarcticsun.usap.gov/science/4447/).

The processing steps are:

- remove non-ice and low quality segments 
- bin freeboard segments into grid cells
- calculate aggregate statistics
    + mean segment length
    + segment count
    + length weighted mean freeboard
    + length weighted standard deviation of freeboard
    
#### Grid Cell Mean Segment Length $\bar{L}$

$$
\bar{L}(x, y, D) = \frac{\sum L_i}{N}
$$

where $L_i$ is `/gtx/freeboard_beam_segment/height_segments/height_segment_length_seg`, $x$ and $y$ are projected coordinates for grid centers, and $D$ is day. 

#### Grid Cell Mean Freeboard $\bar{h}$

$$
\bar{h}(x, y, D) = \frac{\sum L_i h_i}{\sum L_i}
$$

where $h_i$ is `gtx/freeboard_beam_segment/beam_freeboard/beam_fb_height`.

#### Grid Cell Standard Deviation of Freeboard $\sigma^2 (x, y, D)$

$$
\sigma^2 (x, y, D) = \frac{\sum L_i (h_i)^2}{\sum L_i} - \bar{h}^2 (x, y, D)
$$

### **Calculate grid indices of segment centers**

Using pyproj and Affine

ModuleNotFoundError: No module named 'pyresample'

### **Assign to grid and calculate grid cell mean**

## **3. Learning outcomes recap (optional)**

Provide a brief summary of the learning outcomes of the tutorial


## **4. Additional resources (optional)**

List some additional resources for users to consult, if applicable/desired.

________

### **When your tutorial is ready for review,  please read our [Contributor Guide](https://github.com/nsidc/NSIDC-Data-Tutorials/blob/main/contributor_guide.md) for next steps.**