<center>
<img src='./img/nsidc_logo.png'/>

# **Using Coiled and h5coro to Produce ICESat-2 Sea Ice Height Time Series**

</center>

---

## **1. Tutorial Introduction/Overview**

Tutorial designed for the "DAAC data access in the cloud hands-on experience" session at the 2023 NSIDC DAAC User Working Group (UWG) Meeting. This is a copy of the `2_ATL07_timeseries` notebook for use with Coiled.


TODOS:
* Explain Coiled
* Question for Luis: Why would I use the decorator function (` @coiled.function()`) vs:

```
cluster = coiled.Cluster(n_workers=20, region="us-west-2")
client = cluster.get_client()
client
```
* How do we incorporate https://medium.com/coiled-hq/processing-a-250-tb-dataset-with-coiled-dask-and-xarray-574370ba5bde ? 


### Installing last versions from earthaccess and coiled

**NOTE**: Restart the kernel and clean output after the next cell

In [None]:
!pip install coiled==0.9.26

!pip uninstall earthaccess
!pip install git+https://github.com/jrbourbeau/earthaccess.git@pickle-logic

## **2. Tutorial steps**

Resoruces: each granule is approx 60-120 MB, A month of data for the Ross ocean returns 59 granules ~4.6 GB. We should use an instance preferable double the memory of the aprox data size we use.

### **Import Packages**

In [None]:
# For Coiled cloud compute
import coiled

# For searching NASA data
import earthaccess

# For reading data, analysis and plotting
import xarray as xr
import numpy as np
import geopandas as gpd
import pandas as pd
import hvplot.xarray

import pprint
from affine import Affine
from pyproj import CRS

from pqdm.threads import pqdm

print(coiled.__version__)
print(earthaccess.__version__)

### **Authenticate**

In [None]:
auth = earthaccess.login()

### **Search for ICESat-2 ATL07 data**

Using spatial/temporal range from https://icesat-2-2023.hackweek.io/tutorials/sea_ice/1_sea_ice_tutorial.html :


```
# Spatial extent: Ross Sea, Antarctica
spatial_extent = [-180, -78, -160, -74]

# Time range
date_range = ['2019-09-16','2019-09-16'] # first time period
# date_range = ['2019-11-13','2019-11-13'] # second time period
```

In [None]:
atl10 = {}
total_results = 0

for year in range(2019,2020):
    
    print(f"Searching year {year} ...")
    granules = earthaccess.search_data(
        short_name = 'ATL10',
        version = '006',
        cloud_hosted = True,
        bounding_box = (-180, -78, -160, -74),
        temporal = (f'{year}-09-01',f'{year}-09-30'),
    )
    total_results += len(granules)
    atl10[str(year)] = granules
print(f"Total: {total_results}")

In [None]:
r = [display(r) for r in atl10["2019"][0:2]]

### **Extract freeboard segments**

We now create a geopandas dataset from our results. 

Because ATL10 is not a gridded prduct we need to extract coordinates and variables from their groups inside the HDF5 file.

#### Open the files using the `open` method. 

The auth object created at the start of the notebook is used to provide Earthdata Login authentication and AWS credentials.

In [None]:
file_tree = {}

for year, granules in atl10.items():
    file_tree[year] = earthaccess.open(granules)


In [None]:
# files[0].f.s3.storage_options
print(file_tree["2019"][0].f.info())

In [None]:
import h5py

with h5py.File(file_tree["2019"][0],'r') as f:

    time = f["gt1r"]['freeboard_segment/delta_time'][:]
time

In [None]:
ds = xr.open_dataset(file_tree["2019"][0], group="gt1r/freeboard_segment/")
ds

### Pre-warming the Coiled instance.

Once we get to run this with Coiled it would be good to instantiate the cluster beforehand

In [None]:
# @coiled.function(region="us-west-2",
#                  memory="16 GiB")
# def trivial(param):
#     print(param)
#     return param

In [None]:
# trivial("test")

In [None]:
## Based on the READ function form Younghyun Koo for the sea ice tutorial at the IS2 hackweek

# @coiled.function(region="us-west-2",
#                  memory="16 GiB")
def read_atl10_local(files, executors):
    """Returns a consolidated GeoPandas dataframe for a set of ATL10 file pointers.
    
    Parameters:
        files (list[S3FSFile]): list of authenticated fsspec file references to ATL10 on S3 (via earthaccess)
        executors (int): number of threads
    
    """
    from h5coro import h5coro, s3driver, filedriver
    from itertools import product
    import geopandas as gpd
    import pandas as pd
    import numpy as np
    import gc
    
    def read_atl10(file):
        # Create a list for saving ATL10 beam track data
        tracks = []
        credentials = {"aws_access_key_id": file.s3.storage_options["key"],
                       "aws_secret_access_key": file.s3.storage_options["secret"],
                       "aws_session_token": file.s3.storage_options["token"]}
        
        f = h5coro.H5Coro(file.info()["name"], s3driver.S3Driver, credentials=credentials)
        f.readDatasets(datasets=["orbit_info/sc_orient"], block=True)
        
        # Check the orbit orientation
        orient = f['orbit_info/sc_orient'][0]

        if orient == 0:
            strong_beams = [f"gt{i}l" for i in [1, 2, 3]]
        elif orient == 1:
            strong_beams = [f"gt{i}r" for i in [1, 2, 3]]
        else:
            strong_beams = []
            
            
        datasets = ["freeboard_segment/latitude",
                    "freeboard_segment/longitude",
                    "freeboard_segment/delta_time",
                    "freeboard_segment/seg_dist_x",
                    "freeboard_segment/heights/height_segment_length_seg",
                    "freeboard_segment/beam_fb_height",
                    "freeboard_segment/heights/height_segment_type"]
            
        ds_list = ["/".join(p) for p in list(product(strong_beams, datasets))]
        f.readDatasets(datasets=ds_list, block=True)
        
        # not taking into account 37 leap seconds
        gps_epoch = pd.to_datetime('1980-01-06 00:00:00')
    
        for beam in strong_beams:
            lat = f[f'{beam}/freeboard_segment/latitude'][:]
            lon = f[f'{beam}/freeboard_segment/longitude'][:]
            gps_since_epoch = f[f'{beam}/freeboard_segment/delta_time'][:]
            seg_x = f[f'{beam}/freeboard_segment/seg_dist_x'][:] / 1000 # (m to km)
            seg_len = f[f'{beam}/freeboard_segment/heights/height_segment_length_seg'][:]
            fb = f[f'{beam}/freeboard_segment/beam_fb_height'][:]
            surface_type = f[f'{beam}/freeboard_segment/heights/height_segment_type'][:]
            fb[fb>100] = np.nan
            
            # ATL10 ATB
            is2_epoch = 1.1988e+9
            
            date_time = gps_epoch + pd.to_timedelta(gps_since_epoch+is2_epoch, unit='s')

            df = pd.DataFrame({'lat': lat, 'lon': lon, 'time': date_time, 'seg_x': seg_x, 'seg_len': seg_len,
                              'freeboard': fb, 'stype': surface_type})
            df['beam'] = beam
            df = df.dropna().reset_index(drop = True)
            gdf = gpd.GeoDataFrame(
                    df, geometry=gpd.points_from_xy(df.lon, df.lat), crs="EPSG:4326"
            )
            del gdf["lat"]
            del gdf["lon"]

            gc.collect()
            tracks.append(gdf)
        # print(f"Done with {file.info()['name']}")
        return tracks
    df = pqdm(files, read_atl10, n_jobs=executors)
    combined = pd.concat([t[0] for t in df if type(t) is list])
    return combined

The idea would be to split each year into its own Dask worker

In [None]:
%%time
tracks = read_atl10_local(file_tree["2019"], executors=16)

In [None]:
tracks

In [None]:
tracks.info(memory_usage='deep')

### For future IO eficient operations we save the geodataframe as parquet

In [None]:
tracks.to_parquet("atl10-2019.parquet")

#### Geopandas Read function 

The function below extracts latitude, longitude, segment distance, segment length, surface type, and freeboard height. See the [NSIDC's ATL10 User Guide](https://nsidc.org/sites/default/files/documents/user-guide/atl10-v006-userguide.pdf) for more details on these variables.

### **Calculate grid indices of segment centers**

Using pyproj and Affine

### **Assign to grid and calculate grid cell mean**

## **3. Learning outcomes recap (optional)**

Provide a brief summary of the learning outcomes of the tutorial


## **4. Additional resources (optional)**

List some additional resources for users to consult, if applicable/desired.

________

### **When your tutorial is ready for review,  please read our [Contributor Guide](https://github.com/nsidc/NSIDC-Data-Tutorials/blob/main/contributor_guide.md) for next steps.**