## Reading ICESat-2 Data
### Basic Data Read-in Example Notebook
This notebook illustrates the use of icepyx for reading ICESat-2 data files, loading them into a data object.
Currently the default data object is an Xarray Dataset, with ongoing work to provide support for other data object types.

### Motivation
Most often, when you open a data file, you must specify the underlying data structure and how you'd like the information to be read in.
A simple example of this, for instance when opening a csv or similarly delimited file, is letting the software know if the data contains a header row, what the data type is (string, double, float, boolean, etc.) for each column, what the delimeter is, and which columns or rows you'd like to be loaded.
Many ICESat-2 data readers are quite manual in nature, requiring that you accurately type out a list of string paths to the various data variables.

icepyx simplifies this process by relying on its awareness of ICESat-2 specific data file variable storage structure.
Instead of needing to manually iterate through the beam pairs, you can provide a few options to the `Read` object and icepyx will do the heavy lifting for you (as detailed in this notebook).

### Approach
If you're interested in what's happening under the hood: icepyx turns your instructions into something called a catalog, then uses the Intake library and the catalog to actually load the data into memory. Specifically, icepyx creates an [Intake](https://intake.readthedocs.io/en/latest/) data [catalog](https://intake.readthedocs.io/en/latest/catalog.html) for each requested variable and then merges the read-in data from each of the variables to create a single data object.

Intake catalogs are powerful (and the tool we selected) because they can be saved, shared, modified, and reused to reproducibly read in a set of data files in a consistent way as part of an analysis workflow.
This approach streamlines the transition between data sources (local/downloaded files or, ultimately, cloud/bucket access) and data object types (e.g. [Xarray Dataset](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html) or [GeoPandas GeoDataFrame](https://geopandas.org/docs/reference/api/geopandas.GeoDataFrame.html)).

#### Credits
* original notebook by: Jessica Scheick
* notebook contributors: 
* templates for default ICESat-2 Intake catalogs from: [Wei Ji]() and [Tian]().


### Import packages, including icepyx

In [None]:
%load_ext autoreload
import icepyx as ipx
%autoreload 2

In [None]:
import os
import fnmatch
import glob
import pathlib
import fsspec
from fsspec.implementations.local import LocalFileSystem

https://github.com/OSOceanAcoustics/echopype/blob/ab5128fb8580f135d875580f0469e5fba3193b84/echopype/utils/io.py
https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=get_map#fsspec.spec.AbstractFileSystem.glob
https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/implementations/local.html
https://github.com/OSOceanAcoustics/echopype/blob/ab5128fb8580f135d875580f0469e5fba3193b84/echopype/convert/api.py#L380
https://echopype.readthedocs.io/en/stable/convert.html


In [None]:
lfs = LocalFileSystem()

In [None]:
lfs

In [None]:
lfs.ls("/")

In [None]:
fsmap = fsspec.get_mapper(str(path))
output_fs = fsmap.fs

In [None]:
output_fs.ls()

In [None]:
output_fs

In [None]:
source_file, file_format


if not isinstance(save_path, Path) and not isinstance(save_path, str):
            raise TypeError("save_path must be a string or Path")

        fsmap = fsspec.get_mapper(str(save_path), **output_storage_options)
        output_fs = fsmap.fs

        # Use the full path such as s3://... if it's not local, otherwise use root
        if isinstance(output_fs, LocalFileSystem):
            root = fsmap.root
        else:
            root = save_path
        if Path(root).suffix == "":  # directory
            out_dir = root
            out_path = os.path.join(root, Path(source_file).stem + file_format)
        else:  # file
            out_dir = os.path.dirname(root)
            out_path = os.path.join(out_dir, Path(root).stem + file_format)

### Set data source path

Provide a full path to the data to be read in (i.e. opened).
Currently accepted inputs are:
* a directory
* a single file

All files to be read in *must* have a consistent filename pattern.
If a directory is supplied as the data source, all files in any subdirectories that match the filename pattern will be included.

S3 bucket data access is currently under development, and requires you are registered with NSIDC as a beta tester for cloud-based ICESat-2 data.
icepyx is working to ensure a smooth transition to working with remote files.
We'd love your help exploring and testing these features as they become available!

In [None]:
urlpath = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2019/11/30/ATL03_20191130221008_09930503_004_01.h5'

In [None]:
filepath = '/Users/jessica/computing/icepyx/test_data/ATL06-20181214041627-Sample.h5'

In [None]:
path = '/Users/jessica/computing/icepyx/test_data/'

In [None]:
filepath2 = '/Users/jessica/computing/icepyx/test_data/test_subdir/ATL03_20191130221008_09930503_004_01.h5'

### Create a filename pattern for your data files

Files provided by NSIDC match the format `"ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5"` where the parameters in curly brackets indicate a parameter name (left of the colin) and character length or format (right of the colin).
Some of this information is used during data opening to help correctly read and label the data within the data structure, particularly when multiple files are opened simultaneously.

By default, icepyx will assume your filenames follow the default format.
However, you can easily read in other ICESat-2 data files by supplying your own filename pattern.
For instance, `pattern="ATL{product:2}-{datetime:%Y%m%d%H%M%S}-Sample.h5"`.

In [None]:
pattern = 'ATL06-{datetime:%Y%m%d%H%M%S}-Sample.h5'
# pattern = 'ATL{product:2}-{datetime:%Y%m%d%H%M%S}-Sample.h5'

In [None]:
pattern = "ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5"

### Create an icepyx read object

In [None]:
reader = ipx.Read(path, "ATL06", pattern) # or ipx.Read(filepath, "ATLXX") if your filenames match the defualt pattern

In [None]:
# DEL
reader._filelist

### Specify variables to be read in

To load your data into memory or prepare it for analysis, icepyx needs is which variables you'd like to read in.
If you've used icepyx to download data from NSIDC with variable subsetting (which is the default), then you may already be familiar with the icepyx `Variables` module and how to create and modify lists of variables.
We showcase a specific case here, but we encourage you to check out [this subsetting example - maybe rebranded to subsetting/variable navigation?]() for a thorough trip through how to create and manipulate lists of ICESat-2 variable paths (examples are provided for multiple data products).

You can use a default list that loads commonly used variables for your data product, or create your own list of variables to be read in.
icepyx will determine what variables are available for you to read in by creating a list from one of your source files.
If you have multiple files that you're reading in, icepyx will automatically generate a list of filenames and take the first one to get the list of available variables.

Thus, if you have different variables available across files (even from the same data product), you may run into issues and need to come up with a workaround (we can help you do so!).
We anticipate most users will have the minimum set of variables they are seeking to load available across all data files, so we're not currently developing this feature.
Please get in touch if it would be a helpful feature for you or if you encounter this problem!

For a basic case, let's say we want to read in height, latitude, and longitude for all beam pairs.
We create our variables list as

In [None]:
# reader.vars.append(var_list=['h_li', "latitude", "longitude"])
reader.vars.append(beam_list=['gt1l'], var_list=['h_li', "latitude", "longitude"])

Then we can view a dictionary of the variables we'd like to read in.

In [None]:
reader.vars.wanted

Don't forget - if you need to start over generating your wanted variables list, it's easy!

In [None]:
reader.vars.remove(all=True)

In [None]:
wanted_groups = ipx.core.variables.list_of_dict_vals(reader.vars.wanted)

In [None]:
vgrp, paths = ipx.core.variables.Variables.parse_var_list(wanted_groups, tiered=True)
print(vgrp)
print(paths)
print(set(paths))

In [None]:
print(list(vgrp.keys()))

In [None]:
# idx = [1 if x == 'ancillary_data' else 0 for i,x in enumerate(paths[0])]
# idx = [i for i,x in enumerate(paths[0]) if x == 'ancillary_data']

grp_spec_vars = [list(vgrp.keys())[i] for i,x in enumerate(paths[0]+'/'+paths[1]) if x == 'gt1l/land_ice_segments']

In [None]:
grp_spec_vars

In [None]:
var_path = 'gt1l/land_ice_segments'

In [None]:
_, paths = ipx.core.variables.Variables.parse_var_list(wanted_groups)

In [None]:
print(_)
print(paths)

In [None]:
reader.vars._iter_vars(reader.vars.wanted, {}, vgrp)

In [None]:
reader.vars._iter_paths(reader.vars.wanted, {}, vgrp, paths[0], paths[1])

In [None]:
reader.vars.avail()

### Loading your data

Now that you've set up all the options, you're ready to read your ICESat-2 data into memory!

In [None]:
ds = reader.load()

In [None]:
ds[0]

In [None]:
# next step (but another PR): xarray extension with icesat-2 aware functions (like "get_strong_beams", etc.)

In [None]:
# ds[0]['crossing_time'].values
ds[0]["rgt"].values

In [None]:
ds[0][0][0].variables

In [None]:
import h5py

In [None]:
with h5py.File(filepath, "r") as fi:
            try:

                # Read in varibales of interest (more can be added!)
                dac = fi[group[k] + "/land_ice_segments/geophysical/dac"][:]
                lat = fi[group[k] + "/land_ice_segments/latitude"][:]
                lon = fi[group[k] + "/land_ice_segments/longitude"][:]
            except:
                pass

In [None]:
fi = h5py.File(filepath, "r")

In [None]:
print(fi["gt1l"].attrs.keys())
print(fi["gt1l"].attrs["atlas_spot_number"])
print(fi["gt1l"].attrs["sc_orientation"])

In [None]:
grp = fi["ancillary_data"]

In [None]:
def add_item(item):
    print(item)
    return item
    
for var in grp.visit(add_item):
    print(grp.visititems(var))

### More on Intake catalogs and the read object

As anyone familiar with ICESat-2 hdf5 files knows, one of the challenges to reading in data is looping through all of the beam pairs for each track.
The icepyx read module takes advantage of icepyx's variables module, which has some awareness of ICESat-2 data and uses that to save the user the trouble of having to loop through each beam pair.
The `reader.load()` function does this by automatically creating minimal Intake catalogs for each variable path, reading in the data, and merging each variable into a ready-to-analyze Xarray DataSet.
The Intake savvy user may wish to view the template catalog or use an existing catalog.

#### Viewing the template catalog

You can access the ICESat-2 catalog template as an attribute of the read object.

***NOTE: accessing `reader.is2catalog` creates a template with a placeholder in the 'group' parameter; thus, it will not work to actually read in data***

In [None]:
reader.is2catalog

In [None]:
reader.is2catalog.gui

#### Use an existing catalog
If you already have a catalog for your data, you can supply that when you create the read object.

In [None]:
catpath = '/Users/jessica/computing/icepyx/test_data/test_catalog.yml'
reader = ipx.Read(filepath, pattern, catpath)

Then, you can use the catalog you supplied by calling intake's `read` directly to read in the specified data variable.

In [None]:
ds = reader.is2catalog.read()

***NOTE: this means that you will only be able to read in a single data variable!***

To take advantage of icepyx's knowledge of ICESat-2 data nesting of beam pairs and read in multiple related variables at once, you must use the variable approach outlined earlier in this tutorial.

In [None]:
ds = reader.load()
ds

#### More customization options

If you'd like to use the icepyx ICESat-2 Catalog template to create your own customized catalog, we recommend that you access the `build_catalog` function directly, which returns an Intake Catalog instance.

This function accepts as keyword input arguments (kwargs) dictionaries with appropriate keys (depending on the Intake driver you are using).
The simplest version of this is specifying the variable parameters and paths of interest.
`var_paths` may contain "variables", each of which must then be further defined by `var_path_params`.
You cannot use glob-like path syntax to access variables (so `var_path = '/*/land_ice_segments'` is NOT VALID).

In [None]:
import icepyx.core.is2cat as is2cat

# build a custom ICESat-2 catalog with a group and parameter
cat = is2cat.build_catalog(var_paths = "/{{laser}}/land_ice_segments",
                     var_path_params = [{"name": "laser",
                                         "description": "Laser Beam Number",
                                         "type": "str",
                                         "default": "gt1l",
                                         "allowed": ["gt1l", "gt1r", "gt2l", "gt2r", "gt3l", "gt3r"]
                                        }]
                    )

#### Saving your catalog
If you create a highly customized ICESat-2 catalog, you can use Intake's `save` to export it as a .yml file.

Don't forget you can easily use an existing catalog (such as this highly customized one you just made) to read in your data with `reader = ipx.Read(filepath, pattern, catalog)` (so it's as easy as re-creating your reader object with your modified catalog).

In [None]:
catpath = '/Users/jessica/computing/icepyx/test_data/test_catalog.yml'
cat.save(catpath)

In [None]:
# DEL
readcatalog = intake.open_catalog(catpath)