## Reading ICESat-2 Data
### Basic Data Read-in Example Notebook
This notebook illustrates the use of icepyx for reading ICESat-2 data files, loading them into a data object.
Currently the default data object is an Xarray Dataset, with ongoing work to provide support for other data object types.

To read in the data, icepyx creates an [Intake](https://intake.readthedocs.io/en/latest/) data [catalog](https://intake.readthedocs.io/en/latest/catalog.html), which can be saved, shared, modified, and reused to reproducibly read in a set of data files in a consistent way as part of an analysis workflow.
This approach streamlines the transition between data sources (local/downloaded files or, ultimately, cloud/bucket access) and data object types (e.g. [Xarray Dataset](http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html) or [GeoPandas GeoDataFrame](https://geopandas.org/docs/reference/api/geopandas.GeoDataFrame.html)).

#### Credits
* original notebook by: Jessica Scheick
* notebook contributors: 
* templates for default ICESat-2 Intake catalogs from: [WeiJi]() and [Tian]().


### Import packages, including icepyx

In [None]:
%load_ext autoreload
import icepyx as ipx
%autoreload 2

In [None]:
import os
import fnmatch
import glob
import pathlib
import fsspec
from fsspec.implementations.local import LocalFileSystem

In [None]:
lfs = LocalFileSystem()

In [None]:
lfs

In [None]:
lfs.ls("/")

In [None]:
fsmap = fsspec.get_mapper(str(path))
output_fs = fsmap.fs

In [None]:
output_fs.ls()

In [None]:
output_fs

### Set data source path

Provide a full path to the data to be read in (i.e. opened).
Currently accepted inputs are:
* a directory
* a single file

All files to be read in *must* have a consistent filename pattern.
If a directory is supplied as the data source, all files in any subdirectories that match the filename pattern will be included.

S3 bucket data access is currently under development, and requires you are registered with NSIDC as a beta tester for cloud-based ICESat-2 data.
icepyx is working to ensure a smooth transition to working with remote files.
We'd love your help exploring and testing these features as they become available!

In [None]:
urlpath = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2019/11/30/ATL03_20191130221008_09930503_004_01.h5'

In [None]:
filepath = '/Users/jessica/computing/icepyx/test_data/ATL06-20181214041627-Sample.h5'

In [None]:
path = '/Users/jessica/computing/icepyx/test_data/'

In [None]:
filepath2 = '/Users/jessica/computing/icepyx/test_data/test_subdir/ATL03_20191130221008_09930503_004_01.h5'

### Create a filename pattern for your data files

Files provided by NSIDC match the format `"ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5"` where the parameters in curly brackets indicate a parameter name (left of the colin) and character length or format (right of the colin).
Some of this information is used during data opening to help correctly read and label the data within the data structure, particularly when multiple files are opened simultaneously.

By default, icepyx will assume your filenames follow the default format.
However, you can easily read in other ICESat-2 data files by supplying your own filename pattern.
For instance, `pattern="ATL{product:2}-{datetime:%Y%m%d%H%M%S}-Sample.h5"`.

In [None]:
pattern = 'ATL{product:2}-{datetime:%Y%m%d%H%M%S}-Sample.h5'

In [None]:
pattern = "ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5"

### Create an icepyx read object

In [None]:
reader = ipx.Read(filepath, pattern) # or ipx.Read(filepath) if your filenames match the defualt pattern

### Use a catalog to read-in data

When you open a data file, you must specify the underlying data structure and how you'd like the information to be read in.
A simple example of this, for instance when opening a csv or similarly delimited file, is letting the software know if the data contains a header row, what the data type is (string, double, float, boolean, etc.) for each column, what the delimeter is, and which columns or rows you'd like to be loaded.
Many ICESat-2 data readers are quite manual in nature, requiring that you accurately type out a list of string paths to the various data variables.

`code example here???`

Intake minimizes that effort by allowing you to instead specify string path patterns within your data structure.

`example of that here...`

icepyx simplifies this process one step further by relying on its awareness of ICESat-2 specific data file variable storage structure.
You can use a default list that loads commonly used variables for your data product, or create your own list of variables to be read in.

These instructions for how the software should read in your data are formatted into something called a catalog.

#### Load in an existing catalog
If you already have a catalog for your data, you can supply that when you create the read object.

In [None]:
catpath = '/Users/jessica/computing/icepyx/test_data/test_catalog.yml'
reader = ipx.Read(filepath, pattern, catpath)

#### Quickly build a catalog
Alternatively, you can easily build a default or custom catalog.

In [None]:
# build a default ICESat-2 catalog
reader.build_catalog()

#### More customization options

For users wishing to further customize their Intake catalog, dictionaries with appropriate keys (depending on the Intake driver you are using) may be entered as keyword arguments (kwargs) to `build_catalog`.
The simplest version of this is specifying the variable parameters and paths of interest.
`var_paths` may contain "variables", each of which must then be further defined by `var_path_params`.
You cannot use glob-like path syntax to access variables (so `var_path = '/*/land_ice_segments'` is NOT VALID).

In [None]:
# build a custom ICESat-2 catalog - specific path
reader.build_catalog(var_paths="/gt3r/land_ice_segments")

In [None]:
# build a custom ICESat-2 catalog - general path
reader.build_catalog(var_paths = "/{{laser}}/land_ice_segments",
                     var_path_params = [{"name": "laser",
                                         "description": "Laser Beam Number",
                                         "type": "str",
                                         "default": "gt1l",
                                         "allowed": ["gt1l", "gt1r", "gt2l", "gt2r", "gt3l", "gt3r"]
                                        }]
                    )

However, you may also add additional entries or use alternative drivers.
This approach is not recommended for those with limited familiarity of Intake catalogs.
If you find yourself needing additional customization at this point, we recommend creating a default catalog as above, exporting it (see below), modifying the underlying catalog file, and then re-creating your reader object with the modified catalog.

In [None]:
# more custom options for your ICESat-2 catalog
engine_kwargs_dict = {
    'engine': "h5netcdf",
    'group': "/gt1l/land_ice_segments" 
}

source_args_dict = {
    'urlpath': filepath,
#     'path_as_pattern': 
    'xarray_kwargs': engine_kwargs_dict
}

sources = ['is2_local', 'is2_s3']

reader.build_catalog(entries={
                        "plugins": {"source": {"module":intake_xarray}}, #don't need this?
                        sources[0]: LocalCatalogEntry(name=sources[0],
                                description="see if this overwrites the defaults",
                                driver=intake_xarray.netcdf.NetCDFSource,
                                args=source_args_dict),
#                          'source2': LocalCatalogEntry(name="is2_test2",
#                                 description="trying to write a 2 dynamic is2 data read in catalog",
#                                 driver=intake_xarray.netcdf.NetCDFSource,
#                                 args=cat_dict)
                          })

#### Viewing your catalog

You can access the catalog you've created or loaded for reading in your `source` data files by accessing the catalog attribute of the object.

In [None]:
reader.catalog

In [None]:
import intake
def serialize(self):
        """
        Produce YAML version of this catalog.
        Note that this is not the same as ``.yaml()``, which produces a YAML
        block referring to this catalog.
        """
        import yaml
        output = {"metadata": self.metadata, "sources": {},
                  "name": self.name}
        for key, entry in self._entries.items():
            kw = entry._captured_init_kwargs.copy()
#             print(kw)
            kw.pop('catalog', None)
            kw['parameters'] = {k.name: k.__getstate__()['kwargs'] for k in kw.get('parameters', [])}
            print(kw['driver'].__name__)
            print(kw['driver'].__class__.__name__)
            print(".".join([kw['driver'].__module__, kw['driver'].__name__]))
            if issubclass(kw['driver'], intake.source.base.DataSourceBase):
                kw['driver'] = ".".join([kw['driver'].__module__, kw['driver'].__name__])
#                 kw['driver'] = str(kw['driver']).split("'")[1]
                print(type(kw['driver']))
                print(kw['driver'])
            
            output["sources"][key] = kw
#         print(output)
        return yaml.dump(output)

In [None]:
serialize(reader.catalog)

Once you have created your catalog, you can use it to explore your data without having to load the entire dataset into memory.
Intake provides a great Graphical User Interface (GUI) for doing so.

In [None]:
reader.catalog.gui

#### Saving your catalog
By saving your catalog as a .yml file, you'll have the exact set of "instructions" you used to read-in your data.
This makes it easy to replicate your read-in, easily make changes to you catalog, or share it with your colleagues.

Don't forget you can easily use an existing catalog to read in your data with `reader = ipx.Read(filepath, pattern, catalog)`

In [None]:
catpath = '/Users/jessica/computing/icepyx/test_data/test_catalog.yml'
reader.catalog.save(catpath)

In [None]:
readcatalog = intake.open_catalog(catpath)

In [None]:
readcatalog["is2_local"].read()

### Loading your data

Once you've set up your catalog and determined which data source you'd like to read from using the GUI above (by default if you have created a catalog using `reader.build_catalog()` the data source will be name "is2_local"), you can simply using the `read()` function to create the specified data object with your desired data.

In [None]:
ds = reader.catalog['is2_local'].read()
ds

In [None]:
source_file, file_format


if not isinstance(save_path, Path) and not isinstance(save_path, str):
            raise TypeError("save_path must be a string or Path")

        fsmap = fsspec.get_mapper(str(save_path), **output_storage_options)
        output_fs = fsmap.fs

        # Use the full path such as s3://... if it's not local, otherwise use root
        if isinstance(output_fs, LocalFileSystem):
            root = fsmap.root
        else:
            root = save_path
        if Path(root).suffix == "":  # directory
            out_dir = root
            out_path = os.path.join(root, Path(source_file).stem + file_format)
        else:  # file
            out_dir = os.path.dirname(root)
            out_path = os.path.join(out_dir, Path(root).stem + file_format)

In [None]:
import intake
import intake_xarray

In [None]:
import h5netcdf

In [None]:
from intake.catalog import Catalog
from intake.catalog.local import LocalCatalogEntry

In [None]:
notebook demo:
    X- show that you can enter minimal (or no) inputs to build catalog
    - show that you can override the defaults if you do more of construction on your end
    - figure out how multiple var_paths will work...
    X- show how to view the gui and use it to get the sources and then read in data
    x (need to finish examples once have them from notebook) - add docstrings for build_catalog
    x- clean up this notebook and start turning it into an example

In [None]:
reader.catalog['is2_local'].read()

In [None]:
sources = ['is2_local', 'is2_s3']
mycat = Catalog.from_dict(name="IS2-hdf5-intake-catalog",
                          description="a dynamic catalog for creating local ICESat-2 intake entries",
                          metadata={"version":1},
                          entries={
#                         "plugins": {"source": {"module":intake_xarray}}, #don't need this?
                        sources[0]: LocalCatalogEntry(name=sources[0],
                                description="trying to write a dynamic is2 data read in catalog",
                                driver=intake_xarray.netcdf.NetCDFSource,
                                args=source_args_dict),
#                          'source2': LocalCatalogEntry(name="is2_test2",
#                                 description="trying to write a 2 dynamic is2 data read in catalog",
#                                 driver=intake_xarray.netcdf.NetCDFSource,
#                                 args=cat_dict)
                          })


In [None]:
mycat.gui

In [None]:
mycat[sources[0]]

In [None]:
args:
      urlpath: /Users/lt/Desktop/Intake_TEST/ATL06_RAW/processed_ATL06_*_{{rgt}}{{cycle}}{{orbitsegment}}_003_0*.h5
      path_as_pattern: processed_ATL06_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5
      chunks:
        delta_time: 500
      xarray_kwargs:
        combine: by_coords
        engine: h5netcdf
        group: /{{laser}}/land_ice_segments
        mask_and_scale: true
        parallel: true
    # https://intake.readthedocs.io/en/latest/catalog.html#parameter-definition
    parameters:
        rgt:
          description: ICESat-2 Reference Ground Track number
          type: str
          default: '0598'  # NEED TO BE IMPROVED !!!
          allowed: ['0598', '0095', '0406', '0537', '0659', '0467', '0467', '0156', '0598']
        cycle:
          description: Cycle number
          type: str
          default: "09"
          allowed: ["01","02","03","04","05","06","07","08","09"]
        orbitsegment:
          description: Orbital Segment
          type: str
          default: 11
          allowed: [10, 11, 12]
        laser:
          description: Laser Beam Number
          type: str
          default: gt1l
          allowed: ["gt1l", "gt1r", "gt2l", "gt2r", "gt3l", "gt3r"]
