## Subsetting ICESat-2 Data with the NSIDC Subsetter
### How to Use the NSIDC Subsetter Example Notebook
This notebook illustrates the use of icepyx for subsetting ICESat-2 data ordered through the NSIDC DAAC. We'll show how to find out what subsetting options are available and how to specify the subsetting for your order.

For more information on using icepyx to find, order, and download data, see our complimentary [ICESat-2_DAAC_DataAccess_Example Notebook](https://github.com/icesat2py/icepyx/blob/master/doc/examples/ICESat-2_DAAC_DataAccess_Example.ipynb).

Questions? Be sure to check out the FAQs throughout this notebook, indicated as italic headings.

#### Credits
* notebook by: Jessica Scheick and Zheng Liu
* some source material: [NSIDC Data Access Notebook](https://github.com/ICESAT-2HackWeek/ICESat2_hackweek_tutorials/tree/master/03_NSIDCDataAccess_Steiker) by Amy Steiker and Bruce Wallin

### _What is SUBSETTING anyway?_

Anyone who's worked with geospatial data has probably encountered subsetting. Typically, we search for data wherever it is stored and download the chunks (aka granules, scenes, passes, swaths, etc.) that contain something we are interested in. Then, we have to extract from each chunk the pieces we actually want to analyze. Those pieces might be geospatial (i.e. an area of interest), temporal (i.e. certain months of a time series), and/or certain variables. This process of extracting the data we are going to use is called subsetting.

In the case of ICESat-2 data coming from the NSIDC DAAC, we can do this subsetting step on the data prior to download, reducing our number of data processing steps and resulting in smaller, faster downloads and storage.

### Import packages, including icepyx

In [5]:
import numpy as np
import xarray as xr
import pandas as pd

import h5py
import os,json
from pprint import pprint

In [6]:
#change working directory
%cd ../

/home/jovyan


In [7]:
%load_ext autoreload
%autoreload 2

from icepyx import icesat2data as ipd

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Create an icesat2data object and log in to Earthdata

For this example, we'll be working with a sea ice dataset (ATL09) for an area along West Greenland (Disko Bay).

In [9]:
region_a = ipd.Icesat2Data('ATL09',[-55, 68, -48, 71],['2019-02-22','2019-02-28'], \
                           start_time='00:00:00', end_time='23:59:59')



In [None]:
region_a.earthdata_login('liuzheng','liuzheng@apl.uw.edu')

In [11]:
region_a.earthdata_login('jessica.scheick','jessica.scheick@maine.edu')

Earthdata Login password:  ········


### Discover Subsetting Options

You can see what subsetting options are available for a given dataset by calling `show_custom_options()`. The options are presented as a series of headings followed by available values in square brackets. Headings are:
* **Subsetting Options**: whether or not temporal and spatial subsetting are available for the dataset
* **Data File Formats (Reformatting Options)**: 
* **Data File (Reformatting) Options Supporting Reprojection**: 
* **Data File (Reformatting) Options NOT Supporting Reprojection**: 
* **Data Variables (also Subsettable)**: a dictionary of variable name keys and the paths to those variables available in the dataset.

In [12]:
region_a.show_custom_options(dictview=True)

Subsetting options
[{'id': 'ICESAT2',
  'maxGransAsyncRequest': '2000',
  'maxGransSyncRequest': '100',
  'spatialSubsetting': 'true',
  'spatialSubsettingShapefile': 'true',
  'temporalSubsetting': 'true',
  'type': 'both'}]
Data File Formats (Reformatting Options)
['TABULAR_ASCII', 'NetCDF4-CF', 'NetCDF-3']
Reprojection Options
[]
Data File (Reformatting) Options Supporting Reprojection
['TABULAR_ASCII', 'NetCDF4-CF', 'NetCDF-3', 'No reformatting']
Data File (Reformatting) Options NOT Supporting Reprojection
[]
Data Variables (also Subsettable)
{'a_m1': ['ancillary_data/atmosphere/a_m1'],
 'a_m2': ['ancillary_data/atmosphere/a_m2'],
 'aclr_true': ['profile_1/high_rate/aclr_true',
               'profile_2/high_rate/aclr_true',
               'profile_3/high_rate/aclr_true'],
 'aclr_use_atlas': ['ancillary_data/atmosphere/aclr_use_atlas'],
 'alpha_day_pce1': ['ancillary_data/atmosphere/alpha_day_pce1'],
 'alpha_day_pce2': ['ancillary_data/atmosphere/alpha_day_pce2'],
 'alpha_day_pce3'

By default, spatial and temporal subsetting is applied to your order unless you specify `subset=False` to `order_granules()` (or `download_granules()`, once I fix that in the code!) Additional subsetting options must be specified as keyword arguments to the order/download functions.

### _Why not just download all the data and subset locally? What if I need more variables/granules?_



## Other FAQs About Subsetting

### Why do I have to provide spatial bounds even if I don't use them to subset?


### Why does the subsetter say no matching data was found?
Sometimes chunks returned in our initial search end up not containing any useful data.

## About Data Variables in an Icesat2Data object
There are two possible variable parameters associated with each ```icesat2data``` object.
1. ```order_vars```, which is for interacting with variables during data querying, ordering, and downloading activities.
2. ```file_vars```, which is for interacting with variables associated with local files [not yet implemented].

Each variables parameter (which is actually an associated variables class object) has methods to:
* get available variables, either available from the NSIDC or the file (```get_avail()``` method).
* append new variables to a wanted list (```append()``` method), allowing the user to submit a list to the NSIDC subsetter and download a smaller, reproducible dataset and/or work with a subset of the available variables in a provided file.
* remove variables from a wanted list (```remove()``` method, NOT YET IMPLEMENTED), allowing the user to customize the list of variables they want to work with/see at a given time.

Each variables instance also has a set of attributes, including ```avail``` and ```wanted``` to indicate the list of variables that is available (unmutable, or unchangeable, as it is based on the input dataset specifications or files) and the list of variables that the user would like extracted (updateable with the ```append``` and ```remove``` methods), respectively. We'll showcase the use of all of these methods and attributes below.


### Now, generate variable dictionary. 
Get the variable dictionary by parsing the dataset xml information from NSIDC, by calling ```show_custom_options()```. 

Should you need to access them outside of the methods we provide, the data variables are stored in ```region_a._cust_options['variables']```. 

In [11]:
region_a.order_vars.get_avail()

#### TESTING SECTION:
##### Setup the user provided variable list to subset variables: Please TRY OUT the tests below. 

Options for inputting variables:
1. Use a default list for the dataset (not yet fully implemented across all datasets)
2. Provide a list of variable names, which will return all path-variable combinations (e.g. longitude will return longitude for both beams for all profiles)
3. Provide a list of variable names and/or specific profiles/beams (not yet implemented).

An example of each type of input is below.

#### Test 1:
Add ```latitude``` for profile 1 and 2

In [17]:
region_a.order_vars.avail

['ds_surf_type',
 'ancillary_data/atlas_sdp_gps_epoch',
 'ancillary_data/control',
 'ancillary_data/data_end_utc',
 'ancillary_data/data_start_utc',
 'ancillary_data/end_cycle',
 'ancillary_data/end_delta_time',
 'ancillary_data/end_geoseg',
 'ancillary_data/end_gpssow',
 'ancillary_data/end_gpsweek',
 'ancillary_data/end_orbit',
 'ancillary_data/end_region',
 'ancillary_data/end_rgt',
 'ancillary_data/granule_end_utc',
 'ancillary_data/granule_start_utc',
 'ancillary_data/qa_at_interval',
 'ancillary_data/release',
 'ancillary_data/start_cycle',
 'ancillary_data/start_delta_time',
 'ancillary_data/start_geoseg',
 'ancillary_data/start_gpssow',
 'ancillary_data/start_gpsweek',
 'ancillary_data/start_orbit',
 'ancillary_data/start_region',
 'ancillary_data/start_rgt',
 'ancillary_data/version',
 'ancillary_data/atmosphere/aclr_use_atlas',
 'ancillary_data/atmosphere/alpha',
 'ancillary_data/atmosphere/a_m1',
 'ancillary_data/atmosphere/a_m2',
 'ancillary_data/atmosphere/asr_cal_factor',

In [34]:
#default variables
var_dict = region_a.order_vars.append(beam_list=['profile_1','profile_2'],var_list=['latitude'], inclusive=True)
pprint(region_a.order_vars.wanted)

{'apparent_surf_reflec': ['profile_3/high_rate/apparent_surf_reflec',
                          'profile_1/high_rate/apparent_surf_reflec',
                          'profile_2/high_rate/apparent_surf_reflec'],
 'atlas_sdp_gps_epoch': ['ancillary_data/atlas_sdp_gps_epoch'],
 'bsnow_con': ['profile_3/high_rate/bsnow_con',
               'profile_3/low_rate/bsnow_con',
               'profile_1/high_rate/bsnow_con',
               'profile_1/low_rate/bsnow_con',
               'profile_2/high_rate/bsnow_con',
               'profile_2/low_rate/bsnow_con'],
 'bsnow_dens': ['profile_3/high_rate/bsnow_dens',
                'profile_1/high_rate/bsnow_dens',
                'profile_2/high_rate/bsnow_dens'],
 'bsnow_h': ['profile_3/high_rate/bsnow_h',
             'profile_3/low_rate/bsnow_h',
             'profile_1/high_rate/bsnow_h',
             'profile_1/low_rate/bsnow_h',
             'profile_2/high_rate/bsnow_h',
             'profile_2/low_rate/bsnow_h'],
 'bsnow_od': ['profile_3/h

#### Test 2:
Add ```latitude``` for profile 2 and overwrite

In [96]:
region_a.order_vars.remove(all=True)
pprint(region_a.order_vars.wanted)
region_a.order_vars.append(beam_list=['profile_2'],var_list=['latitude'])
pprint(region_a.order_vars.wanted)

None
{'atlas_sdp_gps_epoch': ['ancillary_data/atlas_sdp_gps_epoch'],
 'data_end_utc': ['ancillary_data/data_end_utc'],
 'data_start_utc': ['ancillary_data/data_start_utc'],
 'end_delta_time': ['ancillary_data/end_delta_time'],
 'granule_end_utc': ['ancillary_data/granule_end_utc'],
 'granule_start_utc': ['ancillary_data/granule_start_utc'],
 'latitude': ['profile_2/high_rate/latitude', 'profile_2/low_rate/latitude'],
 'sc_orient': ['orbit_info/sc_orient'],
 'start_delta_time': ['ancillary_data/start_delta_time']}


#### Test 2B:
Add ```latitude``` for profile 3 and overwrite (so profile_2 should be removed)

In [88]:
region_a.order_vars.append(beam_list=['profile_3'],var_list=['latitude'])
pprint(region_a.order_vars.wanted)

{'atlas_sdp_gps_epoch': ['ancillary_data/atlas_sdp_gps_epoch'],
 'data_end_utc': ['ancillary_data/data_end_utc'],
 'data_start_utc': ['ancillary_data/data_start_utc'],
 'end_delta_time': ['ancillary_data/end_delta_time'],
 'granule_end_utc': ['ancillary_data/granule_end_utc'],
 'granule_start_utc': ['ancillary_data/granule_start_utc'],
 'latitude': ['profile_2/high_rate/latitude',
              'profile_2/low_rate/latitude',
              'profile_3/high_rate/latitude',
              'profile_3/low_rate/latitude'],
 'sc_orient': ['orbit_info/sc_orient'],
 'start_delta_time': ['ancillary_data/start_delta_time']}


#### Test 3:
Add ```latitude``` for all profiles and with keyword ```low_rate``` and append

In [89]:
region_a.order_vars.append(var_list=['latitude'],keyword_list=['low_rate'])
pprint(region_a.order_vars.wanted)

{'atlas_sdp_gps_epoch': ['ancillary_data/atlas_sdp_gps_epoch'],
 'data_end_utc': ['ancillary_data/data_end_utc'],
 'data_start_utc': ['ancillary_data/data_start_utc'],
 'end_delta_time': ['ancillary_data/end_delta_time'],
 'granule_end_utc': ['ancillary_data/granule_end_utc'],
 'granule_start_utc': ['ancillary_data/granule_start_utc'],
 'latitude': ['profile_2/high_rate/latitude',
              'profile_2/low_rate/latitude',
              'profile_3/high_rate/latitude',
              'profile_3/low_rate/latitude',
              'profile_1/low_rate/latitude'],
 'sc_orient': ['orbit_info/sc_orient'],
 'start_delta_time': ['ancillary_data/start_delta_time']}


#### Before Test 4:
Go back to test 2. Overwrite ```latitude``` for profile 2 only.

In [98]:
region_a.order_vars.remove(beam_list=['profile_1', 'profile_3'], var_list=['latitude'])
pprint(region_a.order_vars.wanted)

{'atlas_sdp_gps_epoch': ['ancillary_data/atlas_sdp_gps_epoch'],
 'data_end_utc': ['ancillary_data/data_end_utc'],
 'data_start_utc': ['ancillary_data/data_start_utc'],
 'end_delta_time': ['ancillary_data/end_delta_time'],
 'granule_end_utc': ['ancillary_data/granule_end_utc'],
 'granule_start_utc': ['ancillary_data/granule_start_utc'],
 'latitude': ['profile_2/high_rate/latitude', 'profile_2/low_rate/latitude'],
 'sc_orient': ['orbit_info/sc_orient'],
 'start_delta_time': ['ancillary_data/start_delta_time']}


#### Test 5:
Append ```latitude``` for profile 3 and ```high_rate``` only

In [91]:
region_a.order_vars.append(beam_list=['profile_3'],var_list=['latitude'],keyword_list=['low_rate'])
pprint(region_a.order_vars.wanted)

{'atlas_sdp_gps_epoch': ['ancillary_data/atlas_sdp_gps_epoch'],
 'data_end_utc': ['ancillary_data/data_end_utc'],
 'data_start_utc': ['ancillary_data/data_start_utc'],
 'end_delta_time': ['ancillary_data/end_delta_time'],
 'granule_end_utc': ['ancillary_data/granule_end_utc'],
 'granule_start_utc': ['ancillary_data/granule_start_utc'],
 'latitude': ['profile_2/high_rate/latitude',
              'profile_2/low_rate/latitude',
              'profile_3/low_rate/latitude'],
 'sc_orient': ['orbit_info/sc_orient'],
 'start_delta_time': ['ancillary_data/start_delta_time']}


#### Test 6:
Add ```sc_orient_time``` under ```orbit_info```.

In [92]:
region_a.order_vars.append(keyword_list=['orbit_info'],var_list=['sc_orient_time'])
pprint(region_a.order_vars.wanted)

{'atlas_sdp_gps_epoch': ['ancillary_data/atlas_sdp_gps_epoch'],
 'data_end_utc': ['ancillary_data/data_end_utc'],
 'data_start_utc': ['ancillary_data/data_start_utc'],
 'end_delta_time': ['ancillary_data/end_delta_time'],
 'granule_end_utc': ['ancillary_data/granule_end_utc'],
 'granule_start_utc': ['ancillary_data/granule_start_utc'],
 'latitude': ['profile_2/high_rate/latitude',
              'profile_2/low_rate/latitude',
              'profile_3/low_rate/latitude'],
 'sc_orient': ['orbit_info/sc_orient'],
 'sc_orient_time': ['orbit_info/sc_orient_time'],
 'start_delta_time': ['ancillary_data/start_delta_time']}


#### Test 7:
Add all variables under ```orbit_info``` but path to ```sc_orient_time``` should not be duplicated.

In [94]:
region_a.order_vars.append(keyword_list=['orbit_info'],inclusive=True)
pprint(region_a.order_vars.wanted)

{'atlas_sdp_gps_epoch': ['ancillary_data/atlas_sdp_gps_epoch'],
 'crossing_time': ['orbit_info/crossing_time'],
 'cycle_number': ['orbit_info/cycle_number'],
 'data_end_utc': ['ancillary_data/data_end_utc'],
 'data_start_utc': ['ancillary_data/data_start_utc'],
 'end_delta_time': ['ancillary_data/end_delta_time'],
 'granule_end_utc': ['ancillary_data/granule_end_utc'],
 'granule_start_utc': ['ancillary_data/granule_start_utc'],
 'lan': ['orbit_info/lan'],
 'latitude': ['profile_2/high_rate/latitude',
              'profile_2/low_rate/latitude',
              'profile_3/low_rate/latitude'],
 'orbit_number': ['orbit_info/orbit_number'],
 'rgt': ['orbit_info/rgt'],
 'sc_orient': ['orbit_info/sc_orient'],
 'sc_orient_time': ['orbit_info/sc_orient_time'],
 'start_delta_time': ['ancillary_data/start_delta_time']}


#### Test 8:
Add all defaults for all beams and all keywords. After this, have to reinitialize ```region_a``` and regenerate variable dictionary to run the above tests again (unless you set ```append=False```). 

In [95]:
region_a.order_vars.append(defaults=True)
pprint(region_a.order_vars.wanted)

{'apparent_surf_reflec': ['profile_1/high_rate/apparent_surf_reflec',
                          'profile_2/high_rate/apparent_surf_reflec',
                          'profile_3/high_rate/apparent_surf_reflec'],
 'atlas_sdp_gps_epoch': ['ancillary_data/atlas_sdp_gps_epoch'],
 'bsnow_con': ['profile_1/high_rate/bsnow_con',
               'profile_1/low_rate/bsnow_con',
               'profile_2/high_rate/bsnow_con',
               'profile_2/low_rate/bsnow_con',
               'profile_3/high_rate/bsnow_con',
               'profile_3/low_rate/bsnow_con'],
 'bsnow_dens': ['profile_1/high_rate/bsnow_dens',
                'profile_2/high_rate/bsnow_dens',
                'profile_3/high_rate/bsnow_dens'],
 'bsnow_h': ['profile_1/high_rate/bsnow_h',
             'profile_1/low_rate/bsnow_h',
             'profile_2/high_rate/bsnow_h',
             'profile_2/low_rate/bsnow_h',
             'profile_3/high_rate/bsnow_h',
             'profile_3/low_rate/bsnow_h'],
 'bsnow_od': ['profile_1/h

In [None]:
#variable names + beams/profiles
###STILL NEED TO MAKE THE BELOW POSSIBLE IN THE CODE

### Setting params and download

In [None]:
region_a.build_CMR_params()
region_a.build_reqconfig_params('download')

In [None]:
region_a.build_subset_params(**{'Coverage':var_dict})
region_a.subsetparams

In [None]:
#Identical to above block, but enters the keywords with a different style
region_a.build_subset_params(Coverage=var_dict)
region_a.subsetparams

In [None]:
region_a.order_granules(session, verbose=True)

In [None]:
region_a.download_granules(session,'.')

### Examine downloaded subset data file 


In [None]:
fn = '166458094/processed_ATL09_20190222003738_08490201_002_01.h5'

#### Check the downloaded dataset
Take ```latitude``` for example,

In [None]:
varname = 'latitude'
#varname = 'sc_orient'

varlist = []
def IS2h5walk(vname, h5node):
    if isinstance(h5node, h5py.Dataset):
        varlist.append(vname)
    return 

with h5py.File(fn,'r') as h5pt:
    h5pt.visititems(IS2h5walk)
    
for tvar in varlist:
    vpath,vn = os.path.split(tvar)
    if vn==varname: print(tvar) 

#### Compare the varaible ```latitude``` in the original data and the subsetted dat

In [None]:
region_a.variables['latitude']

In [None]:
', '.join(x) for x in ['gt1l','gt1r']