# Subsetting ICESat-2 Data
This notebook ({nb-download}`download <IS2_data_access2-subsetting.ipynb>`) illustrates the use of icepyx for subsetting ICESat-2 data ordered through the NSIDC DAAC. We'll show how to find out what subsetting options are available and how to specify the subsetting options for your order.

For more information on using icepyx to find, order, and download data, see our complimentary [ICESat-2 Data Access Notebook](https://icepyx.readthedocs.io/en/latest/example_notebooks/IS2_data_access.html).

Questions? Be sure to check out the FAQs throughout this notebook, indicated as italic headings.

### _What is SUBSETTING anyway?_

_Anyone who's worked with geospatial data has probably encountered subsetting. Typically, we search for data wherever it is stored and download the chunks (aka granules, scenes, passes, swaths, etc.) that contain something we are interested in. Then, we have to extract from each chunk the pieces we actually want to analyze. Those pieces might be geospatial (i.e. an area of interest), temporal (i.e. certain months of a time series), and/or certain variables. This process of extracting the data we are going to use is called subsetting._

_In the case of ICESat-2 data coming from the NSIDC DAAC, we can do this subsetting step on the data prior to download, reducing our number of data processing steps and resulting in smaller, faster downloads and storage._

In [1]:
%mamba uninstall -y icepyx

Removing specs: ['icepyx']
Transaction

  Prefix: /srv/conda/envs/notebook

  Failure: packages to remove not found in the environment:

  - icepyx

The following packages are missing from the target environment:
  - icepyx


PackagesNotFoundError: The following packages are missing from the target environment:
  - icepyx



Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install git+https://github.com/icesat2py/icepyx.git@harmony-take2

Collecting git+https://github.com/icesat2py/icepyx.git@harmony-take2
  Cloning https://github.com/icesat2py/icepyx.git (to revision harmony-take2) to /tmp/pip-req-build-ly550u3f
  Running command git clone --filter=blob:none --quiet https://github.com/icesat2py/icepyx.git /tmp/pip-req-build-ly550u3f
  Running command git checkout -b harmony-take2 --track origin/harmony-take2
  Switched to a new branch 'harmony-take2'
  branch 'harmony-take2' set up to track 'origin/harmony-take2'.
  Resolved https://github.com/icesat2py/icepyx.git to commit 2581fd80b28c4e3e2f811f34bbe981c1274c1765
  Installing build dependencies ... [done
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Note: you may need to restart the kernel to use updated packages.


Import packages, including icepyx

In [3]:
import icepyx as ipx

import numpy as np
import xarray as xr
import pandas as pd

import h5py
import os,json
from pprint import pprint

Create a query object and log in to Earthdata

For this example, we'll be working with a sea ice product (ATL09) for an area along West Greenland (Disko Bay).

In [4]:
region_a = ipx.Query('ATL09',[-55, 68, -48, 71],['2019-02-22','2019-02-28'], \
                           start_time='00:00:00', end_time='23:59:59')

```{admonition} Important Authentication Update
Previously, icepyx required you to explicitly use the `.earthdata_login()` function to login. Running this function is deprecated and will result in an error, as icepyx will call the login function as needed. The user will still need to provide their credentials.
```

## Discover Subsetting Options

You can see what subsetting options are available for a given product by calling `show_custom_options()`. The options are presented as a series of headings followed by available values in square brackets. Headings are:
* **Subsetting Options**: whether or not temporal and spatial subsetting are available for the data product
* **Data File Formats (Reformatting Options)**: return the data in a format other than the native hdf5 (submitted as a key=value kwarg to `order_granules(format='NetCDF4-CF')`)
* **Data File (Reformatting) Options Supporting Reprojection**: return the data in a reprojected reference frame. These will be available for gridded ICESat-2 L3B data products.
* **Data File (Reformatting) Options NOT Supporting Reprojection**: data file formats that cannot be delivered with reprojection
* **Data Variables (also Subsettable)**: a dictionary of variable name keys and the paths to those variables available in the product

In [5]:
region_a.show_custom_options()

{
  "conceptId": "C2649212495-NSIDC_CPRD",
  "shortName": "ATL09",
  "variableSubset": false,
  "bboxSubset": true,
  "shapeSubset": true,
  "temporalSubset": true,
  "concatenate": false,
  "reproject": false,
  "outputFormats": [
    "application/x-hdf"
  ],
  "services": [
    {
      "name": "sds/trajectory-subsetter",
      "href": "https://cmr.earthdata.nasa.gov/search/concepts/S2836723123-XYZ_PROV",
      "capabilities": {
        "subsetting": {
          "temporal": true,
          "bbox": true,
          "shape": true,
          "variable": true
        },
        "output_formats": [
          "application/x-hdf"
        ]
      }
    }
  ],
  "variables": [],
  "capabilitiesVersion": "2"
}


By default, spatial and temporal subsetting based on your initial inputs is applied to your order unless you specify `subset=False` to `order_granules()` or `download_granules()` (which calls `order_granules` under the hood if you have not already placed your order) functions.
Additional subsetting options must be specified as keyword arguments to the order/download functions.

Although some file format conversions and reprojections are possible using the `format`, `projection`,and `projection_parameters` keywords, the rest of this tutorial will focus on variable subsetting, which is provided with the `Coverage` keyword.

### _Why do I have to provide spatial bounds to icepyx even if I don't use them to subset my data order?_

_Because they're still needed for the granule level search._
_Spatial inputs are usually required for any data search, on any platform, even if your search parameters cover the entire globe._

_The spatial information you provide is used to search the data repository and determine which granules might contain data over your area of interest._
_When you use that spatial information for subsetting, it's actually asking the NSIDC subsetter to extract the appropriate data from each granule._
_Thus, even if you set `subset=False` and download entire granules, you still need to provide some inputs on what geographic area you'd like data for._

## About Data Variables in a query object

A given ICESat-2 product may have over 200 variable + path combinations.
icepyx includes a custom `Variables` module that is "aware" of the ATLAS sensor and how the ICESat-2 data products are stored.
The [ICESat-2 Data Variables Example](https://icepyx.readthedocs.io/en/latest/example_notebooks/IS2_data_variables.html) provides a detailed set of examples on how to use icepyx's built in `Variables` module.

Thus, this notebook uses a default list of wanted variables to showcase subsetting and refers the user to the aforementioned Jupyter Notebook for a more thorough exploration of ICESat-2 product variables.

## _Why not just download all the data and subset locally? What if I need more variables/granules?_

_Taking advantage of the NSIDC subsetter is a great way to reduce your download size and thus your download time and the amount of storage required, especially if you're storing your data locally during analysis. By downloading your data using icepyx, it is easy to go back and get additional data with the same, similar, or different parameters (e.g. you can keep the same spatial and temporal bounds but change the variable list). Related tools (e.g. [`captoolkit`](https://github.com/fspaolo/captoolkit)) will let you easily merge files if you're uncomfortable merging them during read-in for processing._

In [6]:
short_name = 'ATL06'
spatial_extent = './supporting_files/simple_test_poly.gpkg'
date_range = ['2019-10-01','2019-10-05']

In [7]:
region_a = ipx.Query(short_name, spatial_extent
, 
   cycles=['03','04','05','06'], tracks=['0849','0902'])

print(region_a.product)
print(region_a.product_version)
print(region_a.cycles)
print(region_a.tracks)
print(region_a.spatial_extent)

ATL06
006
['03', '04', '05', '06']
['0849', '0902']
('polygon', [-55.0, 68.0, -55.0, 71.0, -48.0, 71.0, -48.0, 68.0, -55.0, 68.0])


In [8]:
region_a.visualize_spatial_extent()

We can still print a list of available granules for our query

In [9]:
region_a.avail_granules(cloud=True)

[['s3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2019/05/23/ATL06_20190523204321_08490305_006_02.h5',
  's3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2019/05/27/ATL06_20190527075008_09020303_006_02.h5',
  's3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2019/08/22/ATL06_20190822162310_08490405_006_02.h5',
  's3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2019/08/26/ATL06_20190826032957_09020403_006_02.h5',
  's3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2019/11/21/ATL06_20191121120302_08490505_006_01.h5',
  's3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2019/11/24/ATL06_20191124230949_09020503_006_01.h5',
  's3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2020/02/20/ATL06_20200220074245_08490605_006_01.h5',
  's3://nsidc-cumulus-prod-protected/ATLAS/ATL06/006/2020/02/23/ATL06_20200223184932_09020603_006_01.h5']]

## Applying variable subsetting to your order and download

In order to have your wanted variable list included with your order, you must pass it as a keyword argument to the `subsetparams()` attribute or the `order_granules()` or `download_granules()` (which calls `order_granules` under the hood if you have not already placed your order) functions.

In [10]:
order = region_a.order_granules(subset=True) 
order

Harmony job ID:  a98665dd-abd1-4870-8c9e-e60bae768f18
Initial status of your harmony order request: running


Job ID,Type,Status,Details
a98665dd-abd1-4870-8c9e-e60bae768f18,subset,running,View Details


### Checking an order status

In [11]:
order.status()

{'status': 'running',
 'message': 'The job is being processed',
 'progress': 0,
 'created_at': datetime.datetime(2025, 3, 6, 23, 26, 1, 922000, tzinfo=tzlocal()),
 'updated_at': datetime.datetime(2025, 3, 6, 23, 26, 1, 922000, tzinfo=tzlocal()),
 'created_at_local': '2025-03-06T23:26:01+00:00',
 'updated_at_local': '2025-03-06T23:26:01+00:00',
 'num_input_granules': 8,
 'data_expiration': datetime.datetime(2025, 4, 5, 23, 26, 1, 922000, tzinfo=tzlocal()),
 'data_expiration_local': '2025-04-05T23:26:01+00:00',
 'order_url': 'https://harmony.earthdata.nasa.gov/workflow-ui/a98665dd-abd1-4870-8c9e-e60bae768f18'}

### Downloading subsetted granules

In [12]:
files = order.download_granules("./data")

Downloading results for harmony job a98665dd-abd1-4870-8c9e-e60bae768f18
data/93451573_ATL06_20190822162310_08490405_006_02_subsetted.h5
data/93451572_ATL06_20190527075008_09020303_006_02_subsetted.h5
data/93451571_ATL06_20190523204321_08490305_006_02_subsetted.h5
data/93451574_ATL06_20190826032957_09020403_006_02_subsetted.h5
data/93451576_ATL06_20191124230949_09020503_006_01_subsetted.h5
data/93451575_ATL06_20191121120302_08490505_006_01_subsetted.h5
data/93451577_ATL06_20200220074245_08490605_006_01_subsetted.h5
data/93451578_ATL06_20200223184932_09020603_006_01_subsetted.h5


### _Why does the subsetter say no matching data was found?_
_Sometimes, granules ("files") returned in our initial search end up not containing any data in our specified area of interest._
_This is because the initial search is completed using summary metadata for a granule._
_You've likely encountered this before when viewing available imagery online: your spatial search turns up a bunch of images with only a few border or corner pixels, maybe even in no data regions, in your area of interest._
_Thus, when you go to extract the data from the area you want (i.e. spatially subset it), you don't get any usable data from that image._

## Handling large orders

By default the Harmony subsetter will only process the first 300 granules for large orders, placing them into a "previewing" status. This allows users to check that results look correct. Once the job has completed its preview, which includes the first 100 granules, then we can resume the order if we are satisfied that our request is correct. The following guidance is commented out by default, and can be uncommented to test this large order behavior.

In [13]:
# short_name = 'ATL06'
# spatial_extent = './supporting_files/simple_test_poly.gpkg'
# date_range = ['2018-10-01','2020-02-05']

# region_a = ipx.Query(short_name, spatial_extent, date_range)

# order = region_a.order_granules(subset=True) 
# order

This order includes 311 input granules, and therefore it is automatically placed into a previewing state. We can inspect the status of this order and wait until it moves to a "paused" state, once the initial 100 granules are complete.

In [14]:
# order.status()

If we are satisfied with the order, then we can resume processing:

In [15]:
# order.resume()
# order

## Work with the downloaded data

Now that the subsetted files have been downloaded, we can now work with them using the `icepyx` [Read](https://icepyx.readthedocs.io/en/latest/user_guide/documentation/read.html) class. See the [Reading ICESat-2 Data in for Analysis](https://icepyx.readthedocs.io/en/latest/example_notebooks/IS2_data_read-in.html#) notebook for more information. 

#### Credits
* notebook contributors: Zheng Liu, Jessica Scheick, and Amy Steiker
* some source material: [NSIDC Data Access Notebook](https://github.com/ICESAT-2HackWeek/ICESat2_hackweek_tutorials/tree/main/03_NSIDCDataAccess_Steiker) by Amy Steiker and Bruce Wallin