Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cryo 165 - update ICESat-2 cloud access with search_data method #56

Merged
merged 12 commits into from
Aug 1, 2023
Merged
11 changes: 7 additions & 4 deletions notebooks/ICESat-2_Cloud_Access/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,16 @@
This notebook demonstrates searching for cloud-hosted ICESat-2 data and directly accessing Land Ice Height (ATL06) granules from an Amazon Compute Cloud (EC2) instance using the `earthaccess` package. NASA data "in the cloud" are stored in Amazon Web Services (AWS) Simple Storage Service (S3) Buckets. **Direct Access** is an efficient way to work with data stored in an S3 Bucket when you are working in the cloud. Cloud-hosted granules can be opened and loaded into memory without the need to download them first. This allows you take advantage of the scalability and power of cloud computing.

## Set up
To run the notebook provided in this folder, please see the [NSIDC-Data-Tutorials repository readme](https://github.com/nsidc/NSIDC-Data-Tutorials#readme) for instructions on several ways (using Binder, Docker, or Conda) to do this.

**Note:** If you are running this notebook on your own AWS EC2 instance using the environment set up using the environment.yml file in the NSIDC-Data-Tutorials/notebooks/ICESat-2_Cloud_Access/environment folder, you may need to run the following command before running the notebook to ensure the notebook executes properly:
To run the notebook provided in this folder in the Amazon Web Services (AWS) cloud, there are a couple of options:
* An EC2 instance already set up with the necessary software installed to run a Jupyter notebook, and the environment set up using the provided environment.yml file. **Note:** If you are running this notebook on your own AWS EC2 instance using the environment set up using the environment.yml file in the NSIDC-Data-Tutorials/notebooks/ICESat-2_Cloud_Access/environment folder, you may need to run the following command before running the notebook to ensure the notebook executes properly:

`jupyter nbextension enable --py widgetsnbextension`

You do NOT need to do this if you are using the environment set up using the environment.yml file from the NSIDC-Data-Tutorials/binder folder.
You do NOT need to do this if you are using the environment set up using the environment.yml file from the NSIDC-Data-Tutorials/binder folder.

* Alternatively, if you have access to one, it can be run in a managed cloud-based Jupyter hub. Just make sure all the necessary libraries are installed (`earthaccess`,`xarray`, and `hvplot`).

For further details on the prerequisites, see the 'Prerequisites' section in the notebook.

## Key Learning Objectives

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
"### **Example of end product** \n",
jroebuck932 marked this conversation as resolved.
Show resolved Hide resolved
jroebuck932 marked this conversation as resolved.
Show resolved Hide resolved
jroebuck932 marked this conversation as resolved.
Show resolved Hide resolved
jroebuck932 marked this conversation as resolved.
Show resolved Hide resolved
jroebuck932 marked this conversation as resolved.
Show resolved Hide resolved
jroebuck932 marked this conversation as resolved.
Show resolved Hide resolved
"At the end of this tutorial, the following figure will be generated:\n",
"<center>\n",
"<img src='./img/atl06_land_ice_heights_plot.png'/>\n",
"<img src='./img/atl06_example_end_product.png'/>\n",
"</center>\n",
"\n",
"### **Time requirement**\n",
Expand Down Expand Up @@ -100,7 +100,8 @@
"\n",
"# For reading data, analysis and plotting\n",
"import xarray as xr\n",
"import hvplot.xarray"
"import hvplot.xarray\n",
"import pprint"
]
},
{
Expand Down Expand Up @@ -138,7 +139,9 @@
"\n",
"`earthaccess` leverages the Common Metadata Repository (CMR) API to search for collections and granules. [Earthdata Search](https://search.earthdata.nasa.gov/search) also uses the CMR API.\n",
"\n",
"We can use the `keyword` method for `collection_query` to search for ICESat-2 collections. "
"We can use the `search_datasets` method to search for ICESat-2 collections by setting `keyword='ICESat-2'`.\n",
"\n",
"This will display the number of data collections (data sets) that meet this search criteria."
]
},
{
Expand All @@ -150,15 +153,19 @@
},
"outputs": [],
"source": [
"Query = earthaccess.collection_query().keyword('ICESat-2')"
"Query = earthaccess.search_datasets(keyword = 'ICESat-2')"
]
},
{
"cell_type": "markdown",
"id": "d3957627",
"metadata": {},
"source": [
"The `hits()` method can be used to find out how many collections (both _DAAC-hosted_ and _cloud-hosted_) we found."
"In this case there are 65 collections that have the keyword ICESat-2.\n",
"\n",
"The `search_datasets` method returns a python list of `DataCollection` objects. We can view the metadata for each collection in long form by passing a `DataCollection` object to print or as a summary using the `summary` method. We can also use the `pprint` function to Pretty Print each object.\n",
"\n",
"We will do this for the first 10 results (objects)."
]
},
{
Expand All @@ -168,42 +175,28 @@
"metadata": {},
"outputs": [],
"source": [
"Query.hits()"
]
},
{
"cell_type": "markdown",
"id": "ea86f3e8",
"metadata": {},
"source": [
"We can see what these data collections are by _getting_ `ShortName` and `Versions`. In this case, we'll just grab the first 10 results and print them out."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "be6bcea5",
"metadata": {},
"outputs": [],
"source": [
"collections = Query.fields(['ShortName', 'Version']).get(10)\n",
"print(collections)"
"for collection in Query[:10]:\n",
" pprint.pprint(collection.summary(), sort_dicts=True, indent=4)\n",
" print('')\n",
" "
]
},
{
"cell_type": "markdown",
"id": "b88357e5",
"metadata": {},
"source": [
"The `get`, used above, returns a list of python dictionaries containing metadata `meta` and the two requested attribute fields from the Unified Metadata Model (UMM) entry for the collection `umm`. The `meta` entry for each collection contains the following information.\n",
"For each collection, `summary` returns a subset of fields from the collection metadata and the Unified Metadata Model (UMM):\n",
"- `concept-id` is a unique id for the collection. It consists of an alphanumeric code and the provider-id specific to the DAAC (Distributed Active Archive Center). You can use the `concept_id` to search for data granules.\n",
"- `short_name` is a quick way of referring to a collection (instead of using the full title). It can be found on the collection landing page underneath the collection title after 'DATA SET ID'. See the table below for a list of the shortnames for ICESat-2 collections.\n",
"- `version` is the version of each collection.\n",
"- `file-type` gives information about the file format of the collection granules.\n",
"- `get-data` is a collection of URLs that can be used to access the data, collection landing pages and data tools. \n",
"- `cloud-info` this is for cloud-hosted data and provides additional information about the location of the S3 bucket that holds the data and where to get temporary AWS S3 credentials to access the S3 buckets. `earthaccess` handles these credentials and the links to the S3 buckets, so in general you won't need to worry about this information. \n",
"\n",
"- `concept-id`, which is a unique id for the collection. We'll use the `concept_id` to search for data granules.\n",
"- `granule-count`, the number of data granules in the collection. \n",
"- `provider-id`, the id for DAAC responsible for the collection. This information is also part of the `concept-id`.\n",
"For the ICESat-2 search results, within the concept-id, there is a provider-id; `NSIDC_ECS` and `NSIDC_CPRD`. `NSIDC_ECS` which is for the _on-prem_ collections and `NSIDC_CPRD` is for the _cloud-hosted_ collections. \n",
"\n",
"For the ICESat-2 search results there is a provider-id; `NSIDC_ECS` and `NSIDC_CPRD`. `NSIDC_ECS` which is for the _on-prem_ collections and `NSIDC_CPRD` is for the _cloud-hosted_ collections. \n",
"\n",
"The `umm` fields are `ShortName` and `Version`. For ICESat-2, `ShortNames` are generally how different products are referred to.\n",
"For ICESat-2, `ShortNames` are generally how different products are referred to.\n",
"\n",
"| ShortName | Product Description |\n",
"|:-----------:|:---------------------|\n",
Expand Down Expand Up @@ -234,10 +227,10 @@
"metadata": {},
"outputs": [],
"source": [
"Query = earthaccess.collection_query().keyword('ICESat-2').cloud_hosted(True)\n",
"print(Query.hits())\n",
"collections = Query.fields(['ShortName', 'Version']).get(10)\n",
"print(collections)"
"Query = earthaccess.search_datasets(\n",
" keyword = 'ICESat-2',\n",
" cloud_hosted = True\n",
")"
]
},
{
Expand All @@ -247,9 +240,11 @@
"source": [
"## Search a data set using spatial and temporal filters \n",
"\n",
"As an example of a search using spatial and temporal filters, we'll search for ATL06 granules over the Juneau Icefield, AK, for March 2020.\n",
"We can use the `search_data` method to search for granules within a data set by location and time using spatial and temporal filters. In this example, we will search for data granules from the ATL06 verison 006 cloud-hosted data set over the Juneau Icefield, AK, for March and April 2020.\n",
"\n",
"The temporal range is identified with standard date strings, and latitude-longitude corners of a bounding box is specified. Polygons and points, as well as shapefiles can also be specified.\n",
"\n",
"The ATL06 version 005 collection is identified by the `concept_id` C2153572614-NSIDC_CPRD. The temporal range is identified with standard date strings, and latitude-longitude corners of a bounding box is specified. Polygons and points, as well as shapefiles can also be specified."
"This will display the number of granules that match our search. "
]
},
{
Expand All @@ -259,40 +254,25 @@
"metadata": {},
"outputs": [],
"source": [
"Query = earthaccess.granule_query().concept_id(\n",
" 'C2153572614-NSIDC_CPRD'\n",
").temporal(\n",
" \"2020-03-01\", \"2020-03-30\"\n",
").bounding_box(\n",
" -134.7,58.9,-133.9,59.2)"
]
},
{
"cell_type": "markdown",
"id": "49e860c2",
"metadata": {},
"source": [
"As before, we can use `Query.hits()` to find the number of granules that match our search. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b8218347",
"metadata": {},
"outputs": [],
"source": [
"Query.hits()"
"results = earthaccess.search_data(\n",
" short_name = 'ATL06',\n",
" version = '006',\n",
" cloud_hosted = True,\n",
" bounding_box = (-134.7,58.9,-133.9,59.2),\n",
" temporal = ('2020-03-01','2020-04-30'),\n",
")"
]
},
{
"cell_type": "markdown",
"id": "a7bc1b37",
"metadata": {},
"source": [
"We'll get metadata for these 4 granules and display it. The rendered metadata shows a download link, granule size and two images of the data.\n",
"To display the rendered metadata, including the download link, granule size and two images, we will use `display`. In the example below, all 4 results are shown. \n",
"\n",
"The download link is `https` and can be used download the granule to your local machine. This is similar to downloading _DAAC-hosted_ data but in this case the data are coming from the Earthdata Cloud. For NASA data in the Earthdata Cloud, there is no charge to the user for egress from AWS Cloud servers. This is not the case for other data in the cloud.\n",
"\n",
"The download link is `https` and can be used download the granule to your local machine. This is similar to downloading _DAAC-hosted_ data but in this case the data are coming from the Earthdata Cloud. For NASA data in the Earthdata Cloud, there is no charge to the user for egress from AWS Cloud servers. This is not the case for other data in the cloud."
"Note the `[None, None, None, None]` that is displayed at the end can be ignored, it has no meaning in relation to the metadata."
]
},
{
Expand All @@ -302,8 +282,7 @@
"metadata": {},
"outputs": [],
"source": [
"granules = Query.get(4)\n",
"[display(g) for g in granules]"
"[display(r) for r in results]"
]
},
{
Expand All @@ -317,9 +296,9 @@
"\n",
"Direct-access to data from an S3 bucket is a two step process. First, the files are opened using the `open` method. The `auth` object created at the start of the notebook is used to provide Earthdata Login authentication and AWS credentials.\n",
"\n",
"The next step is to load the data. In this case, data are loaded into an `xarray.Dataset`. Data could be read into `numpy` arrays or a `pandas.Dataframe`. However, each granule would have to be read using a package that reads HDF5 granules such as `h5py`. `xarray` does this all _under-the-hood_ in a single line but for a single group in the HDF5 granule, in this case land ice heights for the gt1l beam*.\n",
"The next step is to load the data. In this case, data are loaded into an `xarray.Dataset`. Data could be read into `numpy` arrays or a `pandas.Dataframe`. However, each granule would have to be read using a package that reads HDF5 granules such as `h5py`. `xarray` does this all _under-the-hood_ in a single line but for a single group in the HDF5 granule*.\n",
"\n",
"*ICESat-2 measures photon returns from 3 beam pairs numbered 1, 2 and 3 that each consist of a left and a right beam"
"*ICESat-2 measures photon returns from 3 beam pairs numbered 1, 2 and 3 that each consist of a left and a right beam. In this case, we are interested in the left ground track (gt) of beam pair 1. "
]
},
{
Expand All @@ -329,8 +308,7 @@
"metadata": {},
"outputs": [],
"source": [
"%time\n",
"files = earthaccess.open(granules)\n",
"files = earthaccess.open(results)\n",
"ds = xr.open_dataset(files[1], group='/gt1l/land_ice_segments')"
]
},
Expand Down

Large diffs are not rendered by default.

Loading