# Getting started with GEDIPipeline

This notebook demonstrates how to use GEDIPipeline - the unified framework to download, subset, and clip Global Ecosystem Dynamics Investigation (GEDI) data over a specific Region of Interest (ROI).

With this framework, you can streamline the process of acquiring and preparing GEDI data for analysis, enabling more efficient workflows for remote sensing and environmental research.
With this example notebook, you'll learn:

- How to define an ROI and configure the Pipeline.
- Steps to find, download, and process GEDI granules.
- How to use the Pipeline to automate the entire workflow.

## Requirements

1. EarthData Credentials: Ensure you have an active NASA EarthData account. You can create one [here](https://urs.earthdata.nasa.gov/).
2. Python Environment: Install the required Python packages listed in the repository's requirements.txt.

Make sure you have access to the repository by cloning it to your working machine or another working environment.

Repository Link:

https://github.com/leonelluiscorado/GEDI-Pipeline

For more details, consult the repository's [README](https://github.com/leonelluiscorado/GEDI-Pipeline/blob/main/README.md).

---

### Setup

In [9]:
# RUN THIS BEFORE GOING THROUGH NOTEBOOK

import os

example_path = "./example_usage" # Replace with desired notebook's output folder

if not os.path.exists(example_path):
    os.mkdir(example_path)

---

# Defining the example ROI and acquisition dates

In this example, we will use a small ROI in Portugal to demonstrate the execution of GEDIPipeline. In this framework, the ROI is defined as a bounding box with coordinates that *must be* in WG84 EPSG:4326 and organized as follows in a list: 

`[UpperLeft_Latitude, UpperLeft_Longitude, LowerRight_Latitude, LowerRight_Longitude]`

- Example `[40.35, -6.93, , 38.19, -6.93]`

For the acquisition dates, we define two variables `date_start` and `date_end` which describe the desired start and end dates from which to download GEDI data.
Each date is a *string* and must be in this format `"YYYY.MM.DD"`.

- Example date `"2024.11.28"`

In [10]:
# Define our ROI and data collection dates

roi = [40.356011, -6.938200, 40.321162, -6.876083]  # Replace with your desired coordinates
date_start = '2020.04.30' # Replace date_start and end with desired acquisition dates
date_end = '2020.10.31'

---

# Using the Finder class

The *Finder* searches NASA's data repository for all the available orbits that pass over the ROI returning a list of URLs containing the download links for the GEDI orbits.
Before using the Finder, the user selects the desired GEDI product and version to download.

## Available GEDI Products

- GEDI L1B Geolocated Waveform Data Global Footprint Level - GEDI01_B
- GEDI L2A Elevation and Height Metrics Data Global Footprint Level - GEDI02_A
- GEDI L2B Canopy Cover and Vertical Profile Metrics Data Global Footprint Level - GEDI02_B
- GEDI L4A Footprint Level Aboveground Biomass Density - GEDI04_A

For each GEDI data product, you can specify which version you want to download: version '001' or version '002'.

## Using the Finder class

In [7]:
# Import class
from pipeline.finder import GEDIFinder

In [8]:
help(GEDIFinder) # Describe arguments and example usage

Help on class GEDIFinder in module pipeline.finder:

class GEDIFinder(builtins.object)
 |  GEDIFinder(product='GEDI02_A', version='002', date_start='', date_end='', roi=None)
 |
 |  The Finder :class: exports all the available URLs to download GEDI Data that passes over a given ROI and timestamp.
 |
 |  Args:
 |      product: GEDI Product (without version). Products available are {'GEDI01_B'; 'GEDI02_A'; 'GEDI02_B'; 'GEDI04_A'}
 |      version: Version of the desired GEDI Product. There are only two available versions 001 and 002
 |      date_start: Starting datetime to search for GEDI Data. Must be in format YEAR.month.day (e.g 2020.04.01)
 |      date_end: End datetime to search for GEDI Data. Must be in format YEAR.month.day (e.g 2020.12.31)
 |      roi: Region of Interest to search for granules. Coordinates must be in WG84 EPSG:4326 and organized as follows: [UL_Lat, UL_Lon, LR_Lat, LR_Lon]
 |
 |  Example usage:
 |      finder = GEDIFinder(product='GEDI04_A', version='002', date_st

In [9]:
# Create Finder instance

finder = GEDIFinder(
            product = 'GEDI02_A',
            version = '002',
            date_start = date_start,
            date_end = date_end,
            roi = roi
        )

In [10]:
# Return all available orbits

orbits = finder.find(save_file = True, output_filepath = example_path)

[Finder] Found 31 granules over bbox [-6.9382 40.321162 -6.876083 40.356011]
[Finder] Between dates (2020-04-30 00:00:00) and (2020-10-31 00:00:00) exist 3 granules over bbox [-6.9382 40.321162 -6.876083 40.356011]
[Finder] Estimated download size for select granules : 4.87 GB
[Finder] Saved links to file ./example_usage/GEDI02_A_GranuleList_20241118123155.txt


In [11]:
orbits

[('https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/GEDI02_A.002/GEDI02_A_2020203182648_O09106_02_T00028_02_003_01_V002/GEDI02_A_2020203182648_O09106_02_T00028_02_003_01_V002.h5',
  '1681.25'),
 ('https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/GEDI02_A.002/GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002/GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5',
  '1592.68'),
 ('https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/GEDI02_A.002/GEDI02_A_2020282111915_O10326_02_T07143_02_003_02_V002/GEDI02_A_2020282111915_O10326_02_T07143_02_003_02_V002.h5',
  '1600.65')]

Each orbit consists of a tuple with (Download URL, Filesize). This filesize variable is valuable for file checking before downloading (e.g. if file does not exist entirely, download it again).
To access the URL, we simple obtain the first variable of the tuple like so: `granule[0]`.

In [12]:
# Obtain second URL

orbits[1][0]

'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/GEDI02_A.002/GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002/GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5'

After obtaining the intersecting orbits on the desired ROI, we can download these GEDI files with the Downloader class.

---

# Using the Downloader class

Before downloading, the framework will ask for your EarthData credentials. After logging in successfully, it will save your credentials in your user's `.netrc` file if the `persist_login` flag is True (by default is set to False).

In [13]:
# Import downloader class

from pipeline.downloader import GEDIDownloader

In [14]:
help(GEDIDownloader)

Help on class GEDIDownloader in module pipeline.downloader:

class GEDIDownloader(builtins.object)
 |  GEDIDownloader(persist_login=False, save_path=None)
 |
 |  The GEDIDownloader :class: implements a downloading mechanism for a given NASA Repository link, while keeping
 |  an authorization session alive.
 |
 |  It implements a file chunk downloading mechanism and a file checking step to skip a download or not.
 |
 |  Args:
 |          persist_login: Choice to persist login and save to a .netrc file. See Earthdata Access API for more info:
 |                                     https://earthaccess.readthedocs.io/en/latest/howto/authenticate/
 |          save_path: Absolute path to save the downloaded files. If None, saves to current working directory (script).
 |
 |  Methods defined here:
 |
 |  __init__(self, persist_login=False, save_path=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |
 |  download_files(self, files_url)
 |      This function download

For the Downloader module, we simply create an instance and call the `download_granule` for a single URL or `download_files` for a list of URLs.

In [15]:
# Create downloader instance

downloader = GEDIDownloader(persist_login = False, save_path = example_path)

Logging in EarthData...


Depending on the GEDI Product, each downloaded HDF5 file will occupy ~1-2GB. Be sure that you have enough disk space to download.

In [16]:
# Download all of the intersecting files

downloader.download_files(orbits)

[Downloader] Downloading granule and saving "./example_usage/GEDI02_A_2020203182648_O09106_02_T00028_02_003_01_V002.h5"...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1762915809/1762915809 [01:09<00:00, 25510863.15it/s]


[Downloader] Downloading granule and saving "./example_usage/GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5"...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1670050714/1670050714 [01:03<00:00, 26374529.12it/s]


[Downloader] Downloading granule and saving "./example_usage/GEDI02_A_2020282111915_O10326_02_T07143_02_003_02_V002.h5"...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1678401705/1678401705 [01:03<00:00, 26266909.98it/s]


[('https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/GEDI02_A.002/GEDI02_A_2020203182648_O09106_02_T00028_02_003_01_V002/GEDI02_A_2020203182648_O09106_02_T00028_02_003_01_V002.h5',
  '1681.25'),
 ('https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/GEDI02_A.002/GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002/GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5',
  '1592.68'),
 ('https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/GEDI02_A.002/GEDI02_A_2020282111915_O10326_02_T07143_02_003_02_V002/GEDI02_A_2020282111915_O10326_02_T07143_02_003_02_V002.h5',
  '1600.65')]

Your .H5 files are now saved in your specified output directory, we can do a quick check:

In [11]:
os.listdir(example_path)

['GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5',
 'GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.gpkg',
 'GEDI02_A_2020203182648_O09106_02_T00028_02_003_01_V002.gpkg',
 'GEDI02_A_2020203182648_O09106_02_T00028_02_003_01_V002.h5',
 'GEDI02_A_GranuleList_20241118115831.txt',
 'GEDI02_A_GranuleList_20241118130908.txt',
 'GEDI02_A_GranuleList_20241118123155.txt',
 'GEDI02_A_GranuleList_20241118123005.txt',
 'GEDI02_A_2020282111915_O10326_02_T07143_02_003_02_V002.h5',
 'GEDI02_A_2020282111915_O10326_02_T07143_02_003_02_V002.gpkg']

The Downloader downloads each (entire) granule to a specified directory, as subsetting the granule before downloading is currently unavailable through the APIs provided. The listed .H5 files are not clipped to our ROI and use all of the available data product variables. To solve this, we use the following Subsetter class.

---

# Using the Subsetter class

This class clips to the study area and selects all the available Science Dataset (SDS) data product variables to burn for each footprint, and outputs the subsetted orbit, which was previously downloaded. This class accepts any downloaded GEDI HDF5 from LPDAAC. Let's start by creating a Subsetter instance.

In [17]:
# Import class
from pipeline.subsetter import GEDISubsetter

In [18]:
help(GEDISubsetter)

Help on class GEDISubsetter in module pipeline.subsetter:

class GEDISubsetter(builtins.object)
 |  GEDISubsetter(roi, product, out_dir, out_format=None, sds=None, beams=None)
 |
 |  The GEDISubsetter :class: clips the granule to the specified ROI and selects all the desired variables for each footprint
 |  Args:
 |      roi: Region of Interest to search for granules. Coordinates must be in WG84 EPSG:4326 and organized as follows: [UL_Lat, UL_Lon, LR_Lat, LR_Lon]
 |           Rectangle polygons ONLY. TODO: Take SHP files as argument and MultiPolygon
 |      sds: Science Dataset variables used to extract from the granule. Check the product Data Dictionary for more info.
 |           If the variable is inside a group except BEAMXXXX/, it must be specified ( e.g '/geolocation/lat_lowestmode' )
 |           If None, extracts the default variables; else it appends to default variables.
 |      beams: Keeps footprints of select BEAMS to extract from the granule. If none, selects all the avai

The Subsetter allows for a high level of customization (e.g. select BEAMS and variables from the specified data product).
Before selecting SDS variables, check the Data Product Dictionary for the variable names. After this, if the variable is inside a group or subgroup except `BEAMXXXX/`, its parent group must be specified, for example `/geolocation/lat_lowestmode`. The SDS argument defaults to the lists in `subsetter.py`, otherwise, it appends to the default lists.

The default lists are:

```python
# Default layers to be subset and exported, see README for information on how to add additional layers
l1b_subset = ['/geolocation/latitude_bin0', '/geolocation/longitude_bin0', '/channel', '/shot_number', '/rx_sample_start_index',
             '/rxwaveform','/rx_sample_count', '/stale_return_flag', '/tx_sample_count', '/txwaveform',
             '/geolocation/degrade', '/geolocation/delta_time', '/geolocation/digital_elevation_model',
              '/geolocation/solar_elevation',  '/geolocation/local_beam_elevation',  '/noise_mean_corrected',
             '/geolocation/elevation_bin0', '/geolocation/elevation_lastbin', '/geolocation/surface_type', '/geolocation/digital_elevation_model_srtm' '/geolocation/degrade']

l2a_subset = ['/lat_lowestmode', '/lon_lowestmode', '/channel', '/shot_number', '/degrade_flag', '/delta_time', 
             '/digital_elevation_model', '/elev_lowestmode', '/quality_flag', '/rh', '/sensitivity', '/rx_cumulative', '/digital_elevation_model_srtm', 
             '/elevation_bias_flag', '/surface_flag',  '/num_detectedmodes',  '/selected_algorithm',  '/solar_elevation']


l2b_subset = ['/geolocation/lat_lowestmode', '/geolocation/lon_lowestmode', '/channel', '/geolocation/shot_number',
             '/cover', '/cover_z', '/fhd_normal', '/pai', '/pai_z',  '/rhov',  '/rhog',
             '/pavd_z', '/l2a_quality_flag', '/l2b_quality_flag', '/rh100', '/sensitivity',  
             '/stale_return_flag', '/surface_flag', '/geolocation/degrade_flag',  '/geolocation/solar_elevation',
             '/geolocation/delta_time', '/geolocation/digital_elevation_model', '/geolocation/elev_lowestmode', '/pgap_theta']

l4a_subset = [] # TODO: select relevant L4A product variables

The user can also select desired BEAMs in a list like so: `['BEAM0000', 'BEAM0001']`. If the user does not specify BEAMS, it defaults to all available beams:

```python
# Default BEAM Subset
beam_subset = ['BEAM0000', 'BEAM0001', 'BEAM0010', 'BEAM0011', 'BEAM0101', 'BEAM0110', 'BEAM1000', 'BEAM1011']

### For now, we'll use the default BEAMS and SDS variables for the L2A data product

Let's subset the previously downloaded files:

In [19]:
subsetter = GEDISubsetter(
                roi = roi,              # Desired ROI to clip
                product = 'GEDI02_A',   # Desired data product
                out_dir = example_path, # Output file directory to save the .GPKG files
                sds = None,             # SDS Variables to append to default
                beams = None            # BEAMS to select, None selects all the available BEAMS
            )

In [12]:
# Select paths of all the downloaded .H5 files

files = [os.path.join(example_path, f) for f in os.listdir(example_path) if '.h5' in f] 

files

['./example_usage/GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5',
 './example_usage/GEDI02_A_2020203182648_O09106_02_T00028_02_003_01_V002.h5',
 './example_usage/GEDI02_A_2020282111915_O10326_02_T07143_02_003_02_V002.h5']

In [25]:
dfs = [] # Dataframes List

# Subset the downloaded granules
for file in files:
    file_df = subsetter.subset(file) # Subset file
    dfs.append(file_df) # Save to all GeoDataFrames

[Subsetter] Processing file: ./example_usage/GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5
[Subsetter] Selecting BEAMS and clipping to ROI ...
[Subsetter] Intersecting shots found. Selecting variables from subset ...
[Subsetter] No intersecting shots found for BEAM0010 for <HDF5 file "GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5" (mode r)>.
[Subsetter] No intersecting shots found for BEAM0011 for <HDF5 file "GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5" (mode r)>.
[Subsetter] No intersecting shots found for BEAM0101 for <HDF5 file "GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5" (mode r)>.
[Subsetter] No intersecting shots found for BEAM0110 for <HDF5 file "GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5" (mode r)>.
[Subsetter] No intersecting shots found for BEAM1000 for <HDF5 file "GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5" (mode r)>.
[Subsetter] No intersecting shots found for BEAM1011 for <HDF5 file "GEDI02_

After subsetting and outputting the .GPKG files, the subsetter function returns the datasets in GeoDataFrame format. We can process them in a notebook for later use. Let's check them out:

In [26]:
example_df = dfs[2]

example_df.head()

Unnamed: 0,BEAM,shot_number,Latitude,Longitude,index,degrade_flag,delta_time,digital_elevation_model,digital_elevation_model_srtm,elev_lowestmode,...,rx_processing_a6_rx_cumulative_97,rx_processing_a6_rx_cumulative_98,rx_processing_a6_rx_cumulative_99,rx_processing_a6_rx_cumulative_100,selected_algorithm,sensitivity,solar_elevation,surface_flag,geometry,date
0,BEAM0101,103260500200154054,40.350226,-6.937722,40629,0,87393380.0,919.755676,914.721436,917.06543,...,309.0,306.0,302.5,295.75,2,0.984625,43.149551,1,POINT (-6.93772 40.35023),2020/10/08
1,BEAM0101,103260500200154055,40.350536,-6.937191,40630,0,87393380.0,917.794312,922.721619,927.580261,...,307.25,304.0,299.5,292.0,2,0.983251,43.149296,1,POINT (-6.93719 40.35054),2020/10/08
2,BEAM0101,103260500200154056,40.350847,-6.936658,40631,0,87393380.0,932.977905,932.721802,935.411133,...,308.0,305.25,301.75,295.0,1,0.959319,43.149036,1,POINT (-6.93666 40.35085),2020/10/08
3,BEAM0101,103260500200154057,40.351159,-6.936124,40632,0,87393380.0,942.34613,937.721985,939.019165,...,308.5,304.75,300.5,293.0,1,0.960924,43.148777,1,POINT (-6.93612 40.35116),2020/10/08
4,BEAM0101,103260500200154058,40.351473,-6.935587,40633,0,87393380.0,928.473145,934.722168,929.192932,...,309.25,306.75,303.25,294.5,2,0.988238,43.148518,1,POINT (-6.93559 40.35147),2020/10/08


We have successfully downloaded and processed GEDI orbit(s)! The output can now be processed for the user's research purposes.

# Using GEDIPipeline

Using each module separately is useful for specific applications (e.g. finding only the available orbits on ROI), however, running this entire process by creating all of the 3 classes' instances can be time-consuming.
The entire process described in this notebook can be automated using a single class, which is the GEDIPipeline, with a few improvements:

- The Pipeline automatically deletes the original downloaded file after subsetting it, saving disk space.
- More improvements WIP

For this Pipeline, we specify all of the previously described arguments for each class, in one instance creation.

In [4]:
# Import

from pipeline.pipeline import GEDIPipeline

In [5]:
help(GEDIPipeline)

Help on class GEDIPipeline in module pipeline.pipeline:

class GEDIPipeline(builtins.object)
 |  GEDIPipeline(out_directory, product, version, date_start, date_end, roi, sds, beams, persist_login=False)
 |
 |  The GEDIPipeline :class: performs all operations in selecting, downloading and subsetting GEDI Data for a given region of interest
 |  Args:
 |      out
 |
 |  Methods defined here:
 |
 |  __init__(self, out_directory, product, version, date_start, date_end, roi, sds, beams, persist_login=False)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |
 |  run_pipeline(self)
 |
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |
 |  __dict__
 |      dictionary for instance variables
 |
 |  __weakref__
 |      list of weak references to the object



In [6]:
# Pipeline instance

pipeline = GEDIPipeline(
    out_directory = example_path,
    product = 'GEDI02_A',
    version = '002',
    date_start = date_start,
    date_end = date_end,
    roi = roi,
    beams = None,
    sds = None,
    persist_login = False
)

Logging in EarthData...


To run the entire pipeline, we simply call `run_pipeline()` on our pipeline instance.

In [8]:
pipeline.run_pipeline()

[Finder] Found 31 granules over bbox [-6.9382 40.321162 -6.876083 40.356011]
[Finder] Between dates (2020-04-30 00:00:00) and (2020-10-31 00:00:00) exist 3 granules over bbox [-6.9382 40.321162 -6.876083 40.356011]
[Finder] Estimated download size for select granules : 4.87 GB
[Finder] Saved links to file ./example_usage/GEDI02_A_GranuleList_20241118130908.txt
Skipping granule from link ('https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/GEDI02_A.002/GEDI02_A_2020203182648_O09106_02_T00028_02_003_01_V002/GEDI02_A_2020203182648_O09106_02_T00028_02_003_01_V002.h5', '1681.25') as it is already subsetted.
Skipping granule from link ('https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/GEDI02_A.002/GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002/GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5', '1592.68') as it is already subsetted.
Skipping granule from link ('https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/GEDI02_A.002/GEDI02_A_20202821

[('https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/GEDI02_A.002/GEDI02_A_2020203182648_O09106_02_T00028_02_003_01_V002/GEDI02_A_2020203182648_O09106_02_T00028_02_003_01_V002.h5',
  '1681.25'),
 ('https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/GEDI02_A.002/GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002/GEDI02_A_2020278125206_O10265_02_T08566_02_003_02_V002.h5',
  '1592.68'),
 ('https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/GEDI02_A.002/GEDI02_A_2020282111915_O10326_02_T07143_02_003_02_V002/GEDI02_A_2020282111915_O10326_02_T07143_02_003_02_V002.h5',
  '1600.65')]

---

This notebook showed the basic usage of this Pipeline, however, each module can be used separately for specific purposes.
If you have any questions, open an issue on the GitHub repository or contact us at: leonel.corado@uevora.pt