# Getting started with GEDIPipeline

This notebook demonstrates how to use GEDIPipeline - the unified framework to download, subset, and clip Global Ecosystem Dynamics Investigation (GEDI) data over a specific Region of Interest (ROI).

With this framework, you can streamline the process of acquiring and preparing GEDI data for analysis, enabling more efficient workflows for remote sensing and environmental research.
With this example notebook, you'll learn:

- How to define an ROI and configure the Pipeline.
- Steps to find, download, and process GEDI granules.
- How to use the Pipeline to automate the entire workflow.

## Requirements

1. EarthData Credentials: Ensure you have an active NASA EarthData account. You can create one [here](https://urs.earthdata.nasa.gov/).
2. Python Environment: Install the required Python packages listed in the repository's requirements.txt.

Make sure you have access to the repository by cloning it to your working machine or another working environment.

Repository Link:

https://github.com/leonelluiscorado/GEDI-Pipeline

For more details, consult the repository's [README](https://github.com/leonelluiscorado/GEDI-Pipeline/blob/main/README.md).

---

### Setup

In [None]:
# RUN THIS BEFORE GOING THROUGH NOTEBOOK

import os

example_path = "./example_usage" # Replace with desired notebook's output folder

if not os.path.exists(example_path):
    os.mkdir(example_path)

---

# Defining the example ROI and acquisition dates

In this example, we will use a small ROI in Portugal to demonstrate the execution of GEDIPipeline. In this framework, the ROI is defined as a bounding box with coordinates that *must be* in WG84 EPSG:4326 and organized as follows in a list: 

`[UpperLeft_Latitude, UpperLeft_Longitude, LowerRight_Latitude, LowerRight_Longitude]`

- Example `[40.35, -6.93, , 38.19, -6.93]`

For the acquisition dates, we define two variables `date_start` and `date_end` which describe the desired start and end dates from which to download GEDI data.
Each date is a *string* and must be in this format `"YYYY.MM.DD"`.

- Example date `"2024.11.28"`

In [None]:
# Define our ROI and data collection dates

roi = [40.356011, -6.938200, 40.321162, -6.876083]  # Replace with your desired coordinates
date_start = '2020.04.30' # Replace date_start and end with desired acquisition dates
date_end = '2020.10.31'

---

# Using the Finder class

The *Finder* searches NASA's data repository for all the available orbits that pass over the ROI returning a list of URLs containing the download links for the GEDI orbits.
Before using the Finder, the user selects the desired GEDI product and version to download.

## Available GEDI Products

- GEDI L1B Geolocated Waveform Data Global Footprint Level - GEDI01_B
- GEDI L2A Elevation and Height Metrics Data Global Footprint Level - GEDI02_A
- GEDI L2B Canopy Cover and Vertical Profile Metrics Data Global Footprint Level - GEDI02_B
- GEDI L4A Footprint Level Aboveground Biomass Density - GEDI04_A

For each GEDI data product, you can specify which version you want to download: version '001' or version '002'.

## Using the Finder class

In [None]:
# Import class
from pipeline.finder import GEDIFinder

In [None]:
help(GEDIFinder) # Describe arguments and example usage

In [None]:
# Create Finder instance

finder = GEDIFinder(
            product = 'GEDI02_A',
            version = '002',
            date_start = date_start,
            date_end = date_end,
            roi = roi
        )

In [None]:
# Return all available orbits

orbits = finder.find(save_file = True, output_filepath = example_path)

In [None]:
orbits

Each orbit consists of a tuple with (Download URL, Filesize). This filesize variable is valuable for file checking before downloading (e.g. if file does not exist entirely, download it again).
To access the URL, we simple obtain the first variable of the tuple like so: `granule[0]`.

In [None]:
# Obtain second URL

orbits[1][0]

After obtaining the intersecting orbits on the desired ROI, we can download these GEDI files with the Downloader class.

---

# Using the Downloader class

Before downloading, the framework will ask for your EarthData credentials. After logging in successfully, it will save your credentials in your user's `.netrc` file if the `persist_login` flag is True (by default is set to False).

In [None]:
# Import downloader class

from pipeline.downloader import GEDIDownloader

In [None]:
help(GEDIDownloader)

For the Downloader module, we simply create an instance and call the `download_granule` for a single URL or `download_files` for a list of URLs.

In [None]:
# Create downloader instance

downloader = GEDIDownloader(persist_login = False, save_path = example_path)

Depending on the GEDI Product, each downloaded HDF5 file will occupy ~1-2GB. Be sure that you have enough disk space to download.

In [None]:
# Download all of the intersecting files

downloader.download_files(orbits)

Your .H5 files are now saved in your specified output directory, we can do a quick check:

In [None]:
os.listdir(example_path)

The Downloader downloads each (entire) granule to a specified directory, as subsetting the granule before downloading is currently unavailable through the APIs provided. The listed .H5 files are not clipped to our ROI and use all of the available data product variables. To solve this, we use the following Subsetter class.

---

# Using the Subsetter class

This class clips to the study area and selects all the available Science Dataset (SDS) data product variables to burn for each footprint, and outputs the subsetted orbit, which was previously downloaded. This class accepts any downloaded GEDI HDF5 from LPDAAC. Let's start by creating a Subsetter instance.

In [None]:
# Import class
from pipeline.subsetter import GEDISubsetter

In [None]:
help(GEDISubsetter)

The Subsetter allows for a high level of customization (e.g. select BEAMS and variables from the specified data product).
Before selecting SDS variables, check the Data Product Dictionary for the variable names. After this, if the variable is inside a group or subgroup except `BEAMXXXX/`, its parent group must be specified, for example `/geolocation/lat_lowestmode`. The SDS argument defaults to the lists in `subsetter.py`, otherwise, it appends to the default lists.

The default lists are:

```python
# Default layers to be subset and exported, see README for information on how to add additional layers
l1b_subset = ['/geolocation/latitude_bin0', '/geolocation/longitude_bin0', '/channel', '/shot_number', '/rx_sample_start_index',
             '/rxwaveform','/rx_sample_count', '/stale_return_flag', '/tx_sample_count', '/txwaveform',
             '/geolocation/degrade', '/geolocation/delta_time', '/geolocation/digital_elevation_model',
              '/geolocation/solar_elevation',  '/geolocation/local_beam_elevation',  '/noise_mean_corrected',
             '/geolocation/elevation_bin0', '/geolocation/elevation_lastbin', '/geolocation/surface_type', '/geolocation/digital_elevation_model_srtm' '/geolocation/degrade']

l2a_subset = ['/lat_lowestmode', '/lon_lowestmode', '/channel', '/shot_number', '/degrade_flag', '/delta_time', 
             '/digital_elevation_model', '/elev_lowestmode', '/quality_flag', '/rh', '/sensitivity', '/rx_cumulative', '/digital_elevation_model_srtm', 
             '/elevation_bias_flag', '/surface_flag',  '/num_detectedmodes',  '/selected_algorithm',  '/solar_elevation']


l2b_subset = ['/geolocation/lat_lowestmode', '/geolocation/lon_lowestmode', '/channel', '/geolocation/shot_number',
             '/cover', '/cover_z', '/fhd_normal', '/pai', '/pai_z',  '/rhov',  '/rhog',
             '/pavd_z', '/l2a_quality_flag', '/l2b_quality_flag', '/rh100', '/sensitivity',  
             '/stale_return_flag', '/surface_flag', '/geolocation/degrade_flag',  '/geolocation/solar_elevation',
             '/geolocation/delta_time', '/geolocation/digital_elevation_model', '/geolocation/elev_lowestmode', '/pgap_theta']

l4a_subset = [] # TODO: select relevant L4A product variables

The user can also select desired BEAMs in a list like so: `['BEAM0000', 'BEAM0001']`. If the user does not specify BEAMS, it defaults to all available beams:

```python
# Default BEAM Subset
beam_subset = ['BEAM0000', 'BEAM0001', 'BEAM0010', 'BEAM0011', 'BEAM0101', 'BEAM0110', 'BEAM1000', 'BEAM1011']

### For now, we'll use the default BEAMS and SDS variables for the L2A data product

Let's subset the previously downloaded files:

In [None]:
subsetter = GEDISubsetter(
                roi = roi,              # Desired ROI to clip
                product = 'GEDI02_A',   # Desired data product
                out_dir = example_path, # Output file directory to save the .GPKG files
                sds = None,             # SDS Variables to append to default
                beams = None            # BEAMS to select, None selects all the available BEAMS
            )

In [None]:
# Select paths of all the downloaded .H5 files

files = [os.path.join(example_path, f) for f in os.listdir(example_path) if '.h5' in f] 

files

In [None]:
dfs = [] # Dataframes List

# Subset the downloaded granules
for file in files:
    file_df = subsetter.subset(file) # Subset file
    dfs.append(file_df) # Save to all GeoDataFrames

After subsetting and outputting the .GPKG files, the subsetter function returns the datasets in GeoDataFrame format. We can process them in a notebook for later use. Let's check them out:

In [None]:
example_df = dfs[2]

example_df.head()

We have successfully downloaded and processed GEDI orbit(s)! The output can now be processed for the user's research purposes.

# Using GEDIPipeline

Using each module separately is useful for specific applications (e.g. finding only the available orbits on ROI), however, running this entire process by creating all of the 3 classes' instances can be time-consuming.
The entire process described in this notebook can be automated using a single class, which is the GEDIPipeline, with a few improvements:

- The Pipeline automatically deletes the original downloaded file after subsetting it, saving disk space.
- More improvements WIP

For this Pipeline, we specify all of the previously described arguments for each class, in one instance creation.

In [None]:
# Import

from pipeline.pipeline import GEDIPipeline

In [None]:
help(GEDIPipeline)

In [None]:
# Pipeline instance

pipeline = GEDIPipeline(
    out_directory = example_path,
    product = 'GEDI02_A',
    version = '002',
    date_start = date_start,
    date_end = date_end,
    roi = roi,
    beams = None,
    sds = None,
    persist_login = False
)

To run the entire pipeline, we simply call `run_pipeline()` on our pipeline instance.

In [None]:
pipeline.run_pipeline()

---

This notebook showed the basic usage of this Pipeline, however, each module can be used separately for specific purposes.
If you have any questions, open an issue on the GitHub repository or contact us at: leonel.corado@uevora.pt