# How to: Find and Access GEDI  Data


This notebook will provide a guide on how to find and access L1 and L2 Global Ecosystem Dynamics Investigation (GEDI) V2 data providing high-resolution laser ranging of Earth’s forests and topography from the International Space Station (ISS). Currently, there are two methods of finding and accessing L1 and L2  GEDI Version 2 data products:

1. [NASA's Earthdata Search](https://search.earthdata.nasa.gov/search)
2. [NASA's CMR API](https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html)

This notebook will explain how to find and access L1 and L2 GEDI V2 data stored in Earthdata Cloud programmatically. While you can directly work with NASA's CMR API, [`earthaccess`](https://github.com/nsidc/earthaccess ) python library, mentioned in this notebook, provides a user-friendly way to login, search, and download or stream NASA Earth science data available in [Common Metadata Repository (CMR)](https://www.earthdata.nasa.gov/eosdis/science-system-description/eosdis-components/cmr). 

Let's start with importing the required packages.

In [None]:
import os, json
import earthaccess
import pandas as pd
import geopandas as gp
import h5py
from pprint import pprint
from shapely import Polygon
from shapely.geometry import Point
from shapely.geometry.polygon import orient
from datetime import datetime

os.chdir('../..')

## Authentication

To access or download NASA Earth data, you need a .netrc file containing your NASA Earthdata Login information is needed. You can manually create a .netrc file but you can use earthaccess package for easier authentication.  `earthaccess.login()` function is used to authenticate with NASA Earthdata Login credentials stored in a .netrc file. This function will prompt you to enter your NASA Earthdata username and password to create the .netrc file if it doesn't already exist, and then uses your account information for authentication. 



In [None]:
earthaccess.login(persist=True)

## Search for GEDI Collections

GEDI level 1 & level 2 data products are hosted by the Land Processes Distributed Active Archive Center (LP DAAC), while GEDI L3 & L4 are distributed by Oak Ridge National Laboratory (ORNL DAAC). In this example, we will use the cloud-hosted [GEDI L2B Canopy Cover and Vertical Profile Metrics Data Global Footprint Level (GEDI02_B)](https://lpdaac.usgs.gov/products/gedi02_bv002/) to find data, but the same routine can be used to access other products. To find the data we will use the `earthaccess` Python library. `earthaccess` searches [NASA's Common Metadata Repository (CMR)](https://cmr.earthdata.nasa.gov/search), a metadata system that catalogs Earth Science data and associated metadata records. `collection_query` from `earthaccess` is used to search for the NASA data collections. Various query parameters can be used to search collections and granules using attributes associated with them in the metadata. More details can be found [here](https://github.com/nsidc/earthaccess/blob/main/notebooks/Demo.ipynb). 
Below, the CMR Catalog is searched to find collections with `gedi` keywords, managed by `LPCLOUD` provider, and with a `version` number of `002`. The returned response can be used to retrieve the concept-id for each dataset. 

In [None]:
collections = earthaccess.collection_query().keyword('gedi').version('002').provider('LPCLOUD').get()
pprint(collections[0].summary())
    

## Search for GEDI Granules

Collections `concept-id` is needed to search for data. Below, a dictionary is created to store GEDI V2 collection IDs distributed by LP DAAC, and `GEDI02_B` is selected. 

In [None]:
gedi_collectionIDs = {}
for c in collections:
    gedi_collectionIDs[c.summary()['short-name']] = c.summary()['concept-id']
gedi_collectionIDs

In [None]:
gedi_products =  ['GEDI02_B']   #['GEDI01_B', 'GEDI02_B', 'GEDI02_A']

conceptID = [gedi_collectionIDs[g] for g in gedi_products]
conceptID

Next, define a temporal range for the query. 
Please note that GEDI was moved temporarily into hibernation upon the completion of its first mission, which lasted from December 2018 to March 2023. 
GEDI had been temporarily on pause after March 2023. The instrument returned to the original location on ISS on April 22nd, 2024. GEDI has been collecting data after its return on April 23, 2024, but data has not been publicaly distributed yet. More details can be found [here](https://lpdaac.usgs.gov/news/nasa-announces-pause-in-gedi-mission/).
Data is available for the first part of the mission from `2019-04-18` to `2023-03-16` and newly collected data will be available after the proper validations are complete. 

In [None]:
tempRange = ('2022-04-01', '2022-05-31') 

A GeoJSON file is used to define the spatial region of interest (ROI). For this example, the ROI is an area in Sequoia National Forest, CA.

In [None]:
polygon = gp.read_file('data/sequoia.geojson')
polygon['geometry'][0]


Next, submit the query using `search_data` function. 

In [None]:
params = {
    "concept_id" : conceptID,
    "temporal": tempRange,
    "polygon": list(polygon['geometry'][0].exterior.coords),
    # bounding_box = bbx,
    "count": 200
}

In [None]:
results = earthaccess.search_data(**params)

In [None]:
results[0]

## Accessing GEDI Data
There are two options to access NASA Earth science data stored in the [Earthdata Cloud](https://www.earthdata.nasa.gov/technology/cloud-computing). You can download the data using the `HTTPS` links, create your subset and then work with data locally. The other option is loading data in the memory and only save a subset of data. Loading data in the memory will work both when you are working locally (using `HTTPS` links) or in the cloud (using `S3` links). If you have access to cloud preferably in the same Amazon Web Services (AWS) region us-west2,  you can access and work with data virtually in a cloud-based environment using the `S3` links and skip the downloading part. This method is called “Direct Cloud Access” or, “Direct Access”. Please note that direct access using the `S3` links is only possible if you are working in the Amazon Web Services (AWS) Region us-west-2. If you are working with data locally, you still can load data usig `HTTPS` links but the process could be slower. 


### Option 1: Donwnloading GEDI data using `HTTPS` links 
Below, the `HTTPS` links are printed but data can be downloaded using the `download` function from `earthaccess` package directly. 

In [None]:
data_links = [granule.data_links(access="external") for granule in results]
data_links

Below, the first two granules in the result response is downlaoded below but you can adjust to download them all by replacing `results` with `results[0:2]`. 

In [None]:
# # Only downloaded the first 2 granules
# downloaded_files = earthaccess.download(
#     results[0:3],
#     local_path='data/',
# )

Once your data is downloaded, you can use the **GEDI Subsetter** available in [GEDI-Data_Resources repository](https://github.com/nasa/GEDI-Data-Resources) to subset GEDI sub-orbit granules to your spatial bound and your layers. It is easier to [clone the GEDI-Data-Resources repository](https://github.com/nasa/GEDI-Data-Resources?tab=readme-ov-file#getting-started) and run this notebook but if you have not, you need to adjust the directories below to where the `GEDI_Subsetter.py` is stored. 


In [None]:
# !python python/scripts/GEDI_Subsetter/GEDI_Subsetter.py --dir data --roi data/sequoia.geojson --beams BEAM0101 --sds '/beam,/quality_flag,/rh,/pai,/pai_z,/pavd_z,/rh100'

As a final step, you can clean up and delete the downloaded source files to free your local space. 

In [None]:
# downloaded = [i for i in os.listdir('data') if i.endswith('.h5')]
# for file in downloaded: 
#     os.remove(f'data/{file}') 

### 

### Option 2: accessing GEDI data stored in Earthdata Cloud

The other option is loading GEDI data in the memory and only saving a subset of data locally. GEDI V002 data is stored in Earthdata Cloud enabling us to access data in a different way than downloading them. The `open` function from `earthaccess` package will provide a straightforward way to access the data stored in the cloud.  The `open` function offers read and write access for our results to a particular key using a context manager. This will automatically handle the authentication and configurations needed when working locally or in the cloud. 


In [None]:
files = earthaccess.open(results)

`h5py` package is used to read **GEDI HDF5** GEDI files.

In [None]:
gedi_ds = h5py.File(files[0],'r')
print(gedi_ds.keys())


In [None]:
gedi_ds['METADATA']['DatasetIdentification'].attrs['shortName']

The available layers (`variables`) and datasets (`datasets`) in a `GEDI_L2B` granule can be accessed (see the commented cell below), but it will take longer for this cell to run. That is why the available GEDI datasets are saved into a JSON file (`GEDI_Datasets.json`) stored in the `data` folder and will be used here. 

To learn more about the available layers, you can view the GEDI Dictionaries provided in [GEDI products' DOI Landing pages](https://lpdaac.usgs.gov/product_search/?query=gedi&status=Operational&view=cards&sort=title). 

In [None]:
# variables = []
# gedi_ds.visit(variables.append)

# datasets = [v.split('/', 1)[-1] for v in variables if isinstance(gedi_ds[v], h5py.Dataset)]
# list(set(datasets))

In [None]:
del files, gedi_ds

In [None]:
with open('data/GEDI_Datasets.json', 'r') as fp:
    gedi_var = json.load(fp)

gedi_var.keys()

Print 'gedi_var' to view the first 20 layers available in `GEDI_L2B` collection.

In [None]:
gedi_var['GEDI_L2B'][0:20]

First, Define a subset of datasets you are interested in as alist. If you want to keep all the available datasets save `gedi_var['GEDI_L2B']` in a list. In this example, we defined a subset list for the `GEDI_L2B` product but you can subset more L1 & L2 GEDI products here if your query includes more than one product. To adjust the code, you only need to define subset data in a separate list for each product (`subset_L2A`, `subset_L2B`, and `subset_L1B`) and create a dictionary of all products and subset of layers (see the commented cell below). 

**Note that the `lat_lowestmode` and `lon_lowestmode` for both GEDI L2A and L2B products are used as reference Latitude and Longitude for each shot. For GEDI L1B, `latitude_bin0` and `longitude_bin0` can be used as reference Latitude and Longitude for each shot. The reference Latitude and Longitude for each shot in addition to `beam` and `shot_number` are included in the subset outputs by default.**


In [None]:
subset_L2B = ['geolocation/degrade_flag', 'geolocation/digital_elevation_model', 'geolocation/elev_lowestmode', 'lat_highestreturn', 'geolocation/lon_highestreturn', 'geolocation/elev_highestreturn', 'l2b_quality_flag', 'rh100', 'pai', 'pai_z', 'pavd_z']

subset_data = {'GEDI_L2B': subset_L2B }


In [None]:
# # example of subsetting data for all three GEDI products distributed by LP DAAC
# subset_L2A = gedi_var['GEDI_L2A'] 
# subset_L2B = gedi_var['GEDI_L2B'] 
# subset_L1B = gedi_var['GEDI_L1B'] 

# subset_data = {'GEDI_L2B': subset_L2B, 'GEDI_L2A': subset_L2A, 'GEDI_L1B': subset_L1B }               


Next, the layers in our subset list provided above is compared with layers stored in `GEDI_Datasets.json`. This step verifies layers are valid and removes the layers that are not. It also creates full path for the layers in the list if that is not provided by user. For example, the dataset 'lat_highestreturn' is in our subset list but the full path should be 'geolocation/lat_highestreturn'. 

In [None]:
default = {'GEDI_L1B': ['shot_number', 'beam', 'geolocation/latitude_bin0', 'geolocation/longitude_bin0'],
           'GEDI_L2A': ['shot_number', 'beam', 'lat_lowestmode','lon_lowestmode'],
           'GEDI_L2B': ['geolocation/shot_number', 'beam', 'geolocation/lat_lowestmode','geolocation/lon_lowestmode']}

subset_var = {}

for p in list(subset_data.keys()):
    subset = []
    [subset.append(d) for d in subset_data[p] if d not in subset]
    [subset.append(d) for d in default[p] if d not in subset]
    datasets_p = []
    for s in subset:
        my_var = [v for v in gedi_var[p] if v.endswith(f'{s}')]
        if len(my_var) == 1:
            datasets_p.append(my_var[0])
            
        elif len(my_var) > 1:
            my_var = [v for v in my_var if v.startswith(f'{s}')]
                
            for l in my_var:
                if l not in datasets_p:
                    datasets_p.append(l) 
    
    subset_var[p] = datasets_p


In [None]:
subset_var

In addition to layer subsetting, you can subset layers using specific beams. For instance, you can only select Full Power beams ('BEAM0101', 'BEAM0110', 'BEAM1000', 'BEAM1011').

In [None]:
beams = ['BEAM0000', 'BEAM0001', 'BEAM0010', 'BEAM0011', 'BEAM0101', 'BEAM0110', 'BEAM1000', 'BEAM1011'] # ['BEAM0101', 'BEAM0110', 'BEAM1000', 'BEAM1011']


Below, functions are defined to subset the GEDI granules using the beams, datasets, and area of interest. 

In [None]:
def gedi_to_dataframe(granule, beams, vars):
    """
    This function takes existing method of getting data from a GEDI hdf5
    and makes it dynamic so it will retrieve subset of beams and variables from a list provided by the user.
    All column names are taken from the hdf5 source file.
    """

    ds = earthaccess.open([granule])[0]
    #read the dataset
    gedi_ds = h5py.File(ds,'r')
    # see what is the data product 
    product = gedi_ds['METADATA']['DatasetIdentification'].attrs['shortName']
    print(product)
    fileName = gedi_ds['METADATA']['DatasetIdentification'].attrs['fileName']
    date = datetime.strptime(fileName.rsplit('_')[2], '%Y%j%H%M%S').strftime('%Y-%m-%d %H:%M:%S')
    # Create empty DataFrame for this beam
    df_beam = pd.DataFrame(columns=vars[product])
    
    for b in beams:
        data_dic = {}
        for v in vars[product]:
            # print(b,v)
            value = gedi_ds[f'{b}/{v}'][()]
            data_dic[v] = value.tolist() 
            
        df_beam = pd.concat([df_beam, pd.DataFrame(data_dic)],join="inner")
        
        # add product, beam, file name, and date columns 
        df_beam.insert(0, 'product', product)
        df_beam.insert(1, 'Beam', b)
        df_beam.insert(2, 'fileName' , fileName)
        df_beam.insert(3, 'date', date)

    # rename the latitude and longitude here to simplify the dataframe
    df_beam= df_beam.rename(columns={'geolocation/lat_lowestmode': 'lat', 'geolocation/lon_lowestmode': 'lon', 
                                     'geolocation/latitude_bin0':'lat_bin0', 'geolocation/longitude_bin0':'lon_bin0',
                                     'geolocation/shot_number':'shot_number',
                                     'lat_lowestmode': 'lat', 'lon_lowestmode': 'lon'})

    return(product, df_beam.reset_index(drop=True))


def clip_gedi(dataframe,geojson):
    """
    This function takes the subset of GEDI data stored in a Geopandas dataframe and creates a spatial subset.
    """
    #read the GeoJSON
    ROI = gp.GeoDataFrame.from_file(geojson)
    ROI.crs = 'EPSG:4326'

    # Take the lat/lon dataframe and convert each lat/lon to a shapely point and convert to a Geodataframe
    try:
        dataframe = gp.GeoDataFrame(dataframe, geometry=dataframe.apply(lambda row: Point(row.lon, row.lat), axis=1))
    except:
        dataframe = gp.GeoDataFrame(dataframe, geometry=dataframe.apply(lambda row: Point(row.lon_bin0, row.lat_bin0), axis=1))
        
    dataframe = dataframe.set_crs('EPSG:4326')

    shot_list = []
    for num, geom in enumerate(dataframe['geometry']):
        if ROI.contains(geom)[0]:
            shot_n = dataframe.loc[num, 'shot_number']
            shot_list.append(shot_n)
            

    DF = dataframe.where(dataframe['shot_number'].isin(shot_list))
    DF = DF.dropna().reset_index(drop=True)
    return DF

A separate empty dataframe is created for each product. In the first step, source granules are accessed and a subset of data by layer is created.  Next, the spatial subset is implemented and the output Geodataframe object is returned. Finally, the subsets for all the source files are concatenated into a single Geodataframe. 

In [28]:
l2a_df = pd.DataFrame()
l2b_df = pd.DataFrame()
l1b_df = pd.DataFrame()
num = 0
for granule in results:
    num += 0
    # print(granule)
    # subset data by band
    prod, subset_df = gedi_to_dataframe(granule,beams,subset_var)
    # clip to ROI    
    subset_df_clip = clip_gedi(subset_df, 'data/sequoia.geojson')
    if 'L2A' in prod:
        l2a_df = gp.GeoDataFrame(pd.concat([l2a_df, subset_df_clip]))
    elif 'L2B' in prod:
        l2b_df = gp.GeoDataFrame(pd.concat([l2b_df, subset_df_clip]))
    elif 'L1B' in prod:
        l1b_df = gp.GeoDataFrame(pd.concat([l1b_df, subset_df_clip]))

    del subset_df_clip, subset_df, prod

# Reset the indeces
l1b_df = l1b_df.reset_index(drop=True)
l2a_df = l2a_df.reset_index(drop=True)
l2b_df = l2b_df.reset_index(drop=True)

In [None]:
l1b_df

Data types are adjusted to save the GeoJSON file successfully. In this example, `pai_z` and `pavd_z` data types are updated to string. 

In [None]:
l2b_df.columns
for c in ['pai_z', 'pavd_z']:
    l2b_df[c] =  str(l2b_df[c])


Finally, we exported subset files as a GeoJSON using the `GeoDataFrame.to_file` function from `geopandas` package.

In [None]:
from datetime import date
today = date.today()

if len(l1b_df) != 0:
    l1b_df.to_file(f'data/GEDI_L1B_{today}.geojson', driver='GeoJSON') 

if len(l2a_df) != 0:
    l2a_df.to_file(f'data/GEDI_L2A_{today}.geojson', driver='GeoJSON') 

if len(l2b_df) != 0:
    l2b_df.to_file(f'data/GEDI_L2B_{today}.geojson', driver='GeoJSON')

The GeoJSN object can be viewed using QGIS or programmatically. Below, a simple map is created to visualize the spatial coverage of data over our ROI.

In [None]:
from geoviews import opts, tile_sources as gvts
import geoviews
geoviews.extension('bokeh')

gvts.EsriImagery * geoviews.Points(l2b_df, vdims=['date']).options(tools=['hover'], 
                                                                   height=900, width=900, size=1, color='yellow', 
                                                                   fontsize={'xticks': 10, 'yticks': 10, 'xlabel':16, 'ylabel': 16})


In [None]:
l2b_df


## Contact Info:  

Email: LPDAAC@usgs.gov  
Voice: +1-866-573-3222  
Organization: Land Processes Distributed Active Archive Center (LP DAAC)¹  
Website: <https://lpdaac.usgs.gov/>  
Date last modified: 02-20-2024  

¹Work performed under USGS contract G15PD00467 for NASA contract NNG14HH33I.  