# H5XRAY

H5XRAY is a visualization / reporting tool to better understand the 
A weekend project inspired by the h5cloud project at the 2023 ICESat-2 Hackweek.

__Jonathan Markel__  
3D Geospatial Laboratory  
The University of Texas at Austin  
09/09/2023

#### [Twitter](https://twitter.com/jonm3d) | [GitHub](https://github.com/jonm3d) | [Website](http://j3d.space) | [GoogleScholar](https://scholar.google.com/citations?user=KwxwFgYAAAAJ&hl=en) | [LinkedIn](https://www.linkedin.com/in/j-markel/) 

In [1]:
from h5xray import h5xray
import matplotlib.pyplot as plt
import icepyx as ipx


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


In [2]:
input_file = "data/atl03_4.h5"

In [3]:
# main function for notebook interaction
help(h5xray.analyze)

Help on function analyze in module h5xray.h5xray:

analyze(input_file, request_byte_size=2097152, plotting_options={}, report=True, cost_per_request=0.0004)
    Main function for plotting / reporting details of an HDF5 file.
    
    Args:
        input_file (str): Path to the input HDF5 file.
        request_byte_size (int): The size of each request in bytes. Default is 2MiB (2*1024*1024 bytes).
        plotting_options (dict): A dictionary of plotting options. Refer to the plot_dataframe function for available options.
        report (bool): Whether to print a report about the HDF5 file. Default is True.
        cost_per_request (float): Cost per GET request (default: $0.0004 per request).
    
    Returns:
        None



## ICESat-2 Data on S3
Let's use some sample ICESat-2 H5 files already on S3 (several GB of geolocated photon data). We'll be combining several resources to locate some ICESat-2 data in the cloud
- NASA OpenScapes [Data Access Using S3](https://nasa-openscapes.github.io/earthdata-cloud-cookbook/examples/NSIDC/ICESat2-CMR-OnPrem-vs-Cloud.html#data-access-using-aws-s3) Guide 
- [ICESat-2 Cloud Data Access](https://github.com/icesat2py/icepyx/blob/main/doc/source/example_notebooks/IS2_cloud_data_access.ipynb) with icepyx

In [4]:
# !pip3 install https://github.com/nasa/eo-metadata-tools/releases/download/latest-master/eo_metadata_tools_cmr-0.0.1-py3-none-any.whl

In [5]:
import icepyx as ipx
import h5py
import s3fs

# # Earthdata Credentials
# earthdata_uid = 'jonathanmarkel'
# earthdata_pwd = 'update_to_avoid_hardcoding_password'

# Create an icepyx Query Object (this is just to make use of its login capability)
short_name = 'ATL03'
spatial_extent = [-45, 58, -35, 75]  # Dummy values
date_range = ['2019-11-30', '2019-11-30']  # Dummy values

reg = ipx.Query(short_name, spatial_extent, date_range)
# reg.earthdata_login(earthdata_uid, earthdata_pwd, s3token=True)

# Set up S3 Filesystem
s3 = s3fs.S3FileSystem(
    key=reg._s3login_credentials['accessKeyId'],
    secret=reg._s3login_credentials['secretAccessKey'],
    token=reg._s3login_credentials['sessionToken']
)


EARTHDATA_USERNAME and EARTHDATA_PASSWORD are not set in the current environment, try setting them or use a different strategy (netrc, interactive)
You're now authenticated with NASA Earthdata Login
Using token with expiration date: 10/07/2023
Using .netrc file for EDL


In [10]:
h5xray.check_if_aws(verbose=True);

Running on an AWS EC2 instance.
Instance ID: i-0efef6414e518cc33
Instance Type: r5.xlarge
AWS Region: us-west-2


In [7]:
# Specify the S3 URL
s3url = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/006/2019/11/30/ATL03_20191130112041_09860505_006_01.h5'

In [8]:
with s3.open(s3url, 'rb') as s3f:
    with h5py.File(s3f, 'r') as f:
        print([key for key in f.keys()])

['METADATA', 'ancillary_data', 'atlas_impulse_response', 'ds_surf_type', 'ds_xyz', 'gt1l', 'gt1r', 'gt2l', 'gt2r', 'gt3l', 'gt3r', 'orbit_info', 'quality_assessment']


In [9]:
# trying s3 utils in h5xray
# h5xray.analyze(s3url) # still broken

Error reading the HDF5 file: [Errno 2] Unable to synchronously open file (unable to open file: name = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/006/2019/11/30/ATL03_20191130112041_09860505_006_01.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)


AssertionError: 

## Default Usage
At it's core, h5xray is meant to quickly visualize and report on the structure of and requests needed to read an HDF5 file. The barcode plot below shows blocks for each dataset within the H5 file. The width of a block represents the total size in bytes, and color indicates how many GET requests are needed to read in that data (blue is few). For the same size request / colorbar, more red = more requests = more $ to read from cloud storage.

In [None]:
h5xray.analyze(input_file) # default usage

For more programmatic uses, the report can be silenced and the plot can be saved to disk.

In [None]:
h5xray.analyze(input_file, report=False, plotting_options={'output_file':'img/barcode.png'}) # simple barcode

## Plot Details
The debug plot option creates more detailed plots, adding the title, colorbar, and labels to identify large datasets.

In [None]:
h5xray.analyze(input_file, report=False, plotting_options={'debug':True, 'output_file':'img/options_labels.png'})


## Request Details
It may be helpful to manually control the size of the GET requests when reading in data. Let's see how using larger GET requests changes the number needed to read in all the data, especially for larger datasets. Here, we see that the largest datasets needed fewer requests, and the barcode is lighter / bluer overall.


In [None]:
h5xray.analyze(input_file, request_byte_size=3*1024*1024, plotting_options={'debug':True})

## Plot Customization
Minor plot details will likely differ between HDF5 files, including the range of the colorbar, the colormap, the title, and the figure size. The font size of the dataset labels, and the threshold (in bytes) required to label a dataset can be changed for smaller/larger files.

In [None]:
# path to save image
output_file = 'img/options_all.png'
output_file

In [None]:
# try changing these!
plotting_options = {'debug':True, # whether to include the title, colormap, and labels
                    'cmap': plt.cm.RdYlBu_r, 
                    'byte_threshold':10 * 1024**2, # datasets with more than this get labeled
                    'font_size':9, # font size for dataset labels
                    'figsize':(10, 3),
                    'max_requests': 15, # specify colormap range
                    'title':'DEMO',
                    'output_file':output_file
                   }

h5xray.analyze(input_file, report=True, plotting_options=plotting_options)
