# H5XRAY

How to generate xray plots for HDF5 files to indicate cloud friendliness (required number of GET requests to fully read).

__Jonathan Markel__  
3D Geospatial Laboratory  
The University of Texas at Austin  
Last Updated: 11/06/2023

#### [Twitter](https://twitter.com/jonm3d) | [GitHub](https://github.com/jonm3d) | [Website](http://j3d.space) | [GoogleScholar](https://scholar.google.com/citations?user=KwxwFgYAAAAJ&hl=en) | [LinkedIn](https://www.linkedin.com/in/j-markel/) 

In [None]:
from h5xray import h5xray

In [None]:
input_file = "data/ATL08_icesat2.h5"


In [None]:
# main function for notebook interaction
help(h5xray.analyze)

## Default Usage
At it's core, h5xray is meant to quickly visualize and report on the structure of and requests needed to read an HDF5 file. The barcode plot below shows blocks for each dataset within the H5 file. The width of a block represents the total size in bytes, and color indicates how many GET requests are needed to read in that data (blue is few). For the same size request / colorbar, more red = more requests = more $ to read from cloud storage.

In [None]:
h5xray.analyze(input_file) # default usage

For more programmatic uses, the report can be silenced and the plot can be saved to disk.

In [None]:
f, report = h5xray.analyze(input_file, 
                      request_byte_size=0.2*1024*1024,      # size of get request
                      plotting_options={'output_file':'img/barcode.png'}, 
                      report_type='str') 

# print(report)

## Request Details
It may be helpful to manually control the size of the GET requests when reading in data. Let's see how using larger GET requests changes the number needed to read in all the data, especially for larger datasets. Here, we see that the largest datasets needed fewer requests, and the barcode is lighter / bluer overall.


In [None]:
f, _ = h5xray.analyze(input_file, 
               request_byte_size=0.5*1024*1024,     # 0.5 MB for visualization purposes
               cost_per_request=0.0004e-3,          # cost to read ($0.0004 per 1000 requests for S3)
               plotting_options={'debug':True, # show title, axis labels, colorbar, etc
                                'output_file':'img/request_details.png', # where to write the output image
                                },
               report_type='print')

## Plot Customization
Minor plot details will likely differ between HDF5 files, including the range of the colorbar, the colormap, the title, and the figure size. The font size of the dataset labels, and the threshold (in bytes) required to label a dataset can be changed for smaller/larger files.

In [None]:
import matplotlib.pyplot as plt # for specifying colormap

In [None]:
# try changing these!
plotting_options = {'debug':True, # whether to include the title, colormap, and labels
                    'cmap': plt.cm.RdYlBu_r, 
                    'byte_threshold':10 * 1024**2, # datasets with more than this get labeled
                    'font_size':9, # font size for dataset labels
                    'figsize':(10, 3),
                    'max_requests': 20, # specify colormap range
                    'title':'DEMO',
                    'output_file':'img/options_all.png'
                   }

h5xray.analyze(input_file, 
               request_byte_size=0.1*1024*1024, # 0.1 MB for small file visualizations
               report=True, plotting_options=plotting_options)
