# earthdata-hashdiff

**Contact:** Owen Littlejohns (owen.m.littlejohns@nasa.gov)

This notebook will demonstrate typical workflows to use the [earthdata-hashdiff](https://github.com/nasa/earthdata-hashdiff) Python package.

## What is earthdata-hashdiff?

`earthdata-hashdiff` is a Python package that parses Earth science data file formats (HDF-5, netCDF4 and GeoTIFF) and hashes the contents of those files. These hashes are stored in a JSON object, which can be saved to disk. This enables the easy storage of a smaller artefact for tasks such as regression testing, while omitting metadata and data attributes that may change between test executions (such as timestamps in history attributes). The package also allows for comparison between a binary file (HDF-5, netCDF4 or GeoTIFF) and a JSON file containing previously calculated hashes.

## earthdata-hashdiff installation:

`earthdata-hashdiff` can be installed via pip using the standard command: `pip install earthdata-hashdiff`.

## Running this notebook:

First create a Python environment to allow for isolated installation of Python packages using either pyenv or conda. If using conda, run the following commands from within the `docs` directory:

```
# Create the new conda environment:
conda create --name earthdata-hashdiff-docs python=3.12 -c conda-forge --override-channels -y

# Activate the environment:
conda activate earthdata-hashdiff-docs

# Install necessary packages:
pip install -r requirements.txt
```

## Downloading sample data:

This notebook will use sample data from the Global Precipitation Monitor (GPM) Integrated Multi-satellitE Retrievals for GPM (IMERG) Final Precipitation data (half hourly, 0.1 degree spatial resolution). To be able to execute this notebook, use the links below to download two sample files:

* [3B-HHR.MS.MRG.3IMERG.20250331-S220000-E222959.1320.V07B.HDF5](https://data.gesdisc.earthdata.nasa.gov/data/GPM_L3/GPM_3IMERGHH.07/2025/090/3B-HHR.MS.MRG.3IMERG.20250331-S220000-E222959.1320.V07B.HDF5)
* [3B-HHR.MS.MRG.3IMERG.20250331-S223000-E225959.1350.V07B.HDF5](https://data.gesdisc.earthdata.nasa.gov/data/GPM_L3/GPM_3IMERGHH.07/2025/090/3B-HHR.MS.MRG.3IMERG.20250331-S223000-E225959.1350.V07B.HDF5)

Additionally, for GeoTIFF examples, this notebook uses sample data from the ECOsystem Spaceborne Thermal Radiometer Experiment on Space Station (ECOSTRESS) mission. To run examples with GeoTIFFs, please also download the following sample land surface temperature file:

* [ECOv002_L2T_LSTE_40402_005_13TDE_20250821T104117_0713_01_LST.tif](https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_40402_005_13TDE_20250821T104117_0713_01/ECOv002_L2T_LSTE_40402_005_13TDE_20250821T104117_0713_01_LST.tif)
* 
The notebook will assume that these files are present in the `docs` directory:

In [None]:
gpm_3imerghh_granule_one = (
    '3B-HHR.MS.MRG.3IMERG.20250331-S220000-E222959.1320.V07B.HDF5'
)
gpm_3imerghh_granule_two = (
    '3B-HHR.MS.MRG.3IMERG.20250331-S223000-E225959.1350.V07B.HDF5'
)

ecostress_granule = 'ECOv002_L2T_LSTE_40402_005_13TDE_20250821T104117_0713_01_LST.tif'

## Generating hashes:

`earthdata-hashdiff` has 2 public methods for generating JSON structures containing hashes of variables and groups within an HDF-5 or netCDF4 file: `create_nc4_hash_file` and `create_h5_hash_file`. Both work in an identical way (and are, infact, aliases to the same underlying function).

To begin, the function must be imported from the package:

In [None]:
from earthdata_hashdiff import create_h5_hash_file

Now a JSON file can be produced. The command below will create a JSON file: `3B-HHR.MS.MRG.3IMERG.20250331-S220000-E222959.1320.V07B.HDF5.json`

In [None]:
create_h5_hash_file(
    gpm_3imerghh_granule_one,
    f'{gpm_3imerghh_granule_one}.json',
)

The content of the output file is displayed below. The JSON file contains a dictionary of key/value pairs, where the keys are full paths within the file to the group or variable being hashed, and the value is the SHA256 hash for that group or variable.

In [None]:
import json

with open(f'{gpm_3imerghh_granule_one}.json') as file_handler:
    gpm_3imerghh_granule_one_hashes = json.load(file_handler)

print(json.dumps(gpm_3imerghh_granule_one_hashes, indent=2))

The information that is considered when generating the hash value is:

* Metadata attributes on the group or variable (excluding `history` and `history_json` metadata attributes).
* Dimensions for that group or variable.
* (Variable only) the shape and elements of the data array.

To demonstrate this, the second GPM_3IMERGHH granule can also be used to generate a JSON file:

In [None]:
create_h5_hash_file(
    gpm_3imerghh_granule_two,
    f'{gpm_3imerghh_granule_two}.json',
)

with open(f'{gpm_3imerghh_granule_two}.json') as file_handler:
    gpm_3imerghh_granule_two_hashes = json.load(file_handler)

print(json.dumps(gpm_3imerghh_granule_two_hashes, indent=2))

Comparisons between the hashes for the two files show that some of them are the same:

* `/Grid`
* `/Grid/Intermediate`
* `/Grid/lat`
* `/Grid/lat_bnds`
* `/Grid/lon`
* `/Grid/lon_bnds`

These groups and variables are the same in both granules. They represent either groups that are primarily present to offer structure to the granule hierarchy, or the whole-Earth spatial dimensions that are the same for all GPM_3IMERGHH granules.

The other variables differ due to containing different values in their arrays, due to the different time over which the measurements were taken.

### Additional options for generating hashes:

There are two additional kwargs that can be specified when generating hashes from a file:

* `skipped_metadata_attributes` - this is a set of strings that represent the names of metadata attributes to not include in the calculation of hashes (for all groups and variables). By default `history` and `history_json` are already omitted as they may vary due to transformation operations being applied at different times (e.g., in Harmony regression tests). However, some files may have additional metadata attributes due to on-demand processing with values that will vary based on execution times. Those should also be excluded from the generated hash via this kwarg.
* `xarray_kwargs` - `earthdata-hashdiff` uses `xarray` to read HDF-5 and netCDF4 files, and this dictionary allows control of how `xarray` opens a file. For example, whether times, timedeltas, coordinates or CF Conventions are decoded. The choice to decode, or not, these options will lead to different hashes. By default none of these items are decoded.

### Example 1: Skipping metadata attributes:

The example below will generate hashes for the first GPM_3IMERGHH granule, skipping the `GridHeader` metadata attribute in groups and variables. This attribute is only present on the `/Grid` group, and so this is the only hash that should be different:

In [None]:
create_h5_hash_file(
    gpm_3imerghh_granule_one,
    f'{gpm_3imerghh_granule_one}.skip.json',
    skipped_metadata_attributes={'GridHeader'},
)

with open(f'{gpm_3imerghh_granule_one}.skip.json') as file_handler:
    gpm_3imerghh_granule_one_skip_hashes = json.load(file_handler)

print(json.dumps(gpm_3imerghh_granule_one_skip_hashes, indent=2))

### Example 2: Changing `xarray` defaults:

This example will tell `xarray` to decode times when it opens the granule. As a result, the `/Grid/time` and `/Grid/time_bnds` variables will have different hashes, as the array elements will now be decoded into datetimes.

In [None]:
create_h5_hash_file(
    gpm_3imerghh_granule_one,
    f'{gpm_3imerghh_granule_one}.decode.json',
    xarray_kwargs={
        'decode_cf': False,
        'decode_coords': False,
        'decode_timedelta': False,
        'decode_times': True,
    },
)

with open(f'{gpm_3imerghh_granule_one}.decode.json') as file_handler:
    gpm_3imerghh_granule_one_decode_hashes = json.load(file_handler)

print(json.dumps(gpm_3imerghh_granule_one_decode_hashes, indent=2))

## Hashing GeoTIFFs:

From version 1.1.0 onwards, `earthdata-hashdiff` can also calculate a hash for a GeoTIFF input. A single hash is generated for the full file, which accounts for:

* The data array shape and elements.
* GeoTIFF-specific metadata tags.

To remain lightweight, `earthdata-hashdiff` uses the [tifffile package]() to parse GeoTIFF files, rather than requiring GDAL to be installed in the local environment.

The cell below shows the usage of hashing functionality for a GeoTIFF. Note that this function also has the optional `skipped_metadata_tags` argument, which is analogous to the `skipped_metadata_attributes` for netCDF4 and HDF-5 files.

In [None]:
from earthdata_hashdiff import create_geotiff_hash_file, get_hash_from_geotiff_file

# Create an in-memory dictionary for the GeoTIFF hash value:
geotiff_hash_dictionary = get_hash_from_geotiff_file(ecostress_granule, set())
print(json.dumps(geotiff_hash_dictionary, indent=2))

# Generate the same hash dictionary and write out to a JSON file:
create_geotiff_hash_file(
    ecostress_granule,
    f'{ecostress_granule}.json',
)

## Performing comparisons:

This package also allows the comparison of a netCDF4 file to a JSON file containing previously calculated hash values. There are two functions to perform this, both with the same arguments: `nc4_matches_reference_hash_file` and `h5_matches_reference_hash_file`.

These functions compare the following:

* That the binary file and the JSON file containing hashes have the same variables and groups.
* That the hash values match for all variables and groups.

In [None]:
from earthdata_hashdiff import h5_matches_reference_hash_file

These functions return a boolean value, and so are easily used in assertions:

In [None]:
assert h5_matches_reference_hash_file(
    gpm_3imerghh_granule_one,
    f'{gpm_3imerghh_granule_one}.json',
), 'Binary file did not match previously generated hashes.'

Comparing the first GPM_3IMERGHH file to the hashes calculated for the second file will show failure in the comparison:

In [None]:
assert h5_matches_reference_hash_file(
    gpm_3imerghh_granule_one,
    f'{gpm_3imerghh_granule_two}.json',
), 'Binary file did not match previously generated hashes.'

### Additional options for comparisons:

There are three kwargs that can be used to refine the behaviour of the comparison functionality:

* `skipped_variables_or_groups` - this set of strings instructs the comparison to not compare the hashes for the groups or variables listed. Note, the comparison will still confirm that all the same variables and groups were in the binary file and JSON file.
* `skipped_metadata_attributes` - this is a set of strings representing metadata attributes in the binary file that will not be used in the generation of hashes for variables or groups. If metadata attributes were specified to be skipped using this mechanism when generating the original JSON file used in the comparison, the same metadata attributes should be specified to be skipped in the comparison, too.
* `xarray_kwargs` - as with generation, this is a dictionary of kwargs used by `xarray` when opening the binary file being compared.

### Example 3: skipping variables or groups:

In example 2 for generating a hash file, the `xarray` option to decode times was enabled. This altered the hash values for `/Grid/time` and `/Grid/time_bnds`. As such, a comparison between the hash file from that example and the first GPM_3IMERGHH file, which uses the defaults for comparison, will fail:

In [None]:
assert h5_matches_reference_hash_file(
    gpm_3imerghh_granule_one,
    f'{gpm_3imerghh_granule_one}.decode.json',
), 'Binary file did not match previously generated hashes.'

However, the `/Grid/time` and `/Grid/time_bnds` variables can be omitted from the comparison, and the assertion will pass:

In [None]:
assert h5_matches_reference_hash_file(
    gpm_3imerghh_granule_one,
    f'{gpm_3imerghh_granule_one}.decode.json',
    skipped_variables_or_groups={'/Grid/time', '/Grid/time_bnds'},
), 'Binary file did not match previously generated hashes.'

### Example 4: skipping metadata attributes:

In example 1 for generating a hash file, the `GridHeader` metadata attribute was not included in the generation of hashes for groups and variables. This altered the hash value for `/Grid`. As such, a comparison between the hash file from that example and the first GPM_3IMERGHH file, which uses the defaults for comparison, will fail:

In [None]:
assert h5_matches_reference_hash_file(
    gpm_3imerghh_granule_one,
    f'{gpm_3imerghh_granule_one}.skip.json',
), 'Binary file did not match previously generated hashes.'

However, if the `GridHeader` attribute is omitted from the comparison, the assertion will pass:

In [None]:
assert h5_matches_reference_hash_file(
    gpm_3imerghh_granule_one,
    f'{gpm_3imerghh_granule_one}.skip.json',
    skipped_metadata_attributes={'GridHeader'},
), 'Binary file did not match previously generated hashes.'

## Comparisons with GeoTIFFs

These work in the same way as the comparisons for netCDF4 and HDF-5 files. The cell below will use the previously generated JSON reference file for the ECOSTRESS granule:

In [None]:
from earthdata_hashdiff import geotiff_matches_reference_hash_file

assert geotiff_matches_reference_hash_file(
    ecostress_granule,
    f'{ecostress_granule}.json',
)

## A single comparison entry point

For convenience, you can use the `matches_reference_hash_file` for all of the file types previously discussed. Each call will accept the paths to the binary file and JSON hash file, along with appropriate optional kwargs relevant to the file type.

In [None]:
from earthdata_hashdiff import matches_reference_hash_file

# GeoTIFF example
assert matches_reference_hash_file(
    ecostress_granule,
    f'{ecostress_granule}.json',
)

# HDF-5 example
assert matches_reference_hash_file(
    gpm_3imerghh_granule_one,
    f'{gpm_3imerghh_granule_one}.json',
), 'Binary file did not match previously generated hashes.'

# HDF-5 example with kwargs
assert matches_reference_hash_file(
    gpm_3imerghh_granule_one,
    f'{gpm_3imerghh_granule_one}.decode.json',
    skipped_variables_or_groups={'/Grid/time', '/Grid/time_bnds'},
), 'Binary file did not match previously generated hashes.'

# Further questions?

Feel free to reach out either:

* Via a [GitHub issue](https://github.com/nasa/earthdata-hashdiff/issues)
* Via email: owen.m.littlejohns@nasa.gov
* Via the NASA Agency Slack (internal developers)