diff --git a/.gitignore b/.gitignore index 5027036..f2608b8 100644 --- a/.gitignore +++ b/.gitignore @@ -56,3 +56,4 @@ dist/* # Documentation ancillary files docs/*json docs/*HDF5 +docs/*tif diff --git a/README.md b/README.md index b03ba56..fa78941 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ JSON files that contain SHA 256 hash values for all variables and groups in a netCDF4 or HDF-5 file can be generated using either the `create_h5_hash_file` or `create_nc4_hash_file`. -``` +```python from earthdata_hashdiff import create_nc4_hash_file @@ -33,6 +33,19 @@ The functions to create the hash files have two additional optional arguments: The default value for this kwarg is to turn off all `xarray` decoding for CF Conventions, coordinates, times and time deltas. +A similar JSON file can be created for a GeoTIFF file: + +```python +from earthdata_hashdiff import create_geotiff_hash_file + +create_geotiff_hash_file('path/to/geotiff/file.tif', 'path/to/output/hash.json') +``` + +This function has one additional optional argument: + +* `skipped_metadata_tags` - this is a set of strings. When specified, the + hashing functionality will not include GeoTIFF metadata tags with that name. + ### Comparisons against reference files When a JSON file exists with hashed values, it can be used for comparisons. The @@ -40,7 +53,7 @@ public API provides `h5_matches_reference_hash_file` and `nc4_matches_reference_hash_file`, although these both are aliases for the same underlying functionality using `xarray`: -``` +```python from earthdata_hashdiff import nc4_matches_reference_hash_file @@ -68,6 +81,18 @@ The comparison functions have three optional arguments: The default value for this kwarg is to turn off all `xarray` decoding for CF Conventions, coordinates, times and time deltas. +The same operation can also be performed for a GeoTIFF file in comparison to an +appropriate JSON reference file: + +```python +from earthdata_hashdiff import geotiff_matches_reference_hash_file + +assert geotiff_matches_reference_hash_file( + 'path/to/geotiff/file.tif', + 'path/to/json/with/hash.json', +) +``` + ## Installing ### Using pip @@ -102,7 +127,7 @@ also contains an update to the `earthdata_hashdiff.__about__.py` file. Prerequisites: - - Python 3.10+, ideally installed in a virtual environment, such as `pyenv` + - Python 3.11+, ideally installed in a virtual environment, such as `pyenv` or `conda`. - A local copy of this repository. diff --git a/docs/Using_earthdata-hashdiff.ipynb b/docs/Using_earthdata-hashdiff.ipynb index 17d054d..69a8768 100644 --- a/docs/Using_earthdata-hashdiff.ipynb +++ b/docs/Using_earthdata-hashdiff.ipynb @@ -13,7 +13,7 @@ "\n", "## What is earthdata-hashdiff?\n", "\n", - "`earthdata-hashdiff` is a Python package that parses Earth science data file formats (HDF-5 and netCDF4) and hashes the contents of those files. These hashes are stored in a JSON object, which can be saved to disk. This enables the easy storage of a smaller artefact for tasks such as regression testing, while omitting metadata and data attributes that may change between test executions (such as timestamps in history attributes). The package also allows for comparison between a binary file (HDF-5 or netCDF4) and a JSON file containing previously calculated hashes.\n", + "`earthdata-hashdiff` is a Python package that parses Earth science data file formats (HDF-5, netCDF4 and GeoTIFF) and hashes the contents of those files. These hashes are stored in a JSON object, which can be saved to disk. This enables the easy storage of a smaller artefact for tasks such as regression testing, while omitting metadata and data attributes that may change between test executions (such as timestamps in history attributes). The package also allows for comparison between a binary file (HDF-5, netCDF4 or GeoTIFF) and a JSON file containing previously calculated hashes.\n", "\n", "## earthdata-hashdiff installation:\n", "\n", @@ -41,7 +41,11 @@ "* [3B-HHR.MS.MRG.3IMERG.20250331-S220000-E222959.1320.V07B.HDF5](https://data.gesdisc.earthdata.nasa.gov/data/GPM_L3/GPM_3IMERGHH.07/2025/090/3B-HHR.MS.MRG.3IMERG.20250331-S220000-E222959.1320.V07B.HDF5)\n", "* [3B-HHR.MS.MRG.3IMERG.20250331-S223000-E225959.1350.V07B.HDF5](https://data.gesdisc.earthdata.nasa.gov/data/GPM_L3/GPM_3IMERGHH.07/2025/090/3B-HHR.MS.MRG.3IMERG.20250331-S223000-E225959.1350.V07B.HDF5)\n", "\n", - "The notebook will assume that these two files are present in the `docs` directory:" + "Additionally, for GeoTIFF examples, this notebook uses sample data from the ECOsystem Spaceborne Thermal Radiometer Experiment on Space Station (ECOSTRESS) mission. To run examples with GeoTIFFs, please also download the following sample land surface temperature file:\n", + "\n", + "* [ECOv002_L2T_LSTE_40402_005_13TDE_20250821T104117_0713_01_LST.tif](https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/ECO_L2T_LSTE.002/ECOv002_L2T_LSTE_40402_005_13TDE_20250821T104117_0713_01/ECOv002_L2T_LSTE_40402_005_13TDE_20250821T104117_0713_01_LST.tif)\n", + "* \n", + "The notebook will assume that these files are present in the `docs` directory:" ] }, { @@ -56,7 +60,9 @@ ")\n", "gpm_3imerghh_granule_two = (\n", " '3B-HHR.MS.MRG.3IMERG.20250331-S223000-E225959.1350.V07B.HDF5'\n", - ")" + ")\n", + "\n", + "ecostress_granule = 'ECOv002_L2T_LSTE_40402_005_13TDE_20250821T104117_0713_01_LST.tif'" ] }, { @@ -240,6 +246,43 @@ "print(json.dumps(gpm_3imerghh_granule_one_decode_hashes, indent=2))" ] }, + { + "cell_type": "markdown", + "id": "a9d1fc61-f2c4-4f8b-b784-e456753d51d8", + "metadata": {}, + "source": [ + "## Hashing GeoTIFFs:\n", + "\n", + "From version 1.1.0 onwards, `earthdata-hashdiff` can also calculate a hash for a GeoTIFF input. A single hash is generated for the full file, which accounts for:\n", + "\n", + "* The data array shape and elements.\n", + "* GeoTIFF-specific metadata tags.\n", + "\n", + "To remain lightweight, `earthdata-hashdiff` uses the [tifffile package]() to parse GeoTIFF files, rather than requiring GDAL to be installed in the local environment.\n", + "\n", + "The cell below shows the usage of hashing functionality for a GeoTIFF. Note that this function also has the optional `skipped_metadata_tags` argument, which is analogous to the `skipped_metadata_attributes` for netCDF4 and HDF-5 files." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "655e3e8e-622f-4923-b55e-7f1237382b03", + "metadata": {}, + "outputs": [], + "source": [ + "from earthdata_hashdiff import create_geotiff_hash_file, get_hash_from_geotiff_file\n", + "\n", + "# Create an in-memory dictionary for the GeoTIFF hash value:\n", + "geotiff_hash_dictionary = get_hash_from_geotiff_file(ecostress_granule, set())\n", + "print(json.dumps(geotiff_hash_dictionary, indent=2))\n", + "\n", + "# Generate the same hash dictionary and write out to a JSON file:\n", + "create_geotiff_hash_file(\n", + " ecostress_granule,\n", + " f'{ecostress_granule}.json',\n", + ")" + ] + }, { "cell_type": "markdown", "id": "170873bf-39f2-4907-a9c9-78fe49dee330", @@ -405,6 +448,31 @@ "), 'Binary file did not match previously generated hashes.'" ] }, + { + "cell_type": "markdown", + "id": "e882be95-bc98-4aeb-82cf-68563e949973", + "metadata": {}, + "source": [ + "## Comparisons with GeoTIFFs\n", + "\n", + "These work in the same way as the comparisons for netCDF4 and HDF-5 files. The cell below will use the previously generated JSON reference file for the ECOSTRESS granule:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "48971368-e17a-4267-be2d-f8a2f7680a83", + "metadata": {}, + "outputs": [], + "source": [ + "from earthdata_hashdiff import geotiff_matches_reference_hash_file\n", + "\n", + "assert geotiff_matches_reference_hash_file(\n", + " ecostress_granule,\n", + " f'{ecostress_granule}.json',\n", + ")" + ] + }, { "cell_type": "markdown", "id": "c61a5f43-2bf2-42f6-8c39-abeef381816f", diff --git a/docs/requirements.txt b/docs/requirements.txt index 678c79f..9b4729c 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,4 +1,4 @@ # These packages are required to run the documentation Jupyter notebook. -earthdata-hashdiff ~= 1.0.1 +earthdata-hashdiff ~= 1.1.0 notebook ~= 7.4.5 requests ~= 2.32.4