Is there any variable information stored in the h5 internal structure that makes the h5 file different on each generation? #1919

suneeta-mall · 2021-07-04T05:22:34Z

Is there variable information stored in the h5 internal structure that makes the h5 file different on each generation?

I am writing fixed content in h5 file, but when i read the binary the checksum produced is different on each run:

import numpy as np
import h5py
import hashlib

d = np.ones((100,100), dtype=np.int8)

with h5py.File('data.h5', 'w') as hf:
    hf.create_dataset('dataset', data=d)


with open('data.h5', "rb") as f:
    digest = hashlib.md5(f.read()).hexdigest()
print(digest)

Every run of this script produces a different digest:

In [1]: import numpy as np
   ...: import h5py
   ...: import hashlib
   ...:
   ...: d = np.ones((100,100), dtype=np.int8)
   ...:
   ...: with h5py.File('data.h5', 'w') as hf:
   ...:     hf.create_dataset('dataset', data=d)
   ...:
   ...:
   ...: with open('data.h5', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
b1d4035a06358719d48cf89684f681ab

In [2]: import numpy as np
   ...: import h5py
   ...: import hashlib
   ...:
   ...: d = np.ones((100,100), dtype=np.int8)
   ...:
   ...: with h5py.File('data.h5', 'w') as hf:
   ...:     hf.create_dataset('dataset', data=d)
   ...:
   ...:
   ...: with open('data.h5', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
d97372937056b8a2b5628dba6c770e7a

In [3]: import numpy as np
   ...: import h5py
   ...: import hashlib
   ...:
   ...: d = np.ones((100,100), dtype=np.int8)
   ...:
   ...: with h5py.File('data.h5', 'w') as hf:
   ...:     hf.create_dataset('dataset', data=d)
   ...:
   ...:
   ...: with open('data.h5', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
ef3258302ab433cbfbd8cfd1d6d9f473

My understanding without much detailed h5 internals knowledge is that I should get the same digest as long as it's produced on the same drivers and runtime. This holds true for simple files, see below:

import hashlib
with open('data.txt', 'w') as f:
    f.write('readme')

with open('data.txt', "rb") as f:
    digest = hashlib.md5(f.read()).hexdigest()
print(digest)

When run multiple times will produce the same digest:

In [2]:
   ...: with open('data.txt', 'w') as f:
   ...:     f.write('readme')
   ...:
   ...: with open('data.txt', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b

In [3]:
   ...: with open('data.txt', 'w') as f:
   ...:     f.write('readme')
   ...:
   ...: with open('data.txt', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b

In [4]:
   ...: with open('data.txt', 'w') as f:
   ...:     f.write('readme')
   ...:
   ...: with open('data.txt', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b

Am I mistaken in the expected behavior of h5? Is it a bug? If not, Is there a way to make this reproducible if contents added are exactly the same?

My system info is as follows:

python -c 'import h5py; print(h5py.version.info)'     Exe on: 15:10:17 on 2021-07-04
Summary of the h5py configuration
---------------------------------

h5py    2.10.0
HDF5    1.10.6
Python  3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:12:38)
[Clang 11.0.1 ]
sys.platform    darwin
sys.maxsize     9223372036854775807
numpy   1.20.3

The text was updated successfully, but these errors were encountered:

takluyver · 2021-07-05T13:00:31Z

I wasn't aware of this, but yes, it appears that there are timestamps in HDF5 by default.

You can turn off timestamps on datasets using hf.create_dataset(..., track_times=False) (docs). It's not clear to me if there are timestamps by default on groups or datatypes - if there are, I think we'd accept PRs to make it easy to disable those as well.

suneeta-mall · 2021-07-05T20:01:31Z

Ah thanks for that @takluyver! that worked!
I agree it should be easier to do this, I might give it a crack in coming days so leaving this open

takluyver · 2021-07-05T21:39:50Z

Great!

I was specifically wondering if it needed a track_times parameter for .create_group() as well as .create_dataset(), but it seems that groups don't have a timestamp by default. If I create a file with just a group, it's identical each time I run the script to create it. So that doesn't need any work.

I'm less sure about adding more high-level features to make this more convenient. It doesn't look like the need comes up all that often (previous discussions here and here have attracted only modest interest), and I think the track_times parameter is quite nice and simple, even if it's a bit verbose if you're creating a lot of datasets. It's always possible to create a wrapper function or class if you want to use the same options repeatedly.

suneeta-mall · 2021-07-05T22:38:55Z

Actually found some more interesting things:
Depending on the platform ex linux/mac, the same example with track_times=False will still produce a different checksum. Can confirm that version of h5 libraries are exactly the same so guessing its platform specific?

Must be the platform-specific drivers IMU

takluyver · 2021-07-06T10:08:19Z

I'm surprised by that. I would be less surprised by differences between Unix-y systems and Windows, but I thought HDF5 used mostly the same code on Linux & Mac. It's a pretty complex file format, though, so it's not inconceivable that some minor platform difference affects it.

If you want more details, you might try asking the HDF group, either on help@hdfgroup.org or the HDF forum.

tacaswell · 2021-07-06T21:42:46Z

I had a knee-jerk reaction to this being an endianness issue, but endianess does not apply to 1 byte integers. If it helps this is the hash I'm getting:

In [61]: import numpy as np
     ... import h5py
     ... import hashlib
     ... 
     ... d = np.ones((100, 100), dtype=np.int8)
     ... 
     ... with h5py.File("data.h5", "w") as hf:
     ...     hf.create_dataset(
     ...         "dataset", data=d, track_order=False, track_times=False, dtype="|i1"
     ...     )
     ... 
     ... 
     ... with open("data.h5", "rb") as f:
     ...     digest = hashlib.md5(f.read()).hexdigest()
     ... print(digest)
826188fd0befb6debbc013c97f2282da

with both

Summary of the h5py configuration
---------------------------------

h5py    2.9.0
HDF5    1.10.4
Python  3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.16.4

and

Summary of the h5py configuration
---------------------------------

h5py    3.3.0
HDF5    1.12.0
Python  3.10.0b3+ (heads/3.10:7a2d2ed133, Jul  2 2021, 13:40:48) [GCC 11.1.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.22.0.dev0+208.g7d8fadac3
cython (built with) 3.0.0a8
numpy (built against) 1.22.0.dev0+208.g7d8fadac3
HDF5 (built against) 1.12.0

I agree with @takluyver this probably should go to the hdf5 group!

suneeta-mall · 2021-07-08T00:48:09Z

@tacaswell The platform issue I was talking about was happening on different content than simple numpy array in the example above. track_order had no impact on checksum being platform agnostic.

I suspect your hunch may be right there and it could be related to endianness. I will be interested in knowing the details but this is not a major problem for me for now. I will raise it up with hdf5 group shortly.

I will close this meanwhile. Thanks both @takluyver & @tacaswell

suneeta-mall changed the title ~~Is there variable information stored in the h5 internal structure that makes the h5 file different on each generation?~~ Is there any variable information stored in the h5 internal structure that makes the h5 file different on each generation? Jul 4, 2021

suneeta-mall closed this as completed Jul 8, 2021

takluyver mentioned this issue Aug 26, 2021

Disable writing dataset timestamps by default? #1953

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any variable information stored in the h5 internal structure that makes the h5 file different on each generation? #1919

Is there any variable information stored in the h5 internal structure that makes the h5 file different on each generation? #1919

suneeta-mall commented Jul 4, 2021

takluyver commented Jul 5, 2021

suneeta-mall commented Jul 5, 2021

takluyver commented Jul 5, 2021

suneeta-mall commented Jul 5, 2021 •

edited

Loading

takluyver commented Jul 6, 2021

tacaswell commented Jul 6, 2021

suneeta-mall commented Jul 8, 2021

Is there any variable information stored in the h5 internal structure that makes the h5 file different on each generation? #1919

Is there any variable information stored in the h5 internal structure that makes the h5 file different on each generation? #1919

Comments

suneeta-mall commented Jul 4, 2021

takluyver commented Jul 5, 2021

suneeta-mall commented Jul 5, 2021

takluyver commented Jul 5, 2021

suneeta-mall commented Jul 5, 2021 • edited Loading

takluyver commented Jul 6, 2021

tacaswell commented Jul 6, 2021

suneeta-mall commented Jul 8, 2021

suneeta-mall commented Jul 5, 2021 •

edited

Loading