-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any variable information stored in the h5 internal structure that makes the h5 file different on each generation? #1919
Comments
I wasn't aware of this, but yes, it appears that there are timestamps in HDF5 by default. You can turn off timestamps on datasets using |
Ah thanks for that @takluyver! that worked! |
Great! I was specifically wondering if it needed a I'm less sure about adding more high-level features to make this more convenient. It doesn't look like the need comes up all that often (previous discussions here and here have attracted only modest interest), and I think the |
Actually found some more interesting things: Must be the platform-specific drivers IMU |
I'm surprised by that. I would be less surprised by differences between Unix-y systems and Windows, but I thought HDF5 used mostly the same code on Linux & Mac. It's a pretty complex file format, though, so it's not inconceivable that some minor platform difference affects it. If you want more details, you might try asking the HDF group, either on help@hdfgroup.org or the HDF forum. |
I had a knee-jerk reaction to this being an endianness issue, but endianess does not apply to 1 byte integers. If it helps this is the hash I'm getting: In [61]: import numpy as np
... import h5py
... import hashlib
...
... d = np.ones((100, 100), dtype=np.int8)
...
... with h5py.File("data.h5", "w") as hf:
... hf.create_dataset(
... "dataset", data=d, track_order=False, track_times=False, dtype="|i1"
... )
...
...
... with open("data.h5", "rb") as f:
... digest = hashlib.md5(f.read()).hexdigest()
... print(digest)
826188fd0befb6debbc013c97f2282da with both
and
I agree with @takluyver this probably should go to the hdf5 group! |
@tacaswell The platform issue I was talking about was happening on different content than simple numpy array in the example above. I suspect your hunch may be right there and it could be related to endianness. I will be interested in knowing the details but this is not a major problem for me for now. I will raise it up with hdf5 group shortly. I will close this meanwhile. Thanks both @takluyver & @tacaswell |
Is there variable information stored in the h5 internal structure that makes the h5 file different on each generation?
I am writing fixed content in h5 file, but when i read the binary the checksum produced is different on each run:
Every run of this script produces a different digest:
My understanding without much detailed h5 internals knowledge is that I should get the same digest as long as it's produced on the same drivers and runtime. This holds true for simple files, see below:
When run multiple times will produce the same digest:
Am I mistaken in the expected behavior of h5? Is it a bug? If not, Is there a way to make this reproducible if contents added are exactly the same?
My system info is as follows:
The text was updated successfully, but these errors were encountered: