Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any variable information stored in the h5 internal structure that makes the h5 file different on each generation? #1919

Closed
suneeta-mall opened this issue Jul 4, 2021 · 7 comments

Comments

@suneeta-mall
Copy link

Is there variable information stored in the h5 internal structure that makes the h5 file different on each generation?

I am writing fixed content in h5 file, but when i read the binary the checksum produced is different on each run:

import numpy as np
import h5py
import hashlib

d = np.ones((100,100), dtype=np.int8)

with h5py.File('data.h5', 'w') as hf:
    hf.create_dataset('dataset', data=d)


with open('data.h5', "rb") as f:
    digest = hashlib.md5(f.read()).hexdigest()
print(digest)

Every run of this script produces a different digest:

In [1]: import numpy as np
   ...: import h5py
   ...: import hashlib
   ...:
   ...: d = np.ones((100,100), dtype=np.int8)
   ...:
   ...: with h5py.File('data.h5', 'w') as hf:
   ...:     hf.create_dataset('dataset', data=d)
   ...:
   ...:
   ...: with open('data.h5', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
b1d4035a06358719d48cf89684f681ab

In [2]: import numpy as np
   ...: import h5py
   ...: import hashlib
   ...:
   ...: d = np.ones((100,100), dtype=np.int8)
   ...:
   ...: with h5py.File('data.h5', 'w') as hf:
   ...:     hf.create_dataset('dataset', data=d)
   ...:
   ...:
   ...: with open('data.h5', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
d97372937056b8a2b5628dba6c770e7a

In [3]: import numpy as np
   ...: import h5py
   ...: import hashlib
   ...:
   ...: d = np.ones((100,100), dtype=np.int8)
   ...:
   ...: with h5py.File('data.h5', 'w') as hf:
   ...:     hf.create_dataset('dataset', data=d)
   ...:
   ...:
   ...: with open('data.h5', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
ef3258302ab433cbfbd8cfd1d6d9f473

My understanding without much detailed h5 internals knowledge is that I should get the same digest as long as it's produced on the same drivers and runtime. This holds true for simple files, see below:

import hashlib
with open('data.txt', 'w') as f:
    f.write('readme')

with open('data.txt', "rb") as f:
    digest = hashlib.md5(f.read()).hexdigest()
print(digest)

When run multiple times will produce the same digest:

In [2]:
   ...: with open('data.txt', 'w') as f:
   ...:     f.write('readme')
   ...:
   ...: with open('data.txt', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b

In [3]:
   ...: with open('data.txt', 'w') as f:
   ...:     f.write('readme')
   ...:
   ...: with open('data.txt', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b

In [4]:
   ...: with open('data.txt', 'w') as f:
   ...:     f.write('readme')
   ...:
   ...: with open('data.txt', "rb") as f:
   ...:     digest = hashlib.md5(f.read()).hexdigest()
   ...: print(digest)
3905d7917f2b3429490b01cfb60d8f5b

Am I mistaken in the expected behavior of h5? Is it a bug? If not, Is there a way to make this reproducible if contents added are exactly the same?

My system info is as follows:

python -c 'import h5py; print(h5py.version.info)'     Exe on: 15:10:17 on 2021-07-04
Summary of the h5py configuration
---------------------------------

h5py    2.10.0
HDF5    1.10.6
Python  3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:12:38)
[Clang 11.0.1 ]
sys.platform    darwin
sys.maxsize     9223372036854775807
numpy   1.20.3
@suneeta-mall suneeta-mall changed the title Is there variable information stored in the h5 internal structure that makes the h5 file different on each generation? Is there any variable information stored in the h5 internal structure that makes the h5 file different on each generation? Jul 4, 2021
@takluyver
Copy link
Member

I wasn't aware of this, but yes, it appears that there are timestamps in HDF5 by default.

You can turn off timestamps on datasets using hf.create_dataset(..., track_times=False) (docs). It's not clear to me if there are timestamps by default on groups or datatypes - if there are, I think we'd accept PRs to make it easy to disable those as well.

@suneeta-mall
Copy link
Author

Ah thanks for that @takluyver! that worked!
I agree it should be easier to do this, I might give it a crack in coming days so leaving this open

@takluyver
Copy link
Member

Great!

I was specifically wondering if it needed a track_times parameter for .create_group() as well as .create_dataset(), but it seems that groups don't have a timestamp by default. If I create a file with just a group, it's identical each time I run the script to create it. So that doesn't need any work.

I'm less sure about adding more high-level features to make this more convenient. It doesn't look like the need comes up all that often (previous discussions here and here have attracted only modest interest), and I think the track_times parameter is quite nice and simple, even if it's a bit verbose if you're creating a lot of datasets. It's always possible to create a wrapper function or class if you want to use the same options repeatedly.

@suneeta-mall
Copy link
Author

suneeta-mall commented Jul 5, 2021

Actually found some more interesting things:
Depending on the platform ex linux/mac, the same example with track_times=False will still produce a different checksum. Can confirm that version of h5 libraries are exactly the same so guessing its platform specific?

Must be the platform-specific drivers IMU

@takluyver
Copy link
Member

I'm surprised by that. I would be less surprised by differences between Unix-y systems and Windows, but I thought HDF5 used mostly the same code on Linux & Mac. It's a pretty complex file format, though, so it's not inconceivable that some minor platform difference affects it.

If you want more details, you might try asking the HDF group, either on help@hdfgroup.org or the HDF forum.

@tacaswell
Copy link
Member

I had a knee-jerk reaction to this being an endianness issue, but endianess does not apply to 1 byte integers. If it helps this is the hash I'm getting:

In [61]: import numpy as np
     ... import h5py
     ... import hashlib
     ... 
     ... d = np.ones((100, 100), dtype=np.int8)
     ... 
     ... with h5py.File("data.h5", "w") as hf:
     ...     hf.create_dataset(
     ...         "dataset", data=d, track_order=False, track_times=False, dtype="|i1"
     ...     )
     ... 
     ... 
     ... with open("data.h5", "rb") as f:
     ...     digest = hashlib.md5(f.read()).hexdigest()
     ... print(digest)
826188fd0befb6debbc013c97f2282da

with both

Summary of the h5py configuration
---------------------------------

h5py    2.9.0
HDF5    1.10.4
Python  3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.16.4



and

Summary of the h5py configuration
---------------------------------

h5py    3.3.0
HDF5    1.12.0
Python  3.10.0b3+ (heads/3.10:7a2d2ed133, Jul  2 2021, 13:40:48) [GCC 11.1.0]
sys.platform    linux
sys.maxsize     9223372036854775807
numpy   1.22.0.dev0+208.g7d8fadac3
cython (built with) 3.0.0a8
numpy (built against) 1.22.0.dev0+208.g7d8fadac3
HDF5 (built against) 1.12.0

I agree with @takluyver this probably should go to the hdf5 group!

@suneeta-mall
Copy link
Author

@tacaswell The platform issue I was talking about was happening on different content than simple numpy array in the example above. track_order had no impact on checksum being platform agnostic.

I suspect your hunch may be right there and it could be related to endianness. I will be interested in knowing the details but this is not a major problem for me for now. I will raise it up with hdf5 group shortly.

I will close this meanwhile. Thanks both @takluyver & @tacaswell

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants