Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable writing dataset timestamps by default? #1953

Closed
takluyver opened this issue Aug 26, 2021 · 10 comments · Fixed by #1958
Closed

Disable writing dataset timestamps by default? #1953

takluyver opened this issue Aug 26, 2021 · 10 comments · Fixed by #1958

Comments

@takluyver
Copy link
Member

This is a bit speculative, so I won't push it if there doesn't seem to be a consensus.

The option to disable writing timestamps (group.create_dataset(..., track_times=False)) appears to have been there since 2013 (#271), based on a request from 2011 (#225). It's useful because, in some circumstances, people want running the same code to produce an identical file. There are much bigger challenges for reproducibility, so an option you can simply turn off isn't the end of the world, but it's an extra little annoyance for people to deal with (e.g. on Stackoverflow, and issue #1919).

I just tried to actually inspect some timestamps (to work out how they behave with group.copy()), and it took a while to work out how to do so.

  • I can't find how to make any of the HDF5 command line tools show timestamps (h5ls, h5dump, h5stat).
  • The HDF5 C API favours H5Oget_info as the way to do this, but none of the time fields are exposed in h5py (even in the low-level). The HDF5 docs say that of 4 timestamps, only ctime is implemented.
  • h5py.h5g.get_objinfo().mtime appears to work, although the HDF5 C function is deprecated in favour of H5Oget_info. Only datasets appear to get a timestamp by default, and you'd need the low-level API to enable it for groups, so it's weird to use an H5G API for datasets.

I can't see any issues asking for us to expose the timestamps on h5py.h5o.ObjInfo, or to expose any kind of timestamp in the high-level API. I also did a couple of searches on the HDF forum, without finding much. I get a strong impression that very few people have any use for HDF5 object timestamps.

If more people want timestamps disabled than want to use timestamps, should we disable writing them by default? It would still be possible to enable them by passing track_times=True if someone does want them.

@takluyver
Copy link
Member Author

Just found a couple more threads of people wanting to disable timestamps: April 2017 and July 2017. I still failed to find any threads with people wanting better access to timestamps.

I've also started a thread on the HDF forum to get more feedback.

@tacaswell
Copy link
Member

That seems reasonable to me.

Do we just change this and see who screams or do we want to go through a cycle where we warn if the user does not pas an explicit value for it?

Is tracking times independent of tracking creation order?

@takluyver
Copy link
Member Author

If we believe my estimate that essentially no-one is using the timestamps, I think we should just change the default and see who complains, even though that goes against how we'd normally handle a change like this. It would be weird to show almost everyone a warning and push them to set it explicitly precisely because we believe that people aren't interested in it.

Yes, this is separate from tracking creation order (track_times vs track_order). Order tracking is disabled by default.

@aragilar
Copy link
Member

Seems reasonable to me. If it's near impossible to get the timestamps out, not sure it's that useful to write them then. Maybe if someone has a need for the timestamps, then they can look at making them more accessible also.

@MarDiehl
Copy link

MarDiehl commented Sep 3, 2021

I did not know about this feature and currently add timestamps as a string attributes to every dataset.
So, I would say that timestamps are useful, but only if they can be accessed easily.

Is the automatic creation of timestamps something special to h5py, or is this a feature of the underlying C library?
In the latter case, I would suggest to keep h5py consistent with the rest of the HDF5 ecosystem and write timestamps.

What is the motivation for having bit-wise matching files anyways? I understand that this is a nice feature for a test suite, but h5diff could be used in this case.

@takluyver
Copy link
Member Author

If even people who want a timestamp are manually writing their own, that does suggest that writing the built-in timestamps is not a useful default.

I'm sympathetic to the argument that h5py should follow HDF5's default, but when I went and looked at the HDF5 documentation, the default didn't actually seem to be documented, which implies it's not super important. Maybe if h5py makes the change, HDF5 will follow suit later.

I can see a range of possible motivations for checking exact equality rather than using h5diff. For one thing, it allows you to store the hash of the known good file rather than the file itself. You may also care about performance characteristics like chunking or where data is in a file. I also just like the idea that we can expect running the same code (inc. libraries) gives you the same output.

@MarDiehl
Copy link

MarDiehl commented Sep 5, 2021

I agree that the current implementation is not as useful as it could be. Time stamps are a very useful feature for a self-documenting file format, and time stamping is standard feature of every file system. It would be nice if HDF5 would automatically store the time stamp (it seems it does) and allows easy access (to be done).

From my experience, the use cases where a hash is helpful are rather limited. Most real life application use floating point operations and this introduces bit-wise differences: Hashes will be different if the compiler options are changed, if a different compiler is used or even if any kind of parallelization is employed. For most of my unit tests, I use numpy.allclose/isclose to do the comparison. For some special cases, where no floating point arithmetic is involved, I patch datetime.datetime.

I think for testing purposes where bit-wise equality is possible, setting the time stamp to a fixed value (for the whole library, i.e. via H5 routine) would be the most convenient option. Maybe h5py could take the lead and HDF5 follows later.

@takluyver
Copy link
Member Author

My feeling is still that the balance is in favour of not creating timestamps by default, so I've opened PR #1958 to move it forward. This is only the default, it will still be possible to enable the timestamps. And of course users can also write their own timestamps - which may well be more useful, because then you can control exactly what is being recorded.

I'm still open to hearing that this is the wrong move and we should leave the default alone. But, as I described above, I couldn't find anyone interested in making timestamps more accessible, either in h5py or in HDF5 command line tools. Admittedly there aren't many people who don't want timestamps, but there are a few threads about that over the years.

@takluyver
Copy link
Member Author

The author of rhdf5 (R bindings) said on the HDF forum that he made a similar change, released in May 2020, and so far has not had any complaints.

I think this is a convincing datapoint, because usually breaking something is the most reliable way to get feedback. 😉

@takluyver
Copy link
Member Author

I'll merge the PR in a couple more days if I don't hear anything further. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants