Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix mtime of sdist archive members #452

Merged
merged 2 commits into from
Sep 1, 2023

Conversation

dnicolodi
Copy link
Member

No description provided.

@dnicolodi dnicolodi force-pushed the sdist-mtime branch 8 times, most recently from 3cd27c3 to 079ac4b Compare August 15, 2023 23:51
@dnicolodi dnicolodi added this to the v0.14.0 milestone Aug 23, 2023
Copy link
Contributor

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dnicolodi!

SOURCE_DATE_EPOCH, if it's set, is actually used for all the files in the wheel - both on main and with this PR. I think that's a useful feature, and it's tested - so we should probably document that.

In addition, SOURCE_DATE_EPOCH was used for the generated PKG_INFO file in main, and that is dropped in this PR. That is perhaps a regression, because with this PR it now always uses the current time - and that makes it impossible for two subsequent invocations of python -m build --sdist to yield the exact same output.

The main change in this PR - changing from time zero (1-1-1970) to the time of the last commit - does seem to me to be correct and an improvement on what this code did before.

@dnicolodi
Copy link
Member Author

In addition, SOURCE_DATE_EPOCH was used for the generated PKG_INFO file in main, and that is dropped in this PR. That is perhaps a regression, because with this PR it now always uses the current time - and that makes it impossible for two subsequent invocations of python -m build --sdist to yield the exact same output.

Why is this useful? The whole reproducible builds thing is about producing reproducible binaries, not source tarballs. What we should maybe do is to use the time of the last commit for PKG_INFO too. This would make the tarballs reproducible, except for the point below. As documented by the reproducible build project, SOURCE_DATE_EPOCH should be seen as a work-around for non-compliant tools. Not a specification to adhere to. And, as I already wrote, SOURCE_DATE_EPOCH is about build artifacts, not source distributions.

The main change in this PR - changing from time zero (1-1-1970) to the time of the last commit - does seem to me to be correct and an improvement on what this code did before.

I think the current implementation is just the best of the compromises. We still copy into the tarball the file content from the source directory instead than the content of the last commit. Thus using the timestamp of the commit as mtime seems a bit off.

I still think that picking up changes not committed to the version control is not useful. But, as we discussed, removing this behavior requires a bit more of analysis of the use cases, thus I left it in place.

@rgommers
Copy link
Contributor

Why is this useful? The whole reproducible builds thing is about producing reproducible binaries, not source tarballs.

https://reproducible-builds.org/docs/archives/ talks about timesteps for archives and how to fix this up if it isn't reproducible with the tar utility. So it looks like they care? It may be about archives of binaries though, not entirely clear.

For SciPy we just got a couple of bug reports from OpenSUSE related to non-reproducible builds. So let's ask the author of those issues. @bmwiedemann does it matter to you whether sdists (source archives) have non-reproducible timestamps, or only wheels (installable packages)? In other words, would you ever run python -m build --sdist twice and compare the resulting artifacts?

@bmwiedemann
Copy link

reproducible builds can be seen as a more general framework for deterministic transformations, including a files->tar transformation. So there is some value in being able to it. Some builds let .tar files end up in packages, so it can matter there for the journey to bit-reproducible results.

@dnicolodi
Copy link
Member Author

@bmwiedemann Thanks for chiming in. I understand that a desirable property of an archiving tool is to provide always binary identical archives for the same input files. However, I would like input on a more nuanced case. meson-python creates source and binary distributions of Python packages. This PR focuses on the creating of source distribution archives. For the source code part of the archive, meson-python uses the timestamp of the last commit to the underlying version control system. However, the Python source distribution format requires to add a generated metadata file to the archive.

We are wondering which modification time we should use to store the generated file in the archive: the same as for the source code files or the current time. Further, we would like to know if honoring $SOURCE_DATE_EPOCH has any value in this context.

I'm ambivalent in choosing between current time and last commit timestamp. I see value in using either. The latter of course results in a reproducible source code distribution archive, nor the first. However, I'm pretty sure that respecting $SOURCE_DATE_EPOCH has no to little value here. I would not know to what timestamp would anyone point that variable to when building the source distribution archive.

@dnicolodi
Copy link
Member Author

@rgommers One of the reasons I went for the current time for PKG_INFO is that other generated files in the archive (added or modified by the scripts registered with meson.add_dist_script()) take the current time as mtime in the archive generated by Meson, unless the package authors take special care (and I'm almost certain that very few do). As we don't do anything to make the mtime of these constant, I thought using the current time for PKG_INFO is also fine. One common example of files added via dist scripts are auto-generated files with version information. On the other hand, PKG_INFO depends only on the metadata of the project, thus I don't see it problematic to give it the mtime of the last commit.

@bmwiedemann
Copy link

The current patch would make it pretty hard to reproduce tarballs (even if it is "only" source tarballs) because gzip.GzipFile will embed the current time in the .gz header.
It is valuable to be able to call git archive $ID and get the same result later / elsewhere.
This gives the stability needed for https://github.com/archlinux/aur/blob/meson/PKGBUILD#L12

And SOURCE_DATE_EPOCH can be a good alternative to current time. It is at least an indicator that someone would prefer builds to be reproducible.

@dnicolodi
Copy link
Member Author

The current patch would make it pretty hard to reproduce tarballs

Sure. But does anyone care? All the source tarballs produced by meson are not reproducible (there is at least one souece of non determinism: meson stores the current time in the gzip header) and I don't know of any complaint.

This gives the stability needed for https://github.com/archlinux/aur/blob/meson/PKGBUILD#L12

I don't understand what you mean. This is about the source distribution for Meson, which is not created using meson-python. It is a check that the SHA1 hash of the tarball downloaded from PyPI is the expected one. The tarball cannot be modified once it is uploaded on PyPI. Thus I don't see what this has to do with the this discussion. Can you please elaborate?

Anyway, the source distribution for meson is not reproducible. Meson 0.29.0 is ancient history, however:

$ git checkout 0.29.0
$ python3.11 -m build -s
$ mv dist/meson-0.29.0.tar.gz dist/meson-0.29.0.tar.gz.A
$ python3.11 -m build -s
$ mv dist/meson-0.29.0.tar.gz dist/meson-0.29.0.tar.gz.B
$ cmp dist/meson-0.29.0.tar.gz.A dist/meson-0.29.0.tar.gz.B
dist/meson-0.29.0.tar.gz.A dist/meson-0.29.0.tar.gz.B differ: char 5, line 1

@bmwiedemann
Copy link

It is a check that the SHA1 hash of the tarball downloaded from PyPI is the expected one. The tarball cannot be modified once it is uploaded on PyPI. Thus I don't see what this has to do with the this discussion. Can you please elaborate?

Not about PyPI, but more general:
Some while ago, there was trouble with tarballs from GitHub that suddenly did not match their previous SHAsum anymore, because they are generated on-the-fly from git repos. Would be nice, if all tarballs could be recreated identically from git, so we would not have to store them all. It saves a lot of disk-space, too.

@dnicolodi
Copy link
Member Author

Would be nice, if all tarballs could be recreated identically from git, so we would not have to store them all. It saves a lot of disk-space, too.

This is a very broad scope project. If this is your goal, I think that you may want to start pressuring more popular tools commonly used to produce source distributions than meson-python. Even focusing only on the Python packaging niche: probably the most commonly used tools is setuptools, which as demonstrated above, does not produce reproducible tarballs.

Even if we embrace the goal of reproducible source distributions, $SOURCE_DATE_EPOCH is not the solution.

Set the mtime of all archive members to the mtime set by meson dist,
also for files that may have been locally modified. Use the current
time for the PKG-INFO file.

Drop support for the $SOURCE_DATE_EPOCH environment variable when
creating the sdist. This environment variable is for meant to be used
as a work-around for build tools that produce non-reproducible binary
packages. It does not have a standardized meaning for tools producing
source distributions.

See https://reproducible-builds.org/docs/source-date-epoch/

Fixes mesonbuild#450.
Copy link
Contributor

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

One of the reasons I went for the current time for PKG_INFO is that other generated files in the archive (added or modified by the scripts registered with meson.add_dist_script()) take the current time as mtime in the archive generated by Meson, unless the package authors take special care (and I'm almost certain that very few do). As we don't do anything to make the mtime of these constant, I thought using the current time for PKG_INFO is also fine

I'm not full convinced by this argument, because (a) only a small fraction of packages will be using add_dist_script and (b) it didn't seem necessary to make this change, and it's not impossible that someone may need this kind of sdist reproducibility.

However, the main change in this PR is more important and in good shape. So let's get this in in its current form, because it fixes a bug. If someone comes knocking about the change to PKG_INFO, we can always deal with that then - and we'll have learned something.

So in it goes. Thanks again @dnicolodi. And thanks @bmwiedemann for your thoughts on this issue.

@rgommers rgommers merged commit ac8256a into mesonbuild:main Sep 1, 2023
34 checks passed
@rgommers rgommers added the bug Something isn't working label Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants