Skip to content
This repository has been archived by the owner on Mar 29, 2023. It is now read-only.

Gerrit tarballs for the base packages aren't deterministic #84

Closed
spearce opened this issue Dec 2, 2016 · 19 comments
Closed

Gerrit tarballs for the base packages aren't deterministic #84

spearce opened this issue Dec 2, 2016 · 19 comments

Comments

@spearce
Copy link
Contributor

spearce commented Dec 2, 2016

Originally reported on Google Code with ID 92

Downloading this URL: https://go.googlesource.com/net/+archive/d75b190.tar.gz?dummy=/golang-net-d75b190.tgz
always produces a different tar archive. Contents are the same, but tar metadata is
different. This prevents using such URLs as fingerprinted distfiles in FreeBSD.

It might be that https://go.googlesource.com uses BSD tar that suffers from this bug:
libarchive/libarchive#623 (in this case the fix is switching to GNU tar).

Reported by None on 2015-12-13 16:27:54

@spearce
Copy link
Contributor Author

spearce commented Dec 2, 2016

Downloading this URL: https://go.googlesource.com/net/+archive/d75b190.tar.gz?dummy=/golang-net-d75b190.tgz
always produces a different tar archive. Contents are the same, but tar metadata is
different. This prevents using such URLs as fingerprinted distfiles in FreeBSD.

It might be that https://go.googlesource.com uses BSD tar that suffers from this bug:
libarchive/libarchive#623 (in this case the fix is switching to GNU tar).

Reported by None on 2015-12-13 16:27:54

@spearce
Copy link
Contributor Author

spearce commented Dec 2, 2016

Thanks for filing this bug. Its not Gerrit, but Gitiles that generates these (and even
then I think its JGit's fault). But its certainly not Gerrit. So I'm going to move
this over to the Gitiles project where its more likely to get attention.

Reported by None on 2015-12-14 07:03:42

@spearce spearce closed this as completed Dec 2, 2016
@spearce spearce reopened this Dec 2, 2016
@shahms
Copy link

shahms commented Dec 27, 2016

FWIW, this makes it somewhat annoying to follow Bazel best practices when using http_archive for external dependencies hosted on googlesource.com. Specifically, the best practices are to use http_archive rather than git_repository and to include a sha256sum.

@spearce
Copy link
Contributor Author

spearce commented Dec 30, 2016

FYI, this is pretty unlikely to be fixed due to needing to break a public API inside JGit, which requires a major version bump, and that doesn't happen often.

The best practice to get files from Git is to use the Git wire protocol to git clone the repository. If only a single version is needed, use --depth 1 to get a shallow clone.

If Bazel doesn't want to use Git to fetch source files from Git, then best practice should be to export the files as a tarball and store that tarball in another, non-Git persistent location where the exact bytes of that stream are unlikely to change.

Attempting to checksum a dynamically created .tar.bz2 or .tar.gz stream is not a good idea, as the compressor can change over time and produce different compressed stream results that still inflate to the same original files.

@shahms
Copy link

shahms commented Jan 3, 2017

Bazel can use git directly, but it doesn't support shallow clones and therefore unnecessarily fetches all of the history for a repo. Their suggestion is to use http_archive to fetch a tarball for this use case.

@dborowitz
Copy link
Contributor

IMHO there should be a feature request against Bazel to support shallow clones. It should be trivial to add the --shallow flag.

As Shawn says, using a dynamically generated compressed file is still a bad idea for this use case. Even if we fix JGit/Gitiles to generate a deterministic sequence of bytes at a given server version, we have no way to ensure that the given sequence of bytes remains deterministic across server versions. We may depend on the JDK's zlib implementation for compressing objects, and there is no guarantee that that implementation is going to always produce the same byte sequence across JDK versions. Similarly, we use Apache Commons Compress for generating the archives, and we have no guarantee that a given list of archive entries is always going to contain the same bytes of metadata even if the compressed content is the same. The upshot is that callers really should not depend on the sequence of bytes in an archive being stable in the long term, which is what the Bazel use case is asking for.

@hanwen
Copy link
Contributor

hanwen commented Jan 9, 2017

you could write a custom repository rule that runs a git clone/fetch of a specific revision to implement shawn's suggestion. Beyond fixing the direct issue, I think that would also be a good direction for Bazel to take, so Bazel can stop depending on JGit.

dg0yt added a commit to OpenOrienteering/superbuild that referenced this issue Mar 15, 2019
These tarballs are not deterministic due to changing metadata,
cf. google/gitiles#84
dg0yt added a commit to OpenOrienteering/superbuild that referenced this issue Mar 15, 2019
These tarballs are not deterministic due to changing metadata,
cf. google/gitiles#84
dg0yt added a commit to OpenOrienteering/superbuild that referenced this issue Mar 16, 2019
These tarballs are not deterministic due to changing metadata,
cf. google/gitiles#84
@davido
Copy link
Contributor

davido commented Jun 17, 2019

This is now tracked as: [1]. The change under review is: [2].

@davido
Copy link
Contributor

davido commented Jun 20, 2019

Thanks @msohn it is fixed now, as of JGit 5.1.9.

@dborowitz, @jrn, @hanwen Can this be closed?

@jheiss
Copy link

jheiss commented Jun 6, 2020

Has this been deployed to googlesource.com?

$ curl -s https://boringssl.googlesource.com/boringssl/+archive/ae223d6138807a13006342edfeef32e813246b39.tar.gz | shasum
470f928f1c27777450b35cc6bf7cdce604ffe9af  -

$ curl -s https://boringssl.googlesource.com/boringssl/+archive/ae223d6138807a13006342edfeef32e813246b39.tar.gz | shasum
ec8cd3acabbc7ff12df97064248823be0372a869  -

@vapier
Copy link
Member

vapier commented Jun 7, 2020

unfortunately, it has not, and it doesn't seem like it will be :/

@ryandesign
Copy link

Whom do we need to contact to get that fixed?

@hanwen
Copy link
Contributor

hanwen commented Sep 24, 2020

googlesource.com runs JGit from master, so if this is still non-deterministic, something else is going on.

@ryandesign
Copy link

if this is still non-deterministic

It is: note different Content-Length on different runs of trying to fetch the same commit:

$ curl -I 'https://chromium.googlesource.com/chromium/tools/depot_tools/+archive/5664586374b9a80af397354523e93b9ef9333f16.tar.gz'
HTTP/1.1 200 OK
Cache-Control: private, max-age=7200, stale-while-revalidate=604800
Content-Disposition: attachment; filename=depot_tools-5664586374b9a80af397354523e93b9ef9333f16.tar.gz
Content-Length: 1669011
Content-Security-Policy-Report-Only: script-src 'nonce-LMJfW5Qngj9T28V+Qzc5dw' 'unsafe-inline' 'strict-dynamic' https: http: 'unsafe-eval';object-src 'none';base-uri 'self';report-uri https://csp.withgoogle.com/csp/gerritcodereview/1
Content-Type: application/x-gzip
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Xss-Protection: 0
Date: Thu, 24 Sep 2020 15:29:01 GMT
Alt-Svc: h3-Q050=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"

$ curl -I 'https://chromium.googlesource.com/chromium/tools/depot_tools/+archive/5664586374b9a80af397354523e93b9ef9333f16.tar.gz'
HTTP/1.1 200 OK
Cache-Control: private, max-age=7200, stale-while-revalidate=604800
Content-Disposition: attachment; filename=depot_tools-5664586374b9a80af397354523e93b9ef9333f16.tar.gz
Content-Length: 1668975
Content-Security-Policy-Report-Only: script-src 'nonce-IbkxLKtQPmSfur5zBvL4lg' 'unsafe-inline' 'strict-dynamic' https: http: 'unsafe-eval';object-src 'none';base-uri 'self';report-uri https://csp.withgoogle.com/csp/gerritcodereview/1
Content-Type: application/x-gzip
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Xss-Protection: 0
Date: Thu, 24 Sep 2020 15:29:05 GMT
Alt-Svc: h3-Q050=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-27=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"

@mohd-akram
Copy link

This is still happening.

@jrn
Copy link
Contributor

jrn commented Mar 26, 2021

Attempting to checksum a dynamically created .tar.bz2 or .tar.gz stream is not a good idea, as the compressor can change over time and produce different compressed stream results that still inflate to the same original files.

This has been true from the start. Unless we

a. Store the tarball when a user downloads it (this is what GitHub does), or

b. Keep around historical versions of commons-compress and record which one was used to produce the tarball

we cannot make a long term deterministic tarball download. All the requests I have seen are for use cases that require long term determinism. In that spirit, it would be misleading to pretend we intend to provide that; it is expensive to do and not part of what Gitiles is meant for.

If you don't need determinism, you can use the Gitiles tarball. If you do need determinism, I recommend storing the tarball somewhere (e.g. a cloud storage provider or an ftp host).

@vapier
Copy link
Member

vapier commented Mar 28, 2021

(a) can we make this a hosting config option? I get that storing archives for every project and every commit is a ton of space and would be pretty wasteful (especially if crawlers fire). I wonder if a middle ground of doing it only for tags would work.

(b) how big of a problem is this approach? gitiles doesn't seem to change that much (for better or worse). what if we did this? not entirely unrelated, but the gzip project has an rsync option so compressed files are stable and easy to transfer.

ryandesign referenced this issue in macports/macports-ports Apr 1, 2021
primeos added a commit to primeos/nixpkgs that referenced this issue May 4, 2021
We had to use fetchgit so far as the tarballs are generated on demand
and have embedded timestamps which makes their hashes unstable [0][1].
This is a problem for fetchurl but fetchzip extracts the tarballs into
the Nix store and therefore the contents will get normalized and the
hashes remain stable.

[0]: google/gitiles#84
[1]: https://bugs.eclipse.org/bugs/show_bug.cgi?id=548312
@eighthave
Copy link

I'm still seeing the timestamp in the tar metadata when downloading from googlesource.com. So this is not yet resolved. It looks like it was already fixed in JGit. I added more info in #217

@ryandesign
Copy link

Why is this issue closed? The problem was never fixed. Please reopen.

vszakats added a commit to curl/curl-for-win that referenced this issue Jul 6, 2022
…kip]

Also make directory strip level a parameter in live_xt. googlesource.com
tarballs need level 0.

Ref: google/gitiles#84
winterheart added a commit to winterheart/gentoo that referenced this issue Jul 26, 2022
Don't use *.googlesource.com as tarball source, it generates
non-reproducible tarballs (google/gitiles#84).

Closes: https://bugs.gentoo.org/860297
Signed-off-by: Azamat H. Hackimov <azamat.hackimov@gmail.com>
gentoo-bot pushed a commit to gentoo/gentoo that referenced this issue Jul 26, 2022
Don't use *.googlesource.com as tarball source, it generates
non-reproducible tarballs (google/gitiles#84).

Closes: https://bugs.gentoo.org/860297
Signed-off-by: Azamat H. Hackimov <azamat.hackimov@gmail.com>
Closes: #26550
Signed-off-by: Joonas Niilola <juippis@gentoo.org>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests