Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHA256 checksum changed for v0.26.0 #4343

Open
vladimiroltean opened this issue Sep 11, 2017 · 31 comments
Open

SHA256 checksum changed for v0.26.0 #4343

vladimiroltean opened this issue Sep 11, 2017 · 31 comments

Comments

@vladimiroltean
Copy link

Reproduction steps

I am packaging libgit2 as part of a build system.
I am downloading the v0.26.0 release tarball from here: https://github.com/libgit2/libgit2/archive/v0.26.0.tar.gz.

Expected behavior

Until today, the sha256sum for the tarball was 4ac70a2bbdf7a304ad2a9fb2c53ad3c8694be0dbec4f1fce0f3cd0cda14fb3b9.
I would have expected it not to change.

Actual behavior

Since today, the sha256sum for the tarball is 6a62393e0ceb37d02fe0d5707713f504e7acac9006ef33da1e88960bd78b6eac.
What is happenning?

Version of libgit2 (release number or SHA1)

https://github.com/libgit2/libgit2/archive/v0.26.0.tar.gz

@ethomson
Copy link
Member

I don't know. We don't create these archives ourselves, they're automatically generated by GitHub. Presumably the results of these are cached, but I don't think that there's a guarantee there. That means that they're not stable.

The tar archive itself should be reproducible, but gzip does not have any guarantees of determinism. It uses a timestamp in the header, for example, and of course compression levels will cause wildly different outputs.

The tag for 0.26.0 is correct, it is (and has always been) 15e119375018fba121cf58e02a9f17fe22df0df8.

If you want to continue downloading the archive, I would encourage you to take the shasum of the .tar, not the .tar.gz. I would expect that to be stable.

But as to why this changed suddenly, I don't know. I suspect it was an otherwise harmless cache invalidation. But I don't like that this happened. (It would be ideal if GitHub's codeload would allow for repeatable builds here, and strip the timestamp out.)

As for how to make sure this doesn't happen again, I'm not sure yet. I think that we can upload our own artifacts when we do a release. But I'm not sure that disables the automatic .tar.gz and .zip creation. :/

/cc @carlosmn in case he has more insight into codeload and friends

@carlosmn
Copy link
Member

Tarballs generated by git have never been guaranteed to have the same checksum as any other tarball. git sometimes brings in fixes for path handling e.g. with unicode or to make it compatible with versions of tar from different operating systems. It also depends on what exact version of gzip exists on the host

If your system depends on tarballs which are generated by git on the fly, it has always been prone to arbitrary checksum mismatches. This bites each project which does this sooner or later. It's bit the Linux kernel and Homebrew in the past.

The only way to get consistent checksums is to generate the archive once and upload it somewhere. This is where e.g. GitHub's releases come in, which let you upload your own artifacts. I believe the Homebrew project upload their source archives to a third party. Distributions such as Debian also upload tarballs to their own servers rather than rely on the project itself hosting them.

As to why this has happened now, GitHub recently completed an OS upgrade on its fileservers, but the most likely cause here is a bugfix to git so the generated tarballs are compatible with more versions of tar on different OSs. This fix has already been reverted once because it changed the checksums which some projects had been relying on. I believe those projects have now moved to store the archives they want to keep checksums of.

But even if you store this later checksum, it might change depending on GitHub OS or package upgrade schedule, which means that the checksum might change depending on where in the world you are and which machine happens to serve your request.

@jboning
Copy link

jboning commented Sep 12, 2017

I encountered a new hash for the v0.24.1 release tarball. I compared the contents of a newly-downloaded tarball with the contents of a version I had cached, and found that the release appears to have been updated to include the contents of tests/.

Diff: https://gist.github.com/jboning/cef81b704895e6a224845e0075deabb7

@jboning
Copy link

jboning commented Sep 12, 2017

Retracted. My cache must not have included the tests/ directory--when I re-extracted the old tarball, the contents were identical.

jboning added a commit to livegrep/livegrep that referenced this issue Sep 12, 2017
I validated that the contents are the same using the tarball in my bazel
cache. Apparently github changed the way they generate release tarballs:
libgit2/libgit2#4343 (comment)
@vladimiroltean
Copy link
Author

The only way to get consistent checksums is to generate the archive once and upload it somewhere. This is where e.g. GitHub's releases come in, which let you upload your own artifacts.

I take it that although libgit2 emits release tags, the team does not upload their own tar.xz, but instead relies on the ones automatically created by git? Maybe something can be done there?

Thank you for the swift response.

@ethomson
Copy link
Member

Yes, but again, I'm not sure that disables the automatic .tar.gz and .zip creation.

Worse than having one that has unreliable signatures is two with different signatures (one unreliable).

lissyx pushed a commit to lissyx/tensorflow that referenced this issue Sep 13, 2017
There was a change on GitHub side:
libgit2/libgit2#4343 (comment)

Let us update the SHA256 for now, and keep working.

Fixes mozilla/DeepSpeech#827
lissyx pushed a commit to lissyx/tensorflow that referenced this issue Sep 13, 2017
There was a change on GitHub side:
libgit2/libgit2#4343 (comment)

Let us update the SHA256 for now, and keep working.

Fixes mozilla/DeepSpeech#827
vladimiroltean added a commit to ContainerDroid/termux-packages that referenced this issue Sep 13, 2017
* libgit2/libgit2#4343

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
@ilovezfs
Copy link

I believe the Homebrew project upload their source archives to a third party.

@carlosmn No that is not the case. This is a very breaking change for us (Homebrew): Homebrew/homebrew-core#18044 and we already have a bunch of support issues coming in as a result.

@ethomson
Copy link
Member

Right. It's completely critical that we be able to serve up checksums that are reliable from the time we release software and archive it until, well, forever.

As an end user, I simply don't care what Git does here. Ultimately this is a breaking change for GitHub. You're leaking the abstractions now.

I guess we're going to have to create our own archives now when we release to avoid this. But:

  1. We're but one tiny little library. This breaks a lot of people besides us.
  2. It's a lot more work for every release.
  3. Now we're going to have to pay to host these. I'll spin up an Azure CDN, but this is still another thing that I'm paying for out of pocket.

@peff
Copy link
Member

peff commented Sep 13, 2017

As an end user, I simply don't care what Git does here. Ultimately this is a breaking change for GitHub. You're leaking the abstractions now.

I'm not sure that's entirely true. We've gotten the opposite-direction bug reports, too: why does git archive produce different output than GitHub's tarball generator? But I think that's somewhat beside the point (in both directions). The issue isn't caused by using Git. It's caused by not promising byte stability for auto-generated tarballs.

Now we're going to have to pay to host these.

Is there a reason that uploading a tarball to the releases page isn't a good solution for the project? That is an extra step you have to take, but it's free. I absolutely agree that the extra step is a pain (for your project, and especially for projects like Homebrew which have to rely on the upstream projects to take the step). In an ideal world GitHub would automate that step away by caching the first tarball generated for any tag and keeping it forever, but in the meantime I think it's still possible to get what you want.

@ethomson
Copy link
Member

Is there a reason that uploading a tarball to the releases page isn't a good solution for the project?

I still haven't gotten an answer: if I upload a .tar.gz will that disable the GitHub automatic generation of things?

@ilovezfs
Copy link

I still haven't gotten an answer: if I upload a .tar.gz will that disable the GitHub automatic generation of things?

No it will not.

@carlosmn
Copy link
Member

I still haven't gotten an answer: if I upload a .tar.gz will that disable the GitHub automatic generation of things?

Those are completely different things. The on-the-fly generation isn't about tags, it's the HTTP, better-cacheable version of git archive --remote, which accepts arbitrary inputs.

@ethomson
Copy link
Member

Not trying to be an asshole: I literally don't know what anything you just said means.

My point is: if I make a release, upload a .tar.gz to it, and GitHub also makes a .tar.gz then how would anybody know which one to use? Which one is the canonical one?

IOW, why would I bother uploading anything to GH for a release?

@basepi
Copy link

basepi commented Sep 15, 2017

They end up with different paths. Github's auto-generated release artifacts are under archive/. If you add additional release artifacts they go under releases/download/. Definitely identifiable which is which.

@basepi
Copy link

basepi commented Sep 15, 2017

Here are some examples from hubble:

Auto-generated from Github: https://github.com/hubblestack/hubble/archive/v2.2.1.tar.gz

Our manually-generated package: https://github.com/hubblestack/hubble/releases/download/v2.2.1/hubblestack-2.2.1-1.el7.x86_64.rpm

Note the path differences. They're from the same release

You definitely don't need to host yourself, but if Github won't guarantee that their release artifacts will checksum the same forever, you'll need to just download their artifact once, and then upload it as your own artifact.

@ethomson
Copy link
Member

I didn't mean whether it's identifiable by the URL. I meant whether it's identifiable to the end user looking at the page. It's not clear to me why I would choose one .tar.gz over the other.

@vladimiroltean
Copy link
Author

If by "end users" you mean people like me who package your software, then I'm fine with this proposed workaround. I see no issue with that, especially if you expressly specify in the release notes to pick the pre-packed .tar.gz instead of the generic "Source code (tar.gz)" to get the consistent checksums, You could even spell out the SHA256 yourself as part of the release notes.

@ruslo
Copy link

ruslo commented Sep 15, 2017

You could even spell out the SHA256 yourself as part of the release notes.

If GitHub will provide hash sums for one *.tar.gz file and will not provide for another, then it will be clear what is stable and what is not.

@daira
Copy link

daira commented Sep 15, 2017

How about having a project option to disable the auto-generated tarballs, to avoid confusing/misleading end-users? The last time I checked there wasn't any way to do that. I know this has previously caused support problems for some projects, e.g. Tahoe-LAFS.

@ghost
Copy link

ghost commented Oct 1, 2017

The checksum also changed for 0.25.1. How about moving away from the proprietary github platform?

@ethomson
Copy link
Member

ethomson commented Oct 1, 2017

The checksum also changed for 0.25.1. How about moving away from the proprietary github platform?

No.

@ilovezfs
Copy link

ilovezfs commented Oct 1, 2017

The checksum also changed for 0.25.1. How about moving away from the proprietary github platform?

The proximate cause of the checksum changes is that Git was upgraded on the GitHub backend. Whether GitHub itself is "proprietary" is not exactly … relevant.

@Apteryks
Copy link

Apteryks commented Oct 1, 2017

@ilovezfs the proximate cause of the checksum changes is that Github regenerates archives on the fly instead of pegging them to the tag/commit. Why they would do this I don't understand (it's wasting resources and causing issues like this one). I didn't find the GitHub project where to report that issue; oh, wait, they don't have one!

@ilovezfs
Copy link

ilovezfs commented Oct 1, 2017

@Apteryks yup Homebrew/homebrew-core#18044 (comment).

@ethomson
Copy link
Member

ethomson commented Oct 1, 2017

Indeed. I'm going to leave this open, until GitHub will supports one of:

  1. Supporting actually reproducible release downloads that are automatically generated.
  2. Clarifying the UI so that when we upload our own release archives, the automatically generated ones are removed.
  3. Allowing us to remove the Releases tab entirely, so that we can point people to our own stable downloads.

Obviously, though, no, we are not going to move off GitHub, one of the two biggest supporters of the project. That someone would be so rude as to suggest such a thing truly boggles my mind.

@ilovezfs
Copy link

ilovezfs commented Oct 1, 2017

  1. Supporting actually reproducible release downloads that are automatically generated.

Hopefully this.

@lfam
Copy link

lfam commented Oct 1, 2017

As a downstream packager of libgit2 and many other programs, I can say that people in my position can tell the difference between the snaphots that are automatically generated by GitHub per-tag, and the release archives (tar or zip) that each project's maintainer uploads and distributes via the release page.

When you first visit GitHub as a beginner, it can be confusing to know which one to download, that's true. But you quickly learn that the generically named "Source code" download is something that the project's maintainers can't disable and don't recommend to be used.

In many cases, the "real" release archive includes things beyond what's checked in to revision control, such as generated build scripts (autotools etc), extra documentation, test data, and so on.

So, my project has a policy to always prefer the "real" release over the auto-generated snapshot, for the reasons listed above, as well as the bit-reproducibility problem that motivated this bug report.

@htgoebel
Copy link

htgoebel commented Oct 9, 2017

@ethomson re #4343 (comment)

I'm afraid of sounding like a smart-arse, nevertheless:

  1. Supporting actually reproducible release downloads that are automatically generated.

This already is possible using Travis-CI whenever a new tag is pushed, as documented here, scroll down to "Uploading Multiple Files". So the deploy stage creates the archives (using a before_deploy-command) and travis will attach the archives to the release.

If you prefer the archives to be uploaded by a developer as part of the the release-process, a tool like ghrelease can API the Github release API. This you even allow to sign the archives.

  1. Clarifying the UI so that when we upload our own release archives, the automatically generated ones are removed.

And IMHO there is not need for removing the automatically generated ones (and AFAIK this is not possible). For projects it is important to have any reproducible source, it does not matter if there are other non-deterministic sources. You way want to have a look at how PyInstaller handles this: We simply attach teh archives and the pgp-signatures to the release (example).

  1. Allowing us to remove the Releases tab entirely, so that we can point people to our own stable downloads.

IMHO there is no need for this as long as deterministic archives are available.

@htgoebel
Copy link

htgoebel commented Oct 9, 2017

Off-topic:

Obviously, though, no, we are not going to move off GitHub, one of the two biggest supporters of the project. That someone would be so rude as to suggest such a thing truly boggles my mind

Well, Gitlab has an integrated CI/CD, which has some advantages over the separate GitHub–Travis-CI solution.

@ethomson
Copy link
Member

ethomson commented Oct 9, 2017

Well, Gitlab has an integrated CI/CD, which has some advantages over the separate GitHub–Travis-CI solution.

So when I said "damn that's rude" you decided to double down on it?

@libgit2 libgit2 locked and limited conversation to collaborators Oct 9, 2017
@ethomson
Copy link
Member

ethomson commented Oct 9, 2017

Like I said, I am keeping this open since it is an actual issue that hasn't yet been resolved, but obviously there's no need for further discussion on it, so it's now been locked.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests