GitHub archive hash stability #46034
-
(Edit 2023-02-21) A blog post with our go-forward plan is now live. Doc updates will come soon. Thank you all for your comments and insights.
Hey folks, I'm the product manager for Git at GitHub. On January 30, 2023, we deployed a change from upstream Git which changed the compression library used by I have a pretty good working knowledge of what you're likely using these hashes for. However, it's a lot more powerful if I can use your words and insights directly when I'm influencing changes at GitHub. So forgive me if some of these questions seem a little elementary — I'm trying to channel my "beginner's mind" and take nothing for granted. Can folks provide input on the following:
†These are the auto-generated tarballs/zipballs on the "Releases" page which don't have a filesize and say "Source code (zip)" or "Source code (tar.gz)". They have URLs of the form Being fully transparent: we had not intended to hold these source archives stable (in their byte layout / hash) forever. They're generated by (essentially) running Intent aside, we've sent mixed signals to the community over the years. This is not the first time we've rolled back a change that would have altered the hashes. If we say "these aren't stable" but then spend a half decade keeping them stable, that's confusing. This comment is a correct rendering of what a GitHub employee communicated in a support ticket. I've reviewed the entire ticket myself internally, and we (GitHub) had a communication breakdown between what engineering intended and what ultimately got shared with the customer. The names were essentially correct ("Repository release archives" are stable while "Repository code download archives" are not), but then we mistakenly put the The above ☝️ is not me telling you our plan going forward. It's an attempt at clarifying why this is even a topic for discussion based on what's happened in the past. This very discussion will help GitHub define, commit to, communicate, and ultimately execute on a plan for the future. That plan will include which hashes are stable and which are not (if any), including permanent documentation. Please be civil, and don't get me punished for (or make me regret 😅) being this open. |
Beta Was this translation helpful? Give feedback.
Replies: 20 comments 33 replies
-
Related to this, I started a discussion on the 30th on the development list for git itself (not github of course): https://public-inbox.org/git/df7b0b43-efa2-ea04-dc5b-9515e7f1d86f@gmail.com/T/ The objective is, w.r.t.
to discuss ways for upstream git to offer an actual official guarantee of stability. This could end up meaning that github, in the long run, feels confident documenting that these source archives shall be stable. |
Beta Was this translation helpful? Give feedback.
-
I am a core committer @mesonbuild, a modern cross-platform build system for multi-language software, in this case the https://wrapdb.mesonbuild.com component (https://github.com/mesonbuild/wrapdb)
We depend on the exact bytes of source downloads from github, for some projects. For other projects, we depend on the exact bytes of manually uploaded release artifacts, occasionally uploaded to github releases, occasionally uploaded to third-party file storage. We need the exact bytes, because we hash files for security purposes before running tools such as
We were affected by the change. It resulted in a temporary outage for users, one of whom reported a bug against a specific source download. This led to confusion, and one person submitted a bug report to the project whose source download we were using, asking what's up. We have no plans to change our thinking/policies/tools -- we are beholden to whether projects themselves manually upload artifacts, and most do not. In an ideal world, projects would have formalized release management processes including dedicated release artifacts, cryptographic authorship signatures (PGP), various odd extras such as precompiled manpages or generated autotools configure scripts, etc. etc. etc... but that's a lot of work, and github source downloads using |
Beta Was this translation helpful? Give feedback.
-
I am part of the EasyBuild community, and team lead for research support for the Digital Research Alliance of Canada (the Alliance, for short).
The Alliance supports researchers from across Canada using digital research infrastructure, such as HPC clusters. As part of this function, we install research software on our infrastructure. Since a sine qua none condition of research is reproducibility, we require all scientific software installation to be scripted and reproducible. Since we have limited resources, performance of the compiled applications is key, and therefore we compile almost everything from source with optimizations. To do so, we use EasyBuild, and we require every package installed to have its checksum recorded and verified. This ensures that code installed today is the same which will be reinstalled (if needs be) 2 years from now. In an ideal world, every single software package would actually do proper releases, with semantic versioning, release notes, with their own tarball, with their own published checksums. Unfortunately, in the world of dealing with research code, that is not going to happen. We therefore need a way to at least know that the code has not changed.
Yes, we were affected. Code which was compiling in a development environment a few hours before ended up not compiling when building in a pre-production environment a few hours later because the checksums verification had failed. That raised a lot of questions and made us wonder if we were victim of a supply chain/MitM attack.
It is in fact not critical that checksums don't change. What is critical is that the end user can easily validate that the tarball being downloaded corresponds to the precise tag/commit/version of the code that is intended to be downloaded. It would in fact be better if there was an API/URL that we could call to verify the expected checksum of the file being downloaded. At the moment, the best we can do is download the file, compute its checksum, and if we download it again later, verify that it has not changed since the version we tested before. Even better would be to store and report all historical checksums of a given file, so that one can test whether the tarball downloaded a month ago corresponds to at least one of the checksums that were recorded back then. |
Beta Was this translation helpful? Give feedback.
-
Hi there, @vtbassmatt !
I represent vcpkg, https://github.com/microsoft/vcpkg (and https://github.com/microsoft/vcpkg-tool ), which is a source based package management system for C++. We, the Visual C++ team, did NuGet for C++ back in 2012/2013, and that effort did not work: it turns out doing binary deployment for an ecosystem that tends not to care about ABI stability required an impossible cross product of prebuilt bits to know that the resulting package would work. Instead, we built vcpkg which aims to have the same goals (a collection of libraries that can work together) but operates on sources, so people can rebuild their dependencies if necessary. I, along with @JavierMatosD @vicroms @ras0219-msft @dan-shaw @markle11m and @AugP are core maintainers and implementers.
We rely on the precise bytes in order to deliver verifiable "no changes means no changes". If you have a vcpkg distribution and install the same bits again 5 years later, we consider it an important guarantee that you do in fact get the same bits you built before. It is somewhat common for folks to change the ref to which a tag is attached, for instance, and we want to be able to detect this. Moreover, we also have, for continuity-of-business or supply chain audit reasons, an asset caching feature that can optionally sit between the build environment and all external systems like github, which caches all sources which are fetched. The interface of that is user replaceable and effectively looks like:
because different businesses wanted to be able to supply their own caching mechanisms to which we, the vcpkg team, do not have access or experience. We also prefer to download tarballs rather than
We were extremely affected. Of approximately 2000 ports in our catalog, approximately 1500 of them became un-installable if one did not have the sources in their asset cache. Our CI system has an asset cache turned on, so that kept working, but many customers immediately reported problems. See microsoft/vcpkg#29288 plus about 14 duplicates filed in that 12 hours despite pinning that issue. Moreover, even if we updated the SHA to the new value, all existing versions of ports available in the system whose source was hosted on GitHub would have effectively become broken forever. (We have fairly vague plans for how we would insert a version in the 'middle' but never expected to need to fix ~1500 at the same time) We experienced similar breakage before when the "pax_global_header" thing was changed, but that was tens of relatively obscure ports that changed, not the entire catalog. I spent most of that day trying to author changes to update all 1500 ports before the revert was announced.
This shows that we are going to need to build some sort of feature that allows hashes to be replaced without invalidating versions. We are also likely to consider fast tracking proposals that move source references out of I did some experiments and I don't know if this negotiation time cost is affected by the number of files or entities in the repos though. As a result, we may also reconsider our stance to prefer GitHub source archives and prefer commit SHAs instead if we can come up with an effective asset caching strategy.
Taking off my vcpkg maintainer hat, speaking personally, I don't 'fault' GitHub for anything that happened here. A change was made to git, you updated git, and this happened. I do have one thing to fault GitHub for though, which is that the UI for the source bundles looks like any other asset on a release page. It is highly surprising that everything in the assets block is "safe" but the last two entries can randomly change on you: I would really like a statement that, modulo bugs, the formats will be made reproducible, but I understand there are serious technical challenges in delivering that:
If, as a result, you can't guarantee reproducibility, even for named releases like this, I would really like seeing not reproducible source blocks set apart in some way to demonstrate that these are clearly different in character to other assets. Thanks to @Neumann-A for letting me know that this thread exists. |
Beta Was this translation helpful? Give feedback.
-
As a downstream consumer our build processes across our entire organization were down for upwards of 8h because upstream parts of the Bazel ecosystem rely on them.
You've effectively hit it on the head -- the hashes must remain stable not withstanding a well communicated long period of notification of future breaking change to be rolled out. And to be quite frank, the reasonable expectation (as they are release/tag artifacts )is that the checksum has always been and should always remain stable. It's supposed to be a stable artifact, why should the hash change. Granted, the behind the scenes technical reasons are in conflict. And the problem may be that this mechanism is being used in lieu of proper use of some other package publication system. But large ecosystems are in fact relying upon it. |
Beta Was this translation helpful? Give feedback.
-
Thanks for opening this discussion, and the communication throughout the incident.
I'm a core developer of BinaryBuilder (repository on GitHub)—building framework mainly used in the Julia ecosystem—,and the main maintainer of Yggdrasil—the largest collection of build recipes for this framework (over one thousands, and about 200 of them using GitHub-generated archives).
BinaryBuilder strives to have reproducible builds following well-established practices. With a given build recipe, all users using the same version of BinaryBuilder should be able to produce a bit-by-bit identical output tarball for a given target platform. Build recipes specify different types of sources to build packages, including compressed archives (tarballs, zipballs, etc.) or generic files. Archives and generic files have an accompanying SHA256 checksum, to ensure integrity and security of the download. An unstable checksum makes a build recipe simply not work, as we can't verify it the download. Having unstable hashes means that archives generated automatically by GitHub are unreliable, so all recipes we maintain in Yggdrasil won't be able to use them because they could be unreproducible at a later point in time.
The immediate impact for the BinaryBuilder ecosystem was limited, we discovered it by following the discussions in other packaging ecosystems. We are now banning the use of
As a JuliaLang/julia committer I'd also like to point out that some GitHub archives, with checksums, are referenced in Julia's build system: JuliaLang/julia#48466. Should past checksums become invalid, it won't be possible to compile older versions of Julia fully from source. I think as a maintainer of a packaging ecosystem I'd like clarity of the stability policy (most information I found about this was in some GitHub issues or on Twitter), and a larger notice to inform downstream project of changes that may affect them and give them the time to take appropriate measures. |
Beta Was this translation helpful? Give feedback.
-
As a consumer, I was disappointed (while also very much understanding why) that these package systems experienced the disruption they did. We should all want the latest and greatest compression methods so long as we still get out bytes back at the end. This is why it saddened me to see that many of these systems were using the hash of the tarball rather than the hash of the file contents. I hope this change is the push they need to switch over their hash calculation. |
Beta Was this translation helpful? Give feedback.
-
As a consumer of Conan.io's Conan Center recipes, I can vouch that this change affected our organization mildly. Conan relies on sha256 checksums manually entered into each recipe for each version of the package being built. We hit the issue early when attempting to build the I've started a conversation in the Conan project about future mitigation plans if something like this were to occur again, but at the moment, it would require a lot of manual work, similar to the This just further illustrates how fragile the web ecosystem really is. |
Beta Was this translation helpful? Give feedback.
-
@vtbassmatt Thanks for asking for feedback, it's greatly appreciated!
I'm a co-maintainer for the Buildroot buildsystem, that helps in building linux-based firmwares for embedded devices (in the spirit of OpenEmbedded/Yocto, OpenWRT...).
Buildroot does depend on byte-level tarballs that are generated by Github. We use it to ensure that the archives that get downloaded are pristine, to avoid various attacks (like MITM). Our trust-model (if we can call it that) is based on TOFU: the first time we get a tarball, we assume it is pristine, and hash its content. Later downloads should match that hash; if they don't, it is a security risk (either the new one is maligned, or the original one was).
We were affected in two ways. First, our autobuilder farm started to notice download issues. Our autobuilders continuously run random configurations (amounting to several hundreds to a thousand a day), each building a lot of packages downlaoded from various locations, with quite a substantial part being on Gituhb; we do have a local cache of archives to avoid redownloading the same archives over and over, but that cache is partially pruned (by only just 5 archives at a time!) at the start of each run to check that the upstream is still reachable and that they have not changed. We however did not notice those failures yet as we only get a daily report of the failures. In the meantime... Second, when reviewing submissions, the maintainers started to notice some discrepancy between the hashes from the submissions, and those that they were getting locally. At around the same time, some users started reporting download failures due to hash mis-match. After a bit of back-n-forth, we pinpointed it to the autogenerated archives from Github. Thanks you again for asking for our feedback; that is much appreciated! :-) Regards, |
Beta Was this translation helpful? Give feedback.
-
I am a user of the Bazel build system, but not a developer of Bazel itself. (Specifically I work on workerd.)
workerd has dependencies on several other projects on GitHub. I'll focus on Cap'n Proto as an interesting case, but this applies to other dependencies as well. In our Bazel WORKSPACE file, we instruct it to download Cap'n Proto using the auto-generated archive from Our team also owns and maintains Cap'n Proto itself, and it is very common that we make a change in Cap'n Proto which we immediately want to use in workerd. As a result, it is impractical for us to rely on release tarballs from Cap'n Proto, as this would mean that any time someone needed to make a Cap'n Proto change, they would be saddled with doing a release before they could use the change in workerd. (Cap'n Proto's release process involves doing extended testing on many different platforms that are not relevant to workerd, and it's quite common for some of the exotic platforms to be broken an Cap'n Proto's git head.) Instead of downloading a tarball, we could instruct Bazel to perform a git clone of the dependency instead. However, in general, this performs much worse, as it downloads history that won't be used, and I imagine it doesn't benefit as much from caching. (In theory shallow clones can be used to avoid downloading more than needed, but they are surprisingly painful to configure and keep updated.) I would imagine that GitHub would actually prefer that CI builds and whatnot download the extremely-cacheable tarballs rather than perform git clones, though if that's not the case, that would be interesting to know!
Our CI builds were broken for several hours. We are considering setting up our own cache for these artifacts which caches them permanently, so once the bytes have been fetched once they won't change. This will take some effort, though, that we'd love to avoid. It's also not clear how community members outside our core team would interact with the cache securely.
Googlesource randomizes bytes in their tarballs exactly to prevent people from depending on checksums being stable. As a result, we have been forced to use the git-clone approach for these, but we generally don't like doing so (see above). It seems to me that GitHub is stuck keeping hashes consistent for all commits that currently exist. There's just to way to get away with that without breaking a huge number of builds. Particularly intractable is the problem of transitive dependencies -- if I depend on foo, and foo depends on bar, and all the hashes change, I am dead in the water until foo updates their build (which they may never do) or I bite the bullet and fork foo (which I really don't want to do). But it's still possible to make changes that apply only to tarballs of future commits, if the change is rolled out before the relevant date cutoff. Not sure if it's worth it, but it seems like an option. Ideally Bazel would support specifying a hash of the (canonicalized, somehow) unpacked content rather than a hash of the compressed tarball. |
Beta Was this translation helpful? Give feedback.
-
@vtbassmatt: Thanks for all your work on this, for being so open, and for considering the input of packaging communities!
I'm the original developer and lead for @spack. Spack is AFAIK the most widely used package manager for high performance computing (HPC). It's used broadly in the worldwide HPC community, including U.S. and other national laboratories. It’s known to a lesser extent in the C++, Python, and scientific computing/ML communities. The project has over 1,100 contributors (over its entire lifetime). I'm not part of a formal security committee, but as part of building out infrastructure for Spack, we have to make a lot of security decisions around CI and open source -- both in cloud CI and in more sensitive environments that consume a lot of open source. We have to consider mirroring to air-gapped environments in addition to environments with access to the internet.
Spack builds packages from sources, including source archives, binaries, patches, and other artifact downloads. We rely on stable checksums of those downloads for security and reproducibility. The contribution model in Spack is:
Our security model is essentially that users must trust the project maintainers. Users should only be expanding archives or building source that maintainers approved. Likewise, if another user gave them a package recipe, they trust the checksums in that recipe because they trust that user. So, before building a package recipe, we:
We also allow contributors to specify source to check out by git commit, but we prefer SHA-256’s on archives for a few reasons:
Yes.
We noticed the change because users started seeing checksum failures in their downloads from GitHub. We got a number of complaints on GitHub and in Slack about hashes no longer being valid for certain projects. We maintain a mirror of (nearly) all sources in all packages at https://mirror.spack.io. This is an AWS CloudFront distribution of a big S3 bucket that we update regularly, but not immediately, after contributors add new versions to packages. Thankfully, the mirror mitigated many of the issues and users were, for the most part, not dead in the water due to this change. We currently have a cronjob that updates the mirror once a week or so, so we noticed when users started noticing download errors for packages that were not yet in the mirror. Spack also caches downloads locally, which would have mitigated the problem further for people rebuilding things (this happens a lot in Spack, e.g., to build with different flags or options).
Some contributors began to submit PRs with the new hashes, and we began to think about how we would re-hash all the archives in Spack. There are around 7,500 GitHub The number of packages and archives in Spack grows a bit every year, so this burden will increase a lot over time. Fortunately, since the change was reverted, we did not have to do the big re-hash, and we hope that we will not have to. We did have to revert the commits from contributors who tried to get on top of this issue early, which is not the greatest incentive for folks who try hard to keep their packages up to date. We have had to re-hash all of Spack in the past — specifically the last time GitHub did this in 2017 or so. We have been stable since then. There were only 728 tarballs to deal with then, so we’ve grown by more than 10x. You can find the issue where we first discussed this here, along with links to other affected communities:
It has made us think about building reusable tooling for doing hash changes, and it has led us to consider what else we could be doing besides relying on GitHub archive URLs.
So, unfortunately, there are not a lot of good options other than archive URLs. We have to consume what upstream projects provide, and we cannot practically push 7,500 upstreams to do the “right” thing, and we’re not going to compromise on supply chain security, especially since our project is used to build open source in air-gapped environments.
|
Beta Was this translation helpful? Give feedback.
-
Core comitter/contributor of EasyBuild and co-maintainer of a national HPC system at Dresden, Germany
We use EasyBuild at the HPC center to install software. Some are downloaded from Github via the release URLs. The whole download&install process is automated and tested by different people. Only then a "recipe" (for the automation) is included in EasyBuild releases which gets then used to install software on the HPC system using privileged accounts.
Yes: When installing software where a cached tarball is not available (e.g. when using a recipe/software that was never installed on that system before) the verification step after the download from Github failed. This has happened before also in other distribution systems (CRAN checksums of R packages may change, maintainers of projects on Github reusing tags, re-releases on PyPi using the same version, ...) so we have a way to specify alternative checksums after verifying the contents are still the same. But that is cumbersome to do large-scale and requires access to the "old" version to check for such differences which may be impossible. An alternative we are considering are "NAR checksums" used by Meson which checksum the extracted contents rather than the archive itself and hence is immune against changes to the compression algorithm used. However this at least introduces additional complexity. |
Beta Was this translation helpful? Give feedback.
-
Lead maintainer/core committer of the Ada package manager Alire.
No, because we researched the issue when we were deciding how to distribute our tarballs, and found that there was no guarantee on their immutability. However, this would have been our preferred way to distribute tarballs otherwise, and having to upload alternate files to releases introduces some unneeded duplication and complicates the publishing workflow when tarballs are involved (we have alternatives using git commits).
Not that we have found, although some contributors may have disregarded our guidelines for providing source archives and used the ones from GitHub anyway. We have no reports about this yet though.
In the beginning we assumed these files (in the case of releases) would be generated once and thus inmutable, until we found otherwise. Just so you know what was our initial frame of mind. |
Beta Was this translation helpful? Give feedback.
-
First of all, thank you for the clear and honest communication! It's really refreshing to see :)
I work on Bazel (https://bazel.build/), specifically on external dependencies. I'm on the core Bazel team. (cc other team member @meteorcloudy, major community contributor @fmeum, and security committee member @mihaimaruseac)
Bazel itself just asks for a URL and a checksum, and so far a project has had to provide URLs and checksums for its dependencies. The community at large has often used the source downloads (instead of user-uploaded archives) as the URLs, and thus formed a direct dependency on the checksum being stable. Furthermore, the new Bazel Central Registry (https://bcr.bazel.build/) stores these URLs and checksums centrally (similar to vcpkg)
Yes -- multiple users reported CI breakages. This was especially painful because we had recommended users to treat the
Most points have been eloquently explained by other posts in this discussion, so I won't rehash them. I'd just like to be looped into further discussions and announcements. Thanks again! |
Beta Was this translation helpful? Give feedback.
-
@vtbassmatt Thank you very much for soliciting feedback from the broader community in an open fashion like this, much appreciated!
I am the lead developer/BDFL of EasyBuild, a tool that facilitates the installation of (scientific) software packages (usually from source), on High-Performance Computing (HPC) systems, a.k.a. supercomputers. We have no formal security committee in EasyBuild, which is large community/volunteer driven, but we try to do what we can, which includes making EasyBuild verify a SHA256 checksum of every source file it downloads/uses before unpacking it.
EasyBuild maintains a local "source cache", so it only downloads a file once (only when it's not in the cache yet). We record a SHA256 checksum for every file that EasyBuild downloads/uses, and verify the checksum before unpacking/using that file. The reason for this is multi-fold: not only as a security measure, but also to prevent accidentally using incorrect source files (a corrupt file due to partial download, a somehow different source tarball that the installation was tested with, etc.).
On Jan 30th, several people starting reporting that EasyBuild was producing checksum errors for files it was downloading. This wasn't our first changing-checksum-rodeo - this has sadly become an almost weekly occurrence, albeit for a variety of different reasons: re-releases in CRAN (see here or here), release source tarballs in GitHub (uploaded by project maintainers, so not created on-the-fly by GitHub) being changed by the project (see here), or also in other places (see here). This led us to adding support in EasyBuild for listing not just one but multiple valid checksums - if we know that a source tarball was changed in place but with the exact same (code) contents, we accept both the old and the new one.
We have worked through changing source tarballs served by GitHub before, back in Sept'17 for example. Although the effort required to mitigate the "broken" checksums was relatively limited back then, it certainly would be a lot bigger today: while we roughly had about 2,000 unique software "titles" (not including different versions) supported in Sept'17, we have about triple that today, and I'm confident that we're downloading a lot more stuff from GitHub today than we were 5 years ago (I don't have the numbers readily available on that, but I can figure that out if it would be useful). Moreover, it would render existing EasyBuild releases (which include the SHA256 checksums for all source files EasyBuild knows about) largely useless, since they would be mostly be a Russian roulette tool w.r.t. verifying checksums, and we would need to scramble to:
I can guarantee that this whole ordeal on Jan 30th 2023 caused a lot of confusion all around the world, and will have an impact for days/weeks/months to come - people will have downloaded tarballs during those couple of hours that the change was in effect, use that to compute a SHA256 checksum, include that now incorrect checksum in a contribution to EasyBuild (we get about 2,500 contributions of this type each year...), others would report getting a different checksum, leading to another wild goose chase to figure out how on earth that's possible (cosmic rays maybe?!), etc. TLDR: Although it's clear that the impact that this situation has had was unintended, it has most definitely made a lot of package managers cry. |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for this discussion. It's really cool to see that GitHub cares about its users, including those with not-exactly-mainstream requirements.
I'm the creator of rules_ll, an upstream Clang/LLVM based toolchain for C++ and heterogeneous programming.
We use Bazel, so in practice things broke for similar reasons others have already mentioned. Even without Bazel I think things wouldve broken for us though. We download the llvm-project at specific commits, overlay custom build files, build it and then make that toolchain available to end users. We also pull in various smaller dependencies, sometimes at release tags, often at intermediary commits that contain e.g. bug fixes for compatibility with upstream Clang. Such smaller repos are e.g GitHub-hosted tooling from Nvidia and AMD. So we have several dependencies where we manually, explicitly track commits of git repos and the hashes of their tar archives. We do this to propagate features with minimal delay to their original commit time, which can be months before "stable" releases and there is no way for us to depend on such stable releases in any way. Building an entire C++ toolchain as part of a build is also quite the CPU cycle consuming task. To work around this we depend on caching that becomes more effective with stronger reproducibility guarantees.
Yes. Our infrastructure was down. We noticed this because we flushed our cache. Our project was unusable and having to change hashes blocked all other development. Initially we were worried about a security breach, but thanks to your quick response in the original discussion we quickly noticed that things were fine. However, we were also quite unsure how to handle updating these hashes. Adjusting them is not too much work, but also not completely negligible. Without knowing how frequently the hashes would change, we were unsure whether we needed just a quick patchup or a completely different way to handle this part of our project. The incident caused us to consider building separate fallback mirroring infrastructure. Maybe unintuitively, this has caused us to increase our efforts in relying on hashes, pinned versions, encapsulation etc. We were actually very happy that we were able to notice the change in the supply chain at all. Now we want this awareness for the few remaining parts we don't have control over yet. In my case this means that I'm now working on encapsulating our project in Nix environments so that we have a reproducible host toolchain to build our actual bazel toolchain. Toolchainception!
Similar to some others in this discussion, it doesn't actually matter to us if hashes change. If we know they'll change frequently we can build infrastructure to support this. If it's something like an update of git that causes We don't rely on hashes because we never want inputs to change, but because we need to know when they do change. Holding back updates like potentially more efficient compression is the last thing we want. We use these kinds of tools because they enable frequent, potentially agressive change and provide stability in the form of "rollbackability", not in the form of "we never change anything". |
Beta Was this translation helpful? Give feedback.
-
This is the perspective of a small open source project.
The superbuild project relies on precise bytes from the packages' source tarballs. The OBS builds occassionally include packaging of (particular versions of) dependencies.
The AZP and OBS builds weren't directly affected because we didn't have rebuilds at this time.
Looking at GH releases from the perspective of a software author, it seems fair to me to assume that the tarballs associated with a particular "Release" are handled as precise bytes similar to the assets uploaded to that "Release", not as a on-the-fly service. It seems redundant to upload a source tarball as an asset when GH has all information to create this asset, and even appears to provide it as an asset already. In addition, my GH Release workflow makes use of "Drafts". Basically, I want the precise source tarball bits to be frozen when I create a draft release. Then I can create the binary assets from this specific tarball (by updating superbuild and OBS recipes), and publish the "Release" when the assets are validated and uploaded. |
Beta Was this translation helpful? Give feedback.
-
I represent Conan Center (https://github.com/conan-io/conan-center-index). Conan Center provides C and C++ packages to be used with the Conan package manager. Users have an option to download pre-built binaries provided by us (we currently support a high number of operating systems, architectures and compilers), or alternatively to build them from source from a build recipe we provide.
Whenever our packages are built from source, either by our own CI service or by users, we compare the checksum of the downloaded sources against what was expected for the particular version of a library or package. Not only is traceability important for us and our users (reproducible builds), the complexities of C++ mean that even seemingly innocuous changes can alter the observed behavior either at compile time or at runtime. We are also aware that this is important for our enterprise users, in particular those that work in industries with stringent traceability requirements, such as aerospace, automotive and medical.
We were affected by this change, but the impact for our users was limited and indeed mitigated by GitHub’s decision to revert the change. A great number of our users will download binaries provided by us, and as such, do not build libraries from source and were not impacted by this change. For example, we provide 104 compiled binaries for the popular Protobuf library, covering a number of compiler, architectures and operating system combinations.
We already had a policy of advising recipe authors and contributors to prefer to use published artifacts from releases, as those are guaranteed to be hash-stable, rather than anything from
While I believe we can greatly mitigate this by advising our contributors and recipe authors to choose hash-stable files for library releases, this is not always possible. An increasing number of library authors are no longer doing formal, versioned releases and instead encouraging users to “live at head”. Sometimes there isn't even a tag we can refer to - and all we have is a git commit hash. The lack of a formal version number is not a problem per se - the particular git revision can still be unequivocally and uniquely identified by the git hash of the upstream repository. However, the lack of a hash-stable downloadable archive can be a problem. Using git itself to download source code has proven problematic in the past. It tends to take longer and requires special care to avoid cloning the entire history of a repository when only one commit is needed. If memory serves me well, performing a shallow clone from arbitrary commit hash (rather than a branch or a tag, as some repositories don't provide releases or tags) was not possible until recent versions of the git client, and requires specific capabilities on the git server. Another advantage of using a single file with a hash is that it makes it easy to host mirrors (either official, or local to the organisation). |
Beta Was this translation helpful? Give feedback.
-
That's a complicated question.. I represent myself (as an OSS developer), and BoostOrg, and BFGroup, and Bincrafters, and CPPAlliance, and Conan (as a package contributor). Yes I wear many hats.
Yes, although @jcar87 already commented on Conan (https://github.com/orgs/community/discussions/46034#discussioncomment-4968810). But I also maintain an alternate Conan server implementation (https://barbarian.bfgroup.xyz/) that suffers from the same issues of recipes that use the archive hashes.
Not immediately as by the time I was aware of it and started to think about what I needed to do for the various Conan recipes I maintain the change had been reverted.
Yes. I've been thinking about this since this post. In particular regarding @BillyONeal reply (https://github.com/orgs/community/discussions/46034#discussioncomment-4843932) about the safe vs. doom sections of the release assets list. And I had one idea that would help me, as a producer of OSS software and tools.. It would be fantastic if I could have a button/link/action to "bake" the "Repository code download archives" into "Repository release archives". And even better, having a setting that made that baking automatic for labeled releases and/or an option when creating a release that baked those archives. In other words, if relying on the code archives is the problem. The solution is to make it easy to not rely on them. |
Beta Was this translation helpful? Give feedback.
-
@vtbassmatt I just hit a SHA256 change on an archive: according to the GitHub UI, https://github.com/STMicroelectronics/cmsis-core/archive/refs/tags/v5.4.0_cm4.tar.gz has not been updated since 2019, but the SHA256 changed yesterday from f711074a546bce04426c35e681446d69bc177435cd8f2f1395a52db64f52d100 to 32f226c31d7d1ff4a504404400603e047b99f405cd0c9a8f417f1f250251b829. Is this expected? |
Beta Was this translation helpful? Give feedback.
I figured it out. It's STM's fault. Sorry about the noise here.
STM apparently renamed their repository, from https://github.com/STMicroelectronics/cmsis_core/ to https://github.com/STMicroelectronics/cmsis-core/. The links all redirect, but the top-level folder in the compressed archive now has a different name (
cmsis-core
, notcmsis_core
), so the checksum changed, too.