Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some packages have duplicated x-revision fields #779

Closed
snoyberg opened this issue Jul 16, 2018 · 13 comments
Closed

Some packages have duplicated x-revision fields #779

snoyberg opened this issue Jul 16, 2018 · 13 comments

Comments

@snoyberg
Copy link
Contributor

I have discovered four cases of package revisions where the x-revision: field within the cabal file does not match the revision specified by Hackage. Note the URLs below (interestingly, all for revision 3), and the fact that all of these examples contain x-revision: 2 in their contents:

It's unclear as a consumer of metadata from Hackage whether to trust the x-revision fields or not. The alternative is to count how many previous cabal files are in the tarball, but I haven't seen any guarantee written down that cabal file contents will not be repeated (though that seems to be the case today).

snoyberg added a commit to commercialhaskell/stack that referenced this issue Jul 16, 2018
@phadej
Copy link
Contributor

phadej commented Jul 16, 2018

@phadej
Copy link
Contributor

phadej commented Jul 16, 2018

And old bug to be precise:

screenshot from 2018-07-16 13-23-37

@hvr
Copy link
Member

hvr commented Jul 16, 2018

x-revisions fields MUST not be trusted; they're merely an internal implementation detail of hackage; there's no guarantees and clients shall avoid making assumptions about the x-revision field.

@snoyberg
Copy link
Contributor Author

Is that official documentation somewhere? If so, I'd love to see it. I'm not sure if this is intended to be the official documented spec right here.

@snoyberg
Copy link
Contributor Author

And to be a bit more clear: how are clients supposed to determine the revision number? By counting previous files of the same name? Is that guaranteed to be reliable?

@gwils
Copy link

gwils commented Jul 16, 2018

Why doesn't the x-revision field line up with reality? Perhaps it was added after the revision feature?

@hvr
Copy link
Member

hvr commented Jul 16, 2018

The only reliable, efficient and sensible way to determine the revision count is in fact by counting the occurrence of the tar entry for the respective pkg-id inside the 01-index.tar not the least because you won't be able to parse future syntax of .cabal files in the index (if there weren't already enough other reasons not to do this); that's what tooling such as https://matrix.hackage.haskell.org uses and also the approach that cabal will take (the revision display issue pointed out by @phadej is actually a bug in the visualization that I intend to fix). That being said, the concept of a revision number is primarily for human consumption as it's mostly a concise addressing scheme which is optimised for single self-contained package index where tarballs are immutable; in a more general setting better ways to address are either via --index-state (cheaper to compute; but still not ideal), or content-indexed via e.g. SHA-256 (this is technically the optimal one as it provides benefits for nix-style caching logic and works reliably in more general settings -- but it's inconvenient for humans to use as it's long to type and doesn't have an obvious ordering that coincides with the evolutionary ordering). Long story short, to determine the revision count, simply count the occurrence of the pkg-id in the TUF-protected 01-index.tar

PS: The revision counting can be done incrementally without having to rescan the whole index on each update thanks to the hackage index being monotonically growing as that's part of the incremental index update of hackage-security. The tricky part is detecting the very unlikely but still possible case that the index got rebased; I'm not sure if hackage-security exposes this information yet, but hackage-security is internally able to detectable when the index couldn't be updated incrementally. But it's probably not worth the added complexity yet for the current size of the index as you still need to handle the decompression incrementally which is certainly possible but tricky, as you'd have to store the compression state and so on... but I don't want to bore you with technical details.

@snoyberg
Copy link
Contributor Author

Given that the Hackage interface only provides revision number information to users, it's currently the only reasonable specification format to expect from them. I encourage people to use sha256-based references to revisions, but that's significantly harder to achieve. I'm still trying to understand what guarantees can be expected when processing a 01-index.tar file:

  • Are we guaranteed that a specific revision of a cabal file will appear only once?
  • Is there any possibility that the 01-index.tar file will get rewritten in part or whole in the future, necessitating reindexing? If so, are we guaranteed that the revision number calculated previously will remain the same?

As I see it, the only thing to be done today is what Stack does right now, essentially:

  • Use hackage-security to update the index
  • Recalculate a complete cache of all revision numbers inside the index

This is slow and wasteful (recalculating information which shouldn't have changed). The two alternatives I can think of are (1) parsing the x-revision field and (2) assuming the previous contents of the index haven't changed. It sounds like (1) isn't an option because Hackage is not guaranteeing the veracity of the field it's providing, but I still don't know if (2) is a possibility. If not, it sounds like we're stuck with the much slower approach.

snoyberg added a commit to commercialhaskell/stack that referenced this issue Jul 16, 2018
This will be a new library for storing package information. This first
bit overhauls the Hackage index update code, and stores information in a
SQLite database instead of the old caches. This turns out to be
significantly faster for `stack update` calls.

Fixes #3586

Note that it would be nicer to just resume the caching from where we'd
last left off, or to parse the revision numbers from the cabal files
themselves. See the discussion in haskell/hackage-server#779 to see why
that isn't possible.
@phadej
Copy link
Contributor

phadej commented Jul 16, 2018

These are my best understanding:

Are we guaranteed that a specific revision of a cabal file will appear only once?

There are no hard guarantees that x-revision field is correct. As @hvr said, it's for humans not the machines (it's a cheap compare-and-set protection, as far as I understand).

Also exactly same .cabal file might appear in the index twice, in fact, they do:

Is there any possibility that the 01-index.tar file will get rewritten in part or whole in the future, necessitating reindexing? If so, are we guaranteed that the revision number calculated previously will remain the same?

For main Hackage index the possibility is there, but it's very exceptional case. From https://www.well-typed.com/blog/2015/08/hackage-security-beta/

So we have also taken the opportunity to dramatically reduce download sizes by allowing clients to update this file incrementally. The index tarball is now extended in an append-only way.

It's safe to assume that given index-state (timestamp) post-hackage-security-era, and more recent 01-index.tar, you can recover the information of the old 01-index.tar. So to answer your question:
It's very unluckily that 01-index.tar will be rewritten. A lot of tooling will be broken. I'd expect widely and loudly distributed PSA when that happens.

If I'd build any cache based on 01-index.tar, which can (should) be updated incrementally, then:

  • I'd also save the index (N) and hash of the last processed file-Tar.Entry
  • on incremental update skip forward to N, check the hash
    • if match do incremental update for the rest of new index
    • otherwise, do full update

Or even checksum the whole contents of tar until N. On my machine SHA512 (which is slow and overkill for this purpose) of 01-index.tar takes 1.5 seconds (I have SSD).

@snoyberg
Copy link
Contributor Author

Thanks @phadej, I think that clarifies my concerns. Your proposed workaround for the potential rebase situation should work, I'll see if I can make it happen.

snoyberg added a commit to commercialhaskell/stack that referenced this issue Jul 17, 2018
Thanks to @phadej for the inspiration for this in his comment:
haskell/hackage-server#779 (comment)
@snoyberg
Copy link
Contributor Author

Alright, that seems to have worked:

commercialhaskell/stack@33ef253

To summarize my understanding:

  • There was at some point a bug that caused an identical cabal file to be inserted into the index twice. This includes the x-revision field. Between that, and unreliability of parsing it into the future, the x-revision field should not be used by anyone except end users as a convenience, which may sometimes be wrong.
  • The Hackage display of a revision number does seem to line up perfectly with "number of previous times the same file name appeared in the tarball." Therefore, doing revision indexing in a tool based on this approach will be consistent with what users see on the Hackage website.
  • We cannot rely on previously viewed parts of the 01-index.tar file remaining unchanged on the filesystem, since there may be a rebase from Hackage which would force a redownload of the entire file. hackage-security does not currently provide a notification for when such a redownload occurs.
  • Therefore, the only way to reliably determine if the file has the same prefix as we previously viewed is to actually inspect the contents. Using a hash of the previously viewed prefix works well for this, as demonstrated by @phadej above and commercialhaskell/stack@33ef253.

Thanks for the help @hvr and @phadej. If my summary above is correct, I believe this issue can be closed. If there's some documentation that would be an appropriate place for this summary, let me know and I'll send a PR.

@snoyberg
Copy link
Contributor Author

One final note: the last 1024 bytes (two 512-byte blocks) of a tar file contain all null bytes. Each time the tarball is updated, those last two blocks will be overwritten with new data. Therefore, when calculated hashes, you need to ignore the trailing 1024 bytes.

snoyberg added a commit to commercialhaskell/pantry that referenced this issue Jul 16, 2019
This will be a new library for storing package information. This first
bit overhauls the Hackage index update code, and stores information in a
SQLite database instead of the old caches. This turns out to be
significantly faster for `stack update` calls.

Fixes #3586

Note that it would be nicer to just resume the caching from where we'd
last left off, or to parse the revision numbers from the cabal files
themselves. See the discussion in haskell/hackage-server#779 to see why
that isn't possible.
snoyberg added a commit to commercialhaskell/pantry that referenced this issue Jul 16, 2019
Thanks to @phadej for the inspiration for this in his comment:
haskell/hackage-server#779 (comment)
@gbaz
Copy link
Contributor

gbaz commented Jun 7, 2021

based on last above discussion, closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants