Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for SHA-1 arbitrarily-large objects (AKA Git objects) #1473

Closed
Ericson2314 opened this issue Sep 11, 2023 · 5 comments
Closed

Add support for SHA-1 arbitrarily-large objects (AKA Git objects) #1473

Ericson2314 opened this issue Sep 11, 2023 · 5 comments
Assignees
Labels
discussion feat New feature or request

Comments

@Ericson2314
Copy link

Ericson2314 commented Sep 11, 2023

So first of all I think your current design of focusing on Blake3 and its tree hashing for verified streaming of large objects is very good. Unquestionably on technical grounds, this is the way forward.

However, I also think that there is a lot of existing git-content-addressed data out there, and that content-addressing works best and most simply when the same content-addressing format works end-to-end. I am pretty convinced that the best way for IPFS-family stuff to get adoption is to work with this data and its current addressing scheme.

Concretely, any "linear" hash function we can instead also think of as a tree hash, just one that uses really shitty unbalanced binary trees. So the same techniques by which Blake3 hashing's intermediate steps can be looked at as a Merkle DAG, SHA-1's can too.

(For some background, I have worked on https://www.softwareheritage.org/2022/02/10/building-bridge-to-the-software-heritage-archive/ https://github.com/ipfs/devgrants/blob/master/open-grants/open-proposal-nix-ipfs.md. The former is completely done, the latter was also completely but only more recently is getting upstreamed, see https://github.com/NixOS/rfcs/blob/master/rfcs/0133-git-hashing.md and NixOS/nix#8919. The stumbling blocks have always been (1) pipeline latency with vanilla bitswap, the (2) MTU inducing a max object size. (2) is the more fundamental issue. I have brought up protocol/beyond-bitswap#30 what I am proposing here before, but lack the ability to make it happen on my own. I have also co-mentored the GSOC project for https://github.com/theupdateframework/tap19-ipfs-poc)

I get what I am asking for might sound like "hi I see you support IPv6, can you also please support IPv4", but I maintain it is not that bad because SHA-1 cannot zombie onward in perpetuity they way IPv6 can. And likewise, I am not asking for SHA-256, but because SHA-256 being much healthier than SHA-1 does have that "zombie onward" potential.

If you are willing to do this, as a token of my gratitude I would gladly do what I can to help convince Nix, Software Heritage, The Update Framework, and even Git to support Blake3 hashing for content-addressing source code. Again, I totally believe that proper balanced tree hashing is the right way forward on technical grounds. I just think people need to see how nice end-to-end content-addressing is in order to overcome all the technical debt to get us there.

@Ericson2314 Ericson2314 changed the title Add support for SHA-1 arbitrary large objects (AKA Git objects) Add support for SHA-1 arbitrarily-large objects (AKA Git objects) Sep 11, 2023
@b5
Copy link
Member

b5 commented Sep 13, 2023

👋 @Ericson2314, thanks so much for checking out Iroh, and for taking the time to provide so many useful links to the work that brings you here. Super into the Nix stuff. I haven't had the chance to sit down with Nix yet, but it looks fantastic.

Also, thank you for your contributions to the IPFS community. Reading through your dev grant proposals & TUF+ IPFS mentoring work, you've clearly been around the community doing great stuff for some time. Thanks again!

Cutting to the chase, we're ride-or-die BLAKE3. Iroh will not support SHA-1 natively, but I don't think that's the end of the story.

If you are willing to do this, as a token of my gratitude I would gladly do what I can to help convince Nix, Software Heritage, The Update Framework, and even Git to support Blake3 hashing for content-addressing source code.

Convincing Linus to come off SHA-1 might just be the steepest known mountain in the nerd alps. Let's save you the pain & do something easier. But there's also a deeper lesson here. if we all have to use the same hash function to interop, we're toast. This is exactly the same as convincing everyone to write in $PROGRAMMING_LANGUAGE. not going to happen.

I think the only alternative is to jump into PKI-backed trust by calculating the hash for equivalent objects different systems, then signing the result. You'd end up with something like this:

{
  "hashes": {
    "iroh": "bafkr4ia7uxxfouaxdmumefmah6subqnqnyiel5j2o5ckdycdx56ozdung4",
    "kubo": "QmWVQcAtknigUTYM7iEyQb9im9qf5zLh3rqLv7dNuCTztV",
    "git": "e807615e14aada694fe12e978ad4c1e53036b52d"
  },
  "signer": "lldd7nflnbuy2nbjedc7lxzf5k3vsy4dpihx7cby2eywvwebbhoa",
  "signature": "longBase64SignatureString..."
 }

If you can trust that key, then you can trust the translation. If you have both hashing mechanisms available you can verify the hashes yourself. This extremely simplistic statement can form the foundation of all sorts of interop, and is an approach we can actually pull off before the heat death of the universe. Heck, we could even do it in IPLD (DAG-CBOR), seems like a nice fit.

We need to build this for interop with kubo anyway, so I'd be happy to work in SHA-1 support while we're at it. Any interest? Always keep in mind that:

  • you're welcome to build on top of iroh. the above approach certainly does
  • hashing things twice really isn't that big a deal. Hashing twice feels gross until you realize that it happens on literally every TLS packet that sends content addressed data. If we can do it over the wire, an extra rinse through SHA-1 shouldn't be the thing that holds us up.

Pushback welcome.

@b5 b5 self-assigned this Sep 13, 2023
@b5 b5 added feat New feature or request discussion labels Sep 13, 2023
@Ericson2314
Copy link
Author

Ericson2314 commented Sep 14, 2023

Thank you @b5 for for writing a nice and thorough response.

Pushback welcome

Alright :)

But there's also a deeper lesson here. if we all have to use the same hash function to interop, we're toast. This is exactly the same as convincing everyone to write in $PROGRAMMING_LANGUAGE. not going to happen.

This is a useful metaphor. I agree we can't and shouldn't try to get everyone to use the same programming language. But the costs of polyglot, even the most complicated FFI, are not PKI. Building webs of trust is where p2p projects have floundered and died for decades, and I really want to avoid that for issues where it is not essential at all costs.

Hopefully this argument isn't too theoretical at this point: After all, the MTU restriction with Kubo can also be overcome by rehashing (and chunking) and signing. Have we seeing large uptake of that for interop between content-addressing systems? Not to my knowledge. Furthermore, Git knows that SHA-1 is busted, and has added support for SHA-256, but no one has to my knowledge bothered with this sort of dual-hashing and signed translation despite it being the only feasible way to gently migrate repos with a transition period. I sincerely wish this sort of PKI thing had better social uptake, because even if we don't need it for interop (example 1) we do need it for crypto-agility (example 2), but experience sadly says otherwise.

The second part of my pitch --- convince more things to use blake3, convince people to actually do crypto-agilty --- does indeed sound utopian and like convincing everyone to use the same programming language, but remember that is just the second part. The first part however is the opposite, that existing systems can hardly be convinced of anything up front, that people can't be bothered to deal with PKI to try out new ways of working, and that content-addressing systems should support legacy Git objects for the same reason that FFIs fall back to C --- the lingua franca is the conservative outcome of a vulgar popularity contest.

Kubo gives me more hash functions but not large objects. Iroh gives me large objects but no additional hash functions. IMO The world just needs something that can do both so we can kick off this frictionless, trustless interop. Then we'll move high enough up the hierarchy of needs for converting to tree hashing end-to-end (my step 2) or setting up a PKI (what you propose) to stick --- really, whatever ends up working out is fine with me. But I am sincerely worried getting to those without doing my step 1 is much harder.

@Ericson2314
Copy link
Author

I looked at the code a bit. I will grant that while incremental validation of arbitrarily large SHA-1 blobs is possible, there is no "outboard" format, all the bytes from some position to the end must be read. That's not a conceptual problem --- serial hashes still work as a shitty tree hash --- but it is a wrinkle to trying to support such a format in addition to blake3 with minimal pain and suffering.

@b5
Copy link
Member

b5 commented Sep 19, 2023

Kubo gives me more hash functions but not large objects. Iroh gives me large objects but no hash functions.

You're a very skilled writer! But this statement is inaccurate. Iroh gives you exactly one hash function, and the choice to use BLAKE3 is the reason you can have large objects. It's an either-or that you're asking we make a both-and.

We've spoken with a few folks who are interested in this "shitty tree hash" approach for backporting other hash functions, which has given me some time to hone my thoughts a bit. I think two other wins come from using only BLAKE3:

BLAKE3 is designed to do this

It's both a source of speed, and a specified, intentional use of the algorithm. Iroh doesn't use "BLAKE3 plus a few off-brand approaches that make it a tree hash". It's. just. BLAKE3. That makes this system easier for others to adopt in the long run because we're not fighting a spec. Iroh can interop with anything that uses b3sum. Backporting other hash functions requires engagement with "not spec", and as a project focused on delivering real-world utility, this conversation might be doable, but it's just one more thing we have to do to prove utility. We already have plenty of barriers there.

Iroh does not allow hashing configuration

There is currently no way to embed the configuration into kubo:

$ kubo add --chunker=rabin-avg my_photo.jpg

Supplying a non-default chunking parameter will affect the output hash, and if you're looking for a repeatable "same file in, same hash out" property necessary to use IPFS as a checksumming tool, each application must specify the exact configuration required to achieve repeatable hashes for the same input bytes. In practice most folks work around this by saying "use the defaults". That "use the defaults" approach has cemented a specific, sub-optimal data structure that the entire IPFS ecosystem must support. To this day all UnixFSv1 blocks are encoded with a broken version of Protobuf that cannot be changed without breaking hashes.

So what does this have to do with hash functions? After all, we have multicodecs, which expressly allows us to embed the hash function configuration into the CID. The point is, kubo allows all sorts of configuration that cannot be expressed in the CID, which breaks the implicit promise, and introduces complexity low in the stack that introduces very real problems farther up the stack, ending with application developers, who are now forced to configure a hashing scheme to make their application work.

The answer is to not allow configuration at all. Pick a hash function. Move on. If BLAKE3 is broken, we will be forced to release iroh 2.0 immediately & write a migration. That's a risk we've accepted in favor of giving time back to application developers.

Git knows that SHA-1 is busted, and has added support for SHA-256, but no one has to my knowledge bothered with this sort of dual-hashing and signed translation despite it being the only feasible way to gently migrate repos with a transition period.

I think this reinforces my point about configuration. No one has transitioned, because no one wants to configure their hash function. Who cares? The git maintainers have made the case that they really don't "need" the security properties of SHA-1, and instead rely in the uniqueness property. If I want security, I'll sign my commits, which is an example of a proliferating PKI scheme.

The world just needs something that can do both so we can kick off this frictionless, trustless interop

I'd argue there is no such thing as frictionless data in the semantic web sense. There is tooling called "frictionless data", but the idea that we can design systems that semantically understand how to combine two novel datasets is not solved with a hash function, it's solved with ChatGPT, which will happily lie to you while it's at it.

If we want a world where all git objects can interface with all IPFS objects, we should write an application that does that. (psst, folks have done this!) This will force us to contend with why we are doing this, identify stakeholders who need this tool, which will motivate it's continued maintenance. I'm so down to work on that, and yes, I'd use PKI to do the heavy lifting there.

I'm going to mark this as "closed, won't support", but only because we're not going to do the exact thing this issue asks for: support SHA-1 in iroh. The broader project of making data interoperable, I'm super down for.

@b5 b5 closed this as completed Sep 19, 2023
@Ericson2314
Copy link
Author

Fair enough, I cannot of course dictate what you all choose the prioritize. And the code is there for me or anyone else to try doing this ourselves (though I will not have time for this anytime soon).

I'll just add that "no hash functions" was a typo, not a rhetorical exaggeration 😅. I fixed it above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion feat New feature or request
Projects
Archived in project
Development

No branches or pull requests

2 participants