New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SHA-2 blobrefs (migrate away from sha1-*) #537

Open
bradfitz opened this Issue Oct 21, 2014 · 20 comments

Comments

Projects
None yet
6 participants
@bradfitz
Contributor

bradfitz commented Oct 21, 2014

Camlistore 0.9 should support something besides just SHA-1, like SHA-3. Should probably
then also make those use base64 blobrefs, instead of hex.
@camlistorebot

This comment has been minimized.

camlistorebot commented Dec 3, 2014

Comment 1 by jefft0:

For base64 blobrefs, it may be useful to look at RFC 6920 "Naming Things with Hashes",
which defines a base64 hash for content-addressable.  For example:
ni:///sha-3;f4OxZX_x_FO5LcGBSKHWXfwtSx-j1ncoSt3SABJtkGk
It uses "base64url" which uses '_' and '-' instead of '/' and '+' so that the base64 can
be easily put in a URL, for example if using it to fetch an image for a blog.
http://tools.ietf.org/html/rfc6920
@bradfitz

This comment has been minimized.

Contributor

bradfitz commented Dec 3, 2014

Comment 2:

Yeah, base64url was what I was imagining we'd do.
I don't plan to use the "ni:" scheme, though. We'll stick with Camlistore's blobref
format, though. Moving away from that would be a lot more work, for little gain. We can
also support "ni:///" in other places where necessary.
We need to decide which blobref prefix to use ("sha3-"?) and which SHA-3 output size to
use.
And whether we put the output size into the blobref prefix or not. We could infer it
from the length, but then we can't tell the difference between truncated blobref
strings, if they get truncated at an unfortunate spot.  So we could go with e.g.
"sha3_256-" which is kinda gross. Or "sha3-" for now (implying whatever we pick, e.g.
256), and add "sha3_512-" later if we need a different one.
Thoughts?
@mpl

This comment has been minimized.

Contributor

mpl commented Dec 3, 2014

Comment 3:

If we find a cute way to be implicit all along ("sha3s-" - s for short - for 256,
"sha3l-" - l for long - for 512?) I think it'd be ok, but starting implicit and
switching later to precising the size seems inconsistent to me.
In any case, if I wanted to bikeshedd, I think we should stick with the notation on
wikipedia when being explicit about the size: "sha3-256-" and "sha3-512-".
@bradfitz

This comment has been minimized.

Contributor

bradfitz commented Dec 3, 2014

Comment 4:

My only concern with "sha3-256-" is that it adds a hyphen in the middle of the blobref
type, breaking our usual convention of /^(\w+)-(.+)$/.  I'd be okay with "sha3_256-",
but is a bit uglier and not the normal spelling.
@camlistorebot

This comment has been minimized.

camlistorebot commented Dec 5, 2014

Comment 5 by jefft0:

There are hits if you Google for "sha3256".

@bradfitz bradfitz self-assigned this Dec 5, 2014

@jefft0

This comment has been minimized.

jefft0 commented Dec 29, 2014

Are you planning to use base64 for files and folders in the Camlistore/blobs/sha3 folder? I ask because Windows and OS X filenames are case insensitive. There would be a chance of collision, although very small. Or would you keep using the hex of the sha3 hash for the blob store even if the blobref uses base64?

@bradfitz

This comment has been minimized.

Contributor

bradfitz commented Dec 30, 2014

Good point. We'd probably need to use hex of the hash for the "localdisk" blob store (the one that stores one blob per file).

@jefft0

This comment has been minimized.

jefft0 commented Dec 30, 2014

I estimate the probability of collision for sha3_256 as about one in 2^180 (math below *). When you are about to store a blob and see the same file name, you could re-scan the first few bytes of the file to see if it is really different. Do you already do this when you are about to overwrite a blob file?

(*) Sha3_256 has 43 characters in the base64 file name. For simplicity, assume that the base64 character set has only the 52 letters, ignoring the other 12 symbols. A collision would happen if any of the 43 characters flips between upper and lower case. So there are 2^43 other file names that collide. This is out of 2^256 possible file names. So the chance of one other file colliding is 2^43 / 2^256 or one in 2^213. If over the lifetime of the blob store you generate 2^33 files, then the chance of collision is one in 2^180.

@bradfitz bradfitz modified the milestone: 0.9 Jan 9, 2015

@bradfitz bradfitz removed the release-0.9 label Jan 9, 2015

@jefft0

This comment has been minimized.

jefft0 commented Mar 25, 2015

For case-insensitive file systems (Windows and OS X) you can put a special character after each upper-case letter. For example "sha3-a2B4C6d" becomes "sha3-a2B!4C!6d". It's easy to recover the original base64 by removing all "!". And the file name is still shorter 90% of the time than using only hex characters.

@hullerob

This comment has been minimized.

hullerob commented Mar 27, 2015

How will this affect detection of identical blobs? Will we compute both sha1 and sha3 to be sure (and possibly more in the future) or do we just give up and have same blob multiple times?

@bradfitz bradfitz modified the milestones: 0.10, 0.9 Mar 31, 2015

@bradfitz bradfitz removed the accepted label Apr 25, 2016

@mpl

This comment has been minimized.

Contributor

mpl commented May 31, 2017

@mpl mpl referenced this issue Sep 27, 2017

Open

use Multiformats #959

@bradfitz bradfitz changed the title from Support SHA-3, base64 blobrefs to Support SHA-256 blobrefs (migrate away from sha1-*) Jan 7, 2018

@bradfitz

This comment has been minimized.

Contributor

bradfitz commented Jan 7, 2018

I'm leaning towards SHA-224, and keeping it in hex format, not base64. It's not much longer:

https://play.golang.org/p/zIg749SJou6

sha1	= da39a3ee5e6b4b0d3255bfef95601890afd80709
sha224	= d14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac5b3e42f
@mpl

This comment has been minimized.

@lindner

This comment has been minimized.

Member

lindner commented Jan 8, 2018

@bradfitz

This comment has been minimized.

Contributor

bradfitz commented Jan 9, 2018

@lindner, we're already effectively using multiformats, in an even more verbose & data-archaeological-proof way. And our format predated that proposal, otherwise we might've considered it. I don't see a compelling advantage to hex encoding our "sha224-" prefix to save a few bytes. Actually they don't even have sha2-224 registered with a unique number.

@bradfitz

This comment has been minimized.

Contributor

bradfitz commented Jan 9, 2018

TODO list for this bug:

  • support sha224 by default, https://camlistore-review.googlesource.com/c/camlistore/+/13066
  • audit/remove all callers of "crypto/sha1" or blob.SHA1FromString or blob.SHA1FromBytes in the code, and use the generic blob.RefFromString etc.
  • add server config to let people control whether sha1 is accepted by servers (off by default, but we'll want to let people turn it on for a bit while clients are updated)
  • update docs & examples to sha224
  • support wholeref queries of multiple hashes, and implement for client

camlistorebot pushed a commit that referenced this issue Jan 9, 2018

pkg/blob, all: support SHA-224 blobrefs, make them the default
Updates #537

Change-Id: I3966697cbdb05ca4b380974be604deebdaa258c2

camlistorebot pushed a commit that referenced this issue Jan 10, 2018

all, testhooks: use sha224 by default, add hook for some tests to use…
… sha-1

Remove the blob.SHA{1,224}From{Bytes,String} constructors too. No
longer used. This adds blob.RefFromBytes which was missing. We had
blob.RefFromString. Now everything uses blob.RefFrom* instead of
specifying a hash function.

Some tests set a flag to force use of SHA-1 because there was too much
golden data to update. We can remove those one-by-one over time as we
fix up tests.

Updates #537

Change-Id: Ibe6428089a6221594c2b751f53f98b03b5a28dc2
@mpl

This comment has been minimized.

@mpl

This comment has been minimized.

camlistorebot pushed a commit that referenced this issue Jan 23, 2018

pkg/index: fix location in corpus to use signer ID instead of blobRef
Follow-up of ec66bcc

Updates #537

Change-Id: Ib4dc62ae6be91c061dc6de4bfdd617785abdaab6
@mpl

This comment has been minimized.

camlistorebot pushed a commit that referenced this issue Jan 25, 2018

pkg/index: fix corpus AppendClaims use of signerFilter
Follow-up of ec66bcc

This change should be the last of its kind, in the corpus at least, and
should make most searches and the web UI usable again, with both kinds
of hashes (sha1 and sha224).

Updates #537

Change-Id: Icfe9e8aaab031313612c555b7601895aeba16a7c

@bradfitz bradfitz changed the title from Support SHA-256 blobrefs (migrate away from sha1-*) to Support SHA-2 blobrefs (migrate away from sha1-*) Jan 31, 2018

@bradfitz bradfitz modified the milestones: 0.10, 1.0 May 2, 2018

@bradfitz

This comment has been minimized.

Contributor

bradfitz commented May 2, 2018

Punting the remaining stuff (config, mostly) to 1.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment