Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to base36 by default in all text output (overriding ipfs/go-ipfs/issues/4143 ) #7357

Closed
ribasushi opened this issue May 25, 2020 · 15 comments
Labels
kind/enhancement A net-new feature or improvement to an existing feature

Comments

@ribasushi
Copy link
Contributor

Now that base36 support is landing into go-ipfs for unrelated reasons I would like to resurrect my pitch from 3 years ago to default to base36 rendering as opposed to base32. The internals of go-ipfs operate over binary CIDs so the proposed change is strictly concerned with "UX".

Let me know your thoughts!

✅ Pros

  • Consistency across the board: libp2p-key and unixfs-entry pointers all look like what the wider public knows as "a CID"
  • Brings "semi-avalanche effect" to identity CIDs. E.g.
Data: sufficiently long payload sufficient1y long payload
B32Cid: bafkqagttovtgm2ldnfsw45dmpeqgy33om4qhaylznrxwczak bafkqagttovtgm2ldnfsw45brpeqgy33om4qhaylznrxwczak
B36Cid: kumq0v1xsl33xy9ospzfmf6uuibi0jxtdsrmhovipx6o0a kumq0v1xsl33xy9ospzfmbp3zrpszmf3tbmvd50ou83pje
  • Increases the usable size of an identity-cid from 34 to 36 bytes of "blocksize" while still usable in subdomains (max 63 characters): 1 + ceil( (4+blocksize) * log(256) / log(base32-or-36) )

❌ Cons

  • Early adopters are already familiar with bafy..., another switch would create further confusion
  • Somewhat slower ( on my macbook I can encode and decode ( 1 * 1000 * 1000 * 1000 ) / 2672 = 374251 b36 CIDs in 1 second ):
go-multibase$ go test -count=1 ./... -cpu=1 -bench . -benchmem
goos: darwin
goarch: amd64
pkg: github.com/multiformats/go-multibase
BenchmarkRoundTrip/base32         	 2190152	       541 ns/op	     256 B/op	       4 allocs/op
BenchmarkRoundTrip/base36         	  464097	      2672 ns/op	     336 B/op	       5 allocs/op
BenchmarkRoundTrip/base58btc      	  564238	      2220 ns/op	     336 B/op	       5 allocs/op
BenchmarkEncode/base32            	 5402474	       234 ns/op	     192 B/op	       3 allocs/op
BenchmarkEncode/base36            	  608466	      1966 ns/op	     192 B/op	       3 allocs/op
BenchmarkEncode/base58btc         	  739843	      1585 ns/op	     192 B/op	       3 allocs/op
BenchmarkDecode/base32            	 4150816	       278 ns/op	      64 B/op	       1 allocs/op
BenchmarkDecode/base36            	 1604271	       742 ns/op	     144 B/op	       2 allocs/op
BenchmarkDecode/base58btc         	 1804045	       643 ns/op	     144 B/op	       2 allocs/op
@ribasushi ribasushi added the kind/enhancement A net-new feature or improvement to an existing feature label May 25, 2020
@RubenKelevra

This comment has been minimized.

@ribasushi

This comment has been minimized.

@Stebalien

This comment has been minimized.

@lidel
Copy link
Member

lidel commented May 25, 2020

There is a performance penalty, but this is mostly about UX.
Having that in mind, I'd like to highlight what @RubenKelevra said here:

If we use the same text to binary transformation it would be nice if the type of data we encode would be visible to a human

I wonder, perhaps there is UX next-positive in having different defaults (and canonical representation in subdomains) for mutable and immutable identifiers:

  • keep base32 for /ipfs/baf...
  • use base36 as the default for PeerIDs/libp2p-keys (/ipns/k2...)
    • we will use it for longer keys such as ed25519 anyway, so we are half-way there

This would make it easy to eyeball identifiers (baf.. vs k2..), potentially reducing confusion between "CIDs" and "PeerIDs".

Any downsides?


I don't think we encounter a system which can't distinguish between lower and upper case

I wish that was the case :-)

Unfortunately the default text representation of CID needs to be case-insensitive for pragmatic reasons, some of them are:

  • DNS is one thing,
  • but actual code in browser vendors (and probably a bunch of libraries/SDKs used in the wild) is forcing all characters in the "authority" component of the URI to be lowercase, even if URI does not start with http* (!)
  • Windows filesystems are case-insensitive (only recently Windows 10 build 17107 introduced experimental opt-in per directory to make them case-sensitive)
  • probably much more

@RubenKelevra

This comment has been minimized.

@lidel
Copy link
Member

lidel commented May 25, 2020

@RubenKelevra the limitation here is that IPFS uses CIDv0/CIDv1 spec, and adding checksum would mean creating CIDv2. I think its worth discussing, but as you noticed its out of scope here, so please fill an issue in https://github.com/multiformats/cid 🙏

@Stebalien
Copy link
Member

This would make it easy to eyeball identifiers (baf.. vs k2..), potentially reducing confusion between "CIDs" and "PeerIDs".

Any downsides?

Well, people will start assuming that baf means CID and k2 means peer ID. Worse, people will start writing tools to match these prefixes. On the other hand, that's likely going to happen anyways.

@bmwiedemann
Copy link
Contributor

Apart from performance and en/decoder-complexity,
another downside of base36 is decreased legibility, because it adds 0, 1 and 8 to the character set that can be mistaken for I, l or O
probably not worth it for the small increase in data density.

@Stebalien
Copy link
Member

Well, people will start assuming that baf means CID and k2 means peer ID. Worse, people will start writing tools to match these prefixes. On the other hand, that's likely going to happen anyways.

To clarify, I wouldn't consider the ability to distinguish between CIDs and peer IDs based on encoding a feature.

I'm slightly leaning towards base36 everywhere:

  • To be consistent with CIDs.
  • To stop users from trying to distinguish between "is a cid" and "is not a cid" by looking at the base encoding. They're all CIDs.

@bmwiedemann

Apart from performance and en/decoder-complexity,
another downside of base36 is decreased legibility, because it adds 0, 1 and 8 to the character set that can be mistaken for I, l or O
probably not worth it for the small increase in data density.

The goal is:

  1. To shave off two bytes from specific IPNS keys so they fit in subdomains.
  2. Be consistent.

@bmwiedemann
Copy link
Contributor

To shave off two bytes from specific IPNS keys so they fit in subdomains.

Maybe soon there will be other keys that need another 2 bytes?

Others have avoided the 63-char limit by allowing a dot:
https://tools.ietf.org/html/rfc6376#section-3.6.2.1

and code complexity is still a concern to me, because it makes alternative implementations harder and thereby reduces ipfs adoption:
https://github.com/bmwiedemann/perl-ipfs/blob/master/multibase.pm#L16
has just ~20 lines to implement encoding+decoding for base32+64+16+2
and no base58 or 36 support in there. Luckily both filestore and CIDv1 default to base32.

@aschmahmann
Copy link
Contributor

TLDR: @Stebalien has accurately stated the goal, @bmwiedemann's solution is explored in #7318 and has serious problems for important use cases.

Maybe soon there will be other keys that need another 2 bytes?

Yes, and we'll likely have to deal with that at some point, however the "just use a dot" suggestion is not as simple as you are advertising. Take a look at #7318 for more context and feel free to add any suggestions there if you have some ideas.

Hopefully we can work to ensure that we don't have to deal with this problem "soon" and that by then some of the issues in #7318 will become manageable. Potentially we could even sidestep DNS name restrictions entirely by relying more heavily on an IPFS protocol handler instead just using a DNS name.

In the meanwhile we'd like to handle #6916 while still supporting resolution in subdomains. While we could do that by using base 36 in some places (e.g. subdomains) and base 32 in other places, this proposal is advocating for consistently outputting base 36.

@ribasushi
Copy link
Contributor Author

@bmwiedemann I contributed a scaffold b36/b58 encoder here as a "oneliner" : bmwiedemann/perl-ipfs@83cb86c#r39496064

Maybe soon there will be other keys that need another 2 bytes?

We are now painfully aware about the hard limit for user-visible things being 40 bytes, and are going to aggressively design to not go over this in the future (case in point: the current situation was the result of an oversight).

@ribasushi
Copy link
Contributor Author

Dropping a reference for when there is a decision on this: one of the main spots to fixup would be https://github.com/ipfs/go-ipfs/blob/79d571f272d640f0d4164f6f9d94c5fb44983822/core/corehttp/hostname.go#L381-L396

@ribasushi
Copy link
Contributor Author

@aschmahmann @lidel seems to me we will never do this in go-ipfs. I will close this later today unless you think there's more to talk about.

@lidel
Copy link
Member

lidel commented Jun 21, 2021

@ribasushi closing this sgtm
(we default CIDv1 under /ipfs to base32 and since #7441 we use base36 for libp2p-keys under /ipns what would not fit in DNS label otherwise)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature
Projects
None yet
Development

No branches or pull requests

6 participants