Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonical encoding rules for ambiguous situations #34

Closed
eth-r opened this issue Jan 8, 2018 · 12 comments
Closed

Canonical encoding rules for ambiguous situations #34

eth-r opened this issue Jan 8, 2018 · 12 comments

Comments

@eth-r
Copy link
Contributor

eth-r commented Jan 8, 2018

In some encodings without existing standards (like RFC4648), there is a potential for ambiguous encodings and this should be handled in some way. For example, the common test case "yes mani !" is usually base-2 encoded to "01111001011001010111001100100000011011010110000101101110011010010010000000100001" in implementations, implying that leading zeros are dropped. However, this means that "\x00yes mani !" also encodes to "01111001011001010111001100100000011011010110000101101110011010010010000000100001" which will absolutely cause problems for someone somewhere at some point.

There are a few different ways to resolve this ambiguity.

One would be to override the existing example and require fixed-length encoding so "yes mani !" would encode to "001111001011001010111001100100000011011010110000101101110011010010010000000100001" and "\x00yes mani !" to "00000000001111001011001010111001100100000011011010110000101101110011010010010000000100001". This would require decoding variable-width encodings as the shortest matching string, for backwards compatibility.

Another would be to permit dropping leading zeros as long as the encoding remains unambiguous. In this case the existing example would still be valid, "\x00yes mani !" would encode to "0001111001011001010111001100100000011011010110000101101110011010010010000000100001" and "\x00\x00yes mani !" to "000000000001111001011001010111001100100000011011010110000101101110011010010010000000100001".

A similar problem applies to base-8 and base-10, with the added complexity of the latter case not even mapping cleanly to bits to begin with. A canonical set of test vectors covering these and other edge cases would be extremely useful for implementers; #24

@Stebalien
Copy link
Member

Stebalien commented Jan 8, 2018

Interesting. I believe we want to vary from the RFCs here. Every byte in the input should appear in the output. Zeros should not be special.

@eth-r
Copy link
Contributor Author

eth-r commented Jan 8, 2018

If that's the desired behavior, the JS, C#, Rust and Python implementations are using incorrect test vectors which drop the leading zero. (Additionally, the Haskell implementation currently lacks tests completely and has some very weird behavior, which I'm working on fixing)

@eth-r
Copy link
Contributor Author

eth-r commented Jan 8, 2018

(Also, to resolve some potential ambiguity: this base2 issue is not related to RFCs but the absence of them; RFC4648-compliant base16/32/64 encodings do not suffer from this problem)

@Stebalien
Copy link
Member

I see.

As far as I know, nothing actually uses the base2 encoding (they mostly just exist for completeness) but we should definitely fix that. I'm guessing these base2 implementations have that problem because base2 is usually only used to encode numbers, not byte strings. Would you mind filing bugs against the affected projects?

@eth-r
Copy link
Contributor Author

eth-r commented Jan 8, 2018

Issues filed against the Rust, Python and JS libraries, the C# one is using its own nonstandard thing which is also an incompatibility but not actually about this specific issue.

I've also prepared test vectors for cases that exercise this issue, and my version of the haskell library now preserves leading zeros in all bases covered by the project, except base-10 which isn't implemented yet, and unary which is its own can of worms.

@eth-r eth-r mentioned this issue Jan 8, 2018
@Stebalien
Copy link
Member

Honestly, we should probably just get rid of unary. It's totally useless.

@SleeplessByte
Copy link
Contributor

SleeplessByte commented Jun 19, 2019

Hi there! Whilst implement the ruby variant (#53) I ran into this.

My current implementation has this working for the test vectors, which at most prepend 2 null bytes, but I can't quite figure out what the idea here was. I don't quite understand why the algorithm is different for power of 2 bases vs. non power of two bases.

In any case, it seems that for non-power-of-2-bases (base 10, base 58) the correct method is:

  • count leading zeros at front
  • drop from input
  • encode input
  • put "first character from encoding" in front * counted times

Decoding is the same in reverse:

  • count "first character from encoding" in front
  • drop from input
  • decode input
  • put 0 bytes at front * counted times

I'll turn this into an RFC, because its actually not there yet in this capacity.


The implementation I have for base 2 (and base 16/32/64) are using a c library, part of the Ruby stdlib and they all pass the test vectors. I have the following working, but this might be a quirk:

Encoding yes mani ! with 0, 1, 2 or 3 \x00 in front:

        171312714403326055632220041
     000171312714403326055632220041
   00000171312714403326055632220041
00000000171312714403326055632220041

I have no clue if the 3 * null byte in front is correct. This is the algo I applied:

  • factor of encoding growth = log(256) / log(base)
  • determine expected length of the encoded output (treat 0 byte as regular input)
  • count leading zeros at front
  • drop from input
  • encode input
  • put "first character from encoding" in front (factor * count).floor
  • pad the output if necessary with extra zeroes

Decoding is similar, but slightly simpler:

  • factor of encoding growth = log(256) / log(base)
  • count leading zeros at front
  • drop from input
  • decode input
  • put "first character from encoding" in front (count / factor).round (rounding takes care of removing padding)

Doing this generates the correct output for base 2 ánd base 8 (without using the c lib).
The algorithm feels convoluted. Is this what I'm supposed to be doing?

@Stebalien
Copy link
Member

In any case, it seems that for non-power-of-2-bases (base 10, base 58) the correct method is:

This is more a rule of thumb, than anything. Many of these codings were defined elsewhere and just happened to pick the same rule (or copied each-other). This is likely because most of these encodings are defined as as "treat the input as a a single variable-width number and change the base". Treating the input as a variable-width number wouldn't usually preserve leading zeros.

For powers of two, this also comes from existing definitions. This time, it comes down to the fact that the input is actually treated as a sequence of fixed-width numbers where the base of each is changed independently. The fixed-width nature of these numbers preserves the zeros.

The tricky part is padding: for powers of two, we expect that the input can be broken into an even number of bytes.

  • For base 2 we currently pad by prepending enough zeros (8 - (length(input) % 8)) to round to an even number of bytes. NOTE: I don't believe that the spec, go, and js currently agree on base 2 and nobody uses it in practice.
  • I don't think we currently specify base 8. I know js implements it using the base-x library.
  • For base 16+, we follow the rules of the well defined encodings (we have multiple encodings, some of which pad, some of which don't).

@vmx anything you can fill in here?

@SleeplessByte
Copy link
Contributor

don't believe that the spec, go, and js currently agree on base 2 and nobody uses it in practice.

It (base 2) seems to match the c implementation of ruby's pack (which means that this is probably something that is specced somewhere). In that matter, it seems that all powers of 2 are easy to define.


I do know that base-x is not matching the multibase spec; the ruby implementation is following @eth-r and therefor the spec files, which basically call for the "weird" rules I've laid out above.

I think it would be great to get some more consensus on this, and I'll write spec documents for it, so a new language will be able to implement this with confidence (for example, on exercism.io, we support 50 languages, meaning 50 different implementations 🗡 )

@vmx
Copy link
Member

vmx commented Jul 16, 2019

@Stebalien I've nothing to add, though I'm also not really a base encoding expert.

@Stebalien
Copy link
Member

@SleeplessByte ❤️ ! Actual specs for all of these would be amazing! Please reach out to me over IRC(freenode)/Matrix on the #ipfs (or #multiformats) channel if you need something looked at and I'm not responding.

@vmx
Copy link
Member

vmx commented Jun 9, 2020

I think this is solved by the most recent specs and test vectors we have. Please re-open if you don't agree.

@vmx vmx closed this as completed Jun 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants