Canonical encoding rules for ambiguous situations #34

eth-r · 2018-01-08T16:58:12Z

In some encodings without existing standards (like RFC4648), there is a potential for ambiguous encodings and this should be handled in some way. For example, the common test case "yes mani !" is usually base-2 encoded to "01111001011001010111001100100000011011010110000101101110011010010010000000100001" in implementations, implying that leading zeros are dropped. However, this means that "\x00yes mani !" also encodes to "01111001011001010111001100100000011011010110000101101110011010010010000000100001" which will absolutely cause problems for someone somewhere at some point.

There are a few different ways to resolve this ambiguity.

One would be to override the existing example and require fixed-length encoding so "yes mani !" would encode to "001111001011001010111001100100000011011010110000101101110011010010010000000100001" and "\x00yes mani !" to "00000000001111001011001010111001100100000011011010110000101101110011010010010000000100001". This would require decoding variable-width encodings as the shortest matching string, for backwards compatibility.

Another would be to permit dropping leading zeros as long as the encoding remains unambiguous. In this case the existing example would still be valid, "\x00yes mani !" would encode to "0001111001011001010111001100100000011011010110000101101110011010010010000000100001" and "\x00\x00yes mani !" to "000000000001111001011001010111001100100000011011010110000101101110011010010010000000100001".

A similar problem applies to base-8 and base-10, with the added complexity of the latter case not even mapping cleanly to bits to begin with. A canonical set of test vectors covering these and other edge cases would be extremely useful for implementers; #24

The text was updated successfully, but these errors were encountered:

Stebalien · 2018-01-08T17:13:25Z

Interesting. I believe we want to vary from the RFCs here. Every byte in the input should appear in the output. Zeros should not be special.

eth-r · 2018-01-08T17:26:23Z

If that's the desired behavior, the JS, C#, Rust and Python implementations are using incorrect test vectors which drop the leading zero. (Additionally, the Haskell implementation currently lacks tests completely and has some very weird behavior, which I'm working on fixing)

eth-r · 2018-01-08T17:29:28Z

(Also, to resolve some potential ambiguity: this base2 issue is not related to RFCs but the absence of them; RFC4648-compliant base16/32/64 encodings do not suffer from this problem)

Stebalien · 2018-01-08T17:39:35Z

I see.

As far as I know, nothing actually uses the base2 encoding (they mostly just exist for completeness) but we should definitely fix that. I'm guessing these base2 implementations have that problem because base2 is usually only used to encode numbers, not byte strings. Would you mind filing bugs against the affected projects?

eth-r · 2018-01-08T22:38:24Z

Issues filed against the Rust, Python and JS libraries, the C# one is using its own nonstandard thing which is also an incompatibility but not actually about this specific issue.

I've also prepared test vectors for cases that exercise this issue, and my version of the haskell library now preserves leading zeros in all bases covered by the project, except base-10 which isn't implemented yet, and unary which is its own can of worms.

Stebalien · 2018-01-08T23:57:00Z

Honestly, we should probably just get rid of unary. It's totally useless.

SleeplessByte · 2019-06-19T18:52:08Z

Hi there! Whilst implement the ruby variant (#53) I ran into this.

My current implementation has this working for the test vectors, which at most prepend 2 null bytes, but I can't quite figure out what the idea here was. I don't quite understand why the algorithm is different for power of 2 bases vs. non power of two bases.

In any case, it seems that for non-power-of-2-bases (base 10, base 58) the correct method is:

count leading zeros at front
drop from input
encode input
put "first character from encoding" in front * counted times

Decoding is the same in reverse:

count "first character from encoding" in front
drop from input
decode input
put 0 bytes at front * counted times

I'll turn this into an RFC, because its actually not there yet in this capacity.

The implementation I have for base 2 (and base 16/32/64) are using a c library, part of the Ruby stdlib and they all pass the test vectors. I have the following working, but this might be a quirk:

Encoding yes mani ! with 0, 1, 2 or 3 \x00 in front:

        171312714403326055632220041
     000171312714403326055632220041
   00000171312714403326055632220041
00000000171312714403326055632220041

I have no clue if the 3 * null byte in front is correct. This is the algo I applied:

factor of encoding growth = log(256) / log(base)
determine expected length of the encoded output (treat 0 byte as regular input)
count leading zeros at front
drop from input
encode input
put "first character from encoding" in front (factor * count).floor
pad the output if necessary with extra zeroes

Decoding is similar, but slightly simpler:

factor of encoding growth = log(256) / log(base)
count leading zeros at front
drop from input
decode input
put "first character from encoding" in front (count / factor).round (rounding takes care of removing padding)

Doing this generates the correct output for base 2 ánd base 8 (without using the c lib).
The algorithm feels convoluted. Is this what I'm supposed to be doing?

Stebalien · 2019-07-09T14:31:34Z

In any case, it seems that for non-power-of-2-bases (base 10, base 58) the correct method is:

This is more a rule of thumb, than anything. Many of these codings were defined elsewhere and just happened to pick the same rule (or copied each-other). This is likely because most of these encodings are defined as as "treat the input as a a single variable-width number and change the base". Treating the input as a variable-width number wouldn't usually preserve leading zeros.

For powers of two, this also comes from existing definitions. This time, it comes down to the fact that the input is actually treated as a sequence of fixed-width numbers where the base of each is changed independently. The fixed-width nature of these numbers preserves the zeros.

The tricky part is padding: for powers of two, we expect that the input can be broken into an even number of bytes.

For base 2 we currently pad by prepending enough zeros (8 - (length(input) % 8)) to round to an even number of bytes. NOTE: I don't believe that the spec, go, and js currently agree on base 2 and nobody uses it in practice.
I don't think we currently specify base 8. I know js implements it using the base-x library.
For base 16+, we follow the rules of the well defined encodings (we have multiple encodings, some of which pad, some of which don't).

@vmx anything you can fill in here?

SleeplessByte · 2019-07-09T17:59:33Z

don't believe that the spec, go, and js currently agree on base 2 and nobody uses it in practice.

It (base 2) seems to match the c implementation of ruby's pack (which means that this is probably something that is specced somewhere). In that matter, it seems that all powers of 2 are easy to define.

I do know that base-x is not matching the multibase spec; the ruby implementation is following @eth-r and therefor the spec files, which basically call for the "weird" rules I've laid out above.

I think it would be great to get some more consensus on this, and I'll write spec documents for it, so a new language will be able to implement this with confidence (for example, on exercism.io, we support 50 languages, meaning 50 different implementations 🗡 )

vmx · 2019-07-16T10:57:44Z

@Stebalien I've nothing to add, though I'm also not really a base encoding expert.

Stebalien · 2019-07-24T21:12:58Z

@SleeplessByte ❤️ ! Actual specs for all of these would be amazing! Please reach out to me over IRC(freenode)/Matrix on the #ipfs (or #multiformats) channel if you need something looked at and I'm not responding.

vmx · 2020-06-09T10:47:35Z

I think this is solved by the most recent specs and test vectors we have. Please re-open if you don't agree.

This was referenced Jan 8, 2018

Encoding ambiguity due to underlying design multiformats/rust-multibase#5

Closed

Ambiguous encoding due to underlying design multiformats/py-multibase#11

Closed

Ambigous encoding due to underlying design multiformats/js-multibase#24

Closed

eth-r mentioned this issue Jan 8, 2018

test vectors #24

Closed

tabrath mentioned this issue Feb 4, 2018

Nonstandard encoding multiformats/cs-multibase#4

Closed

vmx closed this as completed Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canonical encoding rules for ambiguous situations #34

Canonical encoding rules for ambiguous situations #34

eth-r commented Jan 8, 2018

Stebalien commented Jan 8, 2018 •

edited

eth-r commented Jan 8, 2018

eth-r commented Jan 8, 2018

Stebalien commented Jan 8, 2018

eth-r commented Jan 8, 2018

Stebalien commented Jan 8, 2018

SleeplessByte commented Jun 19, 2019 •

edited

Stebalien commented Jul 9, 2019

SleeplessByte commented Jul 9, 2019

vmx commented Jul 16, 2019

Stebalien commented Jul 24, 2019

vmx commented Jun 9, 2020

Canonical encoding rules for ambiguous situations #34

Canonical encoding rules for ambiguous situations #34

Comments

eth-r commented Jan 8, 2018

Stebalien commented Jan 8, 2018 • edited

eth-r commented Jan 8, 2018

eth-r commented Jan 8, 2018

Stebalien commented Jan 8, 2018

eth-r commented Jan 8, 2018

Stebalien commented Jan 8, 2018

SleeplessByte commented Jun 19, 2019 • edited

Stebalien commented Jul 9, 2019

SleeplessByte commented Jul 9, 2019

vmx commented Jul 16, 2019

Stebalien commented Jul 24, 2019

vmx commented Jun 9, 2020

Stebalien commented Jan 8, 2018 •

edited

SleeplessByte commented Jun 19, 2019 •

edited