Please remove header #3

Closed
zmorris opened this Issue Nov 27, 2016 · 10 comments

Projects

None yet

4 participants

@zmorris
zmorris commented Nov 27, 2016

http://blog.kevinalbs.com/base122#a_minor_note_on_the_last_character

The 1 byte header makes it impossible to encode when the input size is unknown. Please remove the header and put it at the end of the stream, similar to the padding in base64. Otherwise, keep up the good work!

@andersk
andersk commented Nov 28, 2016 edited

A simpler scheme would be to pad with 1, 10, 100, 1000, etc. rather than with 0, 00, 000, 0000, etc. Then the padding is self-delimiting and no flag is needed at all, as long as you guarantee the use of at least one bit of padding. (This is the same padding scheme used by most hash algorithms, such as MD5 and SHA.)

@kevinAlbs
Owner

@zmorris Thanks for bringing this up (and for making the issue). This makes perfect sense.
@andersk I need to think about it a little more, especially in the case when the last two-byte character would need to encode 14 bits.

@andersk
andersk commented Nov 28, 2016 edited

In this scheme, if the last two-byte character would need to encode 14 bits, it is not the last character. You must use at least one bit of padding. So it would be followed by a one-byte character 01000000 encoding seven padding bits 1000000.

@kevinAlbs
Owner

Here is another idea after reading this comment. Since there are only six illegal characters and three bits are used to encode this, there two values left to use. One of these values can be used to signify if the last two-byte character is shortened (<= 7 encoded bits), and the remainder of the two-byte character would have the encoded data.

The nice thing about this is that it removes the header, won't add another byte, and requires only a small change to the implementation (it actually shortens the decoder!)

@GeertBosch

Hi @kevinAlbs, great work and congrats on your HN appearance! +1 on getting rid of the header byte, that just ruins any kind of streaming use. There are many ways around this without a header. Probably the simplest is to use one of the two encodings you have left in the sss field, say the value 111. This value would just indicate "no character", and the final character is in the remaining bits. A nice property of your encoding is that it pretty much has a fixed expansion ratio. The encoded size can be computed from the length of the unencoded string and its last two characters, and similarly for computing the unencoded size.

With more trickery it would be possible to encode 139 possibilities per encoded byte for an encoding density of 8 bytes per 9 characters (12.5% expansion) instead of 7 bytes per 8 (14.3% expansion). However, encoding will be so much more complex and slow that it's surely not worth it.

I'm curious about how you can embed base-122 encoded data in HTML, as unlike JavaScript, that standard seems to forbid control characters.

@kevinAlbs
Owner

@GeertBosch Thank you! I just pushed this commit which implements just that.

That sounds interesting, how would you achieve 139 possibilities per encoded byte?

And that is a good point regarding control characters. I recall looking at this spec and thinking that the only exceptions were the null char and &, but overlooked this part. All of the browsers I tested on (Edge, IE11, FF, Chrome, Safari) all seem to parse the control characters as expected, but I'd rather not rely on that behavior if undefined, even if using this for HTML is experimental. However, I think there is a straightforward way to move these strings to Javascript without degrading performance (and will also free up the &).

@GeertBosch

You can change the 110sss1x 10xxxxxx format back into 110zzzzz 10zzzzzz. Of the 2048 possibilities for the value of z, the first 128 are not allowed. Define the value the first base-139 digit as c1 = 122 + (z - 128) / 112. The second base-139 digit is c2 = (z - 128) % 112 - 1. If c2 == -1 there is no second character (end of text). For value of c2 exceeding 112, use the 3-byte encoding 1110zzzz 10zzzzzz 10zzzzzz instead. We need to use the 65536 - 2048 == 63488 possibilities to encode 139 - 122 == 17 excess characters for c1, 139 -111 == 28 excess characters for c2, leaving us with 63448 / 17 / 28 == 133 possible values for c3. For the final 7 excess values, we can use the 11110zzz 10zzzzzz 10zzzzzz 10zzzzzz format. The 2031616 usable values in that format are sufficient for the 17 * 28 * 7 * 140 == 466480 needed values.

Besides the considerable amount of logic for excess values, there is also a significant computational overhead to convert 9 base-139 characters into 8 base-256 bytes. This was more as an exercise to look at what the boundaries are of the amount of data one can store in a UTF-8 string, when excluding a small set of byte values (6 in this case) from consideration. Eight bytes of binary data per nine UTF-8 bytes with 6 reserved values is surprisingly close to the theoretical maximum.

@kevinAlbs
Owner

@GeertBosch Neat! IIUC this still only occurs when the first digit is > 122, so the additional savings will only occur then?

@GeertBosch

You'd split your binary input into base-139 digits, so savings will apply always. Even if you encounter only values < 122, you encoded log2(139) = 7.12 bits of information per byte. Of course, splitting binary input in base-139 digits (8 bytes become 9 base-139 digits) is yet more work, so not really practical. Your encoding gets quite close (7 bits per byte) and is far more efficient.

@kevinAlbs
Owner

Ah yes I see what you mean. I was thinking in terms of bits, not base-139 digits. So even if the value is < 122, you're encoding the entire base-139 digit in one byte. Very cool!

I suppose I should close this issue since we've strayed a bit and it has been resolved.

@kevinAlbs kevinAlbs closed this Dec 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment