Endianness #4

marshallpierce · 2017-09-30T15:24:10Z

From Fig 3 in https://arxiv.org/pdf/1709.08990.pdf, it looks like the data layout is intended to be big-endian. However, in the test data from https://bitbucket.org/marshallpierce/stream-vbyte-rust/commits/ad95ed76e271a10c0c0bb57e23800a4e23d606e9 encoding 0, 100, 200, 300, ..., we have the following hex (format from hexdump -C):

00000000  40 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |@UUUUUUUUUUUUUUU|
00000010  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|

Since the first four numbers are 0, 100, 200, 300 taking 1, 1, 1, 2 bytes respectively, based on the figure's diagram of control bits to encoded ints I would expect the first control byte to be 0b00000001 = 0x01, not 0b01000000 = 0x40.

There are 1250 = 0x4E2 control bytes, so looking at where the encoded numbers start, we see:

000004e0  aa aa 00 64 c8 2c 01 90  01 f4 01 58 02 bc 02 20  |...d.,.....X... |

0 = 0x00, 100 = 0x64, 200 = 0xC8 are single bytes of course, but 300 = 0x012C in big endian, and the sample data has 0x2C01.

Of course, little-endian is just as valid a choice as big-endian! Am I misinterpreting the paper? Should I be letting the user choose which endianness to expect?

The text was updated successfully, but these errors were encountered:

lemire · 2017-09-30T15:53:23Z

Thanks. Obviously, we can change the ordering.

I think it is reasonable to try to define things precisely. Let us do that!

lemire · 2017-10-02T23:38:16Z

Pinging @KWillets who is working on a vectorized encoder.

KWillets · 2017-10-03T00:24:32Z

I did see the ordering in the code -- highest-order bits correspond to the last int. I can change it if needed.

lemire · 2017-10-03T00:40:21Z

I think it would be worth it flipping things around as suggested by @marshallpierce. Maybe.

marshallpierce · 2017-10-03T01:23:37Z

There are two endianness issues -- the ordering of 2-bit pairs in the control byte, and the layout of the bytes in each 32-bit int. If I'm reading the code correctly, the latter is just using the platform's native endian-ness, and we should probably specify that to be either little-endian or big-endian. For SIMD code at least it doesn't matter much which way we pick as we can shuffle the bytes any which way in decoding.

lemire · 2017-10-03T02:40:22Z

@marshallpierce

I don't think you should take the figures in the paper as a format specification. I agree that it is now a good idea to specify byte orders and stuff, that is, specify the format, so that we can write different but interoperable versions. But using the figures in the paper is not the best way. I think I drew these figures and when I did so, I did not have in mind that it was meant to specify an interoperable format. I would have been a lot more careful and specific otherwise!

Because the implementation assumes an x64 processor, then it makes sense to assume little endian encoding. The "standard" VByte format is "little endian" (you start from the least significant bytes). Virtually all hardware that I care about (ARM, x64) can be considered to be little endian.

So unless someone has a compelling argument, I propose to keep things little endian. Though I also propose to openly state that it is so, and the code should probably be safe (that is, do the right thing on big endian processors). The current C code is not portable (won't build on non-x64 platforms), but portable code ought to work on both little endian and big endian machines.

Regarding the control bytes... initially, after reading your comment, I felt strongly that the implementation was backward... but I am no longer so sure. If the sole argument for flipping the order of the 2-bit words is "this is how it looks in the paper", then I don't think it is the best argument. The paper just illustrates the data, it does not specify exactly how it should be physically written.

So I'd like to suggest that we only change the order if we have some kind of case for doing so that goes beyond the interpretation of the figures in the paper.

What I'd like to do, from this point forward, is to precisely specify the ordering of the bytes and the 2-bit words, which can be done with a paragraph and a quick schema.

c.c. @KWillets

marshallpierce · 2017-10-03T02:57:09Z

Making the encoded ints be little-endian makes sense to me. It will also help the scalar implementation on little-endian architectures.

I would weakly argue that the control bytes should have their bits in the same order as the encoded numbers they represent because I think that that requires less explanation and is easier to grasp, but it's not a big deal either way if it's clearly documented. One could argue that since the numbers are encoded little-endian style, the control bits are also little-endian, sorta...

lemire · 2017-10-03T04:15:18Z

@KWillets : Do you have an opinion regarding the order of the 2-bit words?

KWillets · 2017-10-03T04:32:55Z

I have no opinion; I believe the implementation will work the same either way.

lemire · 2017-10-03T15:31:14Z

I sketched a format specification... https://github.com/lemire/streamvbyte/blob/master/README.md#format-specification

Please re-open if you disagree or have further comments.

marshallpierce · 2017-10-05T15:21:32Z

No need to re-open, but this seemed like the best place to put this note.

In case it comes in handy for further compatibility testing, I added a simple "example", in Cargo parlance, to the Rust implementation that can be used to easily generate encoded versions of any sequence of numbers (see the readme). I'm happy to add more features to it if they're handy for people. (Delta-encoding sorted ints will certainly end up there once I've implemented that.)

lemire · 2017-10-05T15:37:56Z

@marshallpierce Fantastic work!

lemire closed this as completed Oct 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endianness #4

Endianness #4

marshallpierce commented Sep 30, 2017

lemire commented Sep 30, 2017

lemire commented Oct 2, 2017

KWillets commented Oct 3, 2017

lemire commented Oct 3, 2017 •

edited

Loading

marshallpierce commented Oct 3, 2017

lemire commented Oct 3, 2017

marshallpierce commented Oct 3, 2017

lemire commented Oct 3, 2017

KWillets commented Oct 3, 2017

lemire commented Oct 3, 2017

marshallpierce commented Oct 5, 2017

lemire commented Oct 5, 2017

Endianness #4

Endianness #4

Comments

marshallpierce commented Sep 30, 2017

lemire commented Sep 30, 2017

lemire commented Oct 2, 2017

KWillets commented Oct 3, 2017

lemire commented Oct 3, 2017 • edited Loading

marshallpierce commented Oct 3, 2017

lemire commented Oct 3, 2017

marshallpierce commented Oct 3, 2017

lemire commented Oct 3, 2017

KWillets commented Oct 3, 2017

lemire commented Oct 3, 2017

marshallpierce commented Oct 5, 2017

lemire commented Oct 5, 2017

lemire commented Oct 3, 2017 •

edited

Loading