Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endianness #4

Closed
marshallpierce opened this issue Sep 30, 2017 · 12 comments
Closed

Endianness #4

marshallpierce opened this issue Sep 30, 2017 · 12 comments

Comments

@marshallpierce
Copy link

From Fig 3 in https://arxiv.org/pdf/1709.08990.pdf, it looks like the data layout is intended to be big-endian. However, in the test data from https://bitbucket.org/marshallpierce/stream-vbyte-rust/commits/ad95ed76e271a10c0c0bb57e23800a4e23d606e9 encoding 0, 100, 200, 300, ..., we have the following hex (format from hexdump -C):

00000000  40 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |@UUUUUUUUUUUUUUU|
00000010  55 55 55 55 55 55 55 55  55 55 55 55 55 55 55 55  |UUUUUUUUUUUUUUUU|

Since the first four numbers are 0, 100, 200, 300 taking 1, 1, 1, 2 bytes respectively, based on the figure's diagram of control bits to encoded ints I would expect the first control byte to be 0b00000001 = 0x01, not 0b01000000 = 0x40.

There are 1250 = 0x4E2 control bytes, so looking at where the encoded numbers start, we see:

000004e0  aa aa 00 64 c8 2c 01 90  01 f4 01 58 02 bc 02 20  |...d.,.....X... |

0 = 0x00, 100 = 0x64, 200 = 0xC8 are single bytes of course, but 300 = 0x012C in big endian, and the sample data has 0x2C01.

Of course, little-endian is just as valid a choice as big-endian! Am I misinterpreting the paper? Should I be letting the user choose which endianness to expect?

@lemire
Copy link
Owner

lemire commented Sep 30, 2017

Thanks. Obviously, we can change the ordering.

I think it is reasonable to try to define things precisely. Let us do that!

@lemire
Copy link
Owner

lemire commented Oct 2, 2017

Pinging @KWillets who is working on a vectorized encoder.

@KWillets
Copy link
Collaborator

KWillets commented Oct 3, 2017

I did see the ordering in the code -- highest-order bits correspond to the last int. I can change it if needed.

@lemire
Copy link
Owner

lemire commented Oct 3, 2017

I think it would be worth it flipping things around as suggested by @marshallpierce. Maybe.

@marshallpierce
Copy link
Author

There are two endianness issues -- the ordering of 2-bit pairs in the control byte, and the layout of the bytes in each 32-bit int. If I'm reading the code correctly, the latter is just using the platform's native endian-ness, and we should probably specify that to be either little-endian or big-endian. For SIMD code at least it doesn't matter much which way we pick as we can shuffle the bytes any which way in decoding.

@lemire
Copy link
Owner

lemire commented Oct 3, 2017

@marshallpierce

I don't think you should take the figures in the paper as a format specification. I agree that it is now a good idea to specify byte orders and stuff, that is, specify the format, so that we can write different but interoperable versions. But using the figures in the paper is not the best way. I think I drew these figures and when I did so, I did not have in mind that it was meant to specify an interoperable format. I would have been a lot more careful and specific otherwise!

  1. Because the implementation assumes an x64 processor, then it makes sense to assume little endian encoding. The "standard" VByte format is "little endian" (you start from the least significant bytes). Virtually all hardware that I care about (ARM, x64) can be considered to be little endian.

So unless someone has a compelling argument, I propose to keep things little endian. Though I also propose to openly state that it is so, and the code should probably be safe (that is, do the right thing on big endian processors). The current C code is not portable (won't build on non-x64 platforms), but portable code ought to work on both little endian and big endian machines.

  1. Regarding the control bytes... initially, after reading your comment, I felt strongly that the implementation was backward... but I am no longer so sure. If the sole argument for flipping the order of the 2-bit words is "this is how it looks in the paper", then I don't think it is the best argument. The paper just illustrates the data, it does not specify exactly how it should be physically written.

So I'd like to suggest that we only change the order if we have some kind of case for doing so that goes beyond the interpretation of the figures in the paper.

What I'd like to do, from this point forward, is to precisely specify the ordering of the bytes and the 2-bit words, which can be done with a paragraph and a quick schema.

c.c. @KWillets

@marshallpierce
Copy link
Author

Making the encoded ints be little-endian makes sense to me. It will also help the scalar implementation on little-endian architectures.

I would weakly argue that the control bytes should have their bits in the same order as the encoded numbers they represent because I think that that requires less explanation and is easier to grasp, but it's not a big deal either way if it's clearly documented. One could argue that since the numbers are encoded little-endian style, the control bits are also little-endian, sorta...

@lemire
Copy link
Owner

lemire commented Oct 3, 2017

@KWillets : Do you have an opinion regarding the order of the 2-bit words?

@KWillets
Copy link
Collaborator

KWillets commented Oct 3, 2017

I have no opinion; I believe the implementation will work the same either way.

@lemire
Copy link
Owner

lemire commented Oct 3, 2017

I sketched a format specification... https://github.com/lemire/streamvbyte/blob/master/README.md#format-specification

Please re-open if you disagree or have further comments.

@lemire lemire closed this as completed Oct 3, 2017
@marshallpierce
Copy link
Author

No need to re-open, but this seemed like the best place to put this note.

In case it comes in handy for further compatibility testing, I added a simple "example", in Cargo parlance, to the Rust implementation that can be used to easily generate encoded versions of any sequence of numbers (see the readme). I'm happy to add more features to it if they're handy for people. (Delta-encoding sorted ints will certainly end up there once I've implemented that.)

@lemire
Copy link
Owner

lemire commented Oct 5, 2017

@marshallpierce Fantastic work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants