-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong endianness of bit-sized integers #155
Comments
FYI, I have actually implemented big-endian, little endian, signed and unsigned bit fields (compiler and C++ runtime). I am currently testing my solution. When I am happy with the results I will generate a pull request for these. |
The main problem here is that other implementations beside "big-endian" are not very trivial. For example, consider this:
Intepreting "b" as big-endian is straightforward: b =
((x[2] & 0b1000_0000) >> 7) |
(x[1] << 1) |
((x[0] & 0b0111_1111) << 9) But I see at least two possible interpretations of "b" as little-endian here: // Method 1
// Dividing by x byte boundaries, reassembling by them as well
b =
((x[0] & 0b0111_1111)) |
(x[1] << 7) |
((x[2] & 0b1000_0000) << 8) // actually >> 7 << 15
// Method 2
// First, reassemble "big-endian" 2-byte integer
b0 =
((x[2] & 0b1000_0000) >> 7) |
(x[1] << 1) |
((x[0] & 0b0111_1111) << 9)
// Second, reverse endianness in it, i.e. reassemble per 2-byte integer byte boundaries
b =
((b0 & 0xff00) >> 8) |
((b0 & 0x00ff) << 8) For example, for a given
And this is even without touching a subject of bit order in bytes — which is also actually important, as many compression stream formats rely on that one. |
You are right that this is tricky - I have implemented neither of those cases as the LE vs BE in my use case relate to the byte order. What I have currently implemented is the cases in which bytes are ordered from LS to MS, bits within bytes are ordered from MS to LS. The endianness is therefore related to the byte order only. That means that in your example in the LE case the bits 0-23 in the stream are assumed to be in the following order:
so selecting bits 1-16 into a 16-bit integer (bits i15-i14-...-i0 from MSB to LSB) in my case gives the following mapping:
Unlike your example, in this case the selected bits are actually no longer consecutive due to the byte ordering being different from the bit ordering. |
For completeness, if I understand well, your first method above maps to this layout:
while the second would correspond to:
Are you aware of any formats that actually use these methods? In a third potential interpretation I could think of the byte boundaries playing no role at all - in this case, the bits of the big-endian integer are just taken in reverse order regardless of the byte they come from:
|
While it would not be too hard to add different bit-within-byte orderings within my current bit-field implementation, I guess that this is something that also affects the current integer types and would rather have to be handled in a broader sense? |
I understand that GreyCat is no fan of this but I propose it anyway as maybe others can add to the discussion: Often, C bit fields are used to describe small values. Like this:
The order of the bits is clearly defined, for both big and little endian. If the ksy syntax would adopt this syntax somehow, avoiding the current bit stream syntax, it would make an uncomplicated and non-confusing solution for these kinds of bit fields. The runtime support would need a new bit-read function for this, though. One that gets passed the full size of the value (in bytes or bits) from which to extract the bits, so that it can do this right for little-endian (for big endian, that value can be ignored, unless it wants to also throw an error if it's asked to read more bits than are available). |
We don't really need full size of the value. Here's a drop-in replacement procedure for def read_bits_int_le(n)
bits_needed = n - @bits_left
if bits_needed > 0
# 1 bit => 1 byte
# 8 bits => 1 byte
# 9 bits => 2 bytes
bytes_needed = ((bits_needed - 1) / 8) + 1
buf = read_bytes(bytes_needed)
buf.each_byte { |byte|
@bits |= (byte << @bits_left)
@bits_left += 8
}
end
# raw mask with required number of 1s, starting from lowest bit
mask = (1 << n) - 1
# derive reading result
res = @bits & mask
# remove bottom bits that we've just read by shifting
@bits >>= n
@bits_left -= n
res
end Testing it: # Raw bytes: 21 f3
INPUT = [0x21, 0xf3].pack('C*')
# Little-endian, low-to-high byte order
# Expected output: 321 f
stream = Kaitai::Struct::Stream.new(INPUT)
printf "%x\n", stream.read_bits_int_le(12)
printf "%x\n", stream.read_bits_int_le(4)
# Big-endian, high-to-low byte order (traditional)
# Expected output: 21f 3
stream = Kaitai::Struct::Stream.new(INPUT)
printf "%x\n", stream.read_bits_int(12)
printf "%x\n", stream.read_bits_int(4) It works — actual output matches expected. |
I disagree with your above results being correct. Here's the C code:
Now I can expect in both LE and BE that:
Your code doesn't meet that: You're wrong where you say |
What does it have to do with this code at all? If you're willing to compare it to C, please at least cast a byte array: struct T {
int16_t a : 12;
int16_t b : 4;
};
char i[] = { 0x12, 0x34 };
struct T *v = (struct T *) i;
printf("%x\n%x\n", v->a, v->b); // output: 412 3 I get the exactly same output for |
But I wrote that the BE result is wrong. The LE is fine, just as you now again verified. The BE result should, in this last example, give 234 1 - but your read_bits code doesn't do that. |
@JaapAap I believe your implementation is very close to the one I've just posted? Given that it's the way of packing bits that comes standard in all (?) C implementations, I wonder if we should support it as a default one, and return for more if and when the need for more complicated formats will arise? Returning back to the very first example, parsing 3 bytes = 24 bits of data with 1+16+7 scheme, for a given struct T {
uint32_t a: 1;
uint32_t b: 16;
uint32_t c: 7;
};
char x[] = { 0x4a, 0x9d, 0xd0};
struct T *v = (struct T*) x;
printf("a=%x b=%x c=%x\n", v->a, v->b, v->c); It will result in:
For comparison, two methods from above yield |
In Construct, BitsInteger supports little-endianness but only for multiple of 8 bits. I had multiple questions over the years about bitwise endianness, and came to a conclusion that the entire topic is garbage. There is no (1) standard (2) sane way to doing this because little endian is by definition swapping octects, which is supposedly to be applied to non-octet data. |
So we can create an own one ;)
1 see #76 for non-octet endianness (read it to understand the example!)
|
It seems to me that you keep to discuss only endianness when you actually have to discuss TWO DIFFERENT things:
So, what you'll eventually need are two independent settings (or parameters) for each of these modes. I hope that's clear to everyone. |
We can generalize it to arbitrary-sized chunks. Arbitrary assummes single bit too. See #76 which allows arbitrary number of independent parameters for notation. In this definition of endianness bit orders and byte orders are just special cases of endiannesses. |
Given that we more or less have consensus on implementation of little-endian bit reading, here's the suggestion for syntax.
|
Why should it be done this way? IMHO here is the case compatibility can be broken. These numbers are be even on le machines, this prevents their adoption, because of compatibility issues. IMHO the behavior should be consistent with
I wonder if the cases when bit endian differs from global endianness are so frequent. What is usually found in software, is the pypeline |
Here is an impl in python |
I've finally implemented the little-endian bit reads in all languages except Go (I'm waiting for kaitai-io/kaitai_struct_go_runtime#25 to be reviewed and merged, then it'll be trivial to implement it in Go as well). I added tests The check for illegal LE-BE mix inside in the middle of a byte is not yet implemented. I tried to stuff it into the However, now I think it's not correct to check it here. I tried to add an error format into the |
Here are my thoughts on this: I'm implementing a parser using So in my opinion, silently changing the behaviour of reading bits doesn't sound like a good idea and if it's really need to be done, a warning isn't sufficient. Without an explicit I'm somewhat surprised how easily you talk about breaking backwards compatibility with such an important and in the past heavily requested feature used as it is now for years. I surely won't be the only one stumbling over this break if you actually do it. |
As I understand it, there's no intention to break backwards compatibility - the new In an earlier comment (#155 (comment)) I suggested adding a warning for the case where bit fields are used in a context with
I didn't mean that as "we should break backwards compatibility because the current behavior is suboptimal", but rather as "this is the behavior I would choose if there were no backwards compatibility constraints". Because of backwards compatibility, we can't actually change the default behavior. Instead I suggested adding a warning, to let spec writers know that the default behavior may not be what they expect, and to encourage them to explicitly select the behavior that they need. |
BTW, I have tried to use it on my kaitai impls of IEEE 754 (by replacing le manual handling with the simple code from be impls and setting |
Could you add some details like a hex dump, .ksy spec, actual and expected output so we can work with something? |
Make sure you're using the correct order of the fields, as I described in the bit layout in #155 (comment). |
Thanks, I have missed the fact that the new proposal swaps fields order, as if |
Should this issue be closed? IIRC little-endian bit field support has been fully implemented and was released with version 0.9. Or is there anything left to be implemented that I forgot about? |
Well, the only reason why this issue is still open is that the compile-time check of illegal LE-BE mix on a unaligned bit is still not implemented, as @GreyCat proposed in #155 (comment):
And I update my progress in #155 (comment):
But no progress has been made since then, and it isn't really the top thing on my priority list. Probably it would make sense to extract this feature to another issue, not sure... |
For now bit-sized integers are always big-endian. I think that endianness also should affect them.
The text was updated successfully, but these errors were encountered: