Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong endianness of bit-sized integers #155

Open
KOLANICH opened this issue Apr 18, 2017 · 52 comments
Open

Wrong endianness of bit-sized integers #155

KOLANICH opened this issue Apr 18, 2017 · 52 comments
Milestone

Comments

@KOLANICH
Copy link

For now bit-sized integers are always big-endian. I think that endianness also should affect them.

@JaapAap
Copy link

JaapAap commented May 16, 2017

FYI, I have actually implemented big-endian, little endian, signed and unsigned bit fields (compiler and C++ runtime). I am currently testing my solution. When I am happy with the results I will generate a pull request for these.

@GreyCat
Copy link
Member

GreyCat commented May 19, 2017

The main problem here is that other implementations beside "big-endian" are not very trivial. For example, consider this:

x[0]     x[1]     x[2]     
76543210 76543210 76543210
abbbbbbb bbbbbbbb bccccccc

Intepreting "b" as big-endian is straightforward:

b =
  ((x[2] & 0b1000_0000) >> 7) |
  (x[1] << 1) |
  ((x[0] & 0b0111_1111) << 9)

But I see at least two possible interpretations of "b" as little-endian here:

// Method 1
// Dividing by x byte boundaries, reassembling by them as well
b =
  ((x[0] & 0b0111_1111)) |
  (x[1] << 7) |
  ((x[2] & 0b1000_0000) << 8) // actually >> 7 << 15

// Method 2
// First, reassemble "big-endian" 2-byte integer
b0 =
  ((x[2] & 0b1000_0000) >> 7) |
  (x[1] << 1) |
  ((x[0] & 0b0111_1111) << 9)
// Second, reverse endianness in it, i.e. reassemble per 2-byte integer byte boundaries
b =
  ((b0 & 0xff00) >> 8) |
  ((b0 & 0x00ff) << 8)

For example, for a given x = [0x4a, 0x9d, 0xd0]:

  • first method yields b = 0xceca = 0b1100_1110_1100_1010
  • second method yields b = 0x3b95 = 0b0011_1011_1001_0101

And this is even without touching a subject of bit order in bytes — which is also actually important, as many compression stream formats rely on that one.

@JaapAap
Copy link

JaapAap commented May 20, 2017

You are right that this is tricky - I have implemented neither of those cases as the LE vs BE in my use case relate to the byte order. What I have currently implemented is the cases in which bytes are ordered from LS to MS, bits within bytes are ordered from MS to LS. The endianness is therefore related to the byte order only. That means that in your example in the LE case the bits 0-23 in the stream are assumed to be in the following order:

              x[0]                              x[1]                              x[2] 
		
7   6   5   4   3   2   1   0     7   6   5   4   3   2   1   0     7   6   5   4   3   2   1   0
b7  b6  b5  b4  b3  b2  b1  b0    b15 b14 b13 b12 b11 b10 b9 b8     b23 b22 b21 b20 b19 b18 b17 b16

so selecting bits 1-16 into a 16-bit integer (bits i15-i14-...-i0 from MSB to LSB) in my case gives the following mapping:

              x[0]                              x[1]                              x[2]     
7   6   5   4   3   2   1   0     7   6   5   4   3   2   1   0     7   6   5   4   3   2   1   0
i6  i5  i4  i3  i2  i1  i0  X     i14 i13 i12 i11 i10 i9  i8  i7    X   X   X   X   X   X   X   i15

Unlike your example, in this case the selected bits are actually no longer consecutive due to the byte ordering being different from the bit ordering.

@JaapAap
Copy link

JaapAap commented May 20, 2017

For completeness, if I understand well, your first method above maps to this layout:

           x[0]                              x[1]                              x[2]     
7  6   5   4   3   2   1   0     7   6   5   4   3   2   1   0     7   6   5   4   3   2   1   0
a  i6  i5  i4  i3  i2  i1  i0    i14 i13 i12 i11 i10 i9  i8  i7    i15 c   c   c   c   c   c   c

while the second would correspond to:

           x[0]                              x[1]                              x[2]     
7  6   5   4   3   2   1   0    7   6   5   4   3   2   1   0     7   6   5   4   3   2   1   0
a  i7  i6  i5  i4  i3  i2  i1   i0  i15 i14 i13 i12 i11 i10 i9    i8  c   c   c   c   c   c   c

Are you aware of any formats that actually use these methods? In a third potential interpretation I could think of the byte boundaries playing no role at all - in this case, the bits of the big-endian integer are just taken in reverse order regardless of the byte they come from:

           x[0]                              x[1]                              x[2]     
7  6   5   4   3   2   1   0     7   6   5   4   3   2   1   0     7   6   5   4   3   2   1   0
a  i0  i1  i2  i3  i4  i5  i6    i7  i8  i9  i10 i11 i12 i13 i14   i15 c   c   c   c   c   c   c

@JaapAap
Copy link

JaapAap commented May 21, 2017

While it would not be too hard to add different bit-within-byte orderings within my current bit-field implementation, I guess that this is something that also affects the current integer types and would rather have to be handled in a broader sense?

@tempelmann
Copy link

tempelmann commented Sep 22, 2017

I understand that GreyCat is no fan of this but I propose it anyway as maybe others can add to the discussion:

Often, C bit fields are used to describe small values. Like this:

struct x {
  int16_t a : 12;
  int16_t b : 4;
};

The order of the bits is clearly defined, for both big and little endian.

If the ksy syntax would adopt this syntax somehow, avoiding the current bit stream syntax, it would make an uncomplicated and non-confusing solution for these kinds of bit fields.

The runtime support would need a new bit-read function for this, though. One that gets passed the full size of the value (in bytes or bits) from which to extract the bits, so that it can do this right for little-endian (for big endian, that value can be ignored, unless it wants to also throw an error if it's asked to read more bits than are available).

@GreyCat
Copy link
Member

GreyCat commented Sep 22, 2017

We don't really need full size of the value. Here's a drop-in replacement procedure for read_bits_int that works in "little-endian, low-to-high bit order" fashion:

  def read_bits_int_le(n)
    bits_needed = n - @bits_left
    if bits_needed > 0
      # 1 bit  => 1 byte
      # 8 bits => 1 byte
      # 9 bits => 2 bytes
      bytes_needed = ((bits_needed - 1) / 8) + 1
      buf = read_bytes(bytes_needed)
      buf.each_byte { |byte|
        @bits |= (byte << @bits_left)
        @bits_left += 8
      }
    end

    # raw mask with required number of 1s, starting from lowest bit
    mask = (1 << n) - 1
    # derive reading result
    res = @bits & mask
    # remove bottom bits that we've just read by shifting
    @bits >>= n
    @bits_left -= n

    res
  end

Testing it:

# Raw bytes: 21 f3
INPUT = [0x21, 0xf3].pack('C*')

# Little-endian, low-to-high byte order
# Expected output: 321 f
stream = Kaitai::Struct::Stream.new(INPUT)
printf "%x\n", stream.read_bits_int_le(12)
printf "%x\n", stream.read_bits_int_le(4)

# Big-endian, high-to-low byte order (traditional)
# Expected output: 21f 3
stream = Kaitai::Struct::Stream.new(INPUT)
printf "%x\n", stream.read_bits_int(12)
printf "%x\n", stream.read_bits_int(4)

It works — actual output matches expected.

@GreyCat
Copy link
Member

GreyCat commented Sep 22, 2017

Here's a side-by-side comparison of the difference between these two algorithms for a reference:

side-by-side_le

@tempelmann
Copy link

tempelmann commented Sep 22, 2017

I disagree with your above results being correct.

Here's the C code:

struct T {
  int16_t a : 12;
  int16_t b : 4;
};

unsigned short i = 0x1234;
struct T *v = (struct T *) &i;

Now I can expect in both LE and BE that:

v->a == 0x0234;
v->b == 0x0001;

Your code doesn't meet that:

You're wrong where you say # Expected output: 21f 3 - correct outcome would be 1f3 2 per C's rules. Even changing the order of the read_bits_int() calls does not fix that.

@GreyCat
Copy link
Member

GreyCat commented Sep 22, 2017

What does it have to do with this code at all? If you're willing to compare it to C, please at least cast a byte array:

struct T {
  int16_t a : 12;
  int16_t b : 4;
};
char i[] = { 0x12, 0x34 };
struct T *v = (struct T *) i;

printf("%x\n%x\n", v->a, v->b); // output: 412 3

I get the exactly same output for read_bits_int_le.

@tempelmann
Copy link

But I wrote that the BE result is wrong. The LE is fine, just as you now again verified. The BE result should, in this last example, give 234 1 - but your read_bits code doesn't do that.

@GreyCat
Copy link
Member

GreyCat commented Sep 22, 2017

debian_be

@GreyCat
Copy link
Member

GreyCat commented Sep 22, 2017

@JaapAap I believe your implementation is very close to the one I've just posted? Given that it's the way of packing bits that comes standard in all (?) C implementations, I wonder if we should support it as a default one, and return for more if and when the need for more complicated formats will arise?

Returning back to the very first example, parsing 3 bytes = 24 bits of data with 1+16+7 scheme, for a given x = [0x4a, 0x9d, 0xd0] would be something like that in C:

struct T {
  uint32_t a: 1;
  uint32_t b: 16;
  uint32_t c: 7;
};
char x[] = { 0x4a, 0x9d, 0xd0};
struct T *v = (struct T*) x;

printf("a=%x b=%x c=%x\n", v->a, v->b, v->c);

It will result in:

  • On a big-endian machine: a=0 b=953b c=50
  • On a little-endian machine: a=0 b=4ea5 c=68

For comparison, two methods from above yield b=ceca and b=3b95.

@arekbulski
Copy link
Member

In Construct, BitsInteger supports little-endianness but only for multiple of 8 bits. I had multiple questions over the years about bitwise endianness, and came to a conclusion that the entire topic is garbage. There is no (1) standard (2) sane way to doing this because little endian is by definition swapping octects, which is supposedly to be applied to non-octet data.
https://construct.readthedocs.io/en/latest/api/numerics.html#construct.BitsInteger

@KOLANICH
Copy link
Author

KOLANICH commented Mar 21, 2018

There is no (1) standard

So we can create an own one ;)

(2) sane way to doing this because little endian is by definition swapping octects, which is supposedly to be applied to non-octet data.

1 see #76 for non-octet endianness (read it to understand the example!)
2 there are custom types consuming even number of chunks, it's obvious that some endiannesses can be applied to them.
3 for applying endiannesses to types of non-integer number of chunks I guess we need concept of endianness alignment. We append non-even-sized types to the smallest even-sized with the size greater then the current size, then transform.

0x1_2_3 (1,5 bytes, el4 encoded (see #76) number) -> 0x1_2_3_0 (2 bytes, left alignment) -> 0x0321 (el4 (endian little 4 bit chunk) decoded)

@tempelmann
Copy link

tempelmann commented Mar 21, 2018

It seems to me that you keep to discuss only endianness when you actually have to discuss TWO DIFFERENT things:

  1. Endianness, i.e. the order of the bytes of a multi-byte value
  2. Bit Order, i.e. whether the MSB or LSB is the first bit in a single byte when reading from a bit stream.

So, what you'll eventually need are two independent settings (or parameters) for each of these modes.

I hope that's clear to everyone.

@KOLANICH
Copy link
Author

KOLANICH commented Mar 21, 2018

Endianness, i.e. the order of the bytes of a multi-byte value

We can generalize it to arbitrary-sized chunks. Arbitrary assummes single bit too. See #76 which allows arbitrary number of independent parameters for notation. In this definition of endianness bit orders and byte orders are just special cases of endiannesses.

@GreyCat
Copy link
Member

GreyCat commented Feb 1, 2019

Given that we more or less have consensus on implementation of little-endian bit reading, here's the suggestion for syntax.

  • By default, type: bX stays big-endian
  • We add type: bXbe and type: bXle (akin to uXbe and uXle) to force invocation of particular big- or little-endian function
  • We add meta/bit-endian that can be either be or le; setting it to le makes type: bX to behave as type: bXle
    • We might want to add calculated bit-endianness later
  • Combination of bXbe and bXle back to back, if it's not at the boundary of the byte, is illegal and raises compile-time error.

@KOLANICH
Copy link
Author

KOLANICH commented Feb 1, 2019

By default, type: bX stays big-endian

Why should it be done this way? IMHO here is the case compatibility can be broken. These numbers are be even on le machines, this prevents their adoption, because of compatibility issues. IMHO the behavior should be consistent with uX numbers, otherwise will be some confusion and friction. The software is not stable so the things can be broken, and consistent behavior is IMHO expected.

We add meta/bit-endian

I wonder if the cases when bit endian differs from global endianness are so frequent. What is usually found in software, is the pypeline read an integer in native endianess -> transform endianness into machine one with BSWAP (via intrinsics or idioms translated into intrinsics) or its analogue if needed -> for bit-sized stuff do bitwise operations with shifts and masks. Or in some cases just laying out raw structs upon memory, with compiler handling bit fields itself. I cannot name from my memory a single case where integers and but cross-byte bit-sized ints were had different endianness, because such design requires more code rather than a single conversion.

@KOLANICH
Copy link
Author

@generalmimon
Copy link
Member

generalmimon commented May 2, 2020

I've finally implemented the little-endian bit reads in all languages except Go (I'm waiting for kaitai-io/kaitai_struct_go_runtime#25 to be reviewed and merged, then it'll be trivial to implement it in Go as well).

I added tests BitsSimpleLe, BitsSeqEndianCombo and DefaultBitEndianMod to test this new feature. BitsSimpleLe is a copy of BitsSimple with bit-endian: le and corrected assertions. DefaultBitEndianMod tests proper propagation of default bit endianness. BitsSeqEndianCombo tests whether the big-endian and little-endian bit integers can follow up on a byte boundary. But it's just a positive (succeeding) test, it doesn't have a negative counterpart yet. I plan to add it together with the compiler implementation.

The check for illegal LE-BE mix inside in the middle of a byte is not yet implemented. I tried to stuff it into the compileSeq method (generalmimon/kaitai_struct_compiler@b2f94e5) because that's where the compiler decides whether to insert the alignToByte() call if the a byte-aligned field follows a non-aligned one, and I said to myself that it's a similar stuff (see my previous comment #155 (comment)).

However, now I think it's not correct to check it here. I tried to add an error format into the formats_err/ folder in the tests repo, and it didn't work because it runs only precompile checks (the compileSeq method is in the ClassCompiler). So it apparently should be a precompile check as well.

@ams-tschoening
Copy link

A minor thing we can change at any time when we get some responses from people starting using it in their formats, I think it doesn't really matter if the initial implementation will have always bit-endian: be as default and emit a warning if endian: le, or set bit-endian: le directly by default when endian: le. The BC break IMHO wouldn't hurt much, and given that most of the formats have the same bit field endianess as integer endianess, bit-endian: le is a sensible default. We have to state in the changelog anyway. And AFAIK, the compiler currently doesn't "emit" any warnings, so we'd have to somehow build the infrastructure for it.

Here are my thoughts on this: I'm implementing a parser using endian: le ONLY and using bit fields since they have been implemented, so am using their endianess as be. After reading through this thread and because of unexpected problems after upgrading the compiler+runtime, I simply changed all bit-related reading functions (and only those!) from *Be to *Le in the generated Java code. The result was that things failed terribly, because the bits I used as flag to switch between structures etc. where simply not available anymore where expected when implementing.

So in my opinion, silently changing the behaviour of reading bits doesn't sound like a good idea and if it's really need to be done, a warning isn't sufficient. Without an explicit bit-endian you either should keep things backwards compatible or compiling should break until an explicit bit-endian is chosen or the implementation changed to follow bit-endian: le.

I'm somewhat surprised how easily you talk about breaking backwards compatibility with such an important and in the past heavily requested feature used as it is now for years. I surely won't be the only one stumbling over this break if you actually do it.

@dgelessus
Copy link
Contributor

dgelessus commented May 11, 2020

@ams-tschoening

So in my opinion, silently changing the behaviour of reading bits doesn't sound like a good idea

As I understand it, there's no intention to break backwards compatibility - the new bit-endian attribute would always default to be (which is equivalent to the current behavior), even in an endian: le context. To get the new behavior, you have to explicitly opt in using bit-endian: le.

In an earlier comment (#155 (comment)) I suggested adding a warning for the case where bit fields are used in a context with endian: le, but without explicit bit-endian. My wording in that comment might have been a bit unclear - I wrote:

This [defaulting to bit-endian: be] makes sense from a backwards compatibility standpoint - currently bX is always considered big-endian, regardless of the current endian setting. Though if backwards compatibility wasn't a concern, I think it would be better to default bit-endian to the same value as endian, [...]

I didn't mean that as "we should break backwards compatibility because the current behavior is suboptimal", but rather as "this is the behavior I would choose if there were no backwards compatibility constraints". Because of backwards compatibility, we can't actually change the default behavior. Instead I suggested adding a warning, to let spec writers know that the default behavior may not be what they expect, and to encourage them to explicitly select the behavior that they need.

@KOLANICH
Copy link
Author

KOLANICH commented Aug 5, 2020

BTW, I have tried to use it on my kaitai impls of IEEE 754 (by replacing le manual handling with the simple code from be impls and setting bit-endian to le) and failed, get complete gibberish.

@generalmimon
Copy link
Member

BTW, I have tried to use it on my kaitai impls of IEEE 754 (by replacing le manual handling with the simple code from be impls and setting bit-endian to le) and failed, get complete gibberish.

Could you add some details like a hex dump, .ksy spec, actual and expected output so we can work with something?

@generalmimon
Copy link
Member

Make sure you're using the correct order of the fields, as I described in the bit layout in #155 (comment).

@KOLANICH
Copy link
Author

KOLANICH commented Aug 10, 2020

Make sure you're using the correct order of the fields

Thanks, I have missed the fact that the new proposal swaps fields order, as if el1 endianness (in my unified notation) was used.

@dgelessus
Copy link
Contributor

Should this issue be closed? IIRC little-endian bit field support has been fully implemented and was released with version 0.9. Or is there anything left to be implemented that I forgot about?

@generalmimon
Copy link
Member

generalmimon commented Oct 28, 2020

Or is there anything left to be implemented that I forgot about?

Well, the only reason why this issue is still open is that the compile-time check of illegal LE-BE mix on a unaligned bit is still not implemented, as @GreyCat proposed in #155 (comment):

  • Combination of bXbe and bXle back to back, if it's not at the boundary of the byte, is illegal and raises compile-time error.

And I update my progress in #155 (comment):

The check for illegal LE-BE mix inside in the middle of a byte is not yet implemented. I tried to stuff it into the compileSeq method (generalmimon/kaitai_struct_compiler@b2f94e5) because that's where the compiler decides whether to insert the alignToByte() call if the a byte-aligned field follows a non-aligned one, and I said to myself that it's a similar stuff (see my previous comment #155 (comment)).

However, now I think it's not correct to check it here. I tried to add an error format into the formats_err/ folder in the tests repo, and it didn't work because it runs only precompile checks (the compileSeq method is in the ClassCompiler). So it apparently should be a precompile check as well.

But no progress has been made since then, and it isn't really the top thing on my priority list.

Probably it would make sense to extract this feature to another issue, not sure...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants