Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that the stream state after an EOF error is the same as before #1061

Open
generalmimon opened this issue Jul 28, 2023 · 1 comment
Open

Comments

@generalmimon
Copy link
Member

generalmimon commented Jul 28, 2023

Right now, the state of the stream provided by each of our runtime libraries after a read/write operation fails with an EOF error is effectively implementation-defined. This is bad because it may easily differ between different languages and even stream implementations in one language (i.e. ByteBufferKaitaiStream and RandomAccessFileKaitaiStream in Java) and it cannot be relied upon - it's enough if we change the implementation of some method in the runtime library in something that looks like an unimportant detail (for example, changing the order of operations that don't seem like they have to be done in one particular order), and the behavior changes.

Perhaps users don't usually care about the stream state after an EOF error because in their application it's game over and they won't touch the stream anymore. But even if they don't intend to do any more read/write operations but they only want to check the pos() to diagnose where in the stream the error occurred, for example, they should get stable and well-defined readings from pos(), not affected by the implementation details.

I think that the most understandable and convenient API will be such that after an EOF error, the state of the stream will be the same as before the failed read/write operation was called. This means that if the current pos() is 1 and you call read_bytes(4) that fails with an EOF error, the pos() will still be 1. Likewise, if you call any of the read_bits_int_*() or write_bits_int_*() methods and it fails with an EOF error, not only the pos() will remain the same as before the failed call but also any of the internal bit buffers (traditionally represented by bits_left and bits fields of KaitaiStream classes) will remain untouched.

Known instances of the implementation-specific behavior include:

  • At least in Python and Ruby the read_bytes method first calls the read() method of the underlying I/O stream, which actually doesn't throw any EOF errors itself - it just returns as many bytes are available in the stream. If less bytes were read than requested, our runtime libraries would deliberately throw an EOF error.

    This is fine at first glance, but the problem is that since the underlying read() really reads everything until the end of stream if you wanted more bytes than available, the stream position is implicitly moved at EOF and then the EOF error is raised. This means that the read_bytes permanently mutated the stream state even though it failed with an exception.

    See Ruby implementation of read_bytes for reference (struct.rb:372-381):

      def read_bytes(n)
        r = @_io.read(n)
        if r
          rl = r.bytesize
        else
          rl = 0
        end
        raise EOFError.new("attempted to read #{n} bytes, got only #{rl}") if rl < n
        r
      end
  • In Python since 0.10, the read_bytes actually checks whether a large read request (8 MiB or more) can be satisfied by comparing the requested size with size() - pos(). If it finds out that the requested cannot be satisfied, it doesn't call the read() at all and raises an EOF error straight away (see read_bytes(): add extra length check kaitai_struct_python_runtime#61) - so the pos() will remain where it was before calling read_bytes. However, this check is not done for small reads (< 8 MiB), so if a small read fails, the position will be at EOF. This makes the behavior of read_bytes in Python even more unpredictable (i.e. dependent on a rather arbitrary implementation-specific limit of 8 MiB) than in other languages.

  • In fact all runtime libraries use an identical algorithm for read_bits_int_{be,le}() methods, and at least in the read_bits_int_be variant, the bits_left field is set before read_bytes is called (which might fail, though, and in that case the bits_left is not restored), but bits is set only after the call to read_bytes (so they may be even inconsistent with each other after the failed call). See for example the Python implementation of read_bits_int_be (kaitaistruct.py:233-259):

        def read_bits_int_be(self, n):
            res = 0
    
            bits_needed = n - self.bits_left
            self.bits_left = -bits_needed % 8
    
            if bits_needed > 0:
                # ...
                buf = self.read_bytes(bytes_needed)
                # ...
            else:
                # ...
    
            mask = (1 << self.bits_left) - 1  # `bits_left` is in range 0..7
            self.bits &= mask
    
            return res
@generalmimon
Copy link
Member Author

generalmimon commented Nov 14, 2023

Another instance of this general issue has been found a while ago in the Size() method of the Go runtime: kaitai-io/kaitai_struct_go_runtime#26

This applies to runtimes in other languages too, because their size() methods often use the same approach - it's just that the problem is more visible in Go because of the explicit error handling than in other languages that use exceptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant