Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data views / substreams support #44

Open
GreyCat opened this issue Nov 7, 2016 · 16 comments
Open

Data views / substreams support #44

GreyCat opened this issue Nov 7, 2016 · 16 comments
Milestone

Comments

@GreyCat
Copy link
Member

GreyCat commented Nov 7, 2016

Currently, all languages use something similar to this code, when it's time to do a substream and parse objects from substream:

seq:
  - id: foo
    size: some_size
    type: foo_class
this._raw_foo = io.readBytes(someSize());
KaitaiStream _io__raw_foo = new KaitaiStream(_raw_foo);
this.foo = new FooClass(_io__raw_foo, this, _root);

This is inefficient, especially for larger data streams - it needs to load everything into memory and then parse it from there. A more efficient approach would use using some sort of substreams in manner of data views, i.e. something like:

KaitaiStream _io__foo = _io.substream(_io.pos(), someSize());
this.foo = new FooClass(_io__foo, this, _root);

or, for instances that have known pos field, something like that:

instances:
  foo:
    pos: some_pos
    size: some_size
    type: foo_class
KaitaiStream _io__foo = _io.substream(somePos(), someSize());
this.foo = new FooClass(_io__foo, this, _root);

The devil, of course, is in the details:

  1. We need to make sure a substream functionality for every target language exists (or implement it there)
  2. Probably at least for some time we'll have to support two different ways: i.e. parsing via reading _raw_* byte arrays and parsing using substreams.
  3. We need to decide what to do in lots of other cases:
  • What do we return when reading a type-less value: i.e. user would expect a byte array or we'll switch to use substream there too? or a substream-that-can-be-used-as-bytearray? or it should be used-choosable?
  • What do to when repeat constructs exist on this field?
  • How would that mix with process - these actually require reading and re-processing the whole byte array in memory. Or should we re-implement them as stream-in => stream-out converters as well?
@arekbulski
Copy link
Member

Construct had a substream class that was used by Prefixed field. I am not familiar enough with Kaitai to provide an implementation, just posting it for reference.
https://github.com/construct/construct/blob/a06d8dcf7dfe9839d340c1cef723c9b398efb17d/construct/lib/bitstream.py#L139

@GreyCat
Copy link
Member Author

GreyCat commented Jan 17, 2018

@arekbulski Yes, that's exactly what we were looking to here :)

@arekbulski
Copy link
Member

arekbulski commented Jan 23, 2018

On some runtimes (those non-statically typed) the feature might actually be counter productive. Especially if field inside has only few bytes. Remember that, on Python at least, you are replacing C-impl bytes copy and C-impl BytesIO with something where each read is wrapped in pure-python red tape. And you do that to avoid C-imple copying of few bytes.

If the subcon is mega-size, well, that changes things. You might consider adding a property to parsed field, so at compile time it emits one code or the other. User would need to opt-in for this feature.

@GreyCat
Copy link
Member Author

GreyCat commented Jan 23, 2018

A valid concern, but I'd suggest that we benchmark these things first and then see if we need to do any workarounds. May be in languages like that implementation of .substream(...) should decide in runtime if it's worth to just do a native BytesIO + copy (if that's below a certain threshold of bytes), or wrap it in our wrapper.

Also, I'd suggest that we'd start investigating if there are any kind of bound substream class implementations available in other languages.

@arekbulski
Copy link
Member

Agreed, we need a benchmark.

@jvisser
Copy link

jvisser commented Sep 27, 2018

Hi,

Maybe this is a separate issue but I was wondering if it is possible to extend this to custom processors?

The input to the processor would be the view as described in the issue. The output a higher level object that exposes the processed content as a new kaitai stream/view with the addition of some state management methods like close() for example when the root object is destroyed. The resulting view can then again be used with the data view as described in this issue for the parsing of the processed content.

In my case I have a filesystem with potentially large encrypted partitions. What I would like to do is decrypt the partition to disk en provide a mmap'ed file to kaitai for further parsing. But I can imagine cases where the view returned by the processor would do the processing on demand (like simple xor that is already available).

I think my case can be solved by using a custom type that will do the decryption and manually call kaitai to parse its content afterwards. State management of the mmaped file would have to be done out of band.

@GreyCat
Copy link
Member Author

GreyCat commented Sep 27, 2018

@jvisser Sure, it makes sense. We just need to think out the API carefully.

DL4PD pushed a commit to DL4PD/kaitai_struct that referenced this issue Mar 13, 2019
DL4PD pushed a commit to DL4PD/kaitai_struct that referenced this issue Mar 13, 2019
…_if_opaque_type_pull

kaitai-io/kaitai_struct_compiler kaitai-io#44: Added a test for the concrete p…
DL4PD pushed a commit to DL4PD/kaitai_struct that referenced this issue Mar 13, 2019
@GreyCat GreyCat added this to the v0.10 milestone Mar 28, 2019
krisutofu pushed a commit to krisutofu/kaitai_struct that referenced this issue Jan 2, 2022
* Preserve http(s) protocol when navigating to *.kaitai.io

* Fix explicit https://kaitai.io/ link

* Fix lastl *.kaitai.io links with explicit protocol
@GreyCat
Copy link
Member Author

GreyCat commented Jul 26, 2022

Hey folks, please take a look, I took a first pass at implementing this for Ruby only:

Primarily, changes in generated code will look like this:

   def _read
     @len1 = @_io.read_u4le
-    @_raw_block1 = @_io.read_bytes(len1)
-    _io__raw_block1 = Kaitai::Struct::Stream.new(@_raw_block1)
-    @block1 = Block.new(_io__raw_block1, self, @_root)
+    _io_block1 = @_io.substream(len1)
+    @block1 = Block.new(_io_block1, self, @_root)

So, 2 lines instead of 3, no pre-reading of bytes in the memory, cleaner API overall. I will launch the official tests now, but judging from the tests locally, tests still work as they were before.

Please tell me if you think it's a good idea and/or if we can make this work for other languages.

@GreyCat
Copy link
Member Author

GreyCat commented Jul 27, 2022

Added similar implementation to Java — see kaitai-io/kaitai_struct_java_runtime@ee61d73 for runtime change and kaitai-io/kaitai_struct_compiler@b529bc6b for compiler. The main difference is Java has two KaitaiStream implementations, and the good news is that one of them (ByteBuffer-based one) already has all the slicing/limits machinery in place, ready for substreams implementation.

Generated code-wise, it's a very similar change:

     private void _read() {
         this.len1 = this._io.readU4le();
-        this._raw_block1 = this._io.readBytes(len1());
-        KaitaiStream _io__raw_block1 = new ByteBufferKaitaiStream(_raw_block1);
-        this.block1 = new Block(_io__raw_block1, this, _root);
+        KaitaiStream _io_block1 = this._io.substream(len1())
+        this.block1 = new Block(_io_block1, this, _root);

The big caveat is that this._raw_block1 will stay null. Shall we drop support for it, or shall we strive to keep compatibility? E.g. we can technically replace this:

    public byte[] _raw_block1() { return _raw_block1; }

with something like

    public byte[] _raw_block1() {
        KaitaiStream io = block1._io();
        long oldPos = io.pos();
        io.seek(0);
        byte[] allBytes = block1.readBytesFull();
        io.seek(oldPos);
        return allBytes;
    }

@Mingun
Copy link

Mingun commented Jul 28, 2022

Please, consider also merging kaitai-io/kaitai_struct_java_runtime#28 and related in order to give tools a chance to correctly visualize substreams.

@generalmimon
Copy link
Member

@GreyCat Note that the adoption of SubIO broke the test ExprIoPos in Ruby (test_out/ruby/ci.json:384-394):

   "ExprIoPos": {
-    "status": "passed",
-    "elapsed": 0.000336907,
+    "status": "failed",
+    "elapsed": 0.000463687,
+    "failure": {
+      "file_name": "./spec/ruby/expr_io_pos_spec.rb",
+      "line_num": 4,
+      "message": "undefined method `size' for #<Kaitai::Struct::SubIO:0x0000000002666050 @parent_io=#<File:src/expr_io_pos.bin>, @parent_start=0, @parent_len=16, @parent_end=16, @pos=10, @closed=false>",
+      "trace": "/home/travis/build/kaitai-io/ci_targets/runtime/ruby/lib/kaitai/struct/struct.rb:145:in `size'\n
+      /home/travis/build/kaitai-io/ci_targets/compiled/ruby/expr_io_pos.rb:30:in `_read'\n
+      /home/travis/build/kaitai-io/ci_targets/compiled/ruby/expr_io_pos.rb:25:in `initialize'\n
+      /home/travis/build/kaitai-io/ci_targets/compiled/ruby/expr_io_pos.rb:17:in `new'\n
+      /home/travis/build/kaitai-io/ci_targets/compiled/ruby/expr_io_pos.rb:17:in `_read'\n
+      /home/travis/build/kaitai-io/ci_targets/compiled/ruby/expr_io_pos.rb:12:in `initialize'\n
+      /home/travis/build/kaitai-io/ci_targets/runtime/ruby/lib/kaitai/struct/struct.rb:26:in `new'\n
+      /home/travis/build/kaitai-io/ci_targets/runtime/ruby/lib/kaitai/struct/struct.rb:26:in `from_file'\n
+      /home/travis/build/kaitai-io/ci_targets/tests/spec/ruby/expr_io_pos_spec.rb:6:in `block (2 levels) in <top (required)>'\n
+      /home/travis/.rvm/gems/ruby-3.0.4/gems/rspec-core-3.11.0/lib/rspec/..."
+    },
     "is_kst": true
   },

@armijnhemel
Copy link

@GreyCat
Copy link
Member Author

GreyCat commented Jul 30, 2022

@generalmimon

Note that the adoption of SubIO broke the test ExprIoPos in Ruby (test_out/ruby/ci.json:384-394):

Great catch, thanks! It's very easy to fix, actually, let me do that.

generalmimon referenced this issue in kaitai-io/kaitai_struct_ruby_runtime Jul 30, 2022
@generalmimon
Copy link
Member

generalmimon commented Jul 30, 2022

Zero-copy substreams are great, but we should consider that they would typically require a seekable stream where seek(), pos() and size() methods are available (unless there is an ingenious way around it without having to call any of them).

So at least there should be a command-line option to bring the traditional _raw_* = read_bytes(n) + _io__raw_* = new KaitaiStream(_raw_*) approach back, otherwise support for non-seekable streams will be lost.

It's basically the same issue as in kaitai-io/kaitai_struct_cpp_stl_runtime#46 (comment).

@GreyCat
Copy link
Member Author

GreyCat commented Jul 31, 2022

Makes perfect sense. We'll be supporting both anyway, e.g. for sake of supporting existing interface for custom processing which is all working exclusively on byte arrays. Let's start with a command line switch to toggle both ways.

@armijnhemel
Copy link

Zero-copy substreams are great, but we should consider that they would typically require a seekable stream where seek(), pos() and size() methods are available (unless there is an ingenious way around it without having to call any of them).

So at least there should be a command-line option to bring the traditional _raw_* = read_bytes(n) + _io__raw_* = new KaitaiStream(_raw_*) approach back, otherwise support for non-seekable streams will be lost.

It's basically the same issue as in kaitai-io/kaitai_struct_cpp_stl_runtime#46 (comment).

but as soon as you have read bytes, then you can just create a seekable stream that you can pass around, right?

@GreyCat GreyCat modified the milestones: v0.10, v0.11 Aug 6, 2022
generalmimon added a commit to kaitai-io/kaitai_struct_tests that referenced this issue Aug 4, 2023
The `_raw_*` fields are no longer available in default compiler mode
using zero-copy substreams (see kaitai-io/kaitai_struct#44),
so this test was actually broken for a while.
generalmimon added a commit to kaitai-io/kaitai_struct_tests that referenced this issue Apr 3, 2024
Checks whether creating a substream that exceeds the remaining stream
size throws an EOF error. This turned out not to be the case in Ruby
with zero-copy substreams
(kaitai-io/kaitai_struct#44), so it's clearly
beneficial to have this covered.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants