Following a chain of sectors to merge them in a new stream #196

GreyCat · 2017-07-02T12:43:56Z

At least both FAT filesystem and Microsoft's CFB files follow the same pattern: to specify file contents, one provides an index of starting sector s0. A parser must follow the chain of sectors, as specified in a FAT-like table, i.e.:

1st sector = a0
2nd sector = fat[a0]
3rd sector = fat[fat[a0]]
etc

until it meets certain terminator (like -1 or -2) in the FAT table. After that, if we would want to do further parsing on the file contents, we should reassemble all these individual sectors into one new stream (and probably trim it to the size specified in a separate field somewhere in the directory entry).

The following structure can model most of this behavior, but not all:

seq:
  # Offset to the FAT table, which is, for simplicity sake, consists of 4-byte entries
  - id: ofs_fat
    type: u4
  # A pointer to the first sector
  - id: file_first_sector
    type: sector_ptr
  # Full size that we need to trim file size to
  - id: file_full_size
    type: u4
types:
  fat:
    seq:
      - id: entries
        type: u4
        repeat: eos
  sector_ptr:
    seq:
      - id: current_ptr
        type: u4
    instances:
      body:
        pos: current_ptr * 512
        size: 512
        if: current_ptr != -1
      next:
        pos: _root.ofs_fat + 4 * current_ptr
        type: sector_ptr
        if: current_ptr != -1

This effectively allows to access file sectors one by one by using:

parsed.file_first_sector.body # 1st sector contents
parsed.file_first_sector.next.body # 2nd sector contents
parsed.file_first_sector.next.next.body # 3nd sector contents, etc

However, there is no simple way to unite all these sectors and trim it to file_full_size, except in the app code, to continue parsing, i.e:

data = ''
s = parsed.file_first_sector
while not s.body.nil? do
  data << s.body
  s = s.next
end

Any ideas on what would be the best syntax to do it?

The text was updated successfully, but these errors were encountered:

koczkatamas · 2017-07-02T16:14:36Z

Should we unify this issue with other substream related issue, eg. the PNG one (data of multiple IDAT entries should be concatenated before zlib decompression).

I presume it would be better if we could come up with a more universal solution which may be good enough to support file format we haven't met yet.

Also it may worth thinking a little about serialization (but really just a little as it is really out-of-scope right now): so if we came up with multiple ideas, we can compare them from this point too.

@GreyCat could we collect all the sub-stream related file formats to somewhere (this issue's description, separate wiki page, etc)?

Current this is the list (I'll try to collect them and modify this comment):

Microsoft CFB
FAT filesystem
PNG
(registry file?)
(TCP?)

GreyCat · 2017-07-05T13:28:08Z

Substreams, as in #44, is somewhat different issue, although may be it's worth to discuss this one after completing #44.

GreyCat · 2017-07-21T09:48:01Z

Another complex example from Ogg specification. Each Ogg page has a list of physical "segments" defined like that:

      - id: len_segments
        type: u1
        repeat: expr
        repeat-expr: num_segments
      - id: segments
        repeat: expr
        repeat-expr: num_segments
        size: len_segments[_index]

Nowadays, with advent of _index we can even read them. But the actual software works not on physical "segments", but logical "packets", which are constructed by joining "segments" which its length is 255. That is, a typical Ogg page might contain segment lengths like that:

        [.] 12 = 1       } length 1
        [.] 13 = 1       } length 1
        [.] 14 = 1       } length 1
        [.] 15 = 252     } length 252
        [.] 16 = 255     ⎫
        [.] 17 = 36      ⎭ length 291
        [.] 18 = 255     ⎫
        [.] 19 = 34      ⎭ length 289
        [.] 20 = 255     ⎫
        [.] 21 = 255     ⎪
        [.] 22 = 255     ⎪
        [.] 23 = 61      ⎭ length 826

Packet of 255 bytes is encoded as 2 segments, 255-byte segment + 0-byte segment.

To add the insult to injury, technically packets can be even split between different Ogg pages (i.e. higher level structures). This way one page might end up with a segment of 255 bytes and another one might start with a segment that continues it (and a "continuation" flag).

KOLANICH · 2017-08-28T20:10:42Z

seq:
.......
chains:
  chain_typed:
    doc: lol
    type: frame # if `chain` is set, `type` is forbidden and inherited from that chain
  chain_stream:
    doc: united stream of bytes ready to be parsed
   
types:
........
   aaaa:
     seq:
        - id: frame # a chain of objects of type `frame`
          chain: _root.chain_typed
        - id: size
          type: u8
        - id: byte_chunk # a chain of raw bytes, may be used for parsing after merging via `_io`
          size: size
          chain: chain_stream

What do you think?

GreyCat · 2017-08-28T20:13:47Z

Would you care to elaborate a little on how's that supposed to work, so I won't be reinventing the whole thing from the very beginning, trying to guess what you've meant here?

KOLANICH · 2017-08-28T20:29:18Z

chains is the dictionary of chains identifiers. It is by definition a collection. Implementation is to be decided, I guess different languages will have differrent ones. For C++ I guess we need chain to be a vector and adding to the chain are move semantics operations like emplace_back. For reference types in GC languages I guess it's just a collection.
Chain is typed. If type is ommited it is by definition a collection of raw bytes (with array-like and stream interfaces). A chain has address as any property has. In fact it is a property, as any field in seq and any instance are. So it has an address.

chain bounds any property to its chain by its path. Binding property to chain means that KSC adds some code inserting pointer/reference/actual content to the recently parsed property to the chain.

koczkatamas · 2017-08-28T20:33:12Z

Before we implementing anything concrete I'd like to see how the suggested solution solves the issues mentioned above (Microsoft CFB, FAT filesystem, PNG, registry file?, TCP?, Ogg, referenced issue).

Finding a fits-them-all solution is probably not easy.

KOLANICH · 2017-08-30T06:01:10Z

For ogg do you mean something like this:

.....
types:
  page:
    chains:
      data: {}

    seq:
      ....
      - id: segments
        repeat: expr
        repeat-expr: num_segments
        size: len_segments[_index]
        doc: Segment content bytes make up the rest of the Ogg page.
        chain: data

The proposed syntax should merge all the page's segments to data stream belonging to that page

pavja2 · 2019-03-04T11:29:55Z

I'm not sure if this is directly related to this enhancement proposal but I've encountered a related issue when attempting to build a struct to describe MPEG-TS protocol captures. MPEG-TS consists of lots of small packets (188 bytes each) which have program identifiers and counters in their headers.

Ideally, it would be possible to use Kaitai to not only split a capture in the 188 byte packets but also merge the payloads of packets belonging to a given program identifier according to their specified order (e.g. demultiplex) and then parse that re-assembled payload with its own Kaitai structure.

Right now, I have to pop out of Kaitai into python to do this merging process, generate an intermediary binary containing the demultiplexed payloads, and then pop back into Kaitai with a different .ksy to parse this intermediary format. It'd be great to do this all inside Kaitai for cross-language portability and clarity.

GreyCat · 2019-03-09T11:38:41Z

@pavja2 Could you demonstrate merging algorithm for our reference here? In indeed looks like it's another valid use case for this feature.

kalidasya · 2019-05-06T18:27:35Z

@pavja2 I was also about to implement an mpeg-ts parser do u have your structure file shared somewhere?

GreyCat · 2019-05-06T18:31:00Z

@kalidasya This person here actually claims that she has developed that, although I'm not sure if they'll be able to open source it.

pavja2 · 2019-05-06T19:20:21Z

@kalidasya I do have a basic struct file. It's rough around the edges but works well enough for my needs. It'd be awesome if someone made it better! I'm traveling ATM but will post it soon as I can and let you know when I do.

GreyCat · 2019-05-06T19:46:17Z

@kalidasya @pavja2 Guys, just a heads up: please consider creating a new issue in formats repo for that format and move the discussion there? Otherwise, it will be virtually impossible for others who might be interested to find these and join your cause :)

kalidasya · 2019-05-06T21:28:22Z

@GreyCat thanks for the tip, issue created! @pavja2 when you have time can you link it to the referred ticket?

kalidasya · 2019-05-09T13:40:27Z

To give some context for this ticket from mpeg-ts point of view:

every mpeg-ts stream consist of multiple 188 bytes long packet, each packet (among other things) has a packet identifier and a payload data
this chain should work for this use case but we need to address the data as a dictionary:

.....
types:
  payload:
    chains:
      data: {}

  tspacket:
    seq:
      ....
      - id: payload_unit_start_indicator
        type: b1
     ...
      - id: pid
        type: b13
      ...
      - id: payload
        size: eos 
        chain: data[pid]

@pavja2 what do you think?
what complicates things in this case that every ts packet has a payload_unit_start_indicator which flags if this is a beginning of a new payload (this is the trigger point to start to parse the payload what we accumulated so far). My kaitai knowledge is not enough to assess is it something which can be covered or not.

jaroslaw-wieczorek · 2019-06-10T08:09:49Z

Moved from #555

In the ksy, is it possible to combine all value fields from data_chunk into a new _io stream? I need to combine(concatenate) the data into one stream for further processing without checksums.
types:
  data_chunk:
    seq:
      - id: value
        size: '(_io.size - _io.pos > 17) ? 16 : ((_io.size - _io.pos) - 2)'
      - id: checksum
        size: 2

Update:

Example data_chunks:

Each data_chuck has a field named "value" and "checksum". What I want to do is to get one string containing bytes from all the fields named "value" without data from the "checksum" fields.

tannewt · 2019-08-06T02:12:12Z

Hi hi! This would be handy for USB as well. For example with usb mass storage, with full speed usb packets are 64 bytes at a time but reads and writes of data are done in 512 byte chunks over multiple low level (IN) packets. This is also done for things like long descriptor strings.

tisoft · 2022-07-26T12:25:42Z

Squashfs (kaitai-io/kaitai_struct_formats#596) would also benefit from this. Metadata is stored in blocks, that need to be processed individually and then concatenated, before being able to parse it.

I have it working using custom functions, but native support would be great.

GreyCat added the enhancement label Jul 2, 2017

koczkatamas mentioned this issue Aug 28, 2017

Combining arrays #234

Closed

GreyCat mentioned this issue Jan 18, 2018

PDB format problem (pointers required?) #307

Closed

KOLANICH mentioned this issue Feb 4, 2018

Custom fixed bit-level integer encoding schemes #342

Open

KOLANICH mentioned this issue Feb 20, 2018

Ksy modularization / external ksy inclusion (discussion) #71

Open

KOLANICH mentioned this issue Mar 31, 2019

Add WebSocket dataframe format kaitai-io/kaitai_struct_formats#132

Merged

kalidasya mentioned this issue May 6, 2019

Add MPEG-TS format kaitai-io/kaitai_struct_formats#147

Open

GreyCat mentioned this issue Jun 10, 2019

Data chunks with fixed length except for trailing chunk, matching overall stream length #555

Closed

KOLANICH mentioned this issue Aug 19, 2019

Serialization #27

Open

This was referenced Sep 22, 2019

Fix WAV and add RIFF kaitai-io/kaitai_struct_formats#229

Merged

Resource Interchange File Format kaitai-io/kaitai_struct_formats#230

Closed

KOLANICH mentioned this issue Sep 29, 2023

Extract alternating words #1069

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Following a chain of sectors to merge them in a new stream #196

Following a chain of sectors to merge them in a new stream #196

GreyCat commented Jul 2, 2017

koczkatamas commented Jul 2, 2017

GreyCat commented Jul 5, 2017

GreyCat commented Jul 21, 2017

KOLANICH commented Aug 28, 2017 •

edited

Loading

GreyCat commented Aug 28, 2017

KOLANICH commented Aug 28, 2017 •

edited

Loading

koczkatamas commented Aug 28, 2017

KOLANICH commented Aug 30, 2017 •

edited

Loading

pavja2 commented Mar 4, 2019

GreyCat commented Mar 9, 2019

kalidasya commented May 6, 2019

GreyCat commented May 6, 2019

pavja2 commented May 6, 2019

GreyCat commented May 6, 2019

kalidasya commented May 6, 2019

kalidasya commented May 9, 2019 •

edited

Loading

jaroslaw-wieczorek commented Jun 10, 2019 •

edited

Loading

tannewt commented Aug 6, 2019

tisoft commented Jul 26, 2022

Following a chain of sectors to merge them in a new stream #196

Following a chain of sectors to merge them in a new stream #196

Comments

GreyCat commented Jul 2, 2017

koczkatamas commented Jul 2, 2017

GreyCat commented Jul 5, 2017

GreyCat commented Jul 21, 2017

KOLANICH commented Aug 28, 2017 • edited Loading

GreyCat commented Aug 28, 2017

KOLANICH commented Aug 28, 2017 • edited Loading

koczkatamas commented Aug 28, 2017

KOLANICH commented Aug 30, 2017 • edited Loading

pavja2 commented Mar 4, 2019

GreyCat commented Mar 9, 2019

kalidasya commented May 6, 2019

GreyCat commented May 6, 2019

pavja2 commented May 6, 2019

GreyCat commented May 6, 2019

kalidasya commented May 6, 2019

kalidasya commented May 9, 2019 • edited Loading

jaroslaw-wieczorek commented Jun 10, 2019 • edited Loading

tannewt commented Aug 6, 2019

tisoft commented Jul 26, 2022

KOLANICH commented Aug 28, 2017 •

edited

Loading

KOLANICH commented Aug 28, 2017 •

edited

Loading

KOLANICH commented Aug 30, 2017 •

edited

Loading

kalidasya commented May 9, 2019 •

edited

Loading

jaroslaw-wieczorek commented Jun 10, 2019 •

edited

Loading