Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Following a chain of sectors to merge them in a new stream #196

Open
GreyCat opened this issue Jul 2, 2017 · 19 comments
Open

Following a chain of sectors to merge them in a new stream #196

GreyCat opened this issue Jul 2, 2017 · 19 comments

Comments

@GreyCat
Copy link
Member

GreyCat commented Jul 2, 2017

At least both FAT filesystem and Microsoft's CFB files follow the same pattern: to specify file contents, one provides an index of starting sector s0. A parser must follow the chain of sectors, as specified in a FAT-like table, i.e.:

  • 1st sector = a0
  • 2nd sector = fat[a0]
  • 3rd sector = fat[fat[a0]]
  • etc

until it meets certain terminator (like -1 or -2) in the FAT table. After that, if we would want to do further parsing on the file contents, we should reassemble all these individual sectors into one new stream (and probably trim it to the size specified in a separate field somewhere in the directory entry).

The following structure can model most of this behavior, but not all:

seq:
  # Offset to the FAT table, which is, for simplicity sake, consists of 4-byte entries
  - id: ofs_fat
    type: u4
  # A pointer to the first sector
  - id: file_first_sector
    type: sector_ptr
  # Full size that we need to trim file size to
  - id: file_full_size
    type: u4
types:
  fat:
    seq:
      - id: entries
        type: u4
        repeat: eos
  sector_ptr:
    seq:
      - id: current_ptr
        type: u4
    instances:
      body:
        pos: current_ptr * 512
        size: 512
        if: current_ptr != -1
      next:
        pos: _root.ofs_fat + 4 * current_ptr
        type: sector_ptr
        if: current_ptr != -1

This effectively allows to access file sectors one by one by using:

parsed.file_first_sector.body # 1st sector contents
parsed.file_first_sector.next.body # 2nd sector contents
parsed.file_first_sector.next.next.body # 3nd sector contents, etc

However, there is no simple way to unite all these sectors and trim it to file_full_size, except in the app code, to continue parsing, i.e:

data = ''
s = parsed.file_first_sector
while not s.body.nil? do
  data << s.body
  s = s.next
end

Any ideas on what would be the best syntax to do it?

@koczkatamas
Copy link
Member

Should we unify this issue with other substream related issue, eg. the PNG one (data of multiple IDAT entries should be concatenated before zlib decompression).

I presume it would be better if we could come up with a more universal solution which may be good enough to support file format we haven't met yet.

Also it may worth thinking a little about serialization (but really just a little as it is really out-of-scope right now): so if we came up with multiple ideas, we can compare them from this point too.

@GreyCat could we collect all the sub-stream related file formats to somewhere (this issue's description, separate wiki page, etc)?

Current this is the list (I'll try to collect them and modify this comment):

  • Microsoft CFB
  • FAT filesystem
  • PNG
  • (registry file?)
  • (TCP?)

@GreyCat
Copy link
Member Author

GreyCat commented Jul 5, 2017

Substreams, as in #44, is somewhat different issue, although may be it's worth to discuss this one after completing #44.

@GreyCat
Copy link
Member Author

GreyCat commented Jul 21, 2017

Another complex example from Ogg specification. Each Ogg page has a list of physical "segments" defined like that:

      - id: len_segments
        type: u1
        repeat: expr
        repeat-expr: num_segments
      - id: segments
        repeat: expr
        repeat-expr: num_segments
        size: len_segments[_index]

Nowadays, with advent of _index we can even read them. But the actual software works not on physical "segments", but logical "packets", which are constructed by joining "segments" which its length is 255. That is, a typical Ogg page might contain segment lengths like that:

        [.] 12 = 1       } length 1
        [.] 13 = 1       } length 1
        [.] 14 = 1       } length 1
        [.] 15 = 252     } length 252
        [.] 16 = 255     ⎫
        [.] 17 = 36      ⎭ length 291
        [.] 18 = 255     ⎫
        [.] 19 = 34      ⎭ length 289
        [.] 20 = 255     ⎫
        [.] 21 = 255     ⎪
        [.] 22 = 255     ⎪
        [.] 23 = 61      ⎭ length 826

Packet of 255 bytes is encoded as 2 segments, 255-byte segment + 0-byte segment.

To add the insult to injury, technically packets can be even split between different Ogg pages (i.e. higher level structures). This way one page might end up with a segment of 255 bytes and another one might start with a segment that continues it (and a "continuation" flag).

@KOLANICH
Copy link

KOLANICH commented Aug 28, 2017

seq:
.......
chains:
  chain_typed:
    doc: lol
    type: frame # if `chain` is set, `type` is forbidden and inherited from that chain
  chain_stream:
    doc: united stream of bytes ready to be parsed
   
types:
........
   aaaa:
     seq:
        - id: frame # a chain of objects of type `frame`
          chain: _root.chain_typed
        - id: size
          type: u8
        - id: byte_chunk # a chain of raw bytes, may be used for parsing after merging via `_io`
          size: size
          chain: chain_stream

What do you think?

@GreyCat
Copy link
Member Author

GreyCat commented Aug 28, 2017

Would you care to elaborate a little on how's that supposed to work, so I won't be reinventing the whole thing from the very beginning, trying to guess what you've meant here?

@KOLANICH
Copy link

KOLANICH commented Aug 28, 2017

chains is the dictionary of chains identifiers. It is by definition a collection. Implementation is to be decided, I guess different languages will have differrent ones. For C++ I guess we need chain to be a vector and adding to the chain are move semantics operations like emplace_back. For reference types in GC languages I guess it's just a collection.
Chain is typed. If type is ommited it is by definition a collection of raw bytes (with array-like and stream interfaces). A chain has address as any property has. In fact it is a property, as any field in seq and any instance are. So it has an address.

chain bounds any property to its chain by its path. Binding property to chain means that KSC adds some code inserting pointer/reference/actual content to the recently parsed property to the chain.

@koczkatamas
Copy link
Member

Before we implementing anything concrete I'd like to see how the suggested solution solves the issues mentioned above (Microsoft CFB, FAT filesystem, PNG, registry file?, TCP?, Ogg, referenced issue).

Finding a fits-them-all solution is probably not easy.

@KOLANICH
Copy link

KOLANICH commented Aug 30, 2017

For ogg do you mean something like this:

.....
types:
  page:
    chains:
      data: {}

    seq:
      ....
      - id: segments
        repeat: expr
        repeat-expr: num_segments
        size: len_segments[_index]
        doc: Segment content bytes make up the rest of the Ogg page.
        chain: data

The proposed syntax should merge all the page's segments to data stream belonging to that page

@pavja2
Copy link

pavja2 commented Mar 4, 2019

I'm not sure if this is directly related to this enhancement proposal but I've encountered a related issue when attempting to build a struct to describe MPEG-TS protocol captures. MPEG-TS consists of lots of small packets (188 bytes each) which have program identifiers and counters in their headers.

Ideally, it would be possible to use Kaitai to not only split a capture in the 188 byte packets but also merge the payloads of packets belonging to a given program identifier according to their specified order (e.g. demultiplex) and then parse that re-assembled payload with its own Kaitai structure.

Right now, I have to pop out of Kaitai into python to do this merging process, generate an intermediary binary containing the demultiplexed payloads, and then pop back into Kaitai with a different .ksy to parse this intermediary format. It'd be great to do this all inside Kaitai for cross-language portability and clarity.

@GreyCat
Copy link
Member Author

GreyCat commented Mar 9, 2019

@pavja2 Could you demonstrate merging algorithm for our reference here? In indeed looks like it's another valid use case for this feature.

@kalidasya
Copy link

@pavja2 I was also about to implement an mpeg-ts parser do u have your structure file shared somewhere?

@GreyCat
Copy link
Member Author

GreyCat commented May 6, 2019

@kalidasya This person here actually claims that she has developed that, although I'm not sure if they'll be able to open source it.

@pavja2
Copy link

pavja2 commented May 6, 2019

@kalidasya I do have a basic struct file. It's rough around the edges but works well enough for my needs. It'd be awesome if someone made it better! I'm traveling ATM but will post it soon as I can and let you know when I do.

@GreyCat
Copy link
Member Author

GreyCat commented May 6, 2019

@kalidasya @pavja2 Guys, just a heads up: please consider creating a new issue in formats repo for that format and move the discussion there? Otherwise, it will be virtually impossible for others who might be interested to find these and join your cause :)

@kalidasya
Copy link

@GreyCat thanks for the tip, issue created! @pavja2 when you have time can you link it to the referred ticket?

@kalidasya
Copy link

kalidasya commented May 9, 2019

To give some context for this ticket from mpeg-ts point of view:

  1. every mpeg-ts stream consist of multiple 188 bytes long packet, each packet (among other things) has a packet identifier and a payload data
  2. this chain should work for this use case but we need to address the data as a dictionary:
.....
types:
  payload:
    chains:
      data: {}

  tspacket:
    seq:
      ....
      - id: payload_unit_start_indicator
        type: b1
     ...
      - id: pid
        type: b13
      ...
      - id: payload
        size: eos 
        chain: data[pid]

@pavja2 what do you think?
what complicates things in this case that every ts packet has a payload_unit_start_indicator which flags if this is a beginning of a new payload (this is the trigger point to start to parse the payload what we accumulated so far). My kaitai knowledge is not enough to assess is it something which can be covered or not.

@jaroslaw-wieczorek
Copy link

jaroslaw-wieczorek commented Jun 10, 2019

Moved from #555

In the ksy, is it possible to combine all value fields from data_chunk into a new _io stream? I need to combine(concatenate) the data into one stream for further processing without checksums.

types:
  data_chunk:
    seq:
      - id: value
        size: '(_io.size - _io.pos > 17) ? 16 : ((_io.size - _io.pos) - 2)'
      - id: checksum
        size: 2

Update:

Example data_chunks:
obraz

Each data_chuck has a field named "value" and "checksum". What I want to do is to get one string containing bytes from all the fields named "value" without data from the "checksum" fields.

obraz

@tannewt
Copy link

tannewt commented Aug 6, 2019

Hi hi! This would be handy for USB as well. For example with usb mass storage, with full speed usb packets are 64 bytes at a time but reads and writes of data are done in 512 byte chunks over multiple low level (IN) packets. This is also done for things like long descriptor strings.

@tisoft
Copy link

tisoft commented Jul 26, 2022

Squashfs (kaitai-io/kaitai_struct_formats#596) would also benefit from this. Metadata is stored in blocks, that need to be processed individually and then concatenated, before being able to parse it.

I have it working using custom functions, but native support would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants