Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examples/strategies for dealing with broken file format implementations. #49

Closed
ams-tschoening opened this issue Nov 13, 2016 · 5 comments
Labels

Comments

@ams-tschoening
Copy link

ams-tschoening commented Nov 13, 2016

Parser for EN 13757-3:2012 needed

I need to create a parser for data defined in the standard EN 13757-3:2012 (Communication systems for and remote reading of meters - Part 3: Dedicated application layer) and you can get an example on page 4 of the following PDF. In the end it's like many other binary formats: Some fields of binary data, some with fixed length, some with variable one, all those fields have a special purpose and some implement a bit mask to decide what this purpose is etc.

http://fastforward.ag/downloads/docu/FAST_EnergyCam-Protocol-wirelessMBUS.pdf

I need to decide what/if I build from scratch, where some tools can be of benefit and I like your approach very much. Reading through your documentation, one question came into my mind, though:

How do I deal with the always present exceptions from the standards because of implementation errors of some vendor?

Currently I have already two of those exceptions, where the vendor claims to support EN 13757-3:2012, but actually doesn't in one case by purpose to optimize transferred data and in the other one simply because a firmware bug has been introduced. I'm unsure currently how I would be able to combine default data parsing and recognizing those errors and maybe X more into your approach. In my mind examples for dealing with such things or thinking of strategies how it could be supported at all are of common interest, so I decided to create this issue. If you think otherwise, feel free to close it...

One missing byte

The first problem I have is interesting, because it changes the data format in the first two bytes already: By default the datagrams I need to parse start with 4 bytes, the first containing the length, the second some communication flag and 3/4 are the vendor providing the package. One of my vendors simply removes the second byte with the communication flag, the C-field in the linked PDF with value 44h, to reduce the amount of data transferred, because this byte doesn't ever change in his use cases. But of course I can only know about that case if I read byte 2 and 3 and check if it's that special vendor or not and if so, the parser mode needs to be changed somehow.

With your approach I would create some root type like "header" or such and that would already be different for standard compatible implementation vs. not like I need it.

So would I need to create one of those header types for each implementation I have to deal with? And all of those would need to check something using "if"? And if it's 20 or 50, would all of those be in the same file? Even if I have to heavily document why such a header is needed? Or would be a better approach to generate a different parser for each such case? But what with all the other types then, would I need to redundantly maintain them per file instead of providing some kind of include/link to other types?

In the end, the first example is missing one byte, which influences the position of all the other types created.

Some flipped bits

The second problem is that in the end some arbitrary values are transferred in the datagram, which consist of a triple: One byte to explain the encoding of the value, one byte to explain the unit of the value, one or more bytes for the value itself. It's e.g. "BCD encoded integer" and "hours" and "10" to tell me that "something" last for 10 hours.

Now one of my vendor has a introduced a firmware bug which results in that some datagrams have a flipped bit e.g. for the first of those three bytes. What I need to do is check the vender, the device type and version and then decide that I need to flip some bit in some byte.

Checking for vendor etc. should be easy using "if", but how do I apply the bit flip? Do I need to implement a "process" statement like "zlib", "xor" etc.? Or is using an instance with a "value" calculation the better approach? And what if things get arbitrary complex, like I need to e.g. check a configuration for all versions of devices where this error occurs? Such configuration is not available to your compiler. Additionally, the same question like before applies: Do I combine 50 of such workarounds in the same ksy file?

I would be very thankful for your ideas, because currently I tend to not use your lib and instead implement the parser by hand to have full control to access things like configuration and such.

@GreyCat
Copy link
Member

GreyCat commented Nov 14, 2016

I've took a look at the PDF you're linking in, and the problems you're describing, and, well, I don't think that I can offer any "magic bullet" solutions to these ones. Even if you do it by hand (i.e. writing an imperative parser), both of these hacks would still remain hacks, still ugly and hard to maintain. If more hacks would pile on top of that (i.e. what you mention by "20 or 50"), of course, nothing good will come from that either: would it be .ksy, any other declarative format, or hand-coded algorithm, hacks would still be hacks :(

The best we can do is try to keep it as organized and easy-to-maintain as we can.

For the first problem, i.e. one missing byte from the beginning of the packet, I would opt for something like that:

meta:
  id: energycam_mbus
  endian: le
seq:
  - id: len
    type: u1
    doc: Number of bytes following (without 2 CRCs * 2 bytes)
  - id: b2
    type: u1
  - id: b3
    type: u1
  - id: b4
    type: u1
    if: not is_byte_skipping_vendor_hack
    # Read 3 bytes normally, read only 2 bytes if we have a byte_skipping_vendor
  - id: ident_number
    type: u4
    doc: Part of A-Field, IdentNumber
instances:
  is_byte_skipping_vendor_hack:
    value: b2 == 0x11 and b3 == 0x22 # or whatever ManufacturerID of glitchy vendor would there be
  command:
    value: 'is_byte_skipping_vendor_hack ? 0x44 : b2'
    doc: C-Field, command transmitted
  manufacturer:
    value: 'is_byte_skipping_vendor_hack ? (b3 << 8) + b2 : (b4 << 8) + b3'
    doc: M-Field, ManufacturerID

Basically, you start by reading b2-b3, then call is_byte_skipping_vendor_hack to analyze if that's a byte skipping vendor (you'll have to fix it to your particular ManufacturerID), then decided whether to read b4 or not by that boolean calculated instance. After that, command and manufacturer are all calculated fields, which again depend on is_byte_skipping_vendor_hack and reinterprete b2-b3-b4 based on that boolean assumption.

It seems fairly clean compromise from my point of view, and it generally as effective as it can get (i.e. you'll have to write something similar with such decision point as is_byte_skipping_vendor_hack anyway). is_byte_skipping_vendor_hack is lazy and memoized, thus it would only get calculated once on the first call, and all subsequent calls just return pre-calculated value.

@ams-tschoening
Copy link
Author

Guten Tag Mikhail Yakshin,
am Montag, 14. November 2016 um 09:29 schrieben Sie:

I've took a look at the PDF you're linking in, and the problems
you're describing, and, well, I don't think that I can offer any
"magic bullet" solutions to these ones. Even if you do it by hand
(i.e. writing an imperative parser), both of these hacks would still
remain hacks, still ugly and hard to maintain. If more hacks would
pile on top of that (i.e. what you mention by "20 or 50"), of
course, nothing good will come from that either: would it be .ksy,
any other declarative format, or hand-coded algorithm, hacks would still be hacks :(

Anyway, thanks a lot for your example. In the end all of those hacks
would remain in one KSY file, right? At least as long as I want your
compiler to do the reading work. Or is there any include/module
approach I'm missing?

If not, how easy/maintainable do you think is it to extend your
generated compiler and implement some or even all hacks in one or many
subclasses? That's would I would do if I created a parser from
scratch.

Do you provide protected access to reading methods where I could hook
in or are mostly using private methods with only public getters for
read types?

I still need to have a look at the code you generated today...

Mit freundlichen Grüßen,

Thorsten Schöning

Thorsten Schöning E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme http://www.AM-SoFT.de/

Telefon...........05151- 9468- 55
Fax...............05151- 9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow

@GreyCat
Copy link
Member

GreyCat commented Nov 14, 2016

Flipping a bit sounds not that scary either, as flipping some flip is essentially a XOR with something like 0b10000000. You don't provide lots of details, but I can guess that you're talking about fields like DataInformationField, ValueInformationField and Value. Probably that would boil down to something like:

meta:
  id: energycam_mbus
  endian: le
# ...
instances:
  is_byte_flipping_hack:
    value: manufacturer == 0x3344 # or any other condition
types:
  energrycam_apl:
    seq:
      - id: control_info
        type: u1
      - id: access_number
        type: u1
      - id: state_of_meter
        type: u1
      # ...
      # This is the "data_block" for normal, standard-compliant devices
      - id: data_block_normal
        type: data_block
        if: not _parent.is_byte_flipping_hack
      # This is the "data_block" for glitchy, byte-flipping devices
      - id: data_block_flipped
        type: data_block
        process: xor(0b10000000)
        if: _parent.is_byte_flipping_hack
      # ...
    instances:
      # This is a easy access helper that will map to either "normal" or "flipped" versions depending on which one exists
      data_block:
        value: _parent.is_byte_flipping_hack ? data_block_flipped : data_block_normal
  data_block:
    seq:
      - id: data_information_field
        type: u1
      - id: value_information_field
        type: u1
      # ...

There we have 3 "layers": energycam (called "DLL" in the spec), energrycam_apl (called "APL" in the spec) and data_block, which is exactly these 3 fields which can have their bits flipped. Again, I can't call this "pretty" solution or anything, but, as far as I can see, all solutions that will introduce hacks and workarounds for these problems would bear some sort of tainting. At least from my point of view, this one tries to be reasonably clean - i.e. it clearly marks "hacks" as such and declares boolean values that state so (is_byte_flipping_hack), it maintains some degree of hiding implementation details from public API (i.e. you can always access foo.apl.data_block.data_information_field and not care if it used any hacks in parsing or anything), etc, etc.

As for "having 50s of them in one file vs having 50 different files" - it's generally a matter of preference. Kaitai Struct allows you to have several different files for different structures and allows you to reference types in different files. If you'll be using ksv, then you can load up several ksy files at once into it like that: ksv your-binary.dat first-file.ksy second-file.ksy third-file.ksy, etc.

So, if you feel like having them all in distinct files - feel free to. If you'd ask for my opinion, given that there are 3 types so far there (DLL, APL and that internal "data_block") - I'd stick them all into one file, at least until more hacks would be needed.

@GreyCat
Copy link
Member

GreyCat commented Nov 14, 2016

In the end all of those hacks would remain in one KSY file, right? At least as long as I want your compiler to do the reading work. Or is there any include/module approach I'm missing?

I guess I'm kind of answered that question, but let me elaborate some more. Right now, it's perfectly fine to have stuff like these:

meta:
  id: top_level
seq:
  - id: child
    type: child

and

meta:
  id: child
seq:
  - id: something
    type: u1 # or anything else

in 2 distinct files. It's not possible to separate a single type in 2 different types (at least now). If you feel like that we need to add some sort of include mechanism - let's discuss that - it shouldn't be that hard to implement. Right now we've implemented somewhat working IP networking stack using this approach alone (i.e. some files referencing types in other files).

If not, how easy/maintainable do you think is it to extend your generated compiler and implement some or even all hacks in one or many subclasses? That's would I would do if I created a parser from scratch.

You've probably meant "generated classes"? That would be perfectly reasonable - that's one of the proposed usage scenarios - you get "generated" class, subclass it and add all the imperative methods that you need.

Do you provide protected access to reading methods where I could hook in or are mostly using private methods with only public getters for read types?

It really depends on particular target language. AFAIR, for Java and C++, we generate private parsing method + private member fields + public getters. It's very easy to fix that and generate protected anything if you need that, of course.

@ams-tschoening
Copy link
Author

I think this can be closed for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants