Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation of parsed data with assertions #435

Open
GreyCat opened this issue May 15, 2018 · 21 comments
Open

Validation of parsed data with assertions #435

GreyCat opened this issue May 15, 2018 · 21 comments
Milestone

Comments

@GreyCat
Copy link
Member

GreyCat commented May 15, 2018

Extracting easier part of #81, with a concrete proposal on implementation.

Context: we want to have some sort of way to state that certain attributes have some sort of "valid" values. Stating that has two implications:

  1. All values which are "not valid" must trigger an error of parsing
  2. In some cases, we can calculate values automatically during generation of stream

The simplest existing implementation is contents: it is essentially a byte array, which validates that its contents are exactly as specified in ksy file. On parsing, we'll throw an error if byte array contents won't match the expected value. On generation, we can just write the expected value, and not bother the end-user with setting the attribute manually.

Syntax proposal

Add valid key to the attribute spec. Inside it, there could be:

  • A string, integer or boolean => it will be treated as expression that's supposed to be the only valid value of this attribute. For example:
- id: foo
  type: u4le
  valid: 0x42 # expected in stream: 42 00 00 00
- id: bar
  type: strz
  valid: '"abcd"' # expected in stream: 61 62 63 64 00
- id: baz
  type: u2le
  valid: '2 * foo' # expected in stream: 84 00
  • A map, which can feature keys:
    • eq: expected states that attribute value must be equal to given value (i.e. the same as above)
    • min: expected states minimum valid value for this attribute (compared with a >= expected)
    • max: expected states maximum valid value for this attribute (compared with a <= expected)
    • any-of: list-of-expected states that valid value must be equal to one in the given list of expressions
    • expr: expression states that given expression need to be evaluated (substituting _ with an actual attribute value) to be true to treat attribute value as valid.

All expected and expression values are KS expression strings, which can be constants or can depend on other attributes. list-of-expected must be a YAML array of KS expression strings. expected inferred type must be comparable with type of the attribute. expression inferred type must be boolean.

@KOLANICH
Copy link

KOLANICH commented May 16, 2018

You have suggested to check such relations using syntax in YAML. Having this syntax in YAML has both drawbacks and profits. The drawback is that we have to support it alongside with usual expressions. The profit is that it is yaml, any tools working with yaml would be able to parse it.

So shouod we drop our current expr syntax and start using entirely yaml-based?

@GreyCat
Copy link
Member Author

GreyCat commented May 16, 2018

So shouod we drop our current expr syntax

What exactly is "our current expr syntax", and what do you mean by "dropping" it? We had no syntax proposal for validation until now, and I suspect what you imply is covered in valid/expr, for example, checking that value is even:

- id: foo
  type: u4
  valid:
    expr: _ % 2 == 0

@webbnh
Copy link

webbnh commented May 16, 2018

A map, which can feature keys

If the field is supposed to contain an enum value, can I check that by specifying a key of enum, or can any-of take the enum type-name instead of a list?

@GreyCat
Copy link
Member Author

GreyCat commented May 16, 2018

@webbnh Probably for enums we can just do the following:

  1. Ensure that in all languages "invalid" enum values does not trigger a error by default
  2. Add extra key here, something like that:
- id: foo
  type: u4
  enum: bar
  valid:
    enum-key: true

@KOLANICH
Copy link

KOLANICH commented May 16, 2018

What exactly is "our current expr syntax", and what do you mean by "dropping" it?

What you have proposed can be expressed as a logical expression, assumming that we have some facilities.
eq: expected <=>

throw:
  if: _ != expected

min: expected <=> if: _ < expected
max: expected <=> if: _ > expected
any-of, enum-key <=> #434

But we can go from another side. We can eliminate textual expressions and force the developer into using YAML AST.

for example if: a != 0 and b > _io.size would be something like

size:
  and:
    - '!=':
      - a
      - 0
    - '>':
      - b
      - '_io.size'

I know that it is monstrouous space-wasting and unusable, but I'm not sure ;)

@GreyCat
Copy link
Member Author

GreyCat commented May 17, 2018

My point here is actually very simple: if we'll just do it as one expression — which we actually still can do, i.e. stuff like that is legal:

valid:
  expr: _ == 42 # the same as `eq: 42`
valid:
  expr: _ >= 42 # the same as `min: 42`

It's easy to do, but we lose declarativity this way. If it's only an expression, one can't just look at this validation spec and render it as a range input (for example, if a min/max boundaries are known), or as a combo box with a set of choices (if a set of choices is known). Basically, if we only have a validation expression, the only thing we can do (without reversing it and applying some symbolic solver) is checking every particular value for validity, and that's it. Declarative configuration offers more, for example, you can clearly get a list of valid values, or boundaries, or stuff like that. Proposed mechanism is also extensible. For example, if we'll want to add size limits of strings in future, we can do something like

valid:
  max-size: 42

and it could be rendered as string with limited input box, not asking a user to enter arbitrary string and then rejecting it for some mysterious reason.

@KOLANICH
Copy link

KOLANICH commented May 17, 2018

I haven't thought about using that info for metadata for GUI purposes. It's cool idea. In this case

reversing it and applying some symbolic solver

IMHO is the best way to do it. If we have problems with performance we can try to cache solver results.

It's easy to do, but we lose declarativity this way.

I guess we don't lose declarativity, since we still declare "this value must satisfy this logical expression" and all the needed info is there, but we lose some verbosity. What we win, is that we don't make an assumption about programmer using the proposed verbose syntax instead of exprs. So even if a programmer used exprs, if we have a system recovering these info from exprs, we would expect it working correctly, even retroactively, when a new extractor is added and it becomes working even on old specs without any modification

@kalidasya
Copy link

kalidasya commented May 23, 2019

Please take into account bit sized values like:

- id: fixed_10
  type: b2
  contents: [2]

- id: fixed_10
  contents: [true, false]

@GreyCat
Copy link
Member Author

GreyCat commented Jul 6, 2019

I went forward and do some proof-of-concept implementation in current compiler. Namely:

  • valid is now recognized and parsed in its short form => only equals is supported now, but it's really easy to extend it to support ranges, lists, arbitrary expressions, etc.
  • There is a system of relevant ValidationSpec objects created during KSY loading
  • For Java, Ruby, Python, JS: codegen actually generates validation code
  • Added 3 tests valid_*.ksy:
  • Added a more or less formal system of exceptions specific to KS
    • Implemented it in Ruby, Python, Java; JS is on the way
    • Added relevant knowledge of this system of exceptions into ksc

@generalmimon
Copy link
Member

generalmimon commented Oct 7, 2019

The current KSC (0.9-SNAPSHOT) generates incorrect code when valid is used on an optional field (with if). The validation check is executed outside the if clause instead of inside it.

Test code:

meta:
  id: valid_if_test
seq:
  - id: optional
    if: false
    type: u1
    valid: 42

produces (in JavaScript, but it happens in all languages):

ValidIfTest.prototype._read = function() {
  if (false) {
    this.optional = this._io.readU1();
  }
  if (!(this.optional == 42)) {
    throw new KaitaiStream.ValidationNotEqualError(42, this.optional, this._io, "/seq/0");
  }
}

@GreyCat
Copy link
Member Author

GreyCat commented Oct 19, 2019

@generalmimon

The current KSC (0.9-SNAPSHOT) generates incorrect code when valid is used on an optional field (with if). The validation check is executed outside the if clause instead of inside it.

Probably moving it inside if would make sense, good catch! Can I ask you to contribute relevant test to tests repo?

@generalmimon
Copy link
Member

generalmimon commented Oct 19, 2019

Sure.

generalmimon added a commit to generalmimon/kaitai_struct_formats that referenced this issue Nov 24, 2019
There are two types of RIFF structs: with asserts (chunk and
parent_chunk_data) and without asserts (chunk_generic and
parent_chunk_data_generic). Those without asserts won't be
needed when kaitai-io/kaitai_struct#435 (comment) is resolved.
@GreyCat
Copy link
Member Author

GreyCat commented Nov 25, 2019

Recent compiler updates fixed ValidFailInst and ValidNotParsedIf tests by @generalmimon. Thanks for enhancing our test base!

generalmimon added a commit to kaitai-io/kaitai_struct_go_runtime that referenced this issue Sep 11, 2020
arieltorti pushed a commit to arieltorti/kaitai_struct_cpp_stl_runtime that referenced this issue Oct 15, 2020
@sykhro
Copy link

sykhro commented Mar 25, 2021

Given that the keyword is mostly implemented, is it time to add it to the .ksy schema?
This would be beneficial for users (like me) that use a YAML language server for validation while making structures

@krisutofu
Copy link

krisutofu commented Jan 2, 2022

I found out that it will only pay attention to the first valid key of the field but ignore the others.

seq:
  - id: first_byte
    type: u1
    valid:
      eq: 0x00
    # ignores this:
    valid: 0x01
    # ...
    valid:
      expr: _ >= clip_bottom
      # ignores these:
      expr: _ <= clip_top
      expr: _ == 0x7 or _ == 0x17 or _ == 0x23 or _ == 0x68 or _ == 0xB5

A reasonable behaviour would be a logical AND of all valid expressions.

It strictly separates min/max, eq, any-of and expr so that combinations of these are not accepted. The error message might be confusing, stating that the key is unknown.
Writing it like this is not possible:

seq:
  - id: clip_bottom
    type: u1
  - id: clip_top
    type: u1
  - id: byte
    type: u1
    valid:
      min: clip_bottom
      max: clip_top
      any-of: [0x7, 0x17, 0x23, 0x68, 0xB5]

If this would work it would be a fair solution:

    # ...
    valid:
      min: clip_bottom
      max: clip_top
    valid:
      any-of: [0x7, 0x17, 0x23, 0x68, 0xB5]

The only combination, which works at the moment, is one-time min + max. Multiple min or max are ignored. A useful behaviour would be an n-ary min and max. But if these would be ANDed instead, multiple min would validate an infimum and multiple max a supremum.

This is not a pressing issue. There is no functional limitation. One could write the set of validation formulas as a single formula:

    # ...
    valid:
      expr: |
        _ >= clip_bottom and
        _ <= clip_top and
        (_ == 0x7 or _ == 0x17 or _ == 0x23 or _ == 0x68 or _ == 0xB5)

@KOLANICH
Copy link

KOLANICH commented Jan 2, 2022

I found out that it will only pay attention to the first valid key of the field but ignore the others.

.ksy is the file extension, where the letter y stands for YAML. What is placed into ksy files is a subset of YAML. YAML is the language to serialize and deserialize possibly nested data like in JSON.

in JavaScript it'd look like

let field = {};
....
field.valid = {eq: 0x00};
field.valid = 0x01;

As you see, the latter assignments override the former.
When you store data it is clear that this must never happen. Different YAML parsers react differently on that. Some raise an error. Other ones just take one of the values and silently ignore the rest.

So your mental model is wrong, repeated keys don't mean what you assume them to mean.

@krisutofu
Copy link

krisutofu commented Jan 2, 2022

...
in JavaScript it'd look like

let field = {};
....
field.valid = {eq: 0x00};
field.valid = 0x01;

As you see, the latter assignments overide the former. When you store data it is clear that this must never happen. Different YAML parsers react differently on that. Some raise an error. Other ones just take one of values and silently ignore the rest.

So your mental model is wrong, repeated keys don't mean what you assume them to mean.

Sorry, thank you for clarifying.

generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Jun 17, 2022
generalmimon added a commit to kaitai-io/coreldraw_cdr.ksy that referenced this issue Oct 1, 2022
I forgot that the compiler doesn't yet perform type checking on `valid` expression
(see kaitai-io/kaitai_struct#435 (comment)),
so we need to pay attention ourselves.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants