Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization #27

Open
KOLANICH opened this issue Sep 13, 2016 · 63 comments
Open

Serialization #27

KOLANICH opened this issue Sep 13, 2016 · 63 comments
Assignees

Comments

@KOLANICH
Copy link

@KOLANICH KOLANICH commented Sep 13, 2016

You have deserialization. How about serialization?

@GreyCat GreyCat self-assigned this Sep 13, 2016
@GreyCat
Copy link
Member

@GreyCat GreyCat commented Sep 13, 2016

We've slowly discussing this issue, but it's obviously not as easy as it looks like. Probably it will be a major feature for something like second major version (v2.0 or something like that), so it'll eventually be there, but don't hold your breath for it.

Loading

@KOLANICH
Copy link
Author

@KOLANICH KOLANICH commented Sep 13, 2016

We've slowly discussing this issue, but it's obviously not as easy as it looks like.

What are the troubles?

Loading

@GreyCat
Copy link
Member

@GreyCat GreyCat commented Sep 13, 2016

It is pretty easy for simple fixed (C-style) structures. However, as soon as you start using instances that bind certain values to offsets in the stream, it becomes much more complex. A very simple example:

seq:
  - id: num_files
    type: u4
  - id: files
    type: file
    repeat: expr
    repeat-expr: num_files
types:
  file:
    seq:
      - id: file_ofs
        type: u4
      - id: file_size
        type: u4
    instances:
      body:
        pos: file_ofs
        size: file_size

This .ksy describes very simple file index structure which consists of (num_files) of (file_ofs, file_size) pairs. Each pair describes a "file", which can be accessed by an index using something like:

file_contents = archive.files[42].body

However, adding a file to such archive is a challenge. Ideally, one would want to do something like:

archive.files[42].body = file_contents

This should set relevant bound file_size automatically to accomodate length of assigned file_contents and, what's much more complex, assign file_ofs somehow. This is not an easy task: KS has no innate knowledge on how to manage unmapped space in the stream, is it limited or not, should you find some unused spot and reuse it or just expand the stream and effectively append file_contents to its end, etc.

What's even harder is that in many cases (i.e. archive files, file systems, etc) you don't want rewrite whole file, but just do some changes (i.e. appending new file to the end of archive, or reusing some pre-reserved padding, or something like that).

Another, may be even simpler example: if you read PNGs, you don't care about checksums. When you write PNGs, you have to generate proper checksums for every block — thus we need block checksumming algorithms and some way to bind it to block.

Loading

@KOLANICH
Copy link
Author

@KOLANICH KOLANICH commented Sep 13, 2016

what's the problem to deserialize them, edit, serialize back and then write?

Loading

@GreyCat
Copy link
Member

@GreyCat GreyCat commented Sep 13, 2016

What exactly do you refer as "them"? File archive example? PNG checksums?

Loading

@KOLANICH
Copy link
Author

@KOLANICH KOLANICH commented Sep 13, 2016

What exactly do you refer as "them"? File archive example? PNG checksums?

Almost anything, including files and checksums. In your example
repeat-expr: num_files
and
id: num_files type: u4
means that "files has length num_files". When you deserialize, you read num_files, create files array of length num_files and read there num_files structures. When you serealize you need the inverse: update "num_files" with length and then write. What if "num_files" is complex expression we cannot derive automatically? We shift responsibility to a user and to be able serialize he must to provide both forward and inverse mappings for such expressions, except simple cases when they can be derived automatically.

Now some ideas how to implement this. When processing a KS file it should build a graph of object dependencies, in your case num_files <-> files[].length. Then we apply rules. In this case "scalar <-> array.length" -> "scalar <- array.length" which results in num_files <- files[].length. We can say that every field has some absolute and actual strengh and that edges can go only in direction of non-increasing actual strengh and that actual strengh of node is max(max(actual strengh of neighbours), absolute strengh). This way we transform the graph into tree and reduce the number of free variables in the case of equal strength. When you need to serialize you process the description member by member, launch lazy evaluation according to the graph and write them.

If you want to minimize count of writing operations you store both versions in memory, create diff, and trynot to touch parts not touched by diff.

Loading

@GreyCat
Copy link
Member

@GreyCat GreyCat commented Sep 13, 2016

What if "num_files" is complex expression we cannot derive automatically?

Exactly my thoughts. And actually even that requires us to create some sort of inverse derivation engine. For example, if we have a binding:

    - id: body_size
      type: u4
    - id: body
      size: body_size * 4 + 2

and we update body, we should update body_size, assigning (body.size() - 2) / 4 there. If there are some irreversible bindings (i.e. modulo, hashes, etc) - then, at the very least, we need to detect that situation and allow some extra syntax to make it possible for user to set these inverse dependencies manually.

We can say that every field has some absolute and actual strengh and that edges can go only in direction of non-increasing actual strengh and that actual strengh of node is max(max(actual strengh of neighbours), absolute strengh).

I'm sorry, but I fail to understand that. Could you rephrase it, provide some examples, or explain why it is helpful, i.e. which problem we're trying to solve here?

If you want to minimize count of writing operations you store both versions in memory, create diff, and trynot to touch parts not touched by diff.

The point is that we need to add some extra syntax (or API, or something) to make it possible to do custom space allocation or stuff like that. For example, if you're editing a file system, you can't just always append stuff to the end of it: block device usually has a finite capacity and sooner or later you'll exhaust that and your choice would be to pick and reuse some free blocks in the middle of the block device.

You seem to have good ideas on implementation, would you want to join and help implementing it? Basically anything will help, i.e. .ksy syntax ideas, API ideas, tests for serialization, compiler ideas & code, etc, etc.

Loading

@KOLANICH
Copy link
Author

@KOLANICH KOLANICH commented Sep 13, 2016

And actually even that requires us to create some sort of inverse derivation engine.

Or take some library. Symolic evaluation is rather studied area of math with lots of paper written. I haven't studied it, so it is possible tha some of the ideas I've mentioned were discussed in them.

I'm sorry, but I fail to understand that. Could you rephrase it, provide some examples, or explain why it is helpful, i.e. which problem we're trying to solve here?

Let we have an array and a size_t scalar - number of elements in array. What is main in the couple? What defines it entirely? Data in the array do, number is needed only to allow us to read the array correctly. For example in c-strings you don't need number because the convention is to terminate by \0. Add more data to array and you'll have to increase array capacity which means you will have to increase number in order to read it (and everything after it) correctly. So let array.length have strentgh 2, and scalar 1. Then you have link "files[].length <-> num_files". Let we have another array "filenames[]" and its capacity num_files-2, let we also have foo and bar fields with something. Connections : filenames[].length <-> num_files, num_files <-> foo, bar <-> foo and files[].length <-> foo.

Now we start processing. Let show strenghs in brackets, absolute first.
1 files[].length (2,0)
2 files[].length
files[].length (2,2) itself is processed, go to edges
3 files[].length (2,2) <-> num_files(1,0)
4 files[].length (2,2) <-> num_files(1,1)
2>1 so removing reverse edge and setting to 2
5 files[].length (2,2) -> num_files(1,2)
6 files[].length (2,2) <-> foo(1,0)
7 files[].length (2,2) <-> foo(1,1)
8 files[].length (2,2) -> foo(1,1)
9 files[].length (2,2) -> foo(1,2)
10 num_files(1,2) <-> filenames.length (2,0)
11 num_files(1,2) <-> filenames.length (2,2)
equal, this saves both edges.
12 num_files (1,2) <-> foo(1,2)
13 foo(1,2) <-> bar(1,0)
14 foo(1,2) <-> bar(1,1)
15 foo(1,2) -> bar(1,1)
16 foo(1,2) -> bar(1,2)

then we need to serialize and read config
1 first goes the num_files
it has incoming edge files.length. it is only incoming edge.
2 files[].length is already evaluated, take its value, eval. num_files.
3 it has 2 bidi edges, filenames[].length and foo
evaluate them the same way and check if they match to value of filenum.
4 serialize and write
5 continue to the rest of fields.

explain why it is helpful, i.e. which problem we're trying to solve here?

it determines the order we should evaluate expressions and what expressions depends on what and helps to find conflicts.

and I think you should really read something about symbolic evaluation (I haven't, ideas above are just adhoc thoughts, maybe there are better approaches to it).

You seem to have good ideas on implementation, would you want to join and help implementing it?

Sorry, no. Maybe i'll send you some ideas or code later, but I can't be a permanent member of this project.

Basically anything will help, i.e. .ksy syntax ideas, API ideas, tests for serialization, compiler ideas & code, etc, etc.

OK.

Loading

@GreyCat
Copy link
Member

@GreyCat GreyCat commented Sep 13, 2016

Add more data to array and you'll have to increase array capacity which means you will have to increase number in order to read it (and everything after it) correctly. So let array.length have strentgh 1, and scalar 0.

Ok, then what shall we do in case of the following:

seq:
  - id: num_objs
    type: u4
  - id: headers
    type: header
    repeat: expr
    repeat-expr: num_objs
  - id: footers
    type: footer
    repeat: expr
    repeat-expr: num_objs

This implies that headers[] and footers[] shall always have the same number of objects. How are the strengths are assigned in this case and how do we enforce that they have equal number of objects?

Loading

@KOLANICH
Copy link
Author

@KOLANICH KOLANICH commented Sep 13, 2016

1 see above example
2 throwing an exception, of course

ps graph evaluation is done by KS compiler
runtume checks are done by generated code

and I fixed priorities to match the lines I made supposing strenghs were 1 and 2

Loading

@GreyCat
Copy link
Member

@GreyCat GreyCat commented Mar 24, 2017

I've commited very basic PoC code that demonstrated seq serialization in Java. It is available in distinct "serialization" branches. To test it, one'll need:

Obviously, only a few tests were converted, and, to be frank, now only a very basic set of types is supported. Even strings are not implemented right now. Testing is very basic too, one can run ./run-java and see that two packages are ran: spec is "normal", reading tests, and specwrite are writing tests.

I'd really love to hear any opinions on the API (both runtime & generated), Java implementation (it really pain in the ass, as ByteBuffer does not grow, and you have to preallocate the array, or probably reimplement everything twice with something that grows), etc.

Loading

@GreyCat
Copy link
Member

@GreyCat GreyCat commented Mar 25, 2017

Strings work, repetitions work, enums work, things are slowly coming to reality :)

Loading

@GreyCat
Copy link
Member

@GreyCat GreyCat commented Mar 27, 2017

Serialization progresses slowly. Basic in-stream user types and processing on byte types work, various fancy byte type stuff like terminator and pad-right works, etc, etc.

I've got an idea that might be very simple to implement.

Step 1: manual checks

Generate _read, _write and _check methods. _check runs all the internal format consistency checks to ensure that stuff that will be written will be read back properly. For example:

seq:
  - id: len_of_1
    type: u2
  - id: str1
    type: str
    size: len_of_1 * 2 + 1

would generate:

    public void _read() {
        this.lenOf1 = this._io.readU2le();
        this.str1 = new String(this._io.readBytes(lenOf1() * 2 + 1), Charset.forName("ASCII"));
    }
    public void _write() {
        this._io.writeU2le(this.lenOf1);
        this._io.writeBytes((this.str1).getBytes(Charset.forName("ASCII")));
    }
    public void _check() {
        if (this.str1.bytes().size() != lenOf1() * 2 + 1)
            throw new FormatConsistencyError("str1 size", this.str1.bytes().size(), lenOf1() + 3);
    }

To use this properly, one must manually set both lenOf1 and str1:

r.setStr1("abcde");
r.setLenOf1(2);
r._check(); // should pass, so we're clean to take off
r._write(); // should write consistent data that's guaranteed to be readable back

Step 2: dependent variables

We declare some fields as "dependent", and mark them up in ksy:

seq:
  - id: len_of_1
    type: u2
    dependent: (str1.to_b("ASCII").size - 1) / 2
  - id: str1
    type: str
    size: len_of_1 * 2 + 1

This means that len1 becomes a read-only variable, setLenOf1 setter won't be generated. Instead, it would generate slightly different _write:

    public void _write() {
        this.lenOf1 = (str1().getBytes(Charset.forName("ASCII")).size() - 1) / 2;
        this._io.writeU2le(this.lenOf1);
        this._io.writeBytes((this.str1).getBytes(Charset.forName("ASCII")));
    }

Obviously, using this boils down to single r.setStr1("abcde");.

Any comments / ideas / etc?

Loading

@KOLANICH
Copy link
Author

@KOLANICH KOLANICH commented Mar 28, 2017

Any comments / ideas / etc?

  1. move evaluation of expression into a separate method
  2. use check and write in generic form without explicit expressions in it, but with a method from 1)
  3. Why not to use value instead of dependent?
  4. why do we use .to_b("ASCII").size? String encoding is known, so why not just .size?
  5. this dependent are ugly, it'd be nice to eliminate them, but we need a decision what kind of expressions should be resolved automatically. I guess linear ones should be enough. Another question is how to solve them. There is exp4j, for parsing and storage, but it'll require some code to build a simple symbolic gaussian elimination solver over it. If we wanna write in python, here are docs for lib wrapping multiple SMT solvers: https://github.com/angr/angr-doc/blob/master/docs/claripy.md

Loading

@GreyCat
Copy link
Member

@GreyCat GreyCat commented Mar 28, 2017

  • move evaluation of expression into a separate method
  • use check and write in generic form without explicit expressions in it, but with a method from 1)

You mean, something like that?

    public void _read() {
        this.lenOf1 = this._io.readU2le();
        this.str1 = new String(this._io.readBytes(_sizeStr1()), Charset.forName("ASCII"));
    }
    public void _write() {
        this._io.writeU2le(this.lenOf1);
        this._io.writeBytes((this.str1).getBytes(Charset.forName("ASCII")));
    }
    public void _check() {
        if (this.str1.bytes().size() != _sizeStr1())
            throw new FormatConsistencyError("str1 size", this.str1.bytes().size(), lenOf1() + 3);
    }
    public int _sizeStr1() {
        return lenOf1() * 2 + 1;
    }

Does it bring any benefits? It won't really simplify generation (probably on the contrary), and we'll need to invent tons of names for all that size, if, process, etc expressions.

  • Why not to use value instead of dependent?

Naming is, of course, still to be discussed. One argument against value I have is that value is already used for reading in value instances,

why do we use .to_b("ASCII").size? String encoding is known, so why not just .size?

Using size on a string will give out a length of string in characters. If you'll put some non-ASCII into that string, proper .to_b("ASCII").size conversion will give you an exception, while just taking "number of bytes = number of characters" will give you corrupted data.

I guess we could try to do some sort of .bytesize method for strings taken verbatim from format definition, where "encoding is known", to save retyping the encoding name. However, it still won't work on modified strings, i.e. it's possible to implement str1.bytesize and it won't work with (str1 + 'x').bytesize (as the latter is CalcStrType, which lacks any source encoding info by design).

  • this dependent are ugly, it'd be nice to eliminate them

First of all, you can't really eliminate them completely in any case. Some functions are just irreversible, and in some cases you'll have more free variables than constraints. For example:

seq:
  - id: a
    type: u1
  - id: b
    type: u1
  - id: my_str
    type: str
    size: a + b

Even if you have byte size of my_str, you can't set both a and b automatically. Reversing stuff automatically would be more like a syntactic sugar feature, just to save from typing boring stuff where it's possible. In fact, I heavily suspect that we'll cover 95% of cases with very crude logic like size: a => a = attr.size.

Loading

@KOLANICH
Copy link
Author

@KOLANICH KOLANICH commented Mar 28, 2017

value is already used for reading in value instances,

Yes, that's why I've chosen it. It has the same semantic: deriving a value of a struct member from other ones makes it some kind of an instance, but tied to offset.

we could try to do some sort of .bytesize

size attribute in a .ksy means size in bytes. So I see no reason it to mean anything else in expression language.

First of all, you can't really eliminate them completely in any case.

Yes, we have already discussed this.

Even if you have byte size of my_str, you can't set both a and b automatically.

Since they are of equal strentgh. This should cause ksc to throw a warning about undefined behavior.

Reversing stuff automatically would be more like a syntactic sugar feature, just to save from typing boring stuff where it's possible.

Not only. One expression can contradict another one, it can cause nasty errors ot can be used as a backdoor. Another side is that we (humans) don't know exact expressions ahead of time. So I propose the following:
1 to have a syntax to provide a manual expression
2 missing expressions are generated by compiler and inserted into a ksy with another type of expressions
3 human examines ksy output for errors and malicious code
4 verified output is used to generate actual code
5 we would sometimes want to

  • regenerate all non-manual expressions in a ksy
  • check that expressions don't contradict each other

In fact, I heavily suspect that we'll cover 95% of cases with very crude logic like size: a => a = attr.size.

Maybe.

Loading

@ixe013
Copy link

@ixe013 ixe013 commented Oct 5, 2017

I would like to play with the serialization branch...

Is there a pre-built compiler and Java runtime of the serialization branch available? If not, that's ok, I'll setup a Scala build environment.

Loading

@GreyCat
Copy link
Member

@GreyCat GreyCat commented Oct 5, 2017

@ixe013 There are no pre-built packages for such experimental branches, but it's relatively straightforward to try it yourself. You don't really need anything beyond sbt — it would download all the required stuff (including Scala compiler and libraries of the versions needed) automatically. See dev docs — it's literally one command to run, like sbt compilerJVM/universal:packageBin — et voila.

Loading

@glenn-teillet
Copy link

@glenn-teillet glenn-teillet commented Oct 10, 2017

I was able to build a Read-write version of my ksy, however the generaed code does not compile as the Reader reads from Int and stores them in Enums as Long (Int->Long) and the _writes() try to write the Enum Long as Int (Long->Int cause error).

I have these errors:
The method writeBitsInt(int, long) is undefined for the type KaitaiStream
The method writeU1(int) in the type KaitaiStream is not applicable for the arguments (long)
The method writeU2be(int) in the type KaitaiStream is not applicable for the arguments (long)

Loading

@glenn-teillet
Copy link

@glenn-teillet glenn-teillet commented Oct 10, 2017

I also have errors in every constructor, where it tries to assign _parent received as a KaitaiStruct in the constructor signature to the _parent without casting to the more specific type.

Type mismatch: cannot convert from KaitaiStruct to KaitaiStruct.ReadWrite

Loading

@GreyCat
Copy link
Member

@GreyCat GreyCat commented Oct 10, 2017

@glenn-teillet There's still lots of work to do, it's nowhere near production-ready. Before trying to compile read-life ksy files with lots of complicate details, probably we should spend some time getting read-write test infrastructure to work, and, probably, porting that older serialization branch code to modern codebase.

Loading

@KOLANICH
Copy link
Author

@KOLANICH KOLANICH commented Oct 23, 2017

Some suggested definitions for the serialization spec:

  • Serialization.
    Let we have a binary format f, a set of sequences of bits FS, its subset of sequences of bytes making a valid format FS_f, a set of object-oriented Turing-complete programming languages PL, a set of valid Kaitai Struct definitions KSY, including the subset of definitions for the format f KSY_f, and the KS compiler KSC : PL × KSY → (PSC, SSC), where PSC: FS → O is a set of a parsing programs, SSC: O → FS is a set of serializing programs and ssc_{ksy_f}(psc_{ksy_f}(s)) ≡ s ∀s ∈ FS_f, ∀ksy_f ∈ KSY_f, ∀pl ∈ PL, KSC(pl, ksy_f)=(psc_{ksy_f}, ssc_{ksy_f}). To be practically usable there should be a way to create an o= psc_{ksy_f}(s) programmatically without doing any parsing of actual bit string s.

Serialization is the part of KSC producing serialization programs.

  • Internal representation are the objects created by KSC-generated code in program's runtime.
  • Finalization is a process of transforming the internal representation that way, that only trivial transformations are left to be done to create a serialized structure.
  • An expression is the mapping of a subset of internal representation variables called expression arguments to another variable called expression output.
  • A reverse expression is an expression mapping an original expression's output back to its arguments with respect to the current state of the internal representation (including the arguments).
  • Trivial transformations are the ones not involving computing any expressions. Examples of trivial transformations are endianess conversions and bit shifts induced by using bit-sized fields.

Comments?

Loading

@FSharpCSharp
Copy link

@FSharpCSharp FSharpCSharp commented Dec 22, 2017

This is a very interesting subject. What is the current status of this? Will this be pursued in the future? It would be very good if you could use serialization. Especially for binary formats based on bit's this would be a great relief. I hope that the issue will be pursued further and that we can also benefit from it in the future.

Loading

@GreyCat
Copy link
Member

@GreyCat GreyCat commented Dec 22, 2017

No updates here, latest comments like #27 (comment) still reflect current state of affairs. Help is welcome :)

Loading

@fudgepop01
Copy link
Member

@fudgepop01 fudgepop01 commented Jun 25, 2019

I think there's something incredibly interesting discussed in this talk [[here] (https://youtu.be/AdNJ3fydeao)] that gives me some ideas that (may or may not have) already been brought to the table

It speaks of how svelte (a frontend framework/web-compiler) handles truly reactive data. It sort of takes the javascript $: operator (don't ask me what it's supposed to do natively, idk either), and generates something reactive from that. Watch the talk to experience some more that it can do - it's really cool.

I think this is a brilliant way of handling dependencies between variables, and it really helped me visualize what something like that might look like in kaitai struct. Theoretically, could there be a second stage of compilation for the generated files? The first pass is what we have now which generates the parser.

The change would be placing markers with data about values that relate to each other. It doesn't need to be syntactically correct to the native language - that's for the second pass to handle. For now it just generates the parser along with some type of markers. This means that what's generated at this point is essentially just a long string.

The second pass (perhaps a kaitai_struct_linker or something) would ONLY look for these markers and replace them with the proper code to implement the relationships between them.

In my (possibly incredibly naivé) mind, this is good because it separates the concern of recompilation and serialization into two different phases that could potentially be toggled. If a parser is already in place and variables just need to be "linked" - just run the kaitai_struct_linker. If there is only a need for a parser, then just run the standard compiler that we already have (with an option toggled to leave out the markers).

...Does this idea make any sense?

Loading

@KOLANICH
Copy link
Author

@KOLANICH KOLANICH commented Aug 19, 2019

Current processors architecture in large part is incompatible to serialization. Having 2 functions is enough for very dumb processors, but not enough for smart ones like compression algorithms.

  • For compression algorithms we have to introduce a function for getting their arguments from the blob.
  • but these arguments arguments are not the only parameters. In fact we can split dictionary from the arguments. This means a yet another method to compute an optimal dictionary shared by muptiple streams.
  • And both these methods are specific to compression algos only.

So we need another architecture.

Probably we should eliminate process and introduce types doing the same instead coupled with typed value-instances #127, templates #135 and chains #196.

It may look like

seq: 
  - id: dictionary
    size: 256
  - id: compressed_blob0
    size: 100
  - id: compressed_blob1
    size: 500
instance:
  a_compression:
    type: kaitai.compress.algo
  b_compression:
    type: kaitai.compress.algo
  a:
    io: kaitai.compress.algo.uncompressed._io
    type: a

where computation of optimal dictionary should be specified in a separate spec which should be a part of KSC stdlib.

More precisely

compression_type:
  params:
    - id: compressed
      type: bytes
      generate: compress(args, dict, uncompressed)
    - id: args
      type: any
      generate:
        - get_args(compressed)
        - compute_optimal_args(uncompressed[i])
    - id: dict
      type: bytes
      generate: compute_optimal_dict(uncompressed[], args) # [] is used to inform the compiler that if all the other args form the key, by which all uncompressed are aggregated using this function. I have thought a bit about Einstein notation, but currently have no idea how much use it has here
    - id: uncompressed
      type: bytes
      generate: decompress(args, dict, compressed)

defines an interface of processor type

and

kaitai.compress.*: compression_type

defines that the user-implemented types under that namespace are compressors

Loading

@jchv
Copy link

@jchv jchv commented Sep 8, 2019

FWIW, just to keep conversations up to date, I am working on trying to add support for basic serialization capabilities in various runtimes.

It looks largely like the Java runtime at least for the write functions themselves, which is entirely accidental (I had only just looked to see if there were any major differences.) I plan to continue pushing PRs to as many runtimes as possible, please yell at me if I'm doing it wrong. :)

(Sidenote: I am aware for more advanced write functionality it would make sense to work on the compiler and runtime portions in parallel, but for these really basic operations I don't think it matters very much, and it's low hanging fruit.)

Loading

@jchv
Copy link

@jchv jchv commented Sep 8, 2019

OK, I think it's time to talk about the interface. I realize there's been some discussion for this already, and a lot of it has focused on reversible expressions. I don't want to get into that; all of that stuff would be very nice but is far into the distance in my opinion. One only needs to look at the amount of progress made on this issue since it was created to realize that there is simply not enough time for going directly into symbolic execution and whatnot from scratch. Today users would benefit from even the most basic serialization support. I think that's what we should aim to implement first, as long as it doesn't impede on the potential future.

First, setters. I think any language that has getters or helper methods today should have parallel methods for setting as well, and some helper methods may be useful even in languages that do not currently have getters or helper methods. There's already an issue tracking this, #566. This is probably the first thing that should be done, before actually trying to implement writing, since it's not very useful otherwise.

  • Add setters for basic primitives. For integers, byte arrays, strings, and almost anything else that compiles down to a language primitive, this should be extremely easy.
  • For some kinds of variables, mutable variations of getters may be needed. For example, for underlying KaitaiStructs.
  • For repeated fields, represented as arrays, it's probably advisable to have helpers for adding new entries as well as getting a mutable underlying value.
  • Users of the generated library would be responsible for making sure everything makes sense. KaitaiStruct could generate assertions in the write method that the expressions evaluate to the expected values before writing and throw an exception if not. (This would, for example, catch mistakes where the user forgot to increment a length prefix.) This has been discussed here, but I think it is worth including even in the early stages.

Secondly, instances. With non-value instances, the incoming KaitaiStream is saved and reused later to lazily grab objects. This works great for reading, but poses problems for writing. In order to support more advanced use cases, modifications to the kaitai struct schema may be needed. For now, I propose pushing some of the complexity back to the user in exchange for allowing the user to do advanced things today:

  • Don't store a KaitaiStream for each KaitaiStruct. Instead, each non-value instance would generate a new Instance class deriving from KaitaiInstance.
  • KaitaiInstance would provide a function for writing itself out to a KaitaiStream. The calling generated code would be in charge of seeking.
  • The KaitaiInstance class should have a constructor that takes in a KaitaiStream, an offset, and a length. This would lazily store the stream and its offset. Attempting to access the value would cause it to be decoded from the stream. If it is never mutated, the decoding step can be skipped when writing out to stream and a direct copy can be made.
  • There should be an instance constructor that takes the type of the field. This would lazily store the type itself.
  • The KaitaiInstance class should have a static convenience function for loading from memory, just storing it as a memory-backed KaitaiStream. (Static function prevents colliding with constructor if the underlying type happens to be the same as the raw binary type, like str in C++.)
  • During KaitaiStruct deserialization, KaitaiInstance objects corresponding to each instance are populated with the values read during parsing. Because the values are read at a different stage, this is a potential compatibility break, but can probably be minimized by making it happen as late as possible.
  • In some cases it may be preferable for all instances to be fetched into memory prior to writing, for example for writing to a file in-place. A new KaitaiStruct method should be provided that recursively loads all instances and converts them to in-memory instances. For KaitaiStructs without instances, this would be a noop (other than calling it for underlying KaitaiStructs.) They would still be direct copied to the output unless attempting to mutate the underlying value.
  • Compatibility can largely be maintained in the interface by providing a separate variable/getter/setter for directly manipulating the instance class versus the data. The setter method for the ordinary type itself can just silently manipulate the instance into a memory instance.
  • If the KaitaiInstance is unset during write, an exception is thrown.

Some of this is very tentative since I am still mulling over how this might even work preliminarily, but I think it is time to start discussing it ASAP.

Thirdly, the IO object. Once the IO object no longer needs to be stored for each KaitaiStruct, I think that there needs to be the ability to construct an object that is blank. For compatibility, the existing constructor that reads the structural data from a data source should still exist, but it would mostly just serve to call the now-public read method.

I think this needs to be changed, but it may be possible to keep compatibility in most of the APIs anyways. Here's my proposal.

Today:

class struct_t : public kaitai::kstruct {
public:
    struct_t(kaitai::kstream* p__io, kaitai::kstruct* p__parent = nullptr, struct_t* p__root = nullptr);

private:
    void _read();

...
};

Proposed interface:

class struct_t : public kaitai::kstruct {
public:
    // New: A default constructor that sets 'zero values' for everything.
    struct_t();
    // Same signature as before.
    struct_t(kaitai::kistream* p__io, kaitai::kstruct* p__parent = nullptr, struct_t* p__root = nullptr);

    // Now public.
    void _read(kaitai::kistream* p__io, kaitai::kstruct* p__parent = nullptr, struct_t* p__root = nullptr);
    // Write method.
    void _write(kaitai::kostream *p__io, kaitai::kstruct* p__parent = nullptr, struct_t* p__root = nullptr);
...
};

The kistream would only ever be saved in KaitaiInstances.

Again, this is all preliminary proposals. First thing would be to work on setters. But, I do think it is important to start discussing seriously what the first serialization interface should look like.

Loading

@generalmimon
Copy link
Member

@generalmimon generalmimon commented Nov 17, 2019

@jchv I'm glad that someone is interested in serialization support! However, your comment suggests that we are at the very beginning and that no progress has been made on this issue. That's not true at all. There are Java serialization branches, where most of the basic support has already been implemented.

Let me summarize the implementation progress of this feature in Java (edits from maintainers are welcome):

  • seq fields
    • integers
      • fixed/switchable/inherited endianess
    • bit-sized integers
    • byte arrays
      • terminator/consume/include
      • check if length does not exceed size
    • strings
      • encoding
      • terminator/consume/include
      • check if length equals size
    • enums (kaitai-io/kaitai_struct_compiler#181)
    • repetitions
      • repeat: expr
        • check if array length matches repeat-expr
      • repeat: eos
      • repeat: until
    • user types
      • without size
      • with size (creating a substream)
    • switching
      • basic support
      • type casting
        • case types
          • to user types
          • to primitive types (e.g. int)
            • default case specified (combined type is primitive)
            • default case missing (combined type is nullable (boxed) - unboxing has to be done before actual casting)
        • to byte[] in default case if size is specified and default case is missing
          • with process
  • parse instances
  • testing
    • basic manual creation and running
    • generation from KST
    • automated running and publishing results
  • reverse expressions, autocorrection and other advanced stuff

If you don't understand any of the points, feel free to ask!

Can I ask you, @jhcv, why writing parse instances would require such a complicated solution? I was hoping that reading would stay as it is and writing would be similar to writing seq fields, only with some seeking, but I may be naive.

Loading

@jchv
Copy link

@jchv jchv commented Nov 17, 2019

As far as I know the only way I have deviated from the path taken so far is I’ve attempted to specify what happens with instances. The rest should be pretty similar to the existing branches. Is there anything particularly egregious there?

There’s two use cases that are important:

  1. Editing an existing file. It would be nice if it is possible to say, edit a single instance and write it in-place back to a file, or edit a struct but leave the instances alone. But, it would also be rather important to be able to then go and save to a new file entirely, and pull all of the lazy loaded bits as needed to write the new file. Finally, in some cases you just want to parse the full file and close it, so you can, for example, fully rewrite the input file. All of these cases become complicated with instances today, and in the C++ code this relies on holding a kaitai stream.

  2. Creating new files from scratch. This use case is completely new, but it’s just as important for me as editing existing structures, if not more. Because of my own personal needs, I’d like to push for this to be supported.

Being able to support any editing would, to me, imply adding setters where there are getters, more flexible instances that can still be lazy loaded but also eagerly loaded if needed, and the ability to create new kaitai structs from scratch.

And small note: I don’t know much about the Java generated code or runtime, so admittedly I am not sure how much this applies to Java. I have focused mostly on C++ as it seems to be one of the more challenging ones to support.

Loading

@kruton
Copy link

@kruton kruton commented Sep 19, 2020

Is switching on enum types also broken due to kaitai-io/kaitai_struct_compiler#181? I use such a construct but it ends up generating code where the field is typed as an Object instead of another KaitaiStruct.ReadWrite and the calls to _read() (and _write(...)) fail because of that.

Loading

@generalmimon
Copy link
Member

@generalmimon generalmimon commented Sep 19, 2020

@kruton:

Is switching on enum types also broken due to kaitai-io/kaitai_struct_compiler#181?

Actually not due to kaitai-io/kaitai_struct_compiler#181, but #204, which has been already fixed, but only on the regular master branch, not serialization. After releasing 0.9, I plan to merge the master branch to serialization again, so the fix would be included in serialization as well.

Loading

@Holzhaus
Copy link

@Holzhaus Holzhaus commented Dec 14, 2020

Any news on this?

Loading

@dullbananas
Copy link

@dullbananas dullbananas commented Dec 14, 2020

the age of my comment here makes me feel old

Loading

@ShadowDancer
Copy link

@ShadowDancer ShadowDancer commented Jan 15, 2021

I would like to contribute some serialization code for c#, but by reading this issue I see there is not much push for doing so.

From my perspective inability to write files in place, or inability to serialize compression algorithms are niche problems, what 90% of users need (I think), is to write few strings and ints into array of bytes, and then maybe set its length.
Maybe lets start with getting some basic functionality across all runtimes, even without supporting complex stuff like functional expressions, and then move on in iterative fashion?

Loading

@DarkShadow44
Copy link

@DarkShadow44 DarkShadow44 commented Feb 3, 2021

FWIW, my C implementation of an .ksy compiler at https://github.com/DarkShadow44/UIRibbon-Reversing is capable of (basic) serialization.

Is the branch https://github.com/kaitai-io/kaitai_struct_compiler/tree/serialization the current target for serialization, or where would I need to build on to try and add this?

Loading

@KOLANICH
Copy link
Author

@KOLANICH KOLANICH commented Feb 3, 2021

Oh an independent impl of KSC in python! Nice, though I doubt it makes much sense to fragment the ecosystem - it just causes effort wasted for nothing. Scala, in which main KSC is written, is a bit esotheric so I don't want to touch it, so I considered a several times to start an own KSC, but rejected that idea all the times. You know, doing development of a KSC impl seriously means feature parity and compatibility, this requires serious continious effort ... for what, for just be able to extend KSC in python? To do that (and in any other lang that can interface JVM easily) we just need architecture improvements in KS itself, idea of which unfortunately was rejected. If not seriously involve oneself into a standalone impl ... that means the effort is just wasted.

Anyway, synalysis2kaitai has some code that can be useful for you.

BTW, I'd like to refactor synalysis2kaitai the way having intermidiate representation and multiple passes (currently it is implemented as a single pass, just because I was too lazy to write the glue code to parse the source into IR and then process it there and then dump. Don't repeat my mistakes, one cannot do such complex tasks with a single pass) but have not time for that. Also I noticed that you dump C source manually. It is your code and do as you feel the most convenient for you, but I feel a bit uncomfortable to see directly dumping source instead of using AST, it is again a strong limitation. A better approach is to use AST - this way it is much easier to modify the logic of code generation. https://github.com/git-afsantos/bonsai may be uelpful, though I haven't tried it personally yet.

Loading

@DarkShadow44
Copy link

@DarkShadow44 DarkShadow44 commented Feb 3, 2021

Well, I mostly made this implementation because I wanted something usable for C, and quick. Just seemed easier than learning Scala back then, although my goal would still be to implement all features back into KSC so I can retire my implementation. Thanks for you suggestions, but I doubt I want to continue this implementation now that it does what I needed.

I'd rather try to add serialization for Java or C# into KS. There's quite a few technical problems with my current implementation, but for a basic implementation it should be usable as reference.

Main concept of my idea of serialization is a multi-pass approach.
Basically it works like:

  1. Dry-run to calculate all the offsets of all data, filling special fields of the struct with offsets
  2. Give user option to alter data by filling offsets into all places where it's referenced
  3. Actual write run, writing the data into a continuous blob

Part 2) is needed because some formats (like UIRibbon) reference parts of it's contents from multiple places. It's not needed for reading the format, but the proprietary reader of MS needs it - so we need to be able to do that.

Each pass works according to the following formula:

  1. Write main seq
  2. Write main instances one by one
    2a) Write instance seq
    2b) Write instance instances
    Recurse until everything is written.

That idea if open for discussion, if you want to help review it. This is the only way I see to properly write files.

Loading

@sweetgiorni
Copy link

@sweetgiorni sweetgiorni commented Mar 7, 2021

Let me summarize the implementation progress of this feature in Java (edits from maintainers are welcome):

* `seq` fields
  
  * [x]  integers
    
    * [x]  fixed/switchable/inherited endianess
  * [x]  bit-sized integers
  * [x]  byte arrays
    
    * [x]  terminator/consume/include
    * [x]  check if length does not exceed `size`
  * [x]  strings
    
    * [x]  `encoding`
    * [x]  `terminator`/`consume`/`include`
    * [x]  check if length equals `size`
  * [ ]  enums ([kaitai-io/kaitai_struct_compiler#181](https://github.com/kaitai-io/kaitai_struct_compiler/pull/181))
  * [x]  repetitions
    
    * [x]  `repeat: expr`
      
      * [x]  check if array length matches `repeat-expr`
    * [x]  `repeat: eos`
    * [x]  `repeat: until`
      
      * [ ]  check if last element satisfies the `repeat-until` condition ([kaitai-io/kaitai_struct_compiler#183](https://github.com/kaitai-io/kaitai_struct_compiler/pull/183))
  * user types
    
    * [x]  without `size`
    * [x]  with `size` (creating a substream)
  * switching
    
    * [x]  basic support
    * type casting
      
      * case types
        
        * [x]  to user types
        * to primitive types (e.g. int)
          
          * [x]  default case specified (combined type is primitive)
          * [ ]  default case missing (combined type is nullable (boxed) - unboxing has to be done before actual casting)
      * [ ]  to `byte[]` in default case if `size` is specified and default case is missing
        
        * [ ]  with `process`

* [ ]  parse `instances`

* testing
  
  * [x]  basic manual creation and running
  * [ ]  generation from KST
  * [ ]  automated running and publishing results

* [ ]  reverse expressions, autocorrection and other advanced stuff

@KOLANICH @GreyCat Perhaps you could use this as a template to create separate tickets for tracking serialization implementation in each language? Maybe group them under a milestone? It would help keep track of progress and reduce the number of people asking for updates in this issue :)

Loading

@mmajor73
Copy link

@mmajor73 mmajor73 commented Aug 19, 2021

Can some point me to an example of how to use the new serialization features.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet