Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External processing subroutines #51

Closed
ghost opened this issue Nov 15, 2016 · 34 comments
Closed

External processing subroutines #51

ghost opened this issue Nov 15, 2016 · 34 comments
Milestone

Comments

@ghost
Copy link

ghost commented Nov 15, 2016

Sometimes data is encrypted with custom algorithms which requires from us to create two .ksy files: one for a file which contains that encrypted data and one for decrypted data itself. Then, while working with compiled modules, we have to create an object of the first module, locate the encrypted data, decrypt it, and instantiate an object of the second module with the decrypted data as input. Now we can work with the decrypted data.
In my opinion, this process can be simplified by two things:

  1. Ability to specify an external script file or library and procedure to call. For example:
encrypted_data:
  seq:
    - id: enc_size
      type: u4
    - id: dec_size
      type: u4
    - id: data
      size: enc_size
      process: external
      file: script.pl
      sub: decrypt(enc_size, dec_size)
  1. In addition to the previous item, ability to specify type of processed data (to eliminate the need for another .ksy file):
    - id: data
      size: enc_size
      process: external
      file: script.pl
      sub: decrypt(enc_size, dec_size)
      type: decrypted_data

  decrypted_data:
    ...

If we want to use external library or work with non-scripting languages, the library can be specified as file: library.dll. Then, we should be able to declare function prototype: sub: void* __stdcall decrypt(void* data, int enc_size, int dec_size).

Of course, generation a code to call external library functions for each supported language is the main problem of my proposal, and deallocation of processed data is another problem. What do you think about it all?

@GreyCat
Copy link
Member

GreyCat commented Nov 15, 2016

First of all, technically, you don't need 2 distinct .ksy files in any case. You can always have:

meta:
  id: main_struct
seq:
  # ...
  - id: encrypted_data
    size: some_size
    # no type here!
  # ...
  encrypted_data_struct:
    seq:
      - id: foo
        type: u4
      - id: bar
        type: u4
      # some internal structure inside encrypted container described here

And then you just glue it all together inside your program:

# start parsing main structure
main = MainStruct.from_file("/path/to/some/file.bin")

# so some fancy decryption on byte array we've got
decrypted_data = fancy_decrypt(main.encrypted_data)

# parse internal structure from decrypted byte array
internal_struct = MainStruct::EncryptedDataStruct.new(Kaitai::Struct::Stream.new(decrypted_data))

# do stuff with internal struct
puts internal_struct.foo, internal_struct.bar

As for your main proposal, sure, as you've already noted, the main problem is accessing all these external stuff from all possible languages. For example, JavaScript in a browser usually has no means to execute any external libraries / binaries. Even if we're talking more traditional languages which run on a more-or-less standard OS with a notion of filesystem, processes, etc, there's still tons of available options. For example:

  • Running an external process (more or less straightforward - you pass encrypted stuff to stdin, everything you've got from stdout is your decrypted stuff)
  • Running a function / class from a library in some interpreted language (like Perl, Ruby, Python, JavaScript) — this involves running an interpreter (which, in turn, might involve locating that interpreter and passing some CLI options), specifying some bootstrap code to load these libraries, generating the function call, getting the data back. Pretty untrivial stuff.
  • Running a compiled machine code function from some sort of library (.so / .dll / etc). It's very machine-, ABI- and OS-dependent, and might involve some heavy magic stuff when invoked from scripting languages.
  • Running some Windows COM / OLE code. Obviously, Windows or Wine only.
  • Running some class or method in a JVM.
  • Running some class or method in a .NET CLR.

@LogicAndTrick
Copy link
Collaborator

How about some sort of plugin system? There could be a few "standard" processing plugins that are supported universally (XOR, rotate, zlib) and also the ability to load non-standard ones in the consumer. The non-standard plugins would be maintained by a third party and written in as many languages as they would like.

Example:

encrypted_data:
  seq:
    - id: enc_size
      type: u4
    - id: dec_size
      type: u4
    - id: data
      size: enc_size
      process: custom_thing(enc_size, dec_size)  # <- custom process call

Generated code:

// ctor
{
    EncSize = stream.ReadU4();
    DecSize = stream.ReadU4();
    var _temp_data = stream.ReadBytes(EncSize);
    Data = stream.ExecuteProcess("custom_thing", _temp_data, new object[] { EncSize, DecSize });
}

Client code:

// Register the custom process (this could be an interface, lambda, function pointer, etc. depending on language)
// The list of registered processors would be stored statically and shared between all KS instances
KaitaiStruct.RegisterProcess("custom_thing", (data, args) => {
    var enc_size = (int) args[0];
    var dec_size = (int) args[1];
    return FancyDecrypt(data, enc_size, dec_size);
});

var data = EncryptedData.FromFile("/path/to/some/file.bin");
Console.Write(data.Data); // Writes output from FancyDecrypt

This could also potentially let the zlib processor be an external dependency on platforms that don't have native support (e.g. JavaScript). If the user doesn't want zlib, then they don't need to load the zlib dependency.

@GreyCat
Copy link
Member

GreyCat commented Nov 16, 2016

That's definitely a possibility, although we need to consider the exact interface carefully. There are a few issues here:

  • Many algorithms require certain initialization / finalization phases, and have some state (I've kind of tackled that issue in Processing with real cryptographic ciphers #45). Thus, simply a context-free function / lambda probably won't be enough.
  • Runtime choose-function-by-string-name phase might be slow, and this particular step might be somewhat performance-sensitive (think calling such process million times per second). Also, registering a processor into every stream manually, if there would be millions of substreams created, would be kind of nuisance (and adds more precious bytes of storage for every stream). Shall we provide some sort of app-wide "registry" singleton object for such stuff?
  • Having such stuff in ksy would obviously make it impossible to run it without a running wrapper, i.e. from any visualizer / IDE. Any ideas what could be done here?

@ghost
Copy link
Author

ghost commented Nov 16, 2016

If plugins will be loadable (not executable once as processes), it may be possible to specify several procedures in .ksy which will be compiled into wrappers for plugin procedures and accessible from an user code. For example:

meta:
  - plugin: decryptor
    path: "/path/to/decryptor.so" # .dll / .class / ...
    sub: init
    sub: fancy_decrypt
    sub: deinit

seq:
  - id: data
    plugin: decryptor
    sub: fancy_decrypt

It may be compiled like this (in pseudocode):

...
handle = null

# Will be called once when importing this module from user code
ctor() {
    handle = LoadLibrary("/path/to/decryptor.so") # Or other language / system mechanism to load the plugin
}

init() {
    init_proc = GetProcAddress(handle, "init")
    init_proc()
}

fancy_decrypt(data) {
    fancy_decrypt_proc = GetProcAddress(handle, "fancy_decrypt")
    fancy_decrypt_proc(data)
}

deinit() {
    deinit_proc = GetProcAddress(handle, "deinit")
    deinit_proc()
}
...

In our program, we must call init() before the data will be processed by a runtime. It's an issue because it's possible only for instances which will be parsed when requested, as opposed to main seq block and subblocks.

As you see, the state of an algorithm (or a plugin in a generic case) is hidden in the plugin itself, we only call it's state-controlling procedures like init and deinit, when fancy_decrypt is called automatically by the runtime.

P.S. @GreyCat, thanks for the explanation about two .ksy files, will use that technique in next time.

@GreyCat
Copy link
Member

GreyCat commented Nov 16, 2016

Note that

sub: a
sub: b
sub: c

is not valid YAML. Probably you've meant something like:

sub:
  - a
  - b
  - c

Even so, I don't really understand how it's supposed to work with the state, even if we're considering your WinAPI example. State has to be stored somewhere. If we're talking about regular C-like language, it's either stack (which gets lost on procedure return), or heap (which you'll have to store a pointer to somewhere, or it will be just a memory leak). How do you pass this state (or pointer to the state) from init to fancy_decrypt, to deinit?

Also, finalizer is not always supposed to just be a destructor. In quite a few encryption algorithms, it should close the buffer, dumping its final contents into the stream.

@ghost
Copy link
Author

ghost commented Nov 16, 2016

How do you pass this state (or pointer to the state) from init to fancy_decrypt, to deinit?

As I mentioned, the state is hidden in the plugin (in static / allocated memory, to be more precise). When we call the plugin's init, it initializes the state. When we call fancy_decrypt, the plugin obtains it's state from it's memory and decrypts the data. At last, when deinit is called, the state is freed. As users of the plugin, we don't need to think about the state of plugin and algorithms it uses.
Or I didn't understand your question?

@GreyCat
Copy link
Member

GreyCat commented Nov 16, 2016

Having state in static (i.e. global) memory is probably a very bad idea. You'll end up with all kinds of threading errors, which would be hard to debug. And, besides, you can't even really wrap it into anything, there's even no way around it.

Aside from static memory (i.e. in DLLs address space), you can have allocate something on heap, but then again, you'll need somehow to store that pointer and pass it between initfancy_decryptdeinit.

@GreyCat
Copy link
Member

GreyCat commented Jul 5, 2017

I plan to add very simple custom processing loophole:

  - id: buf1
    size: 5
    process: my_custom_fx(7, true, [0x20, 0x30, 0x40])

Anything except for the standard library processing routines (i.e. something like my_custom_fx) would result in generation of custom processing call. API may vary slightly, but general idea is to generate something like:

    @_raw_buf1 = @_io.read_bytes(5)
    _process = MyCustomFx.new(7, true, [32, 48, 64].pack('C*'))
    @buf1 = _process.decode(@_raw_buf1)

i.e. one is expected to implement a class called MyCustomFx, which has:

  • a constructor which accepts 3 parameters (which can be arbitrary KS expression language expressions — i.e. constants, calculated values, etc).
  • decode method, which accepts byte array and is expected to returns byte array

@GreyCat
Copy link
Member

GreyCat commented Jul 5, 2017

The devil is, of course, in the details. Addressing an arbitrary class in a target language is not very trivial. Every language has a little different idea about namespacing and especially about importing/requiring/including/etc of used classes.

Also, the trick is to make all this invocations somehow compatible, so one can use this approach to implement, for example, crypto library by doing something like:

- id: buf
  size: 1024
  process: io.kaitai.crypto.aes_256_cbc(key, iv)

The simple and straightforward approach is to use the same engine that opaque types use. However, it does not help with the fact that a good "extra processings" library would probably want to follow the rules of target language — and that would mean something like:

  • io.kaitai.crypto.AlgorithmName in Java,
  • Kaitai.Crypto.AlgorithmName in C#,
  • Kaitai::Crypto::AlgorithmName in Ruby,
  • kaitaicrypto.AlgorithmName in Python

etc. Any ideas on a good approach here?

@GreyCat
Copy link
Member

GreyCat commented Jul 7, 2017

I was thinking hard about this one, but so far the only thing I've came up with is that we can reserve kaitai prefix identifier for our standard libraries which are supposed to be available on all systems and languages.

This would probably mean that crypto algorithms library would be kaitai.crypto and in target languages that would be:

  • In C++: kaitai::crypto::algorithm_name_t
  • In C#: Kaitai.Crypto.AlgorithmName
  • In Java: io.kaitai.crypto.AlgorithmName
  • In JavaScript: ???
  • In Perl: IO::KaitaiStruct::AlgorithmName
  • In PHP: Kaitai\Crypto\AlgorithmName
  • In Python: kaitaicrypto.AlgorithmName
  • In Ruby: Kaitai::Crypto::AlgorithmName

@koczkatamas
Copy link
Member

disclamer: sorry, this comment got out of hand and became a bit chaotic, but I have no better idea how to organize it better, so I just post it as-is

When I created a parsing / serialization library the following architecture proved to be the best for me:

Using interfaces for supplying context / required process methods

Every class got a context object for the parsing / serialization and other operations. The context object contained the input / output stream AND an interface of the required external processing functions.

(In the original solution this interface was a template parameter of the IContext interface but for the sake of simplicity, in the proposal below I supply this interface as a separate object.)

The interface part was really important as I had to use this library across different systems, so I could not use even the most basic .NET built-in methods, because this library was compiled as a .NET Portable Library which cannot depend on a lot of Desktop .NET functionality. I tried to use the most basic types here (eg. byte[], int, bool, etc).

Also I had multiple implementation for this interface: one used BouncyCastle for crypto (a fully managed crypto library written in C#, so this could be used on more systems) and the other one used a .NET built-in functionality (which was faster in a desktop environment).

An example of this interface / class structure:

interface IPayloadProcess
{
  byte[] Md5(byte[] input);
}

interface IEncryptedDataProcess: IPayloadProcess
{
  byte[] Aes256CbcPkcs7Decrypt(byte[] input, byte[] key);
}

public class Payload
{
  public IContext Context { get; set; }
  public IPayloadProcess Process { get; set; }

  public UInt32 PayloadLength { get; set; }
  public byte[] Payload { get; set; }
  public byte[] Checksum { get; set; }

  private bool? _isChecksumValid;
  public bool IsChecksumValid => _isChecksumValid.HasValue ? _isChecksumValid.Value : 
      (_isChecksumValid = Process.Md5(Payload).SequenceEqual(Checksum)).Value;
  
  public static Payload Parse(IContext context, IPayloadProcess process)
  {
     var result = new Payload();
     result.Context = context;
     result.Process = process;
     result.PayloadLength = context.reader.ReadUInt32();
     result.Payload = context.reader.Read(result.PayloadLength);
     result.Checksum = context.reader.Read(16);
     return result;
  }
}

class EncryptedData
{
  public IContext Context { get; set; }
  public IEncryptedDataProcess Process { get; set; }

  public UInt32 EncDataLen { get; set; }
  public byte[] EncData { get; set; }
  public byte[] PayloadBytes { get; set; }
  public Payload Payload { get; set; }

  public static EncryptedData Parse(IContext context, IEncryptedDataProcess process)
  {
     var result = new EncryptedData();
     result.Context = context;
     result.Process = process;
     result.EncDataLen = context.reader.ReadUInt32();
     result.EncData = context.reader.Read(result.EncDataLen);
     return result;
  }

  public void Decrypt(byte[] key)
  {
     this.PayloadBytes = Process.Aes256CbcPkcs7Decrypt(this.EncData, key);
     this.Payload = Payload.Parse(Context.SubReader(this.PayloadBytes), Process);
  }
}

And I had these two implementations:

class EveryRequiredProcessDotNet: IEncryptedDataProcess, ISomeOtherClassProcessNotDescribedAbove, ...
{
  byte[] Md5(byte[] input)
  {
    return (new System.Security.Cryptography.MD5()).ComputeHash(input);
  }

  byte[] Aes256CbcPkcs7Decrypt(byte[] input, byte[] key)
  {
    var aes = new System.Security.Cryptography.AesCryptoServiceProvider();
    ...
    return output;
  }
}
class EveryRequiredProcessBouncyCastle: IEncryptedDataProcess, ISomeOtherClassProcessNotDescribedAbove, ...
{
  byte[] Md5(byte[] input)
  {
    return ... new Org.BouncyCastle.Crypto.DigestsMd5Digest() ...;
  }

  byte[] Aes256CbcPkcs7Decrypt(byte[] input, byte[] key)
  {
    return ... new Org.BouncyCastle.Crypto.AesEngine() ...;
  }
}

Other uses of the Context

The context could be used for logging purposes, for example if we introduce the guard functionality then the context could decide whether to stop the processing or just eg. show a warning if a guard condition does not met.

Dependency Injection (slower performance)

I was thinking about using Dependency Injection, so I could fetch the required method from the Context, instead of passing the I*Process interfaces to every instance, so I could just use:

this.PayloadBytes = Context.GetProcess<IEncryptedDataProcess>().Aes256CbcPkcs7Decrypt((this.EncData, key);

Actually I dropped this idea, although it could make my classes less dependent, it could cause runtime errors if an interface was not available.

Dynamic context (even slower, hard to implement)

In some places we did not even supplied the Context object but we fetched it via a static class, which usually used thread-local storage and maintained a different Context per thread, eg:

var context = Context.GetCurrent(this);
this.PayloadBytes = context.GetProcess<IEncryptedDataProcess>().Aes256CbcPkcs7Decrypt((this.EncData, key);

This solution could give even more flexibility in the expense of performance and sometimes implementation issues.

Problems

I'd like to mention a few problems which may be out-of-scope now as they could complicate things a lot, we cannot solve them all and currently less capable solution is better than no solution.

Problem 1: not every input is available

If we want the parse the whole file in one run, then sometimes we cannot supply every required input. Let's consider the key parameter, in our file format we store the following fields after each other: KeyVersion, EncryptedData. So basically we need a custom callback which can fetch the key, but it already needs to access the partially process file (eg. the KeyVersion field, but in our case it's more complicated as it needs to access more fields).

So this callback can be also a custom defined process method eg. IEncryptedDataProcess.GetKey so this way the parsing can happen in one run (without calling the Decrypt method):

  public static EncryptedData Parse(IContext context, IEncryptedDataProcess process)
  {
     var result = new EncryptedData();
     result.Context = context;
     result.Process = process;
     result.KeyVersion = context.reader.ReadUInt32();
     result.EncDataLen = context.reader.ReadUInt32();
     result.EncData = context.reader.Read(result.EncDataLen);
     result.Key = process.GetKey(result);
     result.PayloadBytes = process.Aes256CbcPkcs7Decrypt(this.EncData, key);
     result.Payload = Payload.Parse(Context.SubReader(this.PayloadBytes), process);
     return result;
  }

Problem 2: parsing streaming data (out-of-scope?)

Of course sometimes we processed very big files so we could not store the whole file in memory, thus we had to use streaming decryption, hashing primitives, etc.

We have chunks in our file format, which are like array items and every chunk has a data part. Like in the PNG format the IDAT chunks.

This meant that we had to initialize the crypto and compression primitives before parsing file, supplying them with data chunk-by-chunk and write the result into a file on-the-fly.

@GreyCat
Copy link
Member

GreyCat commented Jul 8, 2017

Ok, that's a good lot of ideas. Let me try to split them down into more independent chunks.

Context

"Context" pattern is more or less the fancy word for "we need to pass along N different things, and we don't want to pass each of them individually every time, so we'll bundle them into a class and pass that class along instead". In your example, "context" is a custom object that bundles stream (reader) and some arbitrary decryption functions.

From classic OOP's point of view, this is generally frowned upon, as it creates a container just for the sake of creation of a container, which has no "real world" counterpart. Given that in your example we create a lot of completely custom classes which can't be reused as-is for most other purposes, we might just as well pass Reader and `

The need for different implementations of the same processing algorithm

This is indeed a problem that we need to think about. Indeed, some languages/platforms have alternatives to choose from when doing some routine task like encryption, compression, etc. Let's analyze the problem first. These implementation libraries might differ in:

  • Performance (i.e. C wrapper library might be faster than a native one)
  • Portability (i.e. some libraries might not be too portable, but offer other benefits)
  • Maintainability (i.e. C wrapper library might be harder to set up than a native one)
  • Compatibility with existing environment (i.e. we know that this library sucks, but we're already using it in lots of other places in our project, so we'd prefer to use it anyway)

So, it boils down to that we need to provide support for alternative implementations of the same thing, balancing between these two ideas:

  • It should work without a user changing anything to select a particular implementation → we need to have some sort of "default" implementation
  • Provide means for user to choose particular implementation if one wants to do that — ideally, without touching anything in the .ksy file

The basic idea that comes to mind then is another CLI switch that would control choice of implementation.

@koczkatamas
Copy link
Member

My previous comment was unnecessary long, but what I wanted to express that

Decouple external libraries

Real world scenarios can be really complicated. If we publish a library generated by Kaitai I would be more comfortable if it was not strongly-coupled to a library chosen by us (even if it is a built-in system library).

I would use an interface, either passed down to the class, or it can be set on a global static object (instead of strongly-coupling it). This would cause only a really minor performance decrease (one additional pointer dereference in most cases), but it would be possible to replace the implementation thus we could serve the most broad user base with the same code (so they won't have to compile their own code, just use the published one from their language's package manager).

Initialization, deinitialization and method calls

This way the user of the library should setup the interface mentioned above, thus the user could initialize it before start the parsing and he could deinitialize after using it, as it would be user's responsibility.

We would not have to worry about calling conventions and stuff like that as the interface would wrap the call from the language-native calling form to the one used by the library.

Default implementation

We could provide a default implementation for most commonly used process dependencies (eg. zlib, crc32, etc). This could be a separate package in the repo, and the users could install it if they want, so they won't have to implement it themselves.

Lazy input parameter resolving

We could use this interface to make it possible to get input parameters lazily, like the key mentioned in the above example, we just simply have to generate a GetKey method on this interface (and the .ksy could use process: get_key()).

Summary

So to summarize the pros of a replaceable external (interface-based?) dependency solution is:

  • easier to implement for us (don't have to worry about calling conventions and stuff)
  • the user can easily replace the external dependencies, and that part which impose IMHO the most risk of making Kaitai integration with their systems harder
  • we can deploy Kaitai packages without external dependencies into package stores (so it will be sufficient for a broader user base)

Cons:

  • negligent (IMHO) performance degradation

@GreyCat GreyCat added this to the v0.8 milestone Jul 15, 2017
@GreyCat
Copy link
Member

GreyCat commented Jul 23, 2017

From what I've read and understood so far, the main proposal is to implement external processing routines as interfaces, not hard-coded classes.

There are obviously some pros and cons for that:

  • The main advantage is, obviously, flexibility. Indeed, one can replace processing components without using ksc to recompile alternative version of sources.
  • The main disadvantage is, in my opinion, added complexity. If we're going "everything is interface" route, then doing stuff like
- id: buf
  size: 1024
  process: aes(128, some_key, some_iv, mode::cbc)

is not enough to get a working AES decryption. This would compile into some sort of factory create method, like:

CustomDecoder _process = KaitaiStream.getDecoder("aes");
buf = _process.decode(...);

and to get it working, one would need some preconfiguration to be done in the app, like:

KaitaiStream.registerDecoder("aes", AESDecoder.class);

Generally, this is close to this proposal by @LogicAndTrick.

Right now I feel that this adds a good deal of complexity which would be a huge hurdle for a novice user to leap by. It will make any jobs for visualizers / IDE much harder, as it won't be enough to just load some plugin classes, you would need to actually choose one between them and call that register method on it.

The "one can distribute compiled stuff without any dependencies" argument is kind of self-deception. True, user can download our package and it won't need any hard extra dependencies. However, it won't work out of the box either. One would need to carefully study the documentation and write that registerDecoder-style line to get it to work.

It boils down to "configuration during compile-time" vs "configuration during run-time" choice, but configuration itself is inevitable. We can actually do both, but I strongly feel that we should start with something that would solve majority of cases in easiest way. After all, if we would have several crypto libraries, there's nothing wrong to supply one of them as a "default" one. Advanced users can always recompile ksy with different CLI option and/or .ksy fix, and get themselves a wrapper for differnent cryptolibrary that suits them better.

@koczkatamas
Copy link
Member

Actually we just need to publish two KaitaiStream packages, one with dependencies (eg. KaitaiStream) and one without them (eg. KaitaiStream.Core). The KaitaiStream would use the Core one but would automatically register every zlib, aes package by default, so no registerDecoder would be required by novice users.

This way we could publish our packages for example in the .NET ecosystem as .NET Standard Class Libraries (eg. Kaitai.Jpeg, KaitaiStream.Core) which are compatible with .NET Framework, Mono and .NET Core. And we could publish the KaitaiStream as eg. .NET Framework Class Library (so it could use the .NET built-in aes, zlib, etc methods).

Additional ideas

Also in C# I would use statically-typed decoders, eg:

IFancyDecoder _process = KaitaiStream.getDecoder<IFancyDecoder>();
buf = _process.decode(_temp_data, EncSize, DecSize);

Or use it directly of course:

buf = KaitaiStream.getDecoder<IFancyDecoder>().decode(_temp_data, EncSize, DecSize);

I also would make the getDecoder a non-static method, and would use io.SubStream(...) instead of new KaitaiStream which would copy the "decoder store" reference to the new instance. This way it would be possible to replace the external subroutine implementation per-KaitaiStream basis, which could be the requirement in some cases.

@LogicAndTrick
Copy link
Collaborator

The dynamic resolve-at-runtime way is still my preferred solution, the runtimes could ship with some of the common decoders bundled in (zlib etc) to ease some of the adoption difficulties. I get that there is a performance hit with dynamic stuff plus the additional complexity of a plugin system, but I still prefer this method.

I understand the hesitation so definitely consider other methods as well. But this one still gets my vote :)

I don't mind the C# suggestions from @koczkatamas except for the statically typed decoders suggestion. I wouldn't want to tie the implementation name with the decoder name like that. I'd want to stay with a string-based lookup method:

// KSY file:
process: custom_decoder(maybe, some, parameters)

// KS runtime
interface IDecoder {
    byte[] Decode(byte[] input, params object[] args);
}

// User / plugin code
class CustomDecoderImplementation : IDecoder { ... }
contextOrWhatever.Register("custom_decoder", new CustomDecoderImplementation());

// Generated code
IDecoder decoder = _io.GetDecoder("custom_decoder");
var result = decoder.Decode(...);

The idea of a plugin context is pretty familiar to me and I've used many libraries that make use of them. Usually a default global context is provided for people who don't want/need the additional flexibility:

// Runtime
public class Context {
    public static Context Default { get; }
    static Context() {
        Default = new Context();
        Default.RegisterDefaults();
    }
    public void RegisterDefaults() {
        Register("zlib", ...);
        // etc
    }
    public void Register(string name, IDecoder decoder) { ... }
}

// Generated code
public class GeneratedClass {
    public static GeneratedClass FromFile(string file, Context context = null)
    {
        if (context == null) context = Context.Default;
        // ...
    }
}

// User can register global plugins
Context.Default.Register("thing", new MyThing());

// Use default context
var gc1 = GeneratedClass.FromFile("...");

// Use custom context
var customContext = new Context();
customContext.Register("thing", new MyOtherThing());
var gc2 = GeneratedClass.FromFile("...", context: customContext);

@koczkatamas
Copy link
Member

Another issue came to my mind: if a program wants to use two generated-from-ksy libraries compiled by third parties and both of them tries to register a process with the same name, but different arguments, then it's game over (the user have to recompile one of the libraries). So we either take the risk or prefix the process name with the root class name for 3rd-party subroutines or something (eg. in C# a static interface like the one in my example solves this problem).

@LogicAndTrick actually it ties the interface name, not the implementation. The main advantage of an interface-based solution is that the arguments are type checked by the C# compiler and not an object array is used. (So the error happens compile-time and not runtime which is one of the reasons I like statically typed languages.)

@LogicAndTrick
Copy link
Collaborator

Hmm, that would work - but what about namespaces? I wouldn't want to pollute a single namespace with a bunch of interfaces, and adding extra namespace metadata to the ksy file would add a lot of complexity that only a few languages can use.

@GreyCat
Copy link
Member

GreyCat commented Jul 24, 2017

Actually we just need to publish two KaitaiStream packages, one with dependencies (eg. KaitaiStream) and one without them (eg. KaitaiStream.Core). The KaitaiStream would use the Core one but would automatically register every zlib, aes package by default, so no registerDecoder would be required by novice users.

Unfortunately, it's not that granular. You can't just go with "bare bones" + "all-inclusive" versions. For example, the vast majority of users won't require any crypto algorithms, and that might be a heavy additional dependency. There are hundreds of exotic compression algorithms, mangling schemes, etc. It might looks like a solution at first, but you'll quickly hit an unreasonable number of dependencies for that "all-inclusive" variant to be of any real use. Nobody would probably interested in a Java or C# ethernet packet decoder (which is like ~10 lines of code), with several thousands of dependencies fetching several gigabytes of libraries from the net.

Even for debugging purposes (like in Web IDE), I doubt that you'd want to preload several thousands of libraries.

It gets even worse for languages that don't have a simple way to manage libraries (like C++). Building such a mega-library-that-depends-on-everything is probably tolerable for projects of scale of Chromium, but I doubt that it's the way to go for us. In C++ world, "depends on everything" quickly becomes "bundles everything".

@koczkatamas
Copy link
Member

Okay, I can agree with that we should only bundle / depend on what is actually used, so if we expect a lot of external subroutines then an all-inclusive KaitaiStream is a no-go.

So if I understand correctly, if a process requires an external dependency then we don't want to put into KaitaiStream but we'd like to make this a dependency for the generated class' package. This means we should remove zlib from KaitaiStream and only make zlib a dependency if a format requires it, eg. the PNG one.

That means if we want to make "no dependency" packages, then we could publish an other package, eg PNG.Core which does not depend on anything but expect it's dependencies runtime.

Question is if we want to go even further and publish packages like Kaitai.Crypto or Kaitai.Compression which would automatically add similar subroutines to the runtime (the crypto one would add aes, des, etc, while the compression one would add zlib, deflate, gzip, etc).

If we go down on this road then question is whether the "Standard" package should directly depend on the used libraries (eg. on openssl) and call their methods statically (which requires compiler magic for every languages and can be hard to extend) OR it should depend on the Kaitai.Crypto package and runtime-based plugin system which would be used by the Core packages too, but it would additionally automatically initialise the dependencies (would call registerDecoder automatically).

Note: we should find an other name than registerDecoder as if we introduce serialisation I presume the same registration would also register an encoder interface too (what we currently call processes those basically will be two-way conversation routines after we introduce serialisation, like decrypt + encrypt for aes, compress / decompress for zlib, etc).

@GreyCat
Copy link
Member

GreyCat commented Jul 25, 2017

I suspect that we are overengineering way too much here. There are no "no dependencies" approach per se. What you call "no dependencies" is actually an approach with "soft" dependencies, i.e. dependences that are not enforced (and thus checked and helped to be established) by the language. I'd prefer that we'd go from "easy" to "advanced" solutions, and actually even go to these "advanced" stuff like soft dependencies if and when the need would arise.

Right now I actually don't even see a lot of real-life use cases that would benefit from such flexible scheme. I know one true use case when one would want to choose between alternative libraries — i.e. for example, choose between standard javax.crypto implementation vs something like a Bouncy Castle Crypto. However, I believe that it's unlikely that a certain project would ever need to use both at once and switch between them in runtime. Given that javax.crypto is clearly a standard and widespread solution, I'd just go with compiling our hypothetical distributed packages with hard dependencies on io.kaitai.crypto.javax, which, in turn, would have a hard dependency on javax.crypto.

This means we should remove zlib from KaitaiStream and only make zlib a dependency if a format requires it, eg. the PNG one.

Yeah, we might do that for making a more straightforward API, but, technically, we can always say that zlib is a legacy exception (and, besides, it is really built-in almost everywhere).

Question is if we want to go even further and publish packages like Kaitai.Crypto or Kaitai.Compression which would automatically add similar subroutines to the runtime (the crypto one would add aes, des, etc, while the compression one would add zlib, deflate, gzip, etc).

To be frank, I don't quite understand you. My idea was (and still is) to make it like that:

  • hadoop_file.ksy uses process: kaitai.compression.snappy(args)
  • kaitai.compression.snappy maps to a io.kaitai.compression.Snappy class in Java
  • There are (in theory) several packages which provide io.kaitai.compression.Snappy class implementation, for example:
    • groupId = io.kaitai, artifactId = kaitai-compression-snappy-xerial (JNI port which tunnels calls into C++ library)
      • in turn, it depends on groupId = org.xerial.snappy, artifactId = snappy-java
    • groupId = io.kaitai, artifactId = kaitai-compression-snappy-dain (pure Java port)
      • in turn, it depends on groupId = org.iq80.snappy, artifactId = snappy

No literal "addition of subroutines to runtime" happens. Just extra dependent packages. Resolving these dependencies takes literally 1-2 clicks in any modern IDE, which would search central maven repo for it and propose the solutions — one just needs to choose one.

If we go down on this road then question is whether the "Standard" package should directly depend on the used libraries (eg. on openssl) and call their methods statically (which requires compiler magic for every languages and can be hard to extend) OR it should depend on the Kaitai.Crypto package and runtime-based plugin system which would be used by the Core packages too, but it would additionally automatically initialise the dependencies (would call registerDecoder automatically).

I don't understand what a "Standard" package and "Kaitai.Crypto" package here stand for.

Note: we should find an other name than registerDecoder as if we introduce serialisation I presume the same registration would also register an encoder interface too

Yeah, but it shouldn't be a general rule. I believe that registerDecoder, registerEncoder and registerDecoderEncoder (or something like registerBoth) would be ok. Sometimes you would have totally different and independent classes / packages (for example, due to licensing / patent purposes).

@koczkatamas
Copy link
Member

koczkatamas commented Jul 26, 2017

Okay, in overall I think we can go the way you proposed, if the need arise we can change later.

I don't really know the Java dependency system, I used NuGet as reference and maybe those two differ.

I don't understand what a "Standard" package and "Kaitai.Crypto" package here stand for.

Yeah, I never explained what I meant by the "Standard" package. What I meant that it is the ready-to-use package which depends on a library chosen by us. For example the "Standard" package called PNG depends on the .NET built-in zlib library.

The opposite of the Standard package is the Core package (eg. PNG.Core) which does not have any NuGet dependency. But if you want to use it you have to provide some zlib implementation. Let's assume that in your usecase you cannot use the .NET implementation but you can use the BouncyCastle one. So you can install the Kaitai.BouncyCastle package which depends on the BouncyCastle library and adds support for every subroutine which is implemented in BC, including the zlib one. The Kaitai.Crypto is not the best name, but meant that this is previously mentioned "chosen by us" default implementation (eg. .NET built-in in the example), which can be installed next to the Core and you would basically get the "Standard" version.

Edit: sorry, the close was a misclick on mobile.

So in overall: you are probably right, most of things I am talking about are advanced use-cases we can think about later if the need rises.

@GreyCat
Copy link
Member

GreyCat commented Jul 27, 2017

Actually, could you ever publish a NuGet package that depends on any classes not provided by its defined dependencies? As far as I understand, for example, it is not really possible with Maven central, as the build process would fail → no luck in deploying it.

@koczkatamas
Copy link
Member

No, you cannot. That's where dynamic dependency resolvation comes into the picture.

@GreyCat
Copy link
Member

GreyCat commented Aug 29, 2017

TWIMC: I've committed support for C# and Python.

@GreyCat
Copy link
Member

GreyCat commented Sep 6, 2017

JavaScript custom processing works now as well. Probably need to debug why C# doesn't work on a CI, and may be we should add Lua support (actually, @adrianherrera has already committed the compiler fix — so the only thing missing is a test port), and we could call it done.

@adrianherrera
Copy link
Member

adrianherrera commented Sep 6, 2017 via email

@adrianherrera
Copy link
Member

adrianherrera commented Sep 6, 2017 via email

@GreyCat
Copy link
Member

GreyCat commented Sep 8, 2017

Hmm, I somehow get this error in TestProcessCustom for Lua:

2) TestProcessCustom.test_process_custom
spec/lua/extra/my_custom_fx.lua:16: attempt to get length of a nil value (local 'data')
stack traceback:
        compiled/lua/process_custom.lua:22: in method '_read'
        compiled/lua/process_custom.lua:16: in local 'init'

May be it's just my local box or something. Will try it on CI...

@adrianherrera
Copy link
Member

adrianherrera commented Sep 9, 2017 via email

@GreyCat
Copy link
Member

GreyCat commented Sep 9, 2017

Alas, even after all these fixes, there's still the same error (both on the CI and on my box):

stack traceback:
 	compiled/lua/process_custom.lua:22: in method '_read'
 	compiled/lua/process_custom.lua:16: in local 'init'
 	../runtime/lua/class.lua:70: in function <../runtime/lua/class.lua:66> 	(...tail calls...)
 	spec/lua/test_process_custom.lua:8: in upvalue 'TestProcessCustom.test_process_custom'
--

@adrianherrera
Copy link
Member

adrianherrera commented Sep 9, 2017 via email

@adrianherrera
Copy link
Member

Can you try now? I think I forgot to do something in the kaitai_struct repo with the compiler submodule!

@GreyCat
Copy link
Member

GreyCat commented Sep 11, 2017

Works perfectly now, thanks! I guess I'm closing this issue — ironically, the previous Perl maintainer who opened it, seems to have left GitHub, and Perl remains the only major language that doesn't have this implemented, and I don't feel like taking on it by myself :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants