Design new Normalizer runner #1136

praseodym · 2023-06-09T09:19:52Z

Design the new Normalizer runner

Considerations:

Runtime
Image distribution
I/O protocol specification
Configuration within KAT

underdarknl · 2023-06-09T09:59:19Z

Initially our design called for Normalizers as (aws)Lambda like functions. This would make it possible to run them in micro-vm's/micro-containers and distribute them as small code packages. (eg, code + requirements) targetting a specific pre-build python (or other interpreter) container running on for example FireCracker.

This has a few advantages:

The code of normalizers runs sandboxed.
The input can be a single raw file (easily testable)
They can be ran in parallel.
The output is easily tested by testing the returned objects for value and schema-validity. (https://python-jsonschema.readthedocs.io/en/stable/validate/)
The whole normalizer can be hashed and as such we can keep track of what we did with which code/input/output.
Crashes can be caught at the runtime level and reported without boilerplate inside the normalizer
Support for multiple languages can be added
Normalizers can carry conflicting dependencies without issue.
Easily packaged (zip, oci container of which the last might be overkill)
seperation of runner code (eg, python3.10 with a set of reasonable modules), and app code (eg, the main method doing the heavy lifting)

This also has a few requirements:

Normalizers do not interact with the outside world (already met except for 1 normalizer who contacts Octopoes)
Normalizers list their requirements (already met)
The Input and Output are text or binary blobs. (currently the output is a python object holding data mirroring the octopoes model.

This also has a few drawbacks (some we can minimize)

Startup time for a sandboxed normalizer is longer than for a direct method call.
Not all functionality envisioned can be captured in a sandboxed normalizer which has not other io options than the initial rawfile +job meta and the resulting output.
Inter-Related objects in the output stream are 'harder' to relate to each other than with python's references. (maybe solvable by using something aking to https://json-schema.org/understanding-json-schema/structuring.html#ref )
One-shot return of data, as the runner only processes all output once the container has returned.

Options that this gives us:

Output can be json, and optionally with versioned schemas.
Run various seperate runner envs (eg, python 3.8, 3.9, php8, php7), needs requirements to bet set in the normalizer manifest.
Cache normalizer dependencies.

praseodym added the boefjes Issues related to boefjes label Jun 9, 2023

praseodym self-assigned this Jun 9, 2023

underdarknl changed the title ~~Design new Boefjes runner~~ Design new Normalizer runner Jun 9, 2023

This was referenced Aug 1, 2023

Design new Boefjes runner #1522

Open

Add first version of new normalisers runner design #1538

Merged

dekkers closed this as completed in #1538 Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design new Normalizer runner #1136

Design new Normalizer runner #1136

praseodym commented Jun 9, 2023 •

edited by underdarknl

underdarknl commented Jun 9, 2023

Design new Normalizer runner #1136

Design new Normalizer runner #1136

Comments

praseodym commented Jun 9, 2023 • edited by underdarknl

underdarknl commented Jun 9, 2023

praseodym commented Jun 9, 2023 •

edited by underdarknl