Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design new Normalizer runner #1136

Closed
praseodym opened this issue Jun 9, 2023 · 1 comment · Fixed by #1538
Closed

Design new Normalizer runner #1136

praseodym opened this issue Jun 9, 2023 · 1 comment · Fixed by #1538
Assignees
Labels
boefjes Issues related to boefjes

Comments

@praseodym
Copy link
Contributor

praseodym commented Jun 9, 2023

Design the new Normalizer runner

Considerations:

  • Runtime
  • Image distribution
  • I/O protocol specification
  • Configuration within KAT
@praseodym praseodym added the boefjes Issues related to boefjes label Jun 9, 2023
@praseodym praseodym self-assigned this Jun 9, 2023
@underdarknl underdarknl changed the title Design new Boefjes runner Design new Normalizer runner Jun 9, 2023
@underdarknl
Copy link
Contributor

Initially our design called for Normalizers as (aws)Lambda like functions. This would make it possible to run them in micro-vm's/micro-containers and distribute them as small code packages. (eg, code + requirements) targetting a specific pre-build python (or other interpreter) container running on for example FireCracker.

This has a few advantages:

  • The code of normalizers runs sandboxed.
  • The input can be a single raw file (easily testable)
  • They can be ran in parallel.
  • The output is easily tested by testing the returned objects for value and schema-validity. (https://python-jsonschema.readthedocs.io/en/stable/validate/)
  • The whole normalizer can be hashed and as such we can keep track of what we did with which code/input/output.
  • Crashes can be caught at the runtime level and reported without boilerplate inside the normalizer
  • Support for multiple languages can be added
  • Normalizers can carry conflicting dependencies without issue.
  • Easily packaged (zip, oci container of which the last might be overkill)
  • seperation of runner code (eg, python3.10 with a set of reasonable modules), and app code (eg, the main method doing the heavy lifting)

This also has a few requirements:

  • Normalizers do not interact with the outside world (already met except for 1 normalizer who contacts Octopoes)
  • Normalizers list their requirements (already met)
  • The Input and Output are text or binary blobs. (currently the output is a python object holding data mirroring the octopoes model.

This also has a few drawbacks (some we can minimize)

  • Startup time for a sandboxed normalizer is longer than for a direct method call.
  • Not all functionality envisioned can be captured in a sandboxed normalizer which has not other io options than the initial rawfile +job meta and the resulting output.
  • Inter-Related objects in the output stream are 'harder' to relate to each other than with python's references. (maybe solvable by using something aking to https://json-schema.org/understanding-json-schema/structuring.html#ref )
  • One-shot return of data, as the runner only processes all output once the container has returned.

Options that this gives us:

  • Output can be json, and optionally with versioned schemas.
  • Run various seperate runner envs (eg, python 3.8, 3.9, php8, php7), needs requirements to bet set in the normalizer manifest.
  • Cache normalizer dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
boefjes Issues related to boefjes
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants