Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JQ dependency is too heavy for some setups #162

Open
spbnick opened this issue Feb 16, 2021 · 9 comments
Open

JQ dependency is too heavy for some setups #162

spbnick opened this issue Feb 16, 2021 · 9 comments

Comments

@spbnick
Copy link
Collaborator

spbnick commented Feb 16, 2021

Depending on the jq library makes it difficult to install kcidb in restricted environments, particularly in AWS Lambda, which e.g. Tuxsuite uses.

Consider other options, e.g.:

  • Replace jq with a custom pure-Python implementation. Shouldn't be too hard, as JSON specification is simple and well-defined, but might be rather slow on big datasets.
  • Create separate tools with streaming support (e.g. kcidb-submit-stream, kcidb-db-dump-stream, kcidb-db-load-stream, and so on), and move them to a separate package, along with jq dependency.
  • Use a library able to print out / parse JSON objects incrementally, and instead input/output complete JSON objects.
@spbnick
Copy link
Collaborator Author

spbnick commented Feb 16, 2021

Scratch the last option. We won't be able to interleave object types this way, unfortunately. Unless we change the schema, embedding type data into objects themselves, that is.

@spbnick
Copy link
Collaborator Author

spbnick commented Sep 2, 2021

Also, check if the stock pypi package is easy to install in AWS lambda. If it is, try to get stream parsing support merged into the upstream again. Consider implementing parsing file objects instead of iterators/generators, as the maintainer requested.

@mrbazzan
Copy link
Contributor

mrbazzan commented Jan 1, 2022

  • Replace jq with a custom pure-Python implementation. Shouldn't be too hard, as JSON specification is simple and well-defined, but might be rather slow on big datasets.

@spbnick
I'm thinking of picking this issue up, specifically using this option.

Is there extra information I need to know?

@spbnick
Copy link
Collaborator Author

spbnick commented Jan 3, 2022

@mrbazzan, that would be fun to do indeed! If you want to do that, here are some of the requirements:

  • Do not require loading the whole of the source JSON text in memory for parsing, at once, but only chunks of limited (and optionally-specified) size. E.g. 4KB-4MB.
  • Be no more than 3x (or thereabouts) slower than JQ parsing our current loads: gigabytes of JSON, split into objects with, say, 10K report objects each. I can provide a sample to run tests on. The tests for this would need to be continuously running along with development, starting from the moment when parsing our input becomes possible. And yes, "3x" is picked based on a feeling of what could be acceptable, no other real requirements at the moment.

You can work on that in your own repo, making your own package, etc., but I would need to review the solution before using it and accepting the dependency. The likelihood of success would be increased with early feedback, though.

@mrbazzan
Copy link
Contributor

mrbazzan commented Jan 13, 2022

@mrbazzan, that would be fun to do indeed! If you want to do that, here are some of the requirements:

  • Do not require loading the whole of the source JSON text in memory for parsing, at once, but only chunks of limited (and optionally-specified) size. E.g. 4KB-4MB.
  • Be no more than 3x (or thereabouts) slower than JQ parsing our current loads: gigabytes of JSON, split into objects with, say, 10K report objects each. I can provide a sample to run tests on. The tests for this would need to be continuously running along with development, starting from the moment when parsing our input becomes possible. And yes, "3x" is picked based on a feeling of what could be acceptable, no other real requirements at the moment.

You can work on that in your own repo, making your own package, etc., but I would need to review the solution before using it and accepting the dependency. The likelihood of success would be increased with early feedback, though.

@spbnick Okay.

A pure python implementation for binding jq, rather than having to use jq.py, right?

Also, please kindly provide sample data to run test on.

@mrbazzan
Copy link
Contributor

Also, I'm still pretty confused. I went through the jq.py repo and ...

I'm really interested in this project, and I would appreciate further guidance

@spbnick
Copy link
Collaborator Author

spbnick commented Jan 14, 2022

A pure python implementation for binding jq, rather than having to use jq.py, right?

We can't really use a pure-Python implementation for binding jq, since it's written in C.

So we might need to write a pure-Python (using only standard library) parser for JSON object streams. That's all we need from JQ - the ability to parse a sequence of JSON objects without loading the whole file into memory.

Another option maybe is to work with upstream further for incorporating our changes (I got to the point of the author ignoring me 😬), or maybe making our own binding for jq, just for stream parsing. In either case we would need a compiled package on PyPi, and a verification that e.g. Amazon Lambda can handle it.

Here's a release tag of our fork of jq.py, if you're interested in that: https://github.com/kernelci/jq.py/tree/1.2.1.post1

Also, please kindly provide sample data to run test on.

You can start with the sample I already provided, just cat all the files together, and once you get to performance tests, I can give you a larger one.

@mrbazzan
Copy link
Contributor

mrbazzan commented Jan 16, 2022

We can't really use a pure-Python implementation for binding jq, since it's written in C.

So we might need to write a pure-Python (using only standard library) parser for JSON object streams. That's all we need from JQ - the ability to parse a sequence of JSON objects without loading the whole file into memory.

Oh... I think I have a better understanding of the problem now. We want a package that offers the parsing ability of JQ(without loading the whole file in memory) but with standard Python packages, so as to make it easy to install kcidb in environments like AWS lambda, right?

@spbnick
Copy link
Collaborator Author

spbnick commented Jan 17, 2022

Oh... I think I have a better understanding of the problem now. We want a package that offers the parsing ability of JQ(without loading the whole file in memory) but with standard Python packages, so as to make it easy to install kcidb in environments like AWS lambda, right?

Yep.

Or, if compiled pypi packages work in AWS after all, either work with upstream to integrate our changes, or make our own pypi package binding jq just for parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants