JQ dependency is too heavy for some setups #162

spbnick · 2021-02-16T18:13:41Z

Depending on the jq library makes it difficult to install kcidb in restricted environments, particularly in AWS Lambda, which e.g. Tuxsuite uses.

Consider other options, e.g.:

Replace jq with a custom pure-Python implementation. Shouldn't be too hard, as JSON specification is simple and well-defined, but might be rather slow on big datasets.
Create separate tools with streaming support (e.g. kcidb-submit-stream, kcidb-db-dump-stream, kcidb-db-load-stream, and so on), and move them to a separate package, along with jq dependency.
Use a library able to print out / parse JSON objects incrementally, and instead input/output complete JSON objects.

The text was updated successfully, but these errors were encountered:

spbnick · 2021-02-16T18:15:30Z

Scratch the last option. We won't be able to interleave object types this way, unfortunately. Unless we change the schema, embedding type data into objects themselves, that is.

spbnick · 2021-09-02T10:00:49Z

Also, check if the stock pypi package is easy to install in AWS lambda. If it is, try to get stream parsing support merged into the upstream again. Consider implementing parsing file objects instead of iterators/generators, as the maintainer requested.

mrbazzan · 2022-01-01T06:58:49Z

Replace jq with a custom pure-Python implementation. Shouldn't be too hard, as JSON specification is simple and well-defined, but might be rather slow on big datasets.

@spbnick
I'm thinking of picking this issue up, specifically using this option.

Is there extra information I need to know?

spbnick · 2022-01-03T13:17:40Z

@mrbazzan, that would be fun to do indeed! If you want to do that, here are some of the requirements:

Do not require loading the whole of the source JSON text in memory for parsing, at once, but only chunks of limited (and optionally-specified) size. E.g. 4KB-4MB.
Be no more than 3x (or thereabouts) slower than JQ parsing our current loads: gigabytes of JSON, split into objects with, say, 10K report objects each. I can provide a sample to run tests on. The tests for this would need to be continuously running along with development, starting from the moment when parsing our input becomes possible. And yes, "3x" is picked based on a feeling of what could be acceptable, no other real requirements at the moment.

You can work on that in your own repo, making your own package, etc., but I would need to review the solution before using it and accepting the dependency. The likelihood of success would be increased with early feedback, though.

mrbazzan · 2022-01-13T18:54:55Z

@mrbazzan, that would be fun to do indeed! If you want to do that, here are some of the requirements:

Do not require loading the whole of the source JSON text in memory for parsing, at once, but only chunks of limited (and optionally-specified) size. E.g. 4KB-4MB.

Be no more than 3x (or thereabouts) slower than JQ parsing our current loads: gigabytes of JSON, split into objects with, say, 10K report objects each. I can provide a sample to run tests on. The tests for this would need to be continuously running along with development, starting from the moment when parsing our input becomes possible. And yes, "3x" is picked based on a feeling of what could be acceptable, no other real requirements at the moment.

You can work on that in your own repo, making your own package, etc., but I would need to review the solution before using it and accepting the dependency. The likelihood of success would be increased with early feedback, though.

@spbnick Okay.

A pure python implementation for binding jq, rather than having to use jq.py, right?

Also, please kindly provide sample data to run test on.

mrbazzan · 2022-01-13T18:56:35Z

Also, I'm still pretty confused. I went through the jq.py repo and ...

I'm really interested in this project, and I would appreciate further guidance

spbnick · 2022-01-14T09:19:37Z

A pure python implementation for binding jq, rather than having to use jq.py, right?

We can't really use a pure-Python implementation for binding jq, since it's written in C.

So we might need to write a pure-Python (using only standard library) parser for JSON object streams. That's all we need from JQ - the ability to parse a sequence of JSON objects without loading the whole file into memory.

Another option maybe is to work with upstream further for incorporating our changes (I got to the point of the author ignoring me 😬), or maybe making our own binding for jq, just for stream parsing. In either case we would need a compiled package on PyPi, and a verification that e.g. Amazon Lambda can handle it.

Here's a release tag of our fork of jq.py, if you're interested in that: https://github.com/kernelci/jq.py/tree/1.2.1.post1

Also, please kindly provide sample data to run test on.

You can start with the sample I already provided, just cat all the files together, and once you get to performance tests, I can give you a larger one.

mrbazzan · 2022-01-16T17:07:48Z

We can't really use a pure-Python implementation for binding jq, since it's written in C.

So we might need to write a pure-Python (using only standard library) parser for JSON object streams. That's all we need from JQ - the ability to parse a sequence of JSON objects without loading the whole file into memory.

Oh... I think I have a better understanding of the problem now. We want a package that offers the parsing ability of JQ(without loading the whole file in memory) but with standard Python packages, so as to make it easy to install kcidb in environments like AWS lambda, right?

spbnick · 2022-01-17T07:55:35Z

Oh... I think I have a better understanding of the problem now. We want a package that offers the parsing ability of JQ(without loading the whole file in memory) but with standard Python packages, so as to make it easy to install kcidb in environments like AWS lambda, right?

Yep.

Or, if compiled pypi packages work in AWS after all, either work with upstream to integrate our changes, or make our own pypi package binding jq just for parsing.

spbnick added this to Suggestions in Hackfest #2 - 6th to 10th September 2021 Sep 2, 2021

gctucker added the KCIDB label Sep 6, 2021

spbnick removed the KCIDB label Oct 5, 2021

tales-aparecida mentioned this issue Mar 27, 2024

Support Python v3.12 #514

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JQ dependency is too heavy for some setups #162

JQ dependency is too heavy for some setups #162

spbnick commented Feb 16, 2021

spbnick commented Feb 16, 2021 •

edited

spbnick commented Sep 2, 2021

mrbazzan commented Jan 1, 2022

spbnick commented Jan 3, 2022

mrbazzan commented Jan 13, 2022 •

edited

mrbazzan commented Jan 13, 2022

spbnick commented Jan 14, 2022

mrbazzan commented Jan 16, 2022 •

edited

spbnick commented Jan 17, 2022

JQ dependency is too heavy for some setups #162

JQ dependency is too heavy for some setups #162

Comments

spbnick commented Feb 16, 2021

spbnick commented Feb 16, 2021 • edited

spbnick commented Sep 2, 2021

mrbazzan commented Jan 1, 2022

spbnick commented Jan 3, 2022

mrbazzan commented Jan 13, 2022 • edited

mrbazzan commented Jan 13, 2022

spbnick commented Jan 14, 2022

mrbazzan commented Jan 16, 2022 • edited

spbnick commented Jan 17, 2022

spbnick commented Feb 16, 2021 •

edited

mrbazzan commented Jan 13, 2022 •

edited

mrbazzan commented Jan 16, 2022 •

edited