-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jq wordcount is slow #61
Comments
Thanks for an excellent bug-report! You were right, there was a silly n^2 in the middle. But it was an unusually subtle one :). It's fixed in master now, so if you build from source it'll run much faster. It turns out the slow bit was the The There's an important optimisation here, though. jq uses refcounting internally (which is valid since it's intentionally impossible to create a cyclic data structure in jq). If Unfortunately, the previous code for |
That should be reasonably fast now, and I think But if you're building from master, you can also use the somewhat inscrutable (and undocumented)
|
Thank you for the fix and the explanation! I can now use I look forward to reading the documentation for I'd really like to use jq for some big-data type things, but it's still about 10X slower than awk for my task. Have you considered 1.) implementing |
It occurs to me now that A hashing version of There are a lot of speedups that could be made in the jq internals. I think the first two I'd do are a simple peephole optimiser for the jq bytecode, and an inliner/constant-propagator (lots of jq guts are implemented as builtins in the jq language itself, this makes everything a bit slower but it means that optimisations speed everything up at once). Going for a machine-code/LLVM backend would be quite fiddly - backtracking is an integral part of jq, which can be implemented quite efficiently when I have control over the stack, and which doesn't fit very neatly into LLVM's C-like stack model. So far, I have explicitly avoided doing any serious performance work on jq. I could sink a lot of time into it, and with very little performance effort it already seems fast enough for most purposes. I'm definitely going to make it go really fast at some stage, but I think there are other things to do with it first. If you want to make it go faster right now there's an easy cheat: hack the Makefile and compile jq with |
You're right that there are higher priorities than performance, and I imagine most people aren't doing really performance-intensive stuff anyway. I trust you'd know if LLVM wasn't the right tool for the job. I tried Thanks again for a great tool and rapid improvements. |
First, I really like jq; thank you. I'm trying to use it for some large-ish data analysis, but running into a few snags. This is probably more my misunderstanding than an issue.
I'm trying to do a wordcount: just count how many times each line occurs in a text file. So for an input file like this:
Produce output like this (order not important):
My strategy is to use
group_by()
(please let me know if there's a better approach). Sincegroup_by()
only works on arrays (is there a reason it can't take a filter?), first I need to make my raw text into a JSON array. I couldn't get a single filter to work, so I ended up withjq -R
followed byjq -s
(as combining both gives me what I don't want):Now I can use
group_by()
:which works. For small inputs. If I try 10,000 random strings generated like so:
The above filter runs in 1.6 seconds. 100,000 strings takes 143 seconds. 1 hour later, I'm still waiting for 1 million strings -- and I'd like to do a lot more than that! For comparison, the equivalent awk:
counts 100,000 lines in 0.2 seconds.
I gather
group_by()
works by sorting, butsort
itself seems pretty fast:So, I'm probably doing something dumb and triggering some N^2 behavior or something. Ideas?
The text was updated successfully, but these errors were encountered: