jq wordcount is slow #61

jkleint · 2012-12-31T22:15:51Z

First, I really like jq; thank you. I'm trying to use it for some large-ish data analysis, but running into a few snags. This is probably more my misunderstanding than an issue.

I'm trying to do a wordcount: just count how many times each line occurs in a text file. So for an input file like this:

one
two
three
two
three
three

Produce output like this (order not important):

one 1
three 3
two 2

My strategy is to use group_by() (please let me know if there's a better approach). Since group_by() only works on arrays (is there a reason it can't take a filter?), first I need to make my raw text into a JSON array. I couldn't get a single filter to work, so I ended up with jq -R followed by jq -s (as combining both gives me what I don't want):

$ < test.txt jq -R '.' | jq -s '.' 
[
  "one",
  "two",
  "three",
  "two",
  "three",
  "three"
]

Now I can use group_by():

< test.txt jq -R '.' | time jq -s -r 'group_by(.)[] | "\(.[0]) \(length)"'

which works. For small inputs. If I try 10,000 random strings generated like so:

$ tr -cd '\n[:alnum:]' < /dev/urandom | head -10000 > test.txt

The above filter runs in 1.6 seconds. 100,000 strings takes 143 seconds. 1 hour later, I'm still waiting for 1 million strings -- and I'd like to do a lot more than that! For comparison, the equivalent awk:

awk '{h[$0]++} END {for (line in h) {print line " " h[line]}}'

counts 100,000 lines in 0.2 seconds.

I gather group_by() works by sorting, but sort itself seems pretty fast:

$ < test-1M.txt jq -R '.' | time jq -s -r 'sort[]' > /dev/null
7.96user 0.15system 0:10.55elapsed 76%CPU (0avgtext+0avgdata 751648maxresident)k

So, I'm probably doing something dumb and triggering some N^2 behavior or something. Ideas?

The text was updated successfully, but these errors were encountered:

stedolan · 2013-01-03T17:38:39Z

Thanks for an excellent bug-report! You were right, there was a silly n^2 in the middle. But it was an unusually subtle one :). It's fixed in master now, so if you build from source it'll run much faster.

It turns out the slow bit was the [foo] array collection syntax, which is used internally by group_by to collect the key for each element. You get the same n^2 behaviour with just [.[]]. Here's a bit of a rant about why:

The [foo] syntax works by creating an empty array, and appending each result of foo to the array. jq's internal JSON library is functional and referentially transparent - all functions work "by value", and there are no side-effects (this is necessary to implement jq's backtracking behaviour in a relatively sane manner). So, the jv_array_append function, in the general case, copies the entire array to append the next element.

There's an important optimisation here, though. jq uses refcounting internally (which is valid since it's intentionally impossible to create a cyclic data structure in jq). If jv_array_append detects that the refcount of the array is 1, and so there are no remaining references to the old array, then it will update the array in-place. This is what was meant to happen for [foo].

Unfortunately, the previous code for [foo] kept an extra reference to the old array hanging around for too long, and forced jv_array_append into the slow full-copy path every time.

stedolan · 2013-01-03T17:41:06Z

That should be reasonably fast now, and I think group_by is the "correct" way of solving this.

But if you're building from master, you can also use the somewhat inscrutable (and undocumented) fold syntax, which might (or might not) be slightly faster:

jq -R '.' | jq -s 'fold {} as $wc (.[] as $word | $wc | .[$word] += 1)'

jkleint · 2013-01-03T19:51:41Z

Thank you for the fix and the explanation! I can now use group_by with 1M strings in 21 seconds, and 10M strings in 239 seconds.

I look forward to reading the documentation for fold. I tried your solution with 1M strings, but now it's the one still going after an hour.... ;)

I'd really like to use jq for some big-data type things, but it's still about 10X slower than awk for my task. Have you considered 1.) implementing group_by with hashing, yielding unsorted output (since you can always sort yourself if needed) and 2.) an LLVM backend? I see you have experience with that sort of thing.... ;)

stedolan · 2013-01-03T21:03:53Z

It occurs to me now that fold currently has exactly the same N^2 problem that [foo] used to have :)

A hashing version of group_by is a good idea.

There are a lot of speedups that could be made in the jq internals. I think the first two I'd do are a simple peephole optimiser for the jq bytecode, and an inliner/constant-propagator (lots of jq guts are implemented as builtins in the jq language itself, this makes everything a bit slower but it means that optimisations speed everything up at once). Going for a machine-code/LLVM backend would be quite fiddly - backtracking is an integral part of jq, which can be implemented quite efficiently when I have control over the stack, and which doesn't fit very neatly into LLVM's C-like stack model.

So far, I have explicitly avoided doing any serious performance work on jq. I could sink a lot of time into it, and with very little performance effort it already seems fast enough for most purposes. I'm definitely going to make it go really fast at some stage, but I think there are other things to do with it first.

If you want to make it go faster right now there's an easy cheat: hack the Makefile and compile jq with -O2 -DNDEBUG. It's currently compiled at -O1 with assertions turned on (mainly because I can handle an "assertion failed on line foo" bugreport better than a "random segfault" bugreport) but it spends quite a bit of time doing redundant checks. There are also a lot of out-of-line small C functions that are called regularly, you'd get quite a perf win by either a) figuring out GCC or Clang's link-time-optimisation, or b) cat-ing all of the C files into one giant file before compiling. Maybe try adding -flto -fwhole-program and see what happens.

jkleint · 2013-01-04T04:53:56Z

You're right that there are higher priorities than performance, and I imagine most people aren't doing really performance-intensive stuff anyway. I trust you'd know if LLVM wasn't the right tool for the job.

I tried gcc -O3 -DNDEBUG and it cut the time to about 2/3 what it was. clang -O4 -DNDEBUG (which implies -flto) got it down to half what it was. Thanks for the tip! That brings it to around 3X the time of a sorting-based Python solution, and if you say there is plenty of room for optimization in the future, I can live with that. :) Besides, I have an optimization idea to cut memory usage, with the likely side effect of bringing performance up.

Thanks again for a great tool and rapid improvements.

stedolan closed this as completed in c013b55 Jan 3, 2013

dtolnay added bug performance labels Jul 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jq wordcount is slow #61

jq wordcount is slow #61

jkleint commented Dec 31, 2012

stedolan commented Jan 3, 2013

stedolan commented Jan 3, 2013

jkleint commented Jan 3, 2013

stedolan commented Jan 3, 2013

jkleint commented Jan 4, 2013

jq wordcount is slow #61

jq wordcount is slow #61

Comments

jkleint commented Dec 31, 2012

stedolan commented Jan 3, 2013

stedolan commented Jan 3, 2013

jkleint commented Jan 3, 2013

stedolan commented Jan 3, 2013

jkleint commented Jan 4, 2013