Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jq startup has gotten slow due to growth in builtins and linking inefficiencies #1411

Open
h2suzuki opened this issue May 17, 2017 · 6 comments

Comments

@h2suzuki
Copy link

Thank you for this great command jq!

Recently, I have compiled jq-1.5 (master branch HEAD) on Raspberry Pi3, and found significant slow down since jq-1.4.

Here is the information I found:

jq-1.4

Note that ./configure is done without any options. Within the build directory, I did the following commands.

# ./jq --version
jq-1.4

# time echo '{}' | strace -wc ./jq .
{}
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 17.64    0.000960          69        14           mmap2
 16.66    0.000907         907         1           execve
 12.36    0.000673          61        11           read
 10.73    0.000584          65         9           mprotect
  8.49    0.000462          77         6         1 open
  6.14    0.000334          48         7           fstat64
  6.01    0.000327         109         3           munmap
  4.80    0.000261          44         6           close
  4.74    0.000258          65         4         3 access
  4.65    0.000253          42         6           lseek
  2.70    0.000147          49         3           brk
  1.47    0.000080          80         1           write
  1.10    0.000060          60         1           ioctl
  0.94    0.000051          51         1           cacheflush
  0.83    0.000045          45         1           uname
  0.75    0.000041          41         1           set_tls
------ ----------- ----------- --------- --------- ----------------
100.00    0.005443                    75         4 total

real    0m0.038s
user    0m0.010s
sys     0m0.010s

jq-1.5 HEAD

For 1.5, ./configure --without-oniguruma is done.

# ./jq --version
jq-1.5rc2-237-gcb3d5af

# time echo '{}' | strace -wc ./jq .
{}
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 15.36    0.000950          68        14           mmap2
 14.82    0.000917         102         9           brk
 14.57    0.000901         901         1           execve
  9.75    0.000603          55        11           read
  9.41    0.000582          65         9           mprotect
  7.44    0.000460          77         6         1 open
  5.38    0.000333          48         7           fstat64
  5.04    0.000312         104         3           munmap
  4.30    0.000266          44         6           close
  4.28    0.000265          66         4         3 access
  4.17    0.000258          43         6           lseek
  1.39    0.000086          86         1           write
  0.97    0.000060          60         1           ioctl
  0.86    0.000053          53         1           getcwd
  0.84    0.000052          52         1           cacheflush
  0.79    0.000049          49         1           uname
  0.63    0.000039          39         1           set_tls
------ ----------- ----------- --------- --------- ----------------
100.00    0.006186                    82         4 total

real    0m0.895s
user    0m0.830s
sys     0m0.010s

You can see the elapsed time becomes almost 1 second for this minimum JSON and 23.5 times longer than 1.4.

Here is another trace of jq-1.5.

# time echo '{}' | strace -r ./jq .
     0.000000 execve("./jq", ["./jq", "."], [/* 47 vars */]) = 0
     0.001837 brk(0)                    = 0x12f5000
     0.000298 uname({sys="Linux", node="raspberry", ...}) = 0
     0.000201 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
     0.000242 mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76fa1000
     0.000226 access("/etc/ld.so.preload", R_OK) = 0
     0.000201 open("/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
     0.000222 fstat64(3, {st_mode=S_IFREG|0644, st_size=42, ...}) = 0
     0.000219 mmap2(NULL, 42, PROT_READ|PROT_WRITE, MAP_PRIVATE, 3, 0) = 0x76fa0000
     0.000214 close(3)                  = 0
     0.000240 open("/usr/lib/arm-linux-gnueabihf/libarmmem.so", O_RDONLY|O_CLOEXEC) = 3

... (omitted) ...

     0.000451 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f9f000
     0.000328 set_tls(0x76f9f4c0, 0x76f9fba8, 0x76fa4050, 0x76f9f4c0, 0x76fa4050) = 0
     0.000473 mprotect(0x76ec0000, 8192, PROT_READ) = 0
     0.000299 mprotect(0x76f3f000, 4096, PROT_READ) = 0
     0.000240 mprotect(0x76f5f000, 20480, PROT_READ|PROT_WRITE) = 0
     0.000300 mprotect(0x76f5f000, 20480, PROT_READ|PROT_EXEC) = 0
     0.000257 cacheflush(0x76f5f000, 0x76f64000, 0, 0x15, 0x7ea53dd0) = 0
     0.000313 mprotect(0x76fa3000, 4096, PROT_READ) = 0
     0.000226 munmap(0x76f41000, 120366) = 0
     0.001042 brk(0)                    = 0x12f5000
     0.000234 brk(0x1316000)            = 0x1316000
     0.000392 ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 opost isig icanon echo ...}) = 0
     0.000752 getcwd("/root/Downloads/jq", 4096) = 19
     0.000663 open("/root/.jq", O_RDONLY) = -1 ENOENT (No such file or directory)
     0.032603 brk(0x1337000)            = 0x1337000
     0.018625 brk(0x1358000)            = 0x1358000
     0.041019 brk(0x1379000)            = 0x1379000
     0.808403 brk(0x139a000)            = 0x139a000
     0.389545 brk(0x1399000)            = 0x1399000
     0.001031 brk(0x131e000)            = 0x131e000
     0.000708 fstat64(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
     0.000251 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f9e000
     0.000291 read(0, "{}\n", 4096)     = 3
     0.000443 fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 0), ...}) = 0
     0.000235 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f9d000
     0.000268 write(1, "\33[1;39m{}\33[0m\n", 14{}
) = 14
     0.000272 read(0, "", 4096)         = 0
     0.000215 close(1)                  = 0
     0.000164 munmap(0x76f9d000, 4096)  = 0
     0.000430 exit_group(0)             = ?
     0.000826 +++ exited with 0 +++

real    0m1.445s
user    0m0.990s
sys     0m0.020s

The major difference comes from brk() system calls after open("/root/.jq", O_RDONLY).

     0.032603 brk(0x1337000)            = 0x1337000
     0.018625 brk(0x1358000)            = 0x1358000
     0.041019 brk(0x1379000)            = 0x1379000
     0.808403 brk(0x139a000)            = 0x139a000
     0.389545 brk(0x1399000)            = 0x1399000
     0.001031 brk(0x131e000)            = 0x131e000

0.808403 and 0.389545 seconds are not taken by brk() itself as indicated by strace -wc
These are implying something heavy going on in the user space between these system calls.

jq-1.4 does not have these brk() calls at all. The problematic part is introduced by jq-1.5.
I put strace output of jq-1.4 below after open("/root/.jq", O_RDONLY).

... (omitted) ...

     0.000165 munmap(0x76ead000, 120366) = 0
     0.000882 brk(0)                    = 0x1213000
     0.000163 brk(0x1234000)            = 0x1234000
     0.000590 open("/root/.jq", O_RDONLY) = -1 ENOENT (No such file or directory)
     0.008392 fstat64(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
     0.000301 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f0a000
     0.000215 read(0, "{}\n", 4096)     = 3
     0.000333 ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, {B9600 opost isig icanon echo ...}) = 0
     0.000217 fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 0), ...}) = 0
     0.000280 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f09000
     0.000241 write(1, "\33[37m{}\33[0m\n", 12{}
) = 12
     0.000217 read(0, "", 4096)         = 0
     0.000171 close(0)                  = 0
     0.000135 munmap(0x76f0a000, 4096)  = 0
     0.000497 exit_group(0)             = ?
     0.000587 +++ exited with 0 +++

real    0m0.043s
user    0m0.020s
sys     0m0.010s

Just to make it clear, we have much free memory, so the issue is not triggered by memory shortage.

# free -h
             total       used       free     shared    buffers     cached
Mem:          733M       616M       116M        55M        86M       340M
-/+ buffers/cache:       189M       543M
Swap:         370M        36M       334M

Thank you in advance!

@nicowilliams
Copy link
Contributor

Perhaps something in the JSON parser has started doing more allocations than it should. That's all I can think of right now. Thanks for this report!

@wtlangford
Copy link
Contributor

So, I ran jq 1.5 (the release branch) through callgrind, and it looks like we're spending most of our time in... block_bind_subblock! Though, this isn't particularly surprising (see also: #1380)

For posterity, I've attached is the call graph generated by callgrind (and kcachegrind).
callgraph

As such, I don't think this is an issue, so much as it is more reason to actually deal with #1380.

As for the extra calls to brk(), I'd expect them to be because we have so many more builtins now and they all get compiled/bound/stripped every time we compile a jq program.

@nicowilliams
Copy link
Contributor

@wtlangford Good job. Yeah, we need to optimize binding!

@nicowilliams
Copy link
Contributor

@wtlangford It would be useful to know how much more time we spend in block_bind_subblock() called from yyparse() versus from linking. See my commentary on #1380.

@wtlangford
Copy link
Contributor

It's hard to say... kcachegrind is not my favorite tool, and this is all too fast overall for something like gprof (which I'm a bit better with).

Notes about these times:

  • callgrind doesn't actually measure time. It instead measures things like "instructions fetched" and "data read". I'm using "instructions fetched" as a reference point for time. Obviously that's a bit fuzzy, but it should do.
  • Percentages are relative to the the caller (unless otherwise stated).

This is all based off of an invocation of jq that looks like this:

echo '{}' | valgrind --tool=callgrind --dump-instr=yes --simulate-cache=yes --collect-jumps=yes ./jq .

Onwards to data!

Almost all of our time (99.84% of the total process time) is spent in jq_compile_args() -> builtins_bind()
Here's a breakdown of that time:

  • 26.08% in builtins_bind_one() (in jq_parse_library())
  • 46.43% in gen_cbinding() (in repeated calls to block_bind())
  • 25.56% in block_drop_unreferenced() (in block_count_refs())

All of the time spent in jq_parse_library() is spent in yyparse() (which makes sense, given that the program we're compiling is just .. I'd expect a more complex program to spend time in block_bind())

Looking at the callers of block_bind_subblock()yields the following breakdown of time spent in block_bind_subblock():

  • 31.17% comes from yyparse()
  • 64.77% comes from gen_cbinding()

Lots of numbers here, and I'm not sure how well I trust them, but the expectation is that spending less time binding/binding more efficiently will speed things up.

@nicowilliams
Copy link
Contributor

Thanks @wtlangford, that confirms my thinking that just going from O(N_insts * M_builtins) to O(N_insts * log(M_builtins)) would be a big win (especially for tiny programs like .). Most jq programs aren't that large, so that should be a big win. Precompiling builtins might be a huge win too, though that would be a significant complication.

@nicowilliams nicowilliams changed the title jq 1.5 becomes very heavy and slow jq startup has gotten slow due to growth in builtins and linking inefficiencies May 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants