Faster unpackm0 #5961

enebo · 2019-11-06T20:08:20Z

Rails uses unpack1 'm0' to decode base64 cookies. Recent work had improved the performance of unpack1 but I was somewhat disturbed that MRI did so much better once the decode string got larger. This PR fast paths m0 and will also use a version which does not use ByteBuffer.

This benchmark looks at some larger strings (note: largest one is a bit bigger than the max size of a cookie but it is close enough to 4096 to get an idea of the cost at that size:

MRI 2.7
Warming up --------------------------------------
  unpack1 m len: 288     2.089k i/100ms
  unpack1 m len: 576     1.330k i/100ms
 unpack1 m len: 1152   858.000  i/100ms
 unpack1 m len: 2304   469.000  i/100ms
 unpack1 m len: 4608   249.000  i/100ms
Calculating -------------------------------------
  unpack1 m len: 288     22.723k (± 1.8%) i/s -    114.895k in   5.057910s
  unpack1 m len: 576     14.569k (± 2.3%) i/s -     73.150k in   5.023969s
 unpack1 m len: 1152      8.632k (± 2.2%) i/s -     43.758k in   5.071798s
 unpack1 m len: 2304      4.716k (± 2.4%) i/s -     23.919k in   5.074664s
 unpack1 m len: 4608      2.508k (± 2.5%) i/s -     12.699k in   5.066543s

9.2.9.0
Warming up --------------------------------------
  unpack1 m len: 288     1.168k i/100ms
  unpack1 m len: 576   633.000  i/100ms
 unpack1 m len: 1152   315.000  i/100ms
 unpack1 m len: 2304   162.000  i/100ms
 unpack1 m len: 4608    83.000  i/100ms
Calculating -------------------------------------
  unpack1 m len: 288     12.171k (± 2.3%) i/s -     61.904k in   5.089356s
  unpack1 m len: 576      6.316k (± 2.5%) i/s -     31.650k in   5.014613s
 unpack1 m len: 1152      3.282k (± 1.6%) i/s -     16.695k in   5.088612s
 unpack1 m len: 2304      1.657k (± 1.5%) i/s -      8.424k in   5.085228s
 unpack1 m len: 4608    841.896  (± 1.5%) i/s -      4.233k in   5.029165s

after
Warming up --------------------------------------
  unpack1 m len: 288     1.895k i/100ms
  unpack1 m len: 576     1.057k i/100ms
 unpack1 m len: 1152   534.000  i/100ms
 unpack1 m len: 2304   274.000  i/100ms
 unpack1 m len: 4608   142.000  i/100ms
Calculating -------------------------------------
  unpack1 m len: 288     19.856k (± 2.8%) i/s -    100.435k in   5.062359s
  unpack1 m len: 576     10.702k (± 2.8%) i/s -     53.907k in   5.041445s
 unpack1 m len: 1152      5.448k (± 1.9%) i/s -     27.234k in   5.001111s
 unpack1 m len: 2304      2.735k (± 3.1%) i/s -     13.700k in   5.014877s
 unpack1 m len: 4608      1.413k (± 3.3%) i/s -      7.100k in   5.032026s

So strangely we do not really fully catch on these larger base64 strings but the new version is quite a bit faster. I really wonder if there is a way to tweak this new version of unpack that can eliminate bounds checking. Nonetheless, these larger sizes are not what I am seeing in a simple Rails app. In that case I took the 4 decodes which happen and benched those:

system ~/work/jruby master 1328% mri27 ../snippets/unpack1.rb 
Warming up --------------------------------------
  unpack1 m len: 288     2.235k i/100ms
   unpack1 m len: 16     7.086k i/100ms
   unpack1 m len: 24     6.759k i/100ms
  unpack1 m len: 148     2.846k i/100ms
Calculating -------------------------------------
  unpack1 m len: 288     21.187k (± 7.2%) i/s -    107.280k in   5.092817s
   unpack1 m len: 16     72.179k (± 3.0%) i/s -    361.386k in   5.011767s
   unpack1 m len: 24     68.537k (± 2.1%) i/s -    344.709k in   5.031934s
  unpack1 m len: 148     32.150k (± 2.9%) i/s -    162.222k in   5.050202s

9.2.9.0
Warming up --------------------------------------
  unpack1 m len: 288     1.187k i/100ms
   unpack1 m len: 16     8.225k i/100ms
   unpack1 m len: 24     6.841k i/100ms
  unpack1 m len: 148     1.906k i/100ms
Calculating -------------------------------------
  unpack1 m len: 288     10.701k (± 1.9%) i/s -     54.602k in   5.104411s
   unpack1 m len: 16     83.846k (± 1.9%) i/s -    419.475k in   5.004838s
   unpack1 m len: 24     71.789k (± 3.4%) i/s -    362.573k in   5.056873s
  unpack1 m len: 148     19.899k (± 2.5%) i/s -    101.018k in   5.079880s

after
Warming up --------------------------------------
  unpack1 m len: 288     1.801k i/100ms
   unpack1 m len: 16     7.937k i/100ms
   unpack1 m len: 24     6.929k i/100ms
  unpack1 m len: 148     3.267k i/100ms
Calculating -------------------------------------
  unpack1 m len: 288     17.733k (± 6.9%) i/s -     90.050k in   5.102682s
   unpack1 m len: 16     98.730k (± 6.6%) i/s -    492.094k in   5.009752s
   unpack1 m len: 24     94.109k (± 2.5%) i/s -    471.172k in   5.009911s
  unpack1 m len: 148     33.853k (± 2.8%) i/s -    169.884k in   5.022907s

(note: indy makes these numbers a little bit better but the vast bulk of performance is all in Java)

We can see here that we are winning in 3 out of 4 unpacks and we are now close on the 4th.

My long term view is that ByteBuffer is impairing our unpack performance and when we rewrite unpack impls we should move to all byte[] versions. This fast path probably will not be needed once that is done.

I would like to understand why we are not catching perf with MRI here for longer strings but this is a positive step forward.

A really wildly unrealistic but not difficult idea would be to split the decode string and process pieces across multiple threads.

enebo · 2019-11-06T20:09:24Z

LOL...extra commit in there for a missing add-opens as a "special" bonus unrelated to this PR :)

enebo · 2019-11-06T20:18:02Z

For fun here is 9.2.7.0 before some work went into unpack:

Warming up --------------------------------------
  unpack1 m len: 288   577.000  i/100ms
  unpack1 m len: 576   384.000  i/100ms
 unpack1 m len: 1152   257.000  i/100ms
 unpack1 m len: 2304   146.000  i/100ms
 unpack1 m len: 4608    78.000  i/100ms
Calculating -------------------------------------
  unpack1 m len: 288      6.557k (± 3.4%) i/s -     32.889k in   5.022262s
  unpack1 m len: 576      4.345k (± 3.7%) i/s -     21.888k in   5.045440s
 unpack1 m len: 1152      2.651k (± 2.6%) i/s -     13.364k in   5.045537s
 unpack1 m len: 2304      1.467k (± 2.4%) i/s -      7.446k in   5.078940s
 unpack1 m len: 4608    775.922  (± 2.4%) i/s -      3.900k in   5.029380s
system ~/work/jruby (9.2.7.0) 1405% jruby ../snippets/unpack1.rb 
Warming up --------------------------------------
  unpack1 m len: 288   605.000  i/100ms
   unpack1 m len: 16     1.108k i/100ms
   unpack1 m len: 24     1.074k i/100ms
  unpack1 m len: 148   767.000  i/100ms
  unpack1 m len: 288   598.000  i/100ms
Calculating -------------------------------------
  unpack1 m len: 288      5.986k (± 2.1%) i/s -     30.250k in   5.056054s
   unpack1 m len: 16     11.218k (± 2.1%) i/s -     56.508k in   5.039890s
   unpack1 m len: 24     10.705k (± 6.5%) i/s -     53.700k in   5.042451s
  unpack1 m len: 148      7.779k (± 3.8%) i/s -     39.117k in   5.036783s
  unpack1 m len: 288      6.053k (± 1.7%) i/s -     30.498k in   5.039550s

…will be positive 7bit values.

enebo added 2 commits November 6, 2019 10:40

add missing module open (noticed running spec:ruby:fast)

6ff7989

Fast path m0 unpack as it is used for cookie decoding in Rails

24abca7

enebo added this to the JRuby 9.2.10.0 milestone Nov 6, 2019

enebo added 2 commits November 6, 2019 15:16

Missing too much string to strict unpack check

e7edfa4

Fix last error on CI? And also remove unneeded &255 since all values …

60298fb

…will be positive 7bit values.

enebo merged commit c56caa3 into jruby:master Nov 7, 2019

ahorek mentioned this pull request Nov 28, 2019

fix strict_base64 for shared strings #5985

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster unpackm0 #5961

Faster unpackm0 #5961

enebo commented Nov 6, 2019

enebo commented Nov 6, 2019

enebo commented Nov 6, 2019

Faster unpackm0 #5961

Faster unpackm0 #5961

Conversation

enebo commented Nov 6, 2019

enebo commented Nov 6, 2019

enebo commented Nov 6, 2019