Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Even JRuby 9.2.8.0 terribly slow #5857

Open
rovf opened this issue Sep 3, 2019 · 20 comments

Comments

@rovf
Copy link

commented Sep 3, 2019

Environment

Provide at least:
JRuby 9.2.8.0, Java 1.8, Windows 10

When stepping up from JRuby 1.7.27 to 9.2.8, I hoped to see some performance increasements, but on contrary, 9.2.8 turned out to be slower than 1.7.27:

One CPU-bound application, which runs constantly in about 31 minutes under 1.7, takes 36 minutes with 9.2. I am aware that I have to take into account a longer warm-up time with JRuby 9.x, but I don't think that this will have a noticeable influence, if the total running time is more than half an hour.

In a different application, which is using the caller method frequently in its logging package, I had expected JRuby 9.2 to be much faster, because it is said that this has improved the speed of this method. What I see instead, is:

  • Running in JRuby 1.7.27 WITHOUT using caller(): 90 Seconds
  • Running in JRuby 1.7.27 WITH using caller in many places, but otherwise processing the same data: 29 minutes (this shows that caller is really a bottleneck)
  • Running in JRuby 9.2.8 WITHOUT caller: 2 minutes
  • Running in JRuby 9.2.8 WITH logging (i.e. caller): 31 minutes

I also considered the heap size (-J-Xmx4g), but monitoring the applications shows that both actually use less than 70% of the heap.

I also tried to enable frozen string literals (because we have many string literals) and "invoke dynamic", but both did not have any noticeable effect.

Any idea why the new JRuby is so much slower than the old one?

@enebo

This comment has been minimized.

Copy link
Member

commented Sep 4, 2019

@rovf so the bulk of your second application essentially is time to perform caller with 90s-120s (depending on version) doing things which is not caller? If so the next step would be understanding what the non-caller part of the application is doing. We have generally beat performance in 9.x over 1.7.x for any reasonably warmed up app. Our main issue for 9.x has been short-lived scripts which take extra time.

Since you are on windows that also makes me wonder about IO. In 1.7 we used pure-Java IO but in 9 we switched to native IO for all but Windows. It is possible we have introduced some extra overhead on IO for windows perhaps (guessing)?

In both apps perhaps there is something in common which explains why both lose to 1.7?

@rovf

This comment has been minimized.

Copy link
Author

commented Sep 5, 2019

Yes. We have a logging module which produces among the data to be written to a log file, also the name of the calling method, the file nam, and the line number, and if this package is active, execution time in JRuby 1.7 went upt from 90 s to 29 min, and in JRuby 9 from 120 s to 31 min. The non-caller part really does not matter: For tests, I removed invocation of caller, printing nothing instead, and things went really fast. It was indeed caller eating most of the time. I understood this for JRuby 1.7, because it was documented that caller was slow, but I hoped improvement on JRuby 9.

I also suspected Windows-I/O. That's why I did the first test with most I/O turned off. Actually, the task is initially reading the input (which takes less than 1 minute), then processes the data. In a normal usage, it would also produce various reports, but since I can configure which reports to write, I simply told it to not write any. Hence most of the time was spending without IO.

Both apps were similar in task, only that in the second one, I used much less input, because I wanted to do the test with logging, and logging a large set of input data is not feasible, as the app would run for days.

Other characteristics of the app, which I find worth mentioning: They create a huge amount of small objects (certainly several millions just for storing the input data in internal form), many do ... end blocks (often accessing variables from outer scopes, so maybe closures must be constructed), and plenty of applications of eval, where a typical one looks like this: eval(s) where s is something like _["x"] && (_["y"] || !_["z"]), and _ has been set to hash in the method surrounding the eval.

@headius

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

Ok wow, that's a severe degradation. Something is definitely broken.

The first thing I would do would be to run JRuby using the JVM sampling profiler. That flag is --sample, may need to be in env JRUBY_OPTS depending on how you are launching. It will print out an aggregate profile on exit.

If caller is indeed heavily hit, we will need to find a better way. Running on Java 9 or higher might help if you don't need the whole stack, but this is a very heavy operation.

@headius

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

FWIW caller may be a bit slower because our total JVM stack depth may be larger in JRuby 9k, and the overhead is linear with number of stack frames.

If you don't need the whole stack, caller can be invoked differently (pass a second argument for number of frames you want) or we can pass a JVM flag to limit how deep the stack trace can get.

@rovf

This comment has been minimized.

Copy link
Author

commented Sep 5, 2019

Now I understand Enobo's caller question better. We nearly always invoke caller with giving an explicit stack frame number - typically 2 or 3, in rare cases as high as 5. Do you still need a sampling profile? How can I send it to you?

@enebo

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

@rovf two things which would be cool to try:

  1. Java 11 (9+ should work though) with application. caller should be noticeably cheaper
  2. Hack IO to use 'b' (e.g. 'rb' or 'wb'). When full transcoding came crlf handling ended up getting more expensive on the non-native side.
@headius

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

@rovf One thing I want to clarify...the parameter I'm talking about is the second param to caller, which specifies how many frames to return. The first parameter just says how many frames back (from current) to start gathering.

As an example here's a benchmark that recurses 50 times and then requests caller(1) versus caller(1,1). The second form isn't available on JRuby 1.7, but you can see how the two-argument form of caller improves dramatically for JRuby 9k on Java 11.

require 'benchmark/ips'

def recurse1(n)
  n == 0 ? caller(1) : recurse1(n - 1)
end

def recurse2(n)
  n == 0 ? caller(1, 1) : recurse2(n - 1)
end

Benchmark.ips do |bm|
  bm.report("no length") { recurse1(50) }
  bm.report("w/ length") { recurse2(50) }
end

Results:

[] ~/projects/jruby $ pickjdk 3 ; jruby bench_caller.rb 
New JDK: jdk-08-u202-b08-hotspot
Warming up --------------------------------------
           no length   326.000  i/100ms
           w/ length   451.000  i/100ms
Calculating -------------------------------------
           no length      4.075k (± 4.5%) i/s -     20.538k in   5.051646s
           w/ length      4.794k (± 1.4%) i/s -     24.354k in   5.080983s

[] ~/projects/jruby $ pickjdk 7 ; jruby bench_caller.rb 
New JDK: jdk-11.0.2+9-hotspot
Warming up --------------------------------------
           no length    94.000  i/100ms
           w/ length     2.511k i/100ms
Calculating -------------------------------------
           no length      1.168k (± 2.3%) i/s -      5.922k in   5.071694s
           w/ length     26.588k (± 2.0%) i/s -    133.083k in   5.007422s

[] ~/projects/jruby $ pickjdk 3 ; rvm jruby-1.7.27 do jruby bench_caller.rb 
New JDK: jdk-08-u202-b08-hotspot
Warming up --------------------------------------
           no length   416.000  i/100ms
Calculating -------------------------------------
           no length      4.512k (± 3.4%) i/s -     22.880k

The degradation for the single-argument form is probably due to the new abstraction I added around stack-gathering; I have ways to reduce the overhead of that abstraction, but ultimately the single-arg form will always be extremely heavy-weight.

@rovf

This comment has been minimized.

Copy link
Author

commented Sep 6, 2019

I see! I was not aware of the two-argument form of caller, because out main platform is JRuby 1.7, an we are evaluating whether it is worth to upgrade to JRuby 9.x.

Actually, our logging package uses caller without arguments, in the form caller()[n], with n specifying how many stackframes to go back. I now realize that this, of course, has to first fetch everything, no matter which frame I need. It also means that the second parameter in a JRuby 9.x environment is empty. I guess that this means that this parameter is implicitly taken to be 1.
Aside from this, I think it might be worth to change caller()[n] into caller(n) in any case.

@rovf

This comment has been minimized.

Copy link
Author

commented Sep 6, 2019

I just notice that the idea to use caller(n) instead of caller()[n] does not really work, because caller(n) does not return a single frame, but everything from the n-th frame. Therefore I would have to write caller(n),first in JRuby 1.7, and I could try caller(n,1) in JRuby 9.x if this would yield better performance, although it degrades maintenance to have two versions of the code, in dependence of the JRuby version.

@rovf

This comment has been minimized.

Copy link
Author

commented Sep 6, 2019

@rovf two things which would be cool to try:

1. Java 11 (9+ should work though) with application.  caller should be noticeably cheaper

I find interesting the benchmarks posted here by headius: While caller indeed works best if the two-argument form is used and we run on jdk 11, the old JRuby 1.7 (which has only 1-element form of caller) is still faster then the one-element-caller in JRuby 9 on jdk 9, and about the same speed as the two-element-caller on that platform. Since there is no way to use jdk 11 in our project, and it affects "only" the logging package, I think we have to live with long logging times.

What is still an issue is that longer execution times of JRuby 9 vs. 1.7 for the "normal" tasks which do not involve I/O. I wonder in what part JRuby 9 has improved and how I could make use of it, to get shorter running times....

@headius

This comment has been minimized.

Copy link
Member

commented Sep 9, 2019

I think we can eliminate the difference in full stack caller.

The gain from Java 9+ logic generally only shows up for deeper stacks...note that mine recurses 50 times before testing. We may need to fall back on full trace when the stack is short because this new mechanism has more overhead per-frame than an all at once stack.

@headius

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

Updated version of benchmark and some improvements:

require 'benchmark/ips'

def recurse(n, &b)
  n == 0 ? yield : recurse(n - 1, &b)
end

recurse(50) do
  Benchmark.ips do |bm|
    bm.report("no length") { caller(1) }
    bm.report("w/ length") { caller(1, 1) } if caller(1, 1) rescue false
  end
end

I have a patch locally that will always use the old style stack trace logic when no "length" parameter is passed. This skips using the Java 9+ StackWalker API and gets us closer to Java 8 performance for this case.

[] ~/projects/jruby $ (pickjdk 3 ; jruby bench_caller.rb)
New JDK: jdk-08-u202-b08-hotspot
Warming up --------------------------------------
           no length   375.000  i/100ms
           w/ length   462.000  i/100ms
Calculating -------------------------------------
           no length      4.136k (± 2.6%) i/s -     21.000k in   5.080767s
           w/ length      4.753k (± 2.1%) i/s -     24.024k in   5.056414s

[] ~/projects/jruby $ (pickjdk 7 ; jruby bench_caller.rb)
New JDK: jdk-11.0.2+9-hotspot
Warming up --------------------------------------
           no length   163.000  i/100ms
           w/ length     1.388k i/100ms
Calculating -------------------------------------
           no length      2.136k (± 2.8%) i/s -     10.758k in   5.040130s
           w/ length     15.850k (± 1.6%) i/s -     80.504k in   5.080391s

[] ~/projects/jruby $ (pickjdk 3 ; rvm jruby-1.7.27 do ruby bench_caller.rb)
New JDK: jdk-08-u202-b08-hotspot
Warming up --------------------------------------
           no length   413.000  i/100ms
Calculating -------------------------------------
           no length      4.557k (± 1.6%) i/s -     23.128k

This brings Java 8 + JRuby 9k full trace performance back up to JRuby 1.7.x and improves Java 11 + JRuby 9k full trace performance to about half that of Java 8.

I think that will resolve the caller portion of your bug but I'm unclear if there are performance issues outside of caller we need to resolve.

Can you clarify what remains slow in JRuby 9k, assuming we have fixed the full-stack caller performance?

headius added a commit to headius/jruby that referenced this issue Sep 11, 2019
Use Java 8 stack walker when full trace is requested.
This is faster than using the Java 9+ stack walker API to generate
a full trace and resolves at least part of jrubyGH-5857.
@headius

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

The stack trace performance issues on Java 11 may be resolves as of 11.0.3: https://bugs.openjdk.java.net/browse/JDK-8151751

I am in-flight and unable to download much, but I'll try to confirm this soon.

@headius

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

Ok, well this is good news on the caller front: Java 11.0.4 does indeed appear to have restored the performance of the full-stack getStackTrace logic. The StackWalker performance is still poor, however, so we'll use getStackTrace when a full trace is requested and consider options for partial traces (e.g. if you request a trace length more than N, where N is the tipping point for performance, we fall back on a full trace).

@enebo Thoughts on the partial traces? StackWalker is a clear win when inside a deep trace, but for a shallow trace getStackTrace will be faster. I'm not sure how we can detect that, though.

Here's the numbers on Java 11.0.2, 11.0.4, and 8 (with my patch):

[] ~/projects/jruby $ pickjdk 8 ; jruby -v bench_caller.rb
New JDK: jdk-11.0.2+9-hotspot
jruby 9.2.9.0-SNAPSHOT (2.5.3) 2019-09-11 f8afae4 OpenJDK 64-Bit Server VM 11.0.2+9 on 11.0.2+9 +jit [darwin-x86_64]
Warming up --------------------------------------
           no length   176.000  i/100ms
           w/ length     1.125k i/100ms
Calculating -------------------------------------
           no length      2.116k (± 3.5%) i/s -     10.560k in   4.998225s
           w/ length     14.195k (± 1.8%) i/s -     72.000k in   5.074148s

[] ~/projects/jruby $ pickjdk 1 ; jruby -v bench_caller.rb
New JDK: adoptopenjdk-11.jdk
jruby 9.2.9.0-SNAPSHOT (2.5.3) 2019-09-11 f8afae4 OpenJDK 64-Bit Server VM 11.0.4+11 on 11.0.4+11 +jit [darwin-x86_64]
Warming up --------------------------------------
           no length   365.000  i/100ms
           w/ length     1.840k i/100ms
Calculating -------------------------------------
           no length      4.987k (± 2.1%) i/s -     25.185k in   5.052564s
           w/ length     21.266k (± 1.3%) i/s -    106.720k in   5.019261s

[] ~/projects/jruby $ pickjdk 4 ; jruby -v bench_caller.rb
New JDK: jdk-08-u202-b08-hotspot
jruby 9.2.9.0-SNAPSHOT (2.5.3) 2019-09-11 f8afae4 OpenJDK 64-Bit Server VM 25.202-b08 on 1.8.0_202-b08 +jit [darwin-x86_64]
Warming up --------------------------------------
           no length   359.000  i/100ms
           w/ length   455.000  i/100ms
Calculating -------------------------------------
           no length      4.051k (± 3.3%) i/s -     20.463k in   5.056607s
           w/ length      4.651k (± 1.4%) i/s -     23.660k in   5.088465s
@headius

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

And here is Java 11.0.2, 11.0.4, and 8 using StackWalker to do the full ("no length") trace:

[] ~/projects/jruby $ pickjdk 8 ; jruby -v bench_caller.rb
New JDK: jdk-11.0.2+9-hotspot
jruby 9.2.9.0-SNAPSHOT (2.5.3) 2019-09-11 aed509f OpenJDK 64-Bit Server VM 11.0.2+9 on 11.0.2+9 +jit [darwin-x86_64]
Warming up --------------------------------------
           no length    83.000  i/100ms
           w/ length     1.277k i/100ms
Calculating -------------------------------------
           no length      1.102k (± 4.4%) i/s -      5.561k in   5.055783s
           w/ length     14.611k (± 2.4%) i/s -     74.066k in   5.072277s

[] ~/projects/jruby $ pickjdk 1 ; jruby -v bench_caller.rb
New JDK: adoptopenjdk-11.jdk
jruby 9.2.9.0-SNAPSHOT (2.5.3) 2019-09-11 aed509f OpenJDK 64-Bit Server VM 11.0.4+11 on 11.0.4+11 +jit [darwin-x86_64]
Warming up --------------------------------------
           no length   130.000  i/100ms
           w/ length     1.961k i/100ms
Calculating -------------------------------------
           no length      1.749k (± 2.4%) i/s -      8.840k in   5.055868s
           w/ length     21.604k (± 2.6%) i/s -    109.816k in   5.086689s

[] ~/projects/jruby $ pickjdk 4 ; jruby -v bench_caller.rb
New JDK: jdk-08-u202-b08-hotspot
jruby 9.2.9.0-SNAPSHOT (2.5.3) 2019-09-11 aed509f OpenJDK 64-Bit Server VM 25.202-b08 on 1.8.0_202-b08 +jit [darwin-x86_64]
Warming up --------------------------------------
           no length   338.000  i/100ms
           w/ length   455.000  i/100ms
Calculating -------------------------------------
           no length      3.971k (± 3.5%) i/s -     19.942k in   5.028479s
           w/ length      4.513k (± 3.8%) i/s -     22.750k in   5.050499s

Performance has improved from 11.0.2 to 11.0.4, but it's still well behind Java 8.

@headius

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

(Correction: the above results are only using StackWalker on 11)

@rovf

This comment has been minimized.

Copy link
Author

commented Sep 11, 2019

Can you clarify what remains slow in JRuby 9k, assuming we have fixed the full-stack caller performance?

For the time being, I can not pinpoint a single issue. I just see that the (pretty huge and complex) application as a whole does not perform that well, but I don't know yet whether there are special heavily used language features which makes it slower.

@headius

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

Your original report only talks about the performance of caller, which I believe should be fixed with my pull request. That is unrelated to any other performance things we've discussed.

In #5857 (comment) you mention looking at a couple other things:

  • Windows IO performance. This may indeed be slower than 1.7 for your case, but it is unrelated to the performance of caller. This is due to the binmode/textmode changes in JRuby 9k. We have not fixed this yet; forcing the IO to use binary mode works around it but does not normalize line endings.
  • Many small objects and various Ruby language constructs in use:

Other characteristics of the app, which I find worth mentioning: They create a huge amount of small objects (certainly several millions just for storing the input data in internal form), many do ... end blocks (often accessing variables from outer scopes, so maybe closures must be constructed), and plenty of applications of eval, where a typical one looks like this: eval(s) where s is something like ["x"] && (["y"] || !_["z"]), and _ has been set to hash in the method surrounding the eval.

Addressing these items individually:

  • do/end, closures, captured variables: this should be faster if anything. The new JRuby 9k compiler is better about eliminating state for such cases. If you can pin down any specific subset of code that's slower, I would very much like to see it.
  • lots of evals at runtime: this is likely to be slower, due to the extra overhead of compiling to our internal representation before beginning execution. We have not spent much time optimizing eval because it's generally discouraged to do it in hot-path code at runtime. I would like to understand why you need to use eval in hot-path code at runtime.

So I think that's where we stand. A next step would be to run profiling (I saw and closed your --sample issue) so we can get an idea of where the time is going. However if you profile with heavy caller use, that's going to massively dominate the profile results and we won't get much actionable data. Please try to profile without caller in play.

@rovf

This comment has been minimized.

Copy link
Author

commented Sep 11, 2019

Just a short comment to answer your questions, because, as you said, they don't really belong to the 'caller' theme:

We don't need to use eval and currently have it in our code for performance reason. We have strings in our applications, which are syntactically a (very restricted) subset of Ruby expressions. I also wrote my own evaluator, and used eval for comparision, and it turned out that eval was faster. I will redo these benchmarks with JRuby 9, maybe the result is differntly.

Aside from this, I do my measurements with logging (i.e. caller) turned off, knowing well that this would dominate the overall timing. Still, for the whole application, JRuby 9 is slower for us, but I understand that we need to find certain constructs in order to investigate it.

Unfortunately, as I wrote in the other thread which you have closed, I can't get profiling to work :-(

@headius

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

@rovf Ok thank you for clarifying!

Regarding eval... it does indeed appear to be slower, though it's not actually as much as I thought:

[] ~/projects/jruby $ jruby -e "loop { t = Time.now; i = 0; while i < 100_000; eval 'a = 1; b = 2; c = 1 + 2; d = c * 5'; i+=1; end; puts Time.now - t }"
1.503821
0.775883
0.737421
0.75464
0.738151
0.726151
0.7366469999999999
0.698106
0.7074779999999999
0.7104969999999999
0.6973309999999999
0.7078979999999999
0.708916
^C
[] ~/projects/jruby $ rvm jruby-1.7.27 do jruby -e "loop { t = Time.now; i = 0; while i < 100_000; eval 'a = 1; b = 2; c = 1 + 2; d = c * 5'; i+=1; end; puts Time.now - t }"
0.72
0.477
0.447
0.46
0.471
0.45
0.448
0.453
^C

I will reopen the sampling issue since it seems there's still some issues getting it to work with your app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.