[perf] Integer#chr(Encoding::UTF_8) is considerably slower than CRuby #2316

postmodern · 2021-04-07T18:26:30Z

I discovered this performance issue when debugging an issue with calling Integer#chr on invalid multi-bytes UTF-8 words taking considerably longer on TruffleRuby than CRuby. The issue does not seem to be related to raising/rescuing the RangeError exception.

Example

require 'benchmark'

BYTES = (0..255).to_a * (1024 * 100)
WORDS = BYTES.each_slice(4).map do |bytes|
  (bytes[0] << 24) | (bytes[1] << 16) | (bytes[2] << 8) | bytes[3]
end

class Integer
  def bad_chr
    raise(RangeError,"test")
  end
end

N = WORDS.length

Benchmark.bm(40) do |b|
  b.report("Integer#chr(Encoding::UTF_8) (N=#{N})") do
    WORDS.each do |word|
      begin
        word.chr(Encoding::UTF_8)
      rescue RangeError
      end
    end
  end

  b.report("raise/rescue RangeError (N=#{N})") do
    N.times do |word|
      begin
        word.bad_chr
      rescue RangeError
      end
    end
  end
end

ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]

                                               user     system      total        real
Integer#chr(Encoding::UTF_8) (N=6553600)   7.705194   0.000000   7.705194 (  7.758757)
raise/rescue RangeError (N=6553600)        5.416638   0.000000   5.416638 (  5.441655)

truffleruby 21.0.0, like ruby 2.7.2, GraalVM CE Native [x86_64-linux]

                                               user     system      total        real
Integer#chr(Encoding::UTF_8) (N=6553600)  26.992242   0.047611  27.039853 ( 26.941786)
raise/rescue RangeError (N=6553600)        0.370205   0.038572   0.408777 (  0.176519)

The text was updated successfully, but these errors were encountered:

headius · 2021-04-07T18:31:03Z

I would be willing to bet the issue is the raise of RangeError. As I noted in jruby/jruby#6652 (comment), your raise benchmark here is eliminating the backtrace for the RangeError. I do not believe this is valid behavior, since between the call to chr or bad_chr and the rescue, the RangeError exception might have been removed, and the const_missing that would result should be able to see the backtrace.

JRuby can eliminate the backtrace but only in the case of a "simple" rescue that does not do side-effecty things like looking up constants. If you were to do a bare rescue, both cases would be fast on JRuby.

headius · 2021-04-07T19:00:53Z

A modified version of your bench that only does the direct raise, with and without the UNPRINTABLE constant lookup, shows that it is indeed the exception handling (edit: the backtrace generation), with the backtrace only eliminated along the non-UNPRINTABLE path:

require 'benchmark/ips'

UNPRINTABLE = ??

class Integer
  def bad_chr
    raise(RangeError,"test")
  end
end


Benchmark.ips do |b|
  b.report("RangeError with const lookup") do
      begin
        1.bad_chr
      rescue RangeError
        UNPRINTABLE
      end
  end

  b.report("RangeError no const lookup") do
      begin
        1.bad_chr
      rescue RangeError
      end
  end
end

Warming up --------------------------------------
RangeError with const lookup
                       681.000  i/100ms
RangeError no const lookup
                       141.587M i/100ms
Calculating -------------------------------------
RangeError with const lookup
                          7.064k (± 5.6%) i/s -     35.412k in   5.031707s
RangeError no const lookup
                          1.610B (± 4.9%) i/s -      8.070B in   5.028192s

chrisseaton · 2021-04-07T19:13:15Z

This second case is really a no-op:

  begin
    raise RangeError
  rescue RangeError
  end

We can see at compile-time that the class of the exception is the same as the class in the rescue clause, so it becomes basically like this:

  goto foo
foo:

Then the goto is removed because it just jumps to the next line anyway. We do actually have a few 'guards' (checks) left in the code, but it's still just straight-line pure code with no operations - the key thing is there's one red line in this diagram!

Your code has a method between the raise and rescue, but that of course gets inlined so it doesn't make any practical difference.

The second case should be the same thing, right? String#chr raises, and the rescue catches it, both of the same class, so what's the problem? I'm afraid it's just a case of this code-path not being quite as nicely inlined yet. The Ruby code is all inlined:

[engine] Inline start Object#foo                                                  |call diff     0.00|Recursion Depth      0|Explore/inline ratio     1.00|IR Nodes    756|Frequency     1.00|Truffle Callees      2|Forced false|Depth      0
[engine] Inlined        Integer#chr                                               |call diff    -5.00|Recursion Depth      0|Explore/inline ratio      NaN|IR Nodes    273|Frequency     1.00|Truffle Callees      3|Forced false|Depth      1
[engine] Inlined          BasicObject#!= <split-7f8b39f8>                         |call diff    -2.00|Recursion Depth      0|Explore/inline ratio      NaN|IR Nodes     68|Frequency     1.00|Truffle Callees      1|Forced false|Depth      2
[engine] Inlined            Integer#==                                            |call diff    -1.00|Recursion Depth      0|Explore/inline ratio      NaN|IR Nodes     67|Frequency     1.00|Truffle Callees      0|Forced false|Depth      3
[engine] Inlined          Truffle::Type.coerce_to_encoding                        |call diff    -1.00|Recursion Depth      0|Explore/inline ratio      NaN|IR Nodes     37|Frequency     1.00|Truffle Callees      0|Forced false|Depth      2
[engine] Inlined          String.from_codepoint                                   |call diff    -1.00|Recursion Depth      0|Explore/inline ratio      NaN|IR Nodes    186|Frequency     1.00|Truffle Callees      0|Forced false|Depth      2
[engine] Inlined        Module#===                                                |call diff    -0.98|Recursion Depth      0|Explore/inline ratio      NaN|IR Nodes     75|Frequency     0.98|Truffle Callees      0|Forced false|Depth      1
[engine] Inline done  Object#foo                                                  |call diff     0.00|Recursion Depth      0|Explore/inline ratio     1.00|IR Nodes    756|Frequency     1.00|Truffle Callees      2|Forced false|Depth      0

But I'm afraid we actually manually not inline the key primitive at the end of the chain at the moment:

truffleruby/src/main/java/org/truffleruby/core/string/StringNodes.java

Lines 3980 to 4024 in d37fbf7

    
               @TruffleBoundary 
        
               @Specialization(guards = { "!isSimple(code, rubyEncoding)", "isCodepoint(code)" }) 
        
               protected RubyString stringFromCodepoint(long code, RubyEncoding rubyEncoding, 
        
                       @Cached RopeNodes.CalculateCharacterLengthNode calculateCharacterLengthNode) { 
        
                   final Encoding encoding = rubyEncoding.encoding; 
        
                   final int length; 
        
                   try { 
        
                       length = encoding.codeToMbcLength((int) code); 
        
                   } catch (EncodingException e) { 
        
                       throw new RaiseException(getContext(), coreExceptions().rangeError(code, rubyEncoding, this)); 
        
                   } 
        
                   if (length <= 0) { 
        
                       throw new RaiseException(getContext(), coreExceptions().rangeError(code, rubyEncoding, this)); 
        
                   } 
        
                   final byte[] bytes = new byte[length]; 
        
                   final int codeToMbc = encoding.codeToMbc((int) code, bytes, 0); 
        
                   if (codeToMbc < 0) { 
        
                       throw new RaiseException(getContext(), coreExceptions().rangeError(code, rubyEncoding, this)); 
        
                   } 
        
                   final Bytes bytesObject = new Bytes(bytes, 0, length); 
        
                   if (calculateCharacterLengthNode.characterLength(encoding, CR_UNKNOWN, bytesObject) != length) { 
        
                       throw new RaiseException(getContext(), coreExceptions().rangeError(code, rubyEncoding, this)); 
        
                   } 
        
                   return makeStringNode.executeMake(bytes, encoding, CodeRange.CR_VALID); 
        
               } 
        
               protected boolean isCodepoint(long code) { 
        
                   // Fits in an unsigned int 
        
                   return code >= 0 && code < (1L << 32); 
        
               } 
        
               protected boolean isSimple(long code, RubyEncoding encoding) { 
        
                   final Encoding enc = encoding.encoding; 
        
                   return (enc.isAsciiCompatible() && code >= 0x00 && code < 0x80) || 
        
                           (enc == ASCIIEncoding.INSTANCE && code >= 0x00 && code <= 0xFF); 
        
               } 
        
           }

This is because we considered it for some reason to be a 'slow path' more unusual or !isSimple case. Maybe that's wrong and we should refine that?

We could probably fix that if you think this should be a fast operation? Possible we could just move the exception throwing outside the non-inlined method, even if the rest of the logic isn't inlined.

We may find after fixing that primitive to make sure it is inlined that we find something else isn't inlined that also interferes with turning the exception into a goto as we'd like to happen.

headius · 2021-04-07T19:20:29Z

@chrisseaton But the RangeError constant might be removed in another thread. How does your optimization handle that case, when the rescue exception lookup would call into a const_missing that might need to see the backtrace?

chrisseaton · 2021-04-07T19:25:41Z

But the RangeError constant might be removed in another thread

If another thread wants to redefine a constant which this machine code uses, such as RangeError, it must first stop all threads using this machine code, and transfer them to the interpreter where the constant is re-checked every time.

As there are no side effects between the raise and rescue, and no significant computation either, no opportunity is provided for that interruption to happen between them and it is not possible to have created the exception and not yet have rescued it - the raise and rescue become atomic and any redefinition must happen either before or after the rescue.

If another thread repeatedly redefines the constant, we will stop applying this optimisation for this constant so that we do not keep deoptimising.

headius · 2021-04-07T19:35:20Z

As there are no side effects between the raise and rescue, and no significant computation either, no opportunity is provided for that interruption to happen between them

@chrisseaton This would be a bug (or at least behavior inconsistent with CRuby), would it not? Were this constant being checked every time, the update would be visible between the raise and the rescue, and the backtrace would need to be in hand before proceeding to the constant lookup. You are optimizing it assuming that the exception constant can't be removed between the raise and the rescue, but that could happen on CRuby.

chrisseaton · 2021-04-07T19:40:47Z

This would be a bug (or at least behavior inconsistent with CRuby), would it not?

No, absolutely not.

What happens is that the thread which wants to redefine the constant has to wait before it is allowed to redefine the constant. Conceptually, it isn't allowed to redefine when the other thread is in between this rescue and raise pair. This is because there is no safepoint between the raise and rescue, in the same way that there is no safepoint 'between' the multiple calls to #+ 14 + 2 + 3 when you have constant folded that to 16, which is I guess an optimisation that JRuby does.

the update would be visible between the raise and the rescue

How would you trigger the update so that it must appear between the raise and rescue? What causality rule are you using for that? Show me an interleaving.

chrisseaton · 2021-04-07T19:42:19Z

@postmodern I think I fixed it by moving the exception raise outside the non-inlined code. Is this what you wanted to see? The baseline is slower because I'm running a development build on the JVM, but the point is I reduced the exception cost.

                                               user     system      total        real
Integer#chr(Encoding::UTF_8) (N=6553600)   9.825050   0.234797  10.059847 (  7.179656)
raise/rescue RangeError (N=6553600)        3.713082   0.050273   3.763355 (  0.602319)

diff --git a/src/main/java/org/truffleruby/core/exception/CoreExceptions.java b/src/main/java/org/truffleruby/core/exception/CoreExceptions.java
index 4312c5f340..30060eade3 100644
--- a/src/main/java/org/truffleruby/core/exception/CoreExceptions.java
+++ b/src/main/java/org/truffleruby/core/exception/CoreExceptions.java
@@ -1034,10 +1034,15 @@ public class CoreExceptions {
     @TruffleBoundary
     public RubyException rangeError(long code, RubyEncoding encoding, Node currentNode) {
         return rangeError(
-                StringUtils.format("invalid codepoint %x in %s", code, encoding.encoding),
+                rangeErrorMessage(code, encoding.encoding),
                 currentNode);
     }
 
+    @TruffleBoundary
+    private String rangeErrorMessage(long code, Encoding encoding) {
+        return StringUtils.format("invalid codepoint %x in %s", code, encoding);
+    }
+
     @TruffleBoundary
     public RubyException rangeError(RubyIntRange range, Node currentNode) {
         return rangeError(StringUtils.format(
diff --git a/src/main/java/org/truffleruby/core/string/StringNodes.java b/src/main/java/org/truffleruby/core/string/StringNodes.java
index 907c13b9fd..6cf699daea 100644
--- a/src/main/java/org/truffleruby/core/string/StringNodes.java
+++ b/src/main/java/org/truffleruby/core/string/StringNodes.java
@@ -3977,26 +3977,18 @@ public abstract class StringNodes {
             return makeStringNode.fromRope(rope);
         }
 
-        @TruffleBoundary
         @Specialization(guards = { "!isSimple(code, rubyEncoding)", "isCodepoint(code)" })
         protected RubyString stringFromCodepoint(long code, RubyEncoding rubyEncoding,
                 @Cached RopeNodes.CalculateCharacterLengthNode calculateCharacterLengthNode) {
             final Encoding encoding = rubyEncoding.encoding;
-            final int length;
-
-            try {
-                length = encoding.codeToMbcLength((int) code);
-            } catch (EncodingException e) {
-                throw new RaiseException(getContext(), coreExceptions().rangeError(code, rubyEncoding, this));
-            }
 
+            final int length = codeToMbcLength(encoding, (int) code);
             if (length <= 0) {
                 throw new RaiseException(getContext(), coreExceptions().rangeError(code, rubyEncoding, this));
             }
 
             final byte[] bytes = new byte[length];
-
-            final int codeToMbc = encoding.codeToMbc((int) code, bytes, 0);
+            final int codeToMbc = codeToMbc(encoding, (int) code, bytes, 0);
             if (codeToMbc < 0) {
                 throw new RaiseException(getContext(), coreExceptions().rangeError(code, rubyEncoding, this));
             }
@@ -4009,6 +4001,20 @@ public abstract class StringNodes {
             return makeStringNode.executeMake(bytes, encoding, CodeRange.CR_VALID);
         }
 
+        @TruffleBoundary
+        private int codeToMbcLength(Encoding encoding, int code) {
+            try {
+                return encoding.codeToMbcLength(code);
+            } catch (EncodingException e) {
+                return -1;
+            }
+        }
+
+        @TruffleBoundary
+        private int codeToMbc(Encoding encoding, int code, byte[] bytes, int p) {
+            return encoding.codeToMbc(code, bytes, p);
+        }
+
         protected boolean isCodepoint(long code) {
             // Fits in an unsigned int
             return code >= 0 && code < (1L << 32);

eregon · 2021-04-07T19:49:09Z

@chrisseaton That looks good, could you make a PR with that? :)

headius · 2021-04-07T19:51:28Z

I will preface this by saying that I do not believe the update necessarily should be visible between the raise and the rescue, but a strict interpretation of rescue behavior implies that this is so.

How would you trigger the update so that it must appear between the raise and rescue?

The behavior of rescue with a constant is that it performs that constant lookup at the time of the rescue and invokes === against it with the incoming exception. By that definition, an update of the constant after the raise and before the rescue most definitely should be visible.

You are making the window between an inlined raise and rescue into a critical section that prevents another thread from performing its constant update until after the window has closed. That is arguably an OK optimization but it does not fit a strict interpretation of rescue semantics in the presence of threads.

chrisseaton · 2021-04-07T19:59:33Z

@postmodern this is the same configuration as you're running

                                               user     system      total        real
Integer#chr(Encoding::UTF_8) (N=6553600)  14.439912   0.034855  14.474767 ( 14.292199)
raise/rescue RangeError (N=6553600)        0.402699   0.004560   0.407259 (  0.127988)

So chr is still slower - it still actually has to do the encoding work! But we definitely knocked it down.

I do fundamentally agree with @headius on the point the ultimately it'd be better not to optimise away the exception and to use a return code instead for fast-path operations.

chrisseaton · 2021-04-07T20:01:56Z

an update of the constant after the raise and before the rescue

I would argue - how do you tell the difference between an update which by chance never happens to be observed between them but could be, and one that will never be observed between them?

You are making the window between an inlined raise and rescue into a critical section that prevents another thread from performing its constant update until after the window has closed.

Yes - exactly the same as when 1 + 2 + 3 is constant folded to 6 - you now can't interrupt after the first #+ and before the second. That window is limited by both side effects (never contains side effects) and by time (never contains unbounded computation.) I think that these practically mean the optimisation is not falsifiable - you can never prove that you didn't interrupt between them because of the optimisation, or you never interrupted between them just by natural chance.

headius · 2021-04-07T20:09:56Z

@chrisseaton A simpler way to say what I am saying:

In the unoptimized case, another thread's constant update could be visible between the raise and the rescue.
In the optimized case, it would not. That is a behavioral change.

Of course, that behavior change might be fine.

Your example of 1 + 2 + 3 suffers from the same issue... a change visible between adds in the interpreter would not be visible once the code is optimized.

I think part of my issue here is that CRuby can context switch potentially on any method call boundary, and potentially also on other cache boundaries like constants (I have not checked in a while). By sweeping away those boundaries you avoid unnecessary guards, but also change behavior.

chrisseaton · 2021-04-07T20:12:34Z

I can agree that you may be able to detect a statistical difference in the locations where interrupts are observed when optimisations like this are applied.

We do have this disclaimer in our compatibility claims.

https://github.com/oracle/truffleruby/blob/master/doc/user/compatibility.md#threads-detect-interrupts-at-different-points

In the past (I think Sidekiq?) we actually had a test failure because we were detecting them in more places than MRI. I've never see a test failure due to the other way around.

headius · 2021-04-07T20:18:13Z

@chrisseaton We have had issues filed in the past about our interrupt checks being too coarse-grained, but I think they were all in a long-lost bug tracker before we moved to Github. I don't recall that they were real-world examples... more like someone attempting to prove a weird concurrent edge case differed from CRuby.

This was during a time when we were the only parallel-executing Ruby and were still figuring out how fine-grained our interrupts needed to be.

In any case this is mostly academic since there's no predictable way to make the constant update in one thread and know another thread will see it mid-rescue without introducing other side-effecty code. We will expand JRuby's backtrace elimination logic to cover explicit rescues like this case.

@postmodern ...but we really should avoid the exception anyway since one false move and it will be back with a vengeance.

postmodern · 2021-04-07T20:43:29Z

@postmodern ...but we really should avoid the exception anyway since one false move and it will be back with a vengeance.

Yes, in the long-run it would be nice if there was a way to directly query an Encoding to ask if a multi-byte Integer is a valid word before attempting to call Integer#chr. I remember CRuby did add String#scrub as a way to workaround invalid byte sequences, so I imagine it must be possible to make an Integer#chr that doesn't raise an exception.

eregon · 2021-04-12T14:33:01Z

Fixed by #2318, the performance is then similar to CRuby for Integer#chr(Encoding::UTF_8).

eregon added the performance label Apr 7, 2021

chrisseaton mentioned this issue Apr 7, 2021

Shift boundaries for String#chr #2318

Merged

eregon assigned eregon and chrisseaton Apr 12, 2021

eregon added this to the 21.2.0 milestone Apr 12, 2021

eregon closed this as completed Apr 12, 2021

eregon linked a pull request Apr 13, 2021 that will close this issue

Shift boundaries for String#chr #2318

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[perf] Integer#chr(Encoding::UTF_8) is considerably slower than CRuby #2316

[perf] Integer#chr(Encoding::UTF_8) is considerably slower than CRuby #2316

postmodern commented Apr 7, 2021

headius commented Apr 7, 2021 •

edited

headius commented Apr 7, 2021 •

edited

chrisseaton commented Apr 7, 2021

headius commented Apr 7, 2021

chrisseaton commented Apr 7, 2021

headius commented Apr 7, 2021

chrisseaton commented Apr 7, 2021

chrisseaton commented Apr 7, 2021 •

edited by eregon

eregon commented Apr 7, 2021

headius commented Apr 7, 2021

chrisseaton commented Apr 7, 2021

chrisseaton commented Apr 7, 2021

headius commented Apr 7, 2021

chrisseaton commented Apr 7, 2021

headius commented Apr 7, 2021 •

edited

postmodern commented Apr 7, 2021

eregon commented Apr 12, 2021 •

edited

[perf] Integer#chr(Encoding::UTF_8) is considerably slower than CRuby #2316

[perf] Integer#chr(Encoding::UTF_8) is considerably slower than CRuby #2316

Comments

postmodern commented Apr 7, 2021

Example

ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]

truffleruby 21.0.0, like ruby 2.7.2, GraalVM CE Native [x86_64-linux]

headius commented Apr 7, 2021 • edited

headius commented Apr 7, 2021 • edited

chrisseaton commented Apr 7, 2021

headius commented Apr 7, 2021

chrisseaton commented Apr 7, 2021

headius commented Apr 7, 2021

chrisseaton commented Apr 7, 2021

chrisseaton commented Apr 7, 2021 • edited by eregon

eregon commented Apr 7, 2021

headius commented Apr 7, 2021

chrisseaton commented Apr 7, 2021

chrisseaton commented Apr 7, 2021

headius commented Apr 7, 2021

chrisseaton commented Apr 7, 2021

headius commented Apr 7, 2021 • edited

postmodern commented Apr 7, 2021

eregon commented Apr 12, 2021 • edited

headius commented Apr 7, 2021 •

edited

headius commented Apr 7, 2021 •

edited

chrisseaton commented Apr 7, 2021 •

edited by eregon

headius commented Apr 7, 2021 •

edited

eregon commented Apr 12, 2021 •

edited