Java::JavaLang::StackOverflowError with gsub(RegExp, Hash) #4001

Closed
reidmorrison opened this Issue Jul 8, 2016 · 12 comments

Comments

Projects
None yet
4 participants
@reidmorrison

Moving from JRuby 1.7.23 to JRuby 9.1.2.0 causes gsub(RegExp, Hash) to fail with a Java::JavaLang::StackOverflowError. The code works fine for several minutes and only fails after significant concurrent load.

Environment

  • jruby 9.1.2.0 (2.3.0) 2016-05-26 7357c8f Java HotSpot(TM) 64-Bit Server VM 25.66-b17 on 1.8.0_66-b17 +jit [linux-x86_64]
  • Linux tparelapp1.clarity.net 2.6.32-504.3.3.el6.x86_64 #1 SMP Wed Dec 17 01:55:02 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
JAVA_OPTS_GENERAL="-server -Dfile.encoding=UTF-8"
JAVA_OPTS_JMX="-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=7100 -Dcom.sun.management.jmxremote.ssl=false -Djmx.remote.x.server.connection.timeout=30000"
JAVA_OPTS_MEM="-Xms8192m -Xmx8192m -Xss2048k -XX:G1HeapRegionSize=8m -XX:ReservedCodeCacheSize=512m"
JAVA_OPTS_GC="-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+UseCodeCacheFlushing -XX:+ParallelRefProcEnabled"
JAVA_OPTS_GC_LOG="-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintAdaptiveSizePolicy  -Xloggc:log/jvm_gc_${INSTANCE}.log"
JAVA_OPTS_JRUBY="-Djruby.thread.pool.enabled=false -Djruby.compile.invokedynamic=false -Djruby.global.require.lock=false"
export JAVA_OPTS="${JAVA_OPTS_GENERAL} ${JAVA_OPTS_JMX} ${JAVA_OPTS_MEM} ${JAVA_OPTS_GC} ${JAVA_OPTS_GC_LOG} ${JAVA_OPTS_JRUBY}"

Applicable gems:

  • Haml 4.0.7
  • Rails 4.1.15

Expected Behavior

# Extract of the code that fails in HAML on the gsub line
# haml-4.0.7/lib/helpers.rb:539
HTML_ESCAPE = {'&' => '&amp;', '<' => '&lt;', '>' => '&gt;', '"' => '&quot;', "'" => '&#039;', }
HTML_ESCAPE_REGEX = /[\"><&]/

def html_escape(text)
  text = text.to_s
  text.gsub(HTML_ESCAPE_REGEX, HTML_ESCAPE)
end

# Extract of code that fails in Rails on the gsub line
# activesupport-4.1.15/lib/active_support/core_ext/string/output_safety.rb:25
HTML_ESCAPE = { '&' => '&amp;',  '>' => '&gt;',   '<' => '&lt;', '"' => '&quot;', "'" => '&#39;' }
HTML_ESCAPE_REGEXP = /[&"'><]/
def html_escape(s)
  s = s.to_s
  if s.html_safe?
    s
  else
    s.gsub(HTML_ESCAPE_REGEXP, HTML_ESCAPE).html_safe
  end
end

Under MRI and prior to JRuby optimization of the code, the html_escape completes successfully in both cases.

Works without issue currently in production on JRuby 1.7.23, under high concurrent load.

Actual Behavior

  • After several minutes of automated testing the application repeatedly throws Java::JavaLang::StackOverflowError when rendering HTML pages. The top line of each stack trace is one of the 2 above sub lines.
  • In moving from JRuby 1.7.23 to JRuby 9.1.2.0 no application code changes were made. Only Jruby was upgraded.
  • In both cases the gsub is receiving a Hash as the second parameter.
  • The complete stack trace: https://gist.github.com/reidmorrison/b8ce8d5916690e6011435f79d413d930

Workaround

  • By replacing the gsub(RegExp, Hash) with a gsub(RegExp, &block) we can work around the problem.

When patching the above code with an alternative call to gsub it works without issue:

def html_escape(text)
  text = text.to_s
  text.gsub(HTML_ESCAPE_REGEX) do |match|
    case match
    when '&'
      '&amp;'
    when '<'
      '&lt;'
    when '>'
      '&gt;'
    when '"'
      '&quot;'
    when "'"
      '&#039;'
    end
  end
end

So far we have not been able to get a standalone reproducible example of this failure.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jul 11, 2016

Member

Huh, weird one. I would have thought the block version would be the one with a problem.

Member

headius commented Jul 11, 2016

Huh, weird one. I would have thought the block version would be the one with a problem.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Jul 11, 2016

Member

The stack overflow output does not give me a lot to go on...and weirdly it doesn't contain gsub at all.

You say heavy concurrent load...but I wouldn't expect concurrency to play here unless something is modifying the Hash you pass in (it's possible to get cycles in a Hash's internal graph if modified concurrently). But even for that I'd expect to see a stuck thread, not a stack overflow.

I wonder if you can reproduce this error, or see some oddity in the resulting escaped text, using a smaller reproduction case that just tries to hit string.gsub(Regexp, Hash) heavily across several threads.

I'll eyeball the code in the meantime and see if I can find anything obvious.

Member

headius commented Jul 11, 2016

The stack overflow output does not give me a lot to go on...and weirdly it doesn't contain gsub at all.

You say heavy concurrent load...but I wouldn't expect concurrency to play here unless something is modifying the Hash you pass in (it's possible to get cycles in a Hash's internal graph if modified concurrently). But even for that I'd expect to see a stuck thread, not a stack overflow.

I wonder if you can reproduce this error, or see some oddity in the resulting escaped text, using a smaller reproduction case that just tries to hit string.gsub(Regexp, Hash) heavily across several threads.

I'll eyeball the code in the meantime and see if I can find anything obvious.

@reidmorrison

This comment has been minimized.

Show comment
Hide comment
@reidmorrison

reidmorrison Jul 12, 2016

It appears that switching to the above workarounds just delays the StackOverflow condition. We also increased the stack size to 3m. It now results in a Stack Overflow on the gsub line in the workaround code.

We are actively working on trying to create standalone code to reproduce the problem.

It appears that switching to the above workarounds just delays the StackOverflow condition. We also increased the stack size to 3m. It now results in a Stack Overflow on the gsub line in the workaround code.

We are actively working on trying to create standalone code to reproduce the problem.

@reidmorrison

This comment has been minimized.

Show comment
Hide comment
@reidmorrison

reidmorrison Jul 21, 2016

We replaced the 2 above patches with following code and are able to workaround the above error:

require 'active_support/core_ext/string/output_safety'
class ERB
  module Util
    def html_escape(s)
      s = s.to_s
      if s.html_safe?
        s
      else
        s = s.dup if s.frozen?
        s.gsub!('&', '&amp;')
        s.gsub!('<', '&lt;')
        s.gsub!('>', '&gt;')
        s.gsub!('"', '&quot;')
        s.gsub!("'", '&#039;')
        s.html_safe
      end
    end

    # Aliasing twice issues a warning "discarding old...". Remove first to avoid it.
    remove_method(:h)
    alias h html_escape

    module_function :h

    singleton_class.send(:remove_method, :html_escape)
    module_function :html_escape
  end
end
Rails.application.config.after_initialize do |app|
  Haml::Helpers
  module Haml::Helpers
    def html_escape_without_haml_xss(text)
      text = text.to_s
      text = text.dup if text.frozen?
      text.gsub!('&', '&amp;')
      text.gsub!('<', '&lt;')
      text.gsub!('>', '&gt;')
      text.gsub!('"', '&quot;')
      text.gsub!("'", '&#039;')
      text
    end

    def html_escape_with_haml_xss(text)
      str = text.to_s
      return text if str.html_safe?
      Haml::Util.html_safe(html_escape_without_haml_xss(str))
    end

    def html_escape(text)
      str = text.to_s
      return text if str.html_safe?
      Haml::Util.html_safe(html_escape_without_haml_xss(str))
    end
  end
end

As a result we have been able to progress beyond the Stack Overflow Error.
@bpardee also pointed out that the Stack Overflow may be a symptom and not the root cause. The Stack overflow may be as a result of trying to render the failure page due to some other exception.

Also, there are several other gsub calls in rails and haml that use a block as the second parameter and none of them are failing at the moment.

We are now seeing ActionView::Template::Error·can not use Context instance already associated with some thread when the above pages are rendered after a significant number of calls.

It appears that haml is using coffee-script to invoke the Java Script compiler, but the coffee-script gem is not thread-safe: https://github.com/rails/ruby-coffee-script/blob/master/lib/coffee_script.rb#L50

Our front-end developers are currently removing all remaining JavaScript from all haml pages to avoid the runtime JavaScript compilation.

Once these changes have been completed we will remove the 2 patches above and see if they are still needed.

We replaced the 2 above patches with following code and are able to workaround the above error:

require 'active_support/core_ext/string/output_safety'
class ERB
  module Util
    def html_escape(s)
      s = s.to_s
      if s.html_safe?
        s
      else
        s = s.dup if s.frozen?
        s.gsub!('&', '&amp;')
        s.gsub!('<', '&lt;')
        s.gsub!('>', '&gt;')
        s.gsub!('"', '&quot;')
        s.gsub!("'", '&#039;')
        s.html_safe
      end
    end

    # Aliasing twice issues a warning "discarding old...". Remove first to avoid it.
    remove_method(:h)
    alias h html_escape

    module_function :h

    singleton_class.send(:remove_method, :html_escape)
    module_function :html_escape
  end
end
Rails.application.config.after_initialize do |app|
  Haml::Helpers
  module Haml::Helpers
    def html_escape_without_haml_xss(text)
      text = text.to_s
      text = text.dup if text.frozen?
      text.gsub!('&', '&amp;')
      text.gsub!('<', '&lt;')
      text.gsub!('>', '&gt;')
      text.gsub!('"', '&quot;')
      text.gsub!("'", '&#039;')
      text
    end

    def html_escape_with_haml_xss(text)
      str = text.to_s
      return text if str.html_safe?
      Haml::Util.html_safe(html_escape_without_haml_xss(str))
    end

    def html_escape(text)
      str = text.to_s
      return text if str.html_safe?
      Haml::Util.html_safe(html_escape_without_haml_xss(str))
    end
  end
end

As a result we have been able to progress beyond the Stack Overflow Error.
@bpardee also pointed out that the Stack Overflow may be a symptom and not the root cause. The Stack overflow may be as a result of trying to render the failure page due to some other exception.

Also, there are several other gsub calls in rails and haml that use a block as the second parameter and none of them are failing at the moment.

We are now seeing ActionView::Template::Error·can not use Context instance already associated with some thread when the above pages are rendered after a significant number of calls.

It appears that haml is using coffee-script to invoke the Java Script compiler, but the coffee-script gem is not thread-safe: https://github.com/rails/ruby-coffee-script/blob/master/lib/coffee_script.rb#L50

Our front-end developers are currently removing all remaining JavaScript from all haml pages to avoid the runtime JavaScript compilation.

Once these changes have been completed we will remove the 2 patches above and see if they are still needed.

@reidmorrison

This comment has been minimized.

Show comment
Hide comment
@reidmorrison

reidmorrison Aug 4, 2016

Update: After removing the JavaScript from all haml pages the above context error has not re-occurred in the test environment.

As soon as the test environment is available again for testing we will update this issue on what happens after removing the above gsub patches / workarounds.

reidmorrison commented Aug 4, 2016

Update: After removing the JavaScript from all haml pages the above context error has not re-occurred in the test environment.

As soon as the test environment is available again for testing we will update this issue on what happens after removing the above gsub patches / workarounds.

@reidmorrison

This comment has been minimized.

Show comment
Hide comment
@reidmorrison

reidmorrison Aug 4, 2016

@headius we can confirm that even with the changes to remove embedded JavaScript that the Stack Overflow problem does re-occur without the above gsub patches. It took about 2 hours to re-occur under load.

Wish we could have somehow come up with a standalone test script for this scenario.

Since we have a workaround, we are proceeding with the production upgrade from JRuby 1.7 to JRuby 9.1.2.0. 🎉

reidmorrison commented Aug 4, 2016

@headius we can confirm that even with the changes to remove embedded JavaScript that the Stack Overflow problem does re-occur without the above gsub patches. It took about 2 hours to re-occur under load.

Wish we could have somehow come up with a standalone test script for this scenario.

Since we have a workaround, we are proceeding with the production upgrade from JRuby 1.7 to JRuby 9.1.2.0. 🎉

@mattm404

This comment has been minimized.

Show comment
Hide comment
@mattm404

mattm404 Aug 9, 2016

I am also having this issue with:
jruby 9.1.2.0 (2.3.0) 2016-05-26 7357c8f OpenJDK 64-Bit Server VM 25.91-b14 on 1.8.0_91-b14 +jit [linux-x86_64]

I can reproduce it in just over a minute by running this code:

require 'haml'

loop do
  haml_engine = Haml::Engine.new('%script{ :src => "/js/site.js" }')
  output = haml_engine.render
  puts output
end

mattm404 commented Aug 9, 2016

I am also having this issue with:
jruby 9.1.2.0 (2.3.0) 2016-05-26 7357c8f OpenJDK 64-Bit Server VM 25.91-b14 on 1.8.0_91-b14 +jit [linux-x86_64]

I can reproduce it in just over a minute by running this code:

require 'haml'

loop do
  haml_engine = Haml::Engine.new('%script{ :src => "/js/site.js" }')
  output = haml_engine.render
  puts output
end

@reidmorrison

This comment has been minimized.

Show comment
Hide comment
@reidmorrison

reidmorrison Aug 9, 2016

Well done @mattm404 🎉
That script easily reproduces theJava::JavaLang::StackOverflowError: from an irb console.

I can also confirm that when running in a rails console, with the above patches applied, the script ran fine.

Well done @mattm404 🎉
That script easily reproduces theJava::JavaLang::StackOverflowError: from an irb console.

I can also confirm that when running in a rails console, with the above patches applied, the script ran fine.

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Aug 10, 2016

Member

@mattm404 @reidmorrison I have looped running this repro on master for many minutes now with no crash. We did fix an overflow do to constant caching since 9.1.2.0 came out. http://ci.jruby.org will have a nightly. Can you see if this is still failing now? I will try 91.2.0 next to make sure it does fail on my machine (MacOS, Java 8u92 for my env).

Member

enebo commented Aug 10, 2016

@mattm404 @reidmorrison I have looped running this repro on master for many minutes now with no crash. We did fix an overflow do to constant caching since 9.1.2.0 came out. http://ci.jruby.org will have a nightly. Can you see if this is still failing now? I will try 91.2.0 next to make sure it does fail on my machine (MacOS, Java 8u92 for my env).

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Aug 10, 2016

Member

@mattm404 After having written that I realize I have no site.js file. I take it that is not empty? If so can I get a copy. I also ran this outside of Rails. So heh...maybe my WFM is a bit premature (although I would like you to try a nightly still).

Member

enebo commented Aug 10, 2016

@mattm404 After having written that I realize I have no site.js file. I take it that is not empty? If so can I get a copy. I also ran this outside of Rails. So heh...maybe my WFM is a bit premature (although I would like you to try a nightly still).

@reidmorrison

This comment has been minimized.

Show comment
Hide comment
@reidmorrison

reidmorrison Aug 10, 2016

@enebo I just tried @mattm404's script above that fails when I use JRuby 9.1.2.0 and can confirm that it works when using jruby-head.

Well Done 🎉

@enebo I just tried @mattm404's script above that fails when I use JRuby 9.1.2.0 and can confirm that it works when using jruby-head.

Well Done 🎉

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Aug 10, 2016

Member

@reidmorrison @mattm404 yeah I can confirm this only runs a short time on 9.1.2.0. So at least @mattm404 reproduction seems to be fixed. I will close but please reopen if your larger script is still crashing.

Member

enebo commented Aug 10, 2016

@reidmorrison @mattm404 yeah I can confirm this only runs a short time on 9.1.2.0. So at least @mattm404 reproduction seems to be fixed. I will close but please reopen if your larger script is still crashing.

@enebo enebo closed this Aug 10, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment