Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'Cannot cast org.jruby.gen.RubyObject7 to org.jruby.gen.RubyObject4' errors #5910

Closed
jordan-thoms opened this issue Oct 10, 2019 · 25 comments
Closed
Milestone

Comments

@jordan-thoms
Copy link

@jordan-thoms jordan-thoms commented Oct 10, 2019

Environment

Provide at least:

  • JRuby version (jruby -v) and command line (flags, JRUBY_OPTS, etc)
    jruby 9.2.8.0 (2.5.3) 2019-08-12 a1ac7ff OpenJDK 64-Bit Server VM 25.222-b10 on 1.8.0_222-8u222-b10-1ubuntu1~18.04.1-b10 +jit [linux-x86_64]

java -XX:+UseCompressedOops -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -Dfile.encoding=UTF-8 -Xmx7000m -Xss512k -Djffi.boot.library.path=/opt/jruby-9.2.8.0/lib/jni -Djava.security.egd=file:/dev/urandom -Xbootclasspath/a:/opt/jruby-9.2.8.0/lib/jruby.jar -classpath : -Djruby.home=/opt/jruby-9.2.8.0 -Djruby.lib=/opt/jruby-9.2.8.0/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main /opt/jruby-9.2.8.0/bin/bundle exec sidekiq -q pdf -q pdf_auto_sync -c 80 -r ./jworker.rb

  • Operating system and platform (e.g. uname -a)
    Linux kami-jworker-5cd6444676-ll7kx 4.14.137+ #1 SMP Thu Aug 8 02:47:02 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux

Other relevant info you may wish to add:

  • Installed or activated gems
  • Application/framework version (e.g. Rails, Sinatra)
  • Environment variables

Expected Behavior

JRuby should not have internal exceptions

Actual Behavior

On rare occasions (we run ~80 k8s pods with this workload at peak each day, and this situation happens for a single one of those processes about once a week), suddenly all requests we attempt to make fail with a ClassCastException (stacktrace below). Interestingly, whenever this happens, it's the stacktrace seems consistent - it's always relating to 'gems/http-cookie-1.0.3/lib/http/cookie_jar/abstract_store.rb:45 '. It seems to also always be specifically failing trying to cast a 'org.jruby.gen.RubyObject7' to 'org.jruby.gen.RubyObject4' object - as in those same class names every time - I'm not sure what significance that has.

I'm guessing this is some sort of race condition under heavy load, as this usually happens to processes which are under heavy load. Once this condition starts, all requests will fail with the same exception until we kill the process.

This has been happening for a while now across a range of jruby versions, I'm not sure exactly when it started but I may be able to get some info on that.

MethodHandleImpl.java:361:in `java.lang.invoke.MethodHandleImpl.newClassCastException': Cannot cast org.jruby.gen.RubyObject7 to org.jruby.gen.RubyObject4 (Java::JavaLang::ClassCastException)
    from MethodHandleImpl.java:356:in `java.lang.invoke.MethodHandleImpl.castReference'
    from FieldVariableAccessor.java:100:in `org.jruby.runtime.ivars.FieldVariableAccessor.set'
    from IRRuntimeHelpers.java:1831:in `org.jruby.ir.runtime.IRRuntimeHelpers.setVariableWithAccessor'
    from /opt/jruby-9.2.8.0/lib/ruby/stdlib/monitor.rb:254:in `opt.jruby_minus_9_dot_2_dot_8_dot_0.lib.ruby.stdlib.monitor.RUBY$method$mon_initialize$0'
    from CompiledIRMethod.java:104:in `org.jruby.internal.runtime.methods.CompiledIRMethod.call'
    from MixedModeIRMethod.java:123:in `org.jruby.internal.runtime.methods.MixedModeIRMethod.call'
    from DynamicMethod.java:192:in `org.jruby.internal.runtime.methods.DynamicMethod.call'
    from CachingCallSite.java:141:in `org.jruby.runtime.callsite.CachingCallSite.call'
    from /opt/jruby-9.2.8.0/lib/ruby/stdlib/monitor.rb:248:in `opt.jruby_minus_9_dot_2_dot_8_dot_0.lib.ruby.stdlib.monitor.invokeOther1:mon_initialize'
    from /opt/jruby-9.2.8.0/lib/ruby/stdlib/monitor.rb:248:in `opt.jruby_minus_9_dot_2_dot_8_dot_0.lib.ruby.stdlib.monitor.RUBY$method$initialize$0'
    from CompiledIRMethod.java:91:in `org.jruby.internal.runtime.methods.CompiledIRMethod.call'
    from MixedModeIRMethod.java:90:in `org.jruby.internal.runtime.methods.MixedModeIRMethod.call'
    from IRRuntimeHelpers.java:1156:in `org.jruby.ir.runtime.IRRuntimeHelpers.instanceSuper'
    from gems/http-cookie-1.0.3/lib/http/cookie_jar/abstract_store.rb:45:in `opt.jruby_minus_9_dot_2_dot_8_dot_0.lib.ruby.gems.shared.gems.http_minus_cookie_minus_1_dot_0_dot_3.lib.http.cookie_jar.abstract_store.invokeSuper4:initialize'
    from gems/http-cookie-1.0.3/lib/http/cookie_jar/abstract_store.rb:45:in `opt.jruby_minus_9_dot_2_dot_8_dot_0.lib.ruby.gems.shared.gems.http_minus_cookie_minus_1_dot_0_dot_3.lib.http.cookie_jar.abstract_store.RUBY$method$initialize$0'
    from CompiledIRMethod.java:91:in `org.jruby.internal.runtime.methods.CompiledIRMethod.call'
    from MixedModeIRMethod.java:90:in `org.jruby.internal.runtime.methods.MixedModeIRMethod.call'
    from IRRuntimeHelpers.java:1156:in `org.jruby.ir.runtime.IRRuntimeHelpers.instanceSuper'
    from gems/http-cookie-1.0.3/lib/http/cookie_jar/hash_store.rb:33:in `opt.jruby_minus_9_dot_2_dot_8_dot_0.lib.ruby.gems.shared.gems.http_minus_cookie_minus_1_dot_0_dot_3.lib.http.cookie_jar.hash_store.invokeSuper0:initialize'
...
@headius

This comment has been minimized.

Copy link
Member

@headius headius commented Oct 15, 2019

Hmm that is peculiar indeed.

@headius

This comment has been minimized.

Copy link
Member

@headius headius commented Oct 15, 2019

A bit more info here...

RubyObject7 and RubyObject4 are generated objects made to be "right-sized" to either 7 or 4 instance variables. In this case it appears to be for the JVM object backing up some class which gets some of its instance variable names (and an initialize method for those vars) from the MonitorMixin module.

That module appears to only define three instance variables: @mon_owner, @mon_count, @mon_mutex, and when I do a simple construction locally it does use the 3-wide object:

$ jruby -rmonitor -e 'p JRuby.ref(Monitor.new).getClass'
class org.jruby.gen.RubyObject3

It appears that the actual class being constructed is (from the http-cookie gem) HTTP::CookieJar::HashStore which extends HTTP::CookieJar::AbstractStore. The latter adds a fourth variable @logger, and the former adds three more variables @jar, @gc_index, and @gc_threshold. So in this case it seems that the RubyObject7 is the right choice.

It's not clear to me why, then, it's trying to cast to RubyObject4. If AbstractStore were instantiated on its own, it would indeed have four variables, but that should not apply to subclasses. We calculate the appropriate backing object anew for each type.

@headius

This comment has been minimized.

Copy link
Member

@headius headius commented Oct 15, 2019

Well one possible improvement might be to synchronize here:

public Set<String> discoverInstanceVariables() {

This logic walks up the class hieararchy looking for instance variables, eventually choosing the appropriate size of JVM object. I have not come up with a way to force a race, though; the first one to make this calculation should win forever, which would preclude the possibility of having two threads see different-sized object types.

@headius

This comment has been minimized.

Copy link
Member

@headius headius commented Oct 15, 2019

If there's a race, it would be somewhere in here where it's setting the new object type for the Ruby class:

public static ObjectAllocator specializeForVariables(RubyClass klass, Set<String> foundVariables) {
int size = foundVariables.size();
// clamp to max object width (jruby/jruby#
size = Math.min(size, Options.REIFY_VARIABLES_MAX.load());
ClassAndAllocator cna = null;
String className = null;
if (Options.REIFY_VARIABLES_NAME.load()) {
className = klass.getName();
if (className.startsWith("#")) {
className = "Anonymous" + Integer.toHexString(System.identityHashCode(klass));
} else {
className = className.replace("::", "/");
}
} else {
// Generate class for specified size
cna = getClassForSize(size);
if (cna == null) {
className = "RubyObject" + size;
}
}
// if we have a className, proceed to generate
if (className != null) {
final String clsPath = "org/jruby/gen/" + className;
synchronized (LOADER) {
Class specialized;
try {
// try loading class without generating
specialized = LOADER.loadClass(clsPath.replace('/', '.'));
} catch (ClassNotFoundException cnfe) {
// generate specialized class
specialized = generateInternal(size, clsPath);
}
try {
ObjectAllocator allocator = (ObjectAllocator) specialized.getDeclaredClasses()[0].newInstance();
cna = new ClassAndAllocator(specialized, allocator);
if (!Options.REIFY_VARIABLES_NAME.load()) {
SPECIALIZED_CLASSES.put(size, cna);
}
} catch (Throwable t) {
throw new RuntimeException(t);
}
}
}
// Pre-initialize variable table with field accessors for size
try {
int offset = 0;
// TODO: this just ends up reifying the first N variables it finds, which may not be the most valuable
for (String name : foundVariables) {
klass.getVariableTableManager().getVariableAccessorForVar(
name,
LOOKUP.findGetter(cna.cls, "var" + offset, Object.class),
LOOKUP.findSetter(cna.cls, "var" + offset, Object.class));
offset++;
if (offset >= size) break;
}
} catch (NoSuchFieldException | IllegalAccessException e) {
throw new RuntimeException(e);
}
klass.setReifiedClass(cna.cls);
klass.setAllocator(cna.allocator);
return cna.allocator;
}

@jordan-thoms

This comment has been minimized.

Copy link
Author

@jordan-thoms jordan-thoms commented Oct 15, 2019

Hmm... I wonder what the best way to debug this more would be?

We saw this happen again this morning, same stacktrace. For now I'm going to setup a mechanism to kill our workers if they hit this error to at reduce the impact of this for us, but we will still see the initial error pop up so if there's a new version you'd like us to test to see if it fixes things I'd be happy to do that. Not sure what the best way to track down a super low probability race condition like this is...

@headius

This comment has been minimized.

Copy link
Member

@headius headius commented Oct 16, 2019

@jordan-thoms Yeah it's definitely something we need to fix on our end.

If I were trying to investigate this myself I'd wrap the lines of code where it happens in another rescue block that inspects the actual object that was wrongly cast, to see why it has a FieldVariableAccessor that points at the wrong type.

A workaround for you, which may result in slightly reduced performance (because we won't "right-size" these objects) would be to set -Xreify.variables=false or pass -Djruby.reify.variables=false to the JVM.

headius added a commit to headius/jruby that referenced this issue Oct 16, 2019
This is likely at least a partial fix for jruby#5910, where it appears
that multiple threads all trigger reificatin of a class's
variables at the same time, leading to some of those threads
choosing a different reified type and ultimately class casting
exceptions.

The change here synchronizes against the class's "real" class,
which should be the one on which we attach the allocated. This
should force multiple threads all allocating the first instance
at once to wait for each other, eventually falling back on the
new allocator that does not re-reify the type.

This will eventually be moot once individual object instances hold
a reference to their own layout managers.
@headius

This comment has been minimized.

Copy link
Member

@headius headius commented Oct 16, 2019

I have pushed a possible fix for this in #5919 that should guarantee once-only reification of variables. This will make sure only one thread gets to decide what the right-sized shape of object should be.

As noted there and in the commit, this will eventually be moot once individual objects hold a reference to their own variable layout manager, but for now I think this is necessary.

@jordan-thoms

This comment has been minimized.

Copy link
Author

@jordan-thoms jordan-thoms commented Oct 16, 2019

Wow, speedy turnaround! Thanks for pointing out the config change, I've deployed that now.

headius added a commit to headius/jruby that referenced this issue Oct 16, 2019
This is likely at least a partial fix for jruby#5910, where it appears
that multiple threads all trigger reificatin of a class's
variables at the same time, leading to some of those threads
choosing a different reified type and ultimately class casting
exceptions.

The change here synchronizes against the class's "real" class,
which should be the one on which we attach the allocated. This
should force multiple threads all allocating the first instance
at once to wait for each other, eventually falling back on the
new allocator that does not re-reify the type.

This will eventually be moot once individual object instances hold
a reference to their own layout managers.
@enebo

This comment has been minimized.

Copy link
Member

@enebo enebo commented Oct 16, 2019

@jordan-thoms I am happy to see that you can test against this PR. We have a small dilemma here in that it happens to you about once a week but we would also like to put out 9.2.9 soon. To help us understand things better you run some set of processes (~80) for some time each day and once a week you see one fail? So if it were possible to run 560 of these do you think you would see this in one day?

@jordan-thoms

This comment has been minimized.

Copy link
Author

@jordan-thoms jordan-thoms commented Oct 17, 2019

@enebo Unfortunately I don't think that would work - the processes are running sidekiq processing a jobs as they come in so if we ran more processes they would be under lower load and probably less likely to hit this error.

It might be possible to reproduce this error more quickly though - if you had a bunch of threads which were doing nothing but sending requests using the http_client gem ( the requests in our sidekiq job are using that through the google-apis gem, e.g. https://github.com/googleapis/google-api-ruby-client/blob/master/lib/google/apis/core/base_service.rb#L401 ), or maybe even directly calling the cookie_jar gem in the same way that that code path does, that could make the race condition more likely to be hit, as while we run sidekiq with concurrency 80 our sidekiq jobs spend a lot of time doing other things than making requests. I probably don't have time to build that right now though.

@enebo

This comment has been minimized.

Copy link
Member

@enebo enebo commented Oct 17, 2019

[deleted comment since I mention a gem which was not being used]

@enebo

This comment has been minimized.

Copy link
Member

@enebo enebo commented Oct 17, 2019

Ok reproduction (~30% of the time on my machine):

jruby -X-C ../snippets/web1.rb
require 'http-cookie'

200.times do
  Thread.new do
    HTTP::CookieJar::HashStore.new
  end
end

Will see how PR does now.

@enebo

This comment has been minimized.

Copy link
Member

@enebo enebo commented Oct 17, 2019

So tried the PR and I could not get a fail to happen after like 40 runs...not 100% proof but extremely likely to have fixed the issue.

@enebo

This comment has been minimized.

Copy link
Member

@enebo enebo commented Oct 17, 2019

Also notice I made sure the app ran in the interpreter which slowed down execution which I believe made the race more possible. compiling main script seemed to never trip over this.

@headius

This comment has been minimized.

Copy link
Member

@headius headius commented Oct 17, 2019

I have not been able to reproduce without using http-cookie, but I'll merge the PR in and we'll iterate on a repro after the fact.

@headius

This comment has been minimized.

Copy link
Member

@headius headius commented Oct 17, 2019

Aha, I think we figured out why this happens.

While trying to come up with a better repro (and explain why this fails in the first place) I remembered that we have a debug flag for outputting the shape of objects as they're encountered: -Xdump.variables. With that output, we see something very interesting for @enebo's script:

...
HTTP::CookieJar::HashStore;[@jar, @logger, @gc_threshold, @gc_index, @mon_mutex, @mon_owner, @mon_count]
HTTP::CookieJar::HashStore;[@jar, @logger, @gc_threshold, @gc_index, @mon_mutex, @mon_owner, @mon_count]
HTTP::CookieJar::HashStore;[@jar, @logger, @gc_threshold, @gc_index, @mon_mutex, @mon_owner, @mon_count]
HTTP::CookieJar::HashStore;[@jar, @logger, @gc_threshold, @gc_index, @mon_mutex, @mon_owner, @mon_count]
HTTP::CookieJar::HashStore;[@logger, @mon_mutex, @mon_owner, @mon_count]
HTTP::CookieJar::HashStore;[@jar, @logger, @gc_threshold, @gc_index, @mon_mutex, @mon_owner, @mon_count]
HTTP::CookieJar::HashStore;[@logger, @mon_mutex, @mon_owner, @mon_count]
HTTP::CookieJar::HashStore;[@jar, @logger, @gc_threshold, @gc_index, @mon_mutex, @mon_owner, @mon_count]
HTTP::CookieJar::HashStore;[@jar, @logger, @gc_threshold, @gc_index, @mon_mutex, @mon_owner, @mon_count]
HTTP::CookieJar::HashStore;[@jar, @logger, @gc_threshold, @gc_index, @mon_mutex, @mon_owner, @mon_count]
HTTP::CookieJar::HashStore;[@jar, @logger, @gc_threshold, @gc_index, @mon_mutex, @mon_owner, @mon_count]
HTTP::CookieJar::HashStore;[@jar, @logger, @gc_threshold, @gc_index, @mon_mutex, @mon_owner, @mon_count]
HTTP::CookieJar::HashStore;[@jar, @logger, @gc_threshold, @gc_index, @mon_mutex, @mon_owner, @mon_count]
... errors start here

So there are at least two different views of this object, making it clear that a race could result if one steps on the other's work and picks a different shape.

Then we figured out why this is happening: we lazily compile methods. When first parsed and turned into an IRScope, we don't immediately compile the subtree for a given def. We defer that until the method is actually called. So in this case, none of the methods in HashStore had been called by the time we construct the first one, so the threads attempting to get a list of variable names don't get any because there's no IR to walk.

Once a first instance has been instantiated and initialize and other methods are called, they will have IR and can be inspected to show a list of variable names.

If that happens at the same time the first thread is doing the initial allocation, we can end up with a 7-wide object being cast as a 4-wide object, or similar races.

I am working on a fix to properly walk the AST if the IR has not been generated yet, so we always have an accurate listing of variables.

@headius

This comment has been minimized.

Copy link
Member

@headius headius commented Oct 18, 2019

I've discovered a bug in http-cookie that is aggravating this problem!

So in AbstractStore, (and also in AbstractSaver I believe) you have code like this:

...
  class << self
    @@class_map = {}

    # Gets an implementation class by the name, optionally trying to
    # load "http/cookie_jar/*_store" if not found.  If loading fails,
    # IndexError is raised.
    def implementation(symbol)
      @@class_map.fetch(symbol)
    rescue IndexError
      begin
        require 'http/cookie_jar/%s_store' % symbol
        @@class_map.fetch(symbol)
      rescue LoadError, IndexError
        raise IndexError, 'cookie store unavailable: %s' % symbol.inspect
      end
    end

    def inherited(subclass) # :nodoc:
      @@class_map[class_to_symbol(subclass)] = subclass
    end

    def class_to_symbol(klass) # :nodoc:
      klass.name[/[^:]+?(?=Store$|$)/].downcase.to_sym
    end
  end
...

This is not thread-safe in any implementation and is the true cause of our variable object widths.

The problem here is that the inherited hook is called immediately once the HashStore and other classes extend AbstractStore. In other words, the class goes into this @@class_map before any of its methods have been defined! If there's another thread accessing the map at the same time, it will see a partially-initialized class.

This is the reason we see anywhere from 4 to 7 variables; depending on when the second thread starts inspecting the class, it will see none, some, or all of HashStore's methods defined, giving it a different view of instance variables.

It actually gets worse, though, because it's possible for a second thread to start instantiating HashStore before it is fully defined, which causes AbstractStore's initialize to call the default version of default_options that returns nil, and you will get errors like this:

...
NoMethodError: undefined method `each_pair' for nil:NilClass
  initialize at /Users/headius/projects/jruby/lib/ruby/gems/shared/gems/http-cookie-1.0.3/lib/http/cookie_jar/abstract_store.rb:52
    repro.rb at repro.rb:5

Note also that this is from @enebo's reproduction script that constructs HashStore instances...but HashStore#initialize is not in this stack trace because it hasn't been defined yet!

This new issue is nothing we can fix; the flaw is in the use of inherited to cache classes before they've been fully defined. This will need to be fixed in the http-cookie gem.

However this did lead us to improve how we're doing this method-walking.

@enebo I realized after we chatted today that we actually do force methods to compile in order to walk their variables. The odd results were due to this bug in http-cookie described above. So I have proceeded with the logic that walks AST, so that simply instantiating a class does not force all methods on all ancestors to compile!

@headius

This comment has been minimized.

Copy link
Member

@headius headius commented Oct 18, 2019

I filed sparklemotion/http-cookie#27.

There appears to be other thread-safety issues in http-cookie like sparklemotion/http-cookie#6. At least one person there pointed out that eager-loading all the related classes avoids problems, and I believe that would also avoid the issue reported above.

@headius headius added this to the JRuby 9.2.9.0 milestone Oct 18, 2019
@jordan-thoms

This comment has been minimized.

Copy link
Author

@jordan-thoms jordan-thoms commented Oct 18, 2019

Ah, yes we have also seen occasional "undefined method `implementation' for HTTP::CookieJar::AbstractStore:Class" errors like what was reported in sparklemotion/http-cookie#6 . Http-cookie and http-client don't seem to have been updated since 2016 so probably not much chance of those issues getting fixed, unfortunately the google-api-ruby-client gem we use is hardcoded to use them...

@kares

This comment has been minimized.

Copy link
Member

@kares kares commented Oct 18, 2019

@jordan-thoms maybe raise the issue on google-api-ruby-client that the libraries they use have issues
... some of which are long-term - not being fixed (and if possible suggest alternatives for them to use)

@jsvd

This comment has been minimized.

Copy link

@jsvd jsvd commented Oct 18, 2019

Hi friends, we're also getting this in Logstash in a few places, for example elastic/logstash#11196

@enebo

This comment has been minimized.

Copy link
Member

@enebo enebo commented Oct 18, 2019

@headius ah yeah this is a good find. Also I am glad we talked about the lazy stuff because it made us realize we do not need to eagerly load those methods just to look for ivars. I actually expect a good memory win because of that (possibly also a change in startup time).

Their code naively looks reasonable until you think of it in the context of concurrency. So one question for anyone watching is what is the appropriate idiom for capturing the subclasses but not actually using them before they are finished being defined? I feel like whatever solution we do come up with should be in our cookbook of Ruby problems. The answer of forcing all types which do this to be eager before the threads spin up will just end up being issues reports we see. I half feel like this will be a library and not some textual explanation (note: based on this I think inherited will end up going into jruby-lint).

@headius

This comment has been minimized.

Copy link
Member

@headius headius commented Oct 18, 2019

@jsvd The fix for this should fix your issue as well. I could not tell if http-cookie is involved...can you confirm? (On mobile currently)

@kares

This comment has been minimized.

Copy link
Member

@kares kares commented Oct 19, 2019

no http-cookie involved there ... rather a Header object (using bindata gem as its base)

there's a lot going on (in Bindata::Struct) and the object's shape might change based on data it gets.
some of that class' (initialization) methods could possible stay cold till a given payload arrives ...

@enebo

This comment has been minimized.

Copy link
Member

@enebo enebo commented Oct 21, 2019

Barring any reply from @jsvd or anyone else who reported this I am resolving this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.