Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DynamicScope sizeError #4516

Closed
snowp opened this issue Mar 1, 2017 · 9 comments
Closed

DynamicScope sizeError #4516

snowp opened this issue Mar 1, 2017 · 9 comments
Milestone

Comments

@snowp
Copy link
Contributor

@snowp snowp commented Mar 1, 2017

  • JRuby version: 9.1.7.0
  • Linux 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Saw this appear in our exception monitoring: org.jruby.runtime.scopes.DynamicScope1 only supports scopes with 1 variables

The code in question that fails looks something like this:

items.each do |item|
  foo = item.foo
  some_helper_method foo
end 

where items is a Java collection. Failure happens while trying to evaluate the block (JavaLang.each is the first Java line in the call stack).

From the logs it looks like the same call site will keep failing with that exception until the app is restarted, so guessing either the call site or DynamicScope1 is in a bad state. I'll note that this is happening in the same jobs that caused me to file an issue about another race condition during DynamicScope generation last year, so this might be related to that.

I'm gonna be digging into this issue, but would love to see if anyone has any ideas.

@enebo enebo added this to the JRuby 9.1.8.0 milestone Mar 1, 2017
@enebo
Copy link
Member

@enebo enebo commented Mar 1, 2017

@snowp wow we want to get this one fixed. Any chance you can get a snippet we can see this fail locally?

@snowp
Copy link
Contributor Author

@snowp snowp commented Mar 1, 2017

The code that triggered it uses a job framework in our in-house Java service container, so most of this code can't be shared. I'll try to simulate what it does in some code, but might be tricky.

I'll also note that this happened today for the first time on one of our staging boxes today, and we've probably done around 30-40 deploys a day with 9.1.7.0 since it came out, making this a pretty rare occurrence. That said it's a pretty gnarly bug when it happens so it's still pretty serious.

@headius
Copy link
Member

@headius headius commented Mar 2, 2017

Can you get us the entire method surrounding that block?

The error you're getting basically means that we allocated a single-slot scope for the method+closure and then it turned out to need more than one slot. Normally I'd suspect our compiler, but with Java integration in the mix it's possible something there is messing it up as well.

Like @enebo says, we very much want to fix this for our 9.1.8.0 release next week. High prioritiy, so any information you can provide would be a great help.

@snowp
Copy link
Contributor Author

@snowp snowp commented Mar 2, 2017

Looks like I was slightly off what line the error seemed to occur on:

 def remove_deleted_fields(builder, field_name)
      field = builder.getDescriptorForType.findFieldByName(field_name)
      builder.getField(field).tap do |values| 
        builder.clearField(field) # FAILS ON THIS LINE (according to stack trace)

        values.each do |value|
          builder.addRepeatedField(field, value) unless SomeRubyModel.deleted_field?(value)
        end
      end
    end 

This is basically what I referred to as some_helper_method in the above code snippet. For reference, the builder is a protobuf builder from protoc, so the return value of buidler.getField in this case is of type java.util.List

Let me know if theres anything else I can provide.

@headius
Copy link
Member

@headius headius commented Mar 3, 2017

The full stack trace (or at least the top 10-20 frames of it) would also be very helpful. You may need -d or -Xbacktrace.style=full or some combination.

I'm leaning toward this being a peculiar IR bug, but I don't see anything in this code that should trip up the compiler.

@headius
Copy link
Member

@headius headius commented Mar 3, 2017

Other notes:

  • There are two nested closures here. In the interpreter, these all would need a DynamicScope. In the JIT, the innermost would not need one.
  • The outermost scope needs three variables (builder, field_name, and field), the first closure needs one additional (values), and the innermost needs one more (value).
  • This could very likely be related to jitting blocks, which we have a fix for in this release. If one of the blocks jits at the same time some other code is running, it's possible it would choose the wrong size DynamicScope to allocate. This is also in line with it being difficult to reproduce, since it would involve a tricky confluence of concurrent behaviors to trigger.

We might be able to sit on our hands and hope this is fixed by @enebo's work in 9.1.8.0 to fix block jitting concurrency issues. If you are able to reprouduce it in any way, you might try doing so on a snapshot of 9.1.8.0 tomorrow or over the weekend.

At this point, unless it's already fixed, this probably won't make 9.1.8.0.

@headius
Copy link
Member

@headius headius commented Mar 3, 2017

Ok, let's optimistically assume this was fixed by @enebo's block jitting work in 9.1.8.0.

@snowp If you see this again after updating to 9.1.8.0, reopen and we'll take a closer look.

@headius headius closed this Mar 3, 2017
@snowp
Copy link
Contributor Author

@snowp snowp commented Mar 3, 2017

That sounds good to me. I haven't been able to reproduce it since that one time, but I'll give it a shot this weekend.

@original-brownbear
Copy link
Contributor

@original-brownbear original-brownbear commented Aug 15, 2017

@headius we still have an occurrence of this with close to 100% likelihood.
I only have a somewhat tricky reproducer in elastic/logstash#7999
actual error of this kind in (but it seems correlated with the method not found error in the parent issue): elastic/logstash#7999 (comment)

I will try to create a less complex reproducer for this, admittedly the one I have in that issue isn't very straight forward yet.
But this is not fixed I fear, we've seen this repeatedly in various spots of LS recently using 9.1.12.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.