New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DynamicScope sizeError #4516

Closed
snowp opened this Issue Mar 1, 2017 · 9 comments

Comments

Projects
None yet
4 participants
@snowp
Contributor

snowp commented Mar 1, 2017

  • JRuby version: 9.1.7.0
  • Linux 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Saw this appear in our exception monitoring: org.jruby.runtime.scopes.DynamicScope1 only supports scopes with 1 variables

The code in question that fails looks something like this:

items.each do |item|
  foo = item.foo
  some_helper_method foo
end 

where items is a Java collection. Failure happens while trying to evaluate the block (JavaLang.each is the first Java line in the call stack).

From the logs it looks like the same call site will keep failing with that exception until the app is restarted, so guessing either the call site or DynamicScope1 is in a bad state. I'll note that this is happening in the same jobs that caused me to file an issue about another race condition during DynamicScope generation last year, so this might be related to that.

I'm gonna be digging into this issue, but would love to see if anyone has any ideas.

@enebo enebo added this to the JRuby 9.1.8.0 milestone Mar 1, 2017

@enebo

This comment has been minimized.

Show comment
Hide comment
@enebo

enebo Mar 1, 2017

Member

@snowp wow we want to get this one fixed. Any chance you can get a snippet we can see this fail locally?

Member

enebo commented Mar 1, 2017

@snowp wow we want to get this one fixed. Any chance you can get a snippet we can see this fail locally?

@snowp

This comment has been minimized.

Show comment
Hide comment
@snowp

snowp Mar 1, 2017

Contributor

The code that triggered it uses a job framework in our in-house Java service container, so most of this code can't be shared. I'll try to simulate what it does in some code, but might be tricky.

I'll also note that this happened today for the first time on one of our staging boxes today, and we've probably done around 30-40 deploys a day with 9.1.7.0 since it came out, making this a pretty rare occurrence. That said it's a pretty gnarly bug when it happens so it's still pretty serious.

Contributor

snowp commented Mar 1, 2017

The code that triggered it uses a job framework in our in-house Java service container, so most of this code can't be shared. I'll try to simulate what it does in some code, but might be tricky.

I'll also note that this happened today for the first time on one of our staging boxes today, and we've probably done around 30-40 deploys a day with 9.1.7.0 since it came out, making this a pretty rare occurrence. That said it's a pretty gnarly bug when it happens so it's still pretty serious.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Mar 2, 2017

Member

Can you get us the entire method surrounding that block?

The error you're getting basically means that we allocated a single-slot scope for the method+closure and then it turned out to need more than one slot. Normally I'd suspect our compiler, but with Java integration in the mix it's possible something there is messing it up as well.

Like @enebo says, we very much want to fix this for our 9.1.8.0 release next week. High prioritiy, so any information you can provide would be a great help.

Member

headius commented Mar 2, 2017

Can you get us the entire method surrounding that block?

The error you're getting basically means that we allocated a single-slot scope for the method+closure and then it turned out to need more than one slot. Normally I'd suspect our compiler, but with Java integration in the mix it's possible something there is messing it up as well.

Like @enebo says, we very much want to fix this for our 9.1.8.0 release next week. High prioritiy, so any information you can provide would be a great help.

@snowp

This comment has been minimized.

Show comment
Hide comment
@snowp

snowp Mar 2, 2017

Contributor

Looks like I was slightly off what line the error seemed to occur on:

 def remove_deleted_fields(builder, field_name)
      field = builder.getDescriptorForType.findFieldByName(field_name)
      builder.getField(field).tap do |values| 
        builder.clearField(field) # FAILS ON THIS LINE (according to stack trace)

        values.each do |value|
          builder.addRepeatedField(field, value) unless SomeRubyModel.deleted_field?(value)
        end
      end
    end 

This is basically what I referred to as some_helper_method in the above code snippet. For reference, the builder is a protobuf builder from protoc, so the return value of buidler.getField in this case is of type java.util.List

Let me know if theres anything else I can provide.

Contributor

snowp commented Mar 2, 2017

Looks like I was slightly off what line the error seemed to occur on:

 def remove_deleted_fields(builder, field_name)
      field = builder.getDescriptorForType.findFieldByName(field_name)
      builder.getField(field).tap do |values| 
        builder.clearField(field) # FAILS ON THIS LINE (according to stack trace)

        values.each do |value|
          builder.addRepeatedField(field, value) unless SomeRubyModel.deleted_field?(value)
        end
      end
    end 

This is basically what I referred to as some_helper_method in the above code snippet. For reference, the builder is a protobuf builder from protoc, so the return value of buidler.getField in this case is of type java.util.List

Let me know if theres anything else I can provide.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Mar 3, 2017

Member

The full stack trace (or at least the top 10-20 frames of it) would also be very helpful. You may need -d or -Xbacktrace.style=full or some combination.

I'm leaning toward this being a peculiar IR bug, but I don't see anything in this code that should trip up the compiler.

Member

headius commented Mar 3, 2017

The full stack trace (or at least the top 10-20 frames of it) would also be very helpful. You may need -d or -Xbacktrace.style=full or some combination.

I'm leaning toward this being a peculiar IR bug, but I don't see anything in this code that should trip up the compiler.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Mar 3, 2017

Member

Other notes:

  • There are two nested closures here. In the interpreter, these all would need a DynamicScope. In the JIT, the innermost would not need one.
  • The outermost scope needs three variables (builder, field_name, and field), the first closure needs one additional (values), and the innermost needs one more (value).
  • This could very likely be related to jitting blocks, which we have a fix for in this release. If one of the blocks jits at the same time some other code is running, it's possible it would choose the wrong size DynamicScope to allocate. This is also in line with it being difficult to reproduce, since it would involve a tricky confluence of concurrent behaviors to trigger.

We might be able to sit on our hands and hope this is fixed by @enebo's work in 9.1.8.0 to fix block jitting concurrency issues. If you are able to reprouduce it in any way, you might try doing so on a snapshot of 9.1.8.0 tomorrow or over the weekend.

At this point, unless it's already fixed, this probably won't make 9.1.8.0.

Member

headius commented Mar 3, 2017

Other notes:

  • There are two nested closures here. In the interpreter, these all would need a DynamicScope. In the JIT, the innermost would not need one.
  • The outermost scope needs three variables (builder, field_name, and field), the first closure needs one additional (values), and the innermost needs one more (value).
  • This could very likely be related to jitting blocks, which we have a fix for in this release. If one of the blocks jits at the same time some other code is running, it's possible it would choose the wrong size DynamicScope to allocate. This is also in line with it being difficult to reproduce, since it would involve a tricky confluence of concurrent behaviors to trigger.

We might be able to sit on our hands and hope this is fixed by @enebo's work in 9.1.8.0 to fix block jitting concurrency issues. If you are able to reprouduce it in any way, you might try doing so on a snapshot of 9.1.8.0 tomorrow or over the weekend.

At this point, unless it's already fixed, this probably won't make 9.1.8.0.

@headius

This comment has been minimized.

Show comment
Hide comment
@headius

headius Mar 3, 2017

Member

Ok, let's optimistically assume this was fixed by @enebo's block jitting work in 9.1.8.0.

@snowp If you see this again after updating to 9.1.8.0, reopen and we'll take a closer look.

Member

headius commented Mar 3, 2017

Ok, let's optimistically assume this was fixed by @enebo's block jitting work in 9.1.8.0.

@snowp If you see this again after updating to 9.1.8.0, reopen and we'll take a closer look.

@headius headius closed this Mar 3, 2017

@snowp

This comment has been minimized.

Show comment
Hide comment
@snowp

snowp Mar 3, 2017

Contributor

That sounds good to me. I haven't been able to reproduce it since that one time, but I'll give it a shot this weekend.

Contributor

snowp commented Mar 3, 2017

That sounds good to me. I haven't been able to reproduce it since that one time, but I'll give it a shot this weekend.

@original-brownbear

This comment has been minimized.

Show comment
Hide comment
@original-brownbear

original-brownbear Aug 15, 2017

Contributor

@headius we still have an occurrence of this with close to 100% likelihood.
I only have a somewhat tricky reproducer in elastic/logstash#7999
actual error of this kind in (but it seems correlated with the method not found error in the parent issue): elastic/logstash#7999 (comment)

I will try to create a less complex reproducer for this, admittedly the one I have in that issue isn't very straight forward yet.
But this is not fixed I fear, we've seen this repeatedly in various spots of LS recently using 9.1.12.0.

Contributor

original-brownbear commented Aug 15, 2017

@headius we still have an occurrence of this with close to 100% likelihood.
I only have a somewhat tricky reproducer in elastic/logstash#7999
actual error of this kind in (but it seems correlated with the method not found error in the parent issue): elastic/logstash#7999 (comment)

I will try to create a less complex reproducer for this, admittedly the one I have in that issue isn't very straight forward yet.
But this is not fixed I fear, we've seen this repeatedly in various spots of LS recently using 9.1.12.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment