JRuby counterpart for RubyVM.stat #4384

Open
dgutov opened this Issue Dec 14, 2016 · 19 comments

Projects

None yet

3 participants

@dgutov
dgutov commented Dec 14, 2016 edited

Environment

jruby 1.7.26 (1.9.3p551) 2016-08-26 69763b8 on OpenJDK 64-Bit Server VM 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14 +jit [linux-amd64]

Expected Behavior

We want to have a fast way to determine whether the set of modules, classes and methods in the current runtime didn't change since the last time a certain method was invoked. One can do that in MRI 2.1 and newer by comparing the return values of RubyVM.stat. Is there a JRuby-specific way to do that, aside from doing a full ObjectSpace scan?

Here's the current use case: pry/pry#1583

@headius
Member
headius commented Dec 14, 2016

Looks like RubyVM.stat produces a hash of three values: global_method_state, global_constant_state, and class_serial.

On JRuby there's no equivalent for global_method_state since we do not invalidate methods globally. However, this could be emulated by simply adding another global serial number; we would not use it internally, but it would indicate how many invalidation events had occurred.

Similarly there's no one global_constant_state because we invalidate constants on a per-name basis (MRI may do something similar, so I'm curious what that value means for them). This could also be simulated by adding an additional serial number.

And I'm guessing here, but I think class_serial is basically just a monotonically-increasing class ID, so you can see if any new classes have been added. This exists in JRuby, but unfortunately it is not exposed right now. The best you'd be able to do would be to define a new class and confirm its ID is one higher than the previous value.

Note also that what does exist is all accessible from Ruby today, since JRuby can reflect back into itself:

p JRuby.runtime.alloc_module_id # n
Class.new # two classes created
p JRuby.runtime.alloc_module_id # n + 3
@dgutov
dgutov commented Dec 14, 2016 edited

However, this could be emulated by simply adding another global serial number; we would not use it internally, but it would indicate how many invalidation events had occurred.

It might be helpful, but after experimenting a little, it seems like in MRI's case the class serial is the one most helpful for us. Take this class:

class C
  def foo
  end
end

Both of the following snippets increase class_serial but not global_method_state:

class C
  def bar
  end
end
def C.baz
  123
end

The best you'd be able to do would be to define a new class and confirm its ID is one higher than the previous value.

This is equivalent to checking JRuby.runtime.alloc_module_id, right? Which is a good recommendation, thanks.

But while it might be "good enough" for our use case of "all methods cache" invalidation, it doesn't work like class_serial: the snippets above don't affect it.

irb(main):011:0> JRuby.runtime.allocModuleId
=> 1605
irb(main):012:0> class C
irb(main):013:1> def bar
irb(main):014:2> end
irb(main):015:1> end
=> nil
irb(main):016:0> JRuby.runtime.alloc_module_id
=> 1608
irb(main):017:0> class C
irb(main):018:1> def bar
irb(main):019:2> 3
irb(main):020:2> end
irb(main):021:1> end
=> nil
irb(main):022:0> JRuby.runtime.alloc_module_id
=> 1609
...
irb(main):035:0> JRuby.runtime.alloc_module_id
=> 1622
irb(main):036:0> def C.baz
irb(main):037:1> asdasd
irb(main):038:1> end
=> nil
irb(main):039:0> JRuby.runtime.alloc_module_id
=> 1623
@headius
Member
headius commented Jan 6, 2017

There's a simple reason for this: in order to reduce cache thrashing on singleton objects, we do not assign a new ID to a singleton class until it defines a method. The justification here is that the singleton class is indistinguishable from the parent class for method-caching purposes (the main reason for this ID) so there's no need to invalidate call sites that have seen the parent.

And not to put too fine an edge on it, but you're depending on very internal behavior of both implementations. You're lucky we still even have the class ID...we don't need it these days.

The down side of incrementing like MRI is not clear. I don't know how common it is to create a new singleton object but never define methods or include modules into it. Your case here is pretty narrow.

@headius
Member
headius commented Jan 6, 2017

FWIW a this point we could add something like RubyVM.stat that just exposes class_id, but I'm very reluctant to define RubyVM at all, since I know there's code out there using to detect it's running on MRI, including parts of stdlib itself.

@dgutov
dgutov commented Jan 6, 2017

There's a simple reason for this: in order to reduce cache thrashing on singleton objects, we do not assign a new ID to a singleton class until it defines a method.

Are you referring to the def C.baz example? But it defines a method (I imagine, the first method in the singleton class). And yet, alloc_module_id doesn't change (or rather only changes by 1 because I call it directly).

In the first example, the problem is that redefining a method doesn't change the alloc id. Neither does this happen if I reopen an existing class and define a new method.

I don't know how common it is to create a new singleton object but never define methods or include modules into it. Your case here is pretty narrow.

Sorry, which of the examples are you referring to?

And not to put too fine an edge on it, but you're depending on very internal behavior of both implementations.

The only thing we're depending on is that RubyVM.stat returns a value that changes if some new class or method or added or changed. That's a pretty clear idea, I think.

We're not depending on the internal structure of the returned value.

@dgutov
dgutov commented Jan 6, 2017

FWIW a this point we could add something like RubyVM.stat that just exposes class_id, but I'm very reluctant to define RubyVM at all, since I know there's code out there using to detect it's running on MRI, including parts of stdlib itself.

Why not add a JRubyVM.stat instead?

@headius
Member
headius commented Jan 7, 2017

Are you referring to the def C.baz example? But it defines a method (I imagine, the first method in the singleton class). And yet, alloc_module_id doesn't change (or rather only changes by 1 because I call it directly).

You know what, you're right, and that is odd. I would have expected it to change. For some reason I read your example like class << C or something, which would not cause a new ID if no method table changes were present.

The only thing we're depending on is that RubyVM.stat returns a value that changes if some new class or method or added or changed. That's a pretty clear idea, I think.

Well, you're depending on that change happening for the same events across runtimes. JRuby goes much farther than MRI in optimizing some of these areas, and we won't invalidate method caches nearly as often. That's why it only seems feasible to me to provide the class ID, since we already have that concept for cache identities.

If you're ok with the limitation that only visibly different classes will trigger a new ID, then what we have today should work fine (after I figure out your singleton example).

Why not add a JRubyVM.stat instead?

Indeed, that's an alternative. Of course, you managed to get at the ID without us adding anything...that's by design.

What I'd prefer is a standard API that both JRuby and MRI implement. Until then, I'm willing to add blessed interfaces for some of these VM stats to avoid having you bind to internal APIs. Probably under JRuby::VM.

@headius
Member
headius commented Jan 7, 2017

You know what, you're right, and that is odd. I would have expected it to change. For some reason I read your example like class << C or something, which would not cause a new ID if no method table changes were present.

I took a second look at the code and I have an explanation: we eagerly create the class's singleton class at construction time. So in this case, I believe it's working properly: no new class has been created. MRI may defer singleton class creation until it is needed.

@dgutov
dgutov commented Jan 7, 2017

Well, you're depending on that change happening for the same events across runtimes.

Rather, I'm depending on it on recent MRI, and hoping other runtimes will adopt something similar. RubyVM is strictly MRI-specific.

JRuby goes much farther than MRI in optimizing some of these areas, and we won't invalidate method caches nearly as often.

When a class has new method added, or a method definition is changed, don't you have to invalidate some cache anyway?

If you're ok with the limitation that only visibly different classes will trigger a new ID

What do you mean by "visibly different"? I think both of my examples result in visibly different classes. Here's another one:

irb(main):001:0> JRuby.runtime.alloc_module_id
=> 1599
irb(main):002:0> class C
irb(main):003:1> def foo
irb(main):004:2> end
irb(main):005:1> end
=> nil
irb(main):006:0> JRuby.runtime.alloc_module_id
=> 1602
irb(main):007:0> class C
irb(main):008:1> def bar
irb(main):009:2> end
irb(main):010:1> end
=> nil
irb(main):011:0> JRuby.runtime.alloc_module_id
=> 1603

The class C got "visibly different" (new method added), yet the counter did not increase.

Of course, you managed to get at the ID without us adding anything...that's by design.

That might be good enough if the counter got incremented more often. In particular, I'd like to be confident that it would happen when a user types out a new method and re-loads the source file containing it.

Having any slight edit of existing methods result in counter increment as well would be ideal, but not necessary so far, for me.

What I'd prefer is a standard API that both JRuby and MRI implement.

The MRI team might be reluctant to standardize the stat structure anyway. The doc says it can change in future releases.

@headius
Member
headius commented Jan 7, 2017

What do you mean by "visibly different"? I think both of my examples result in visibly different classes. Here's another one:

Visibly difference from its parent class. We assign a new ID to a class exactly once, when it begins to differ from its parent. From then on it lives as its own class for caching purposes.

Adding bar to C doesn't change class identity.

I think some clarification is needed here.

JRuby uses this module ID only to uniquely identify classes. On rare occasions, two different class objects will have the same ID if they do not yet differ. Once one of them differs, it will get its own ID exactly once.

This ID is used for type-checking: to see if the incoming receiver object is the same as it was last time (or visibly indistinguishable from what we saw last time). In your case, adding bar to the class does not change the type identity of existing C objects.

When a class has new method added, or a method definition is changed, don't you have to invalidate some cache anyway?

We also have a separate mechanism that tracks method tables. Each class has its own, separate serial number. This is an indication of the class's current "revision" if you will. Adding bar to C in your example will cause this number to be incremented. Each class has its own number, unlike in MRI where there's only one global serial number.

So in order to make all the cases you want visible, we'd need to inspect every class in the system and derive from that whether anything had changed. Or add in one more serial number that we don't use, but which you want because it's how MRI does things in an internal, unofficial API.

And that's what I mean by depending on implementation-specific details.

Rather, I'm depending on it on recent MRI, and hoping other runtimes will adopt something similar. RubyVM is strictly MRI-specific.

Ok, we're on the same page about having our own JRuby-specific VM namespace for these sorts of internal details. However, like RubyVM, I'm not really comfortable blessing such an interface as a supported API, since our internal details have and will change.

I think we can come up with something appropriate for your use...but I don't feel like I understand all the cases you want to cover yet. Just asking for something "like MRI" is a bit vague, because we are similar in some ways and very, very different in others. I'm not sure the stat values MRI exposes directly map to anything in JRuby, but parts of them map to parts of us.

@dgutov
dgutov commented Jan 7, 2017

We assign a new ID to a class exactly once

That seems to be the main point. Thanks.

This ID is used for type-checking: to see if the incoming receiver object is the same as it was last time (or visibly indistinguishable from what we saw last time). In your case, adding bar to the class does not change the type identity of existing C objects.

I imagine you also have to check there that the receiver's method table hasn't changed. So JRuby checks two counters (or does something similar), while MRI gets away with checking just one. I suppose it's a tradeoff between having a quicker validation check and invalidating fewer callsites when a method is (re)defined.

So in order to make all the cases you want visible, we'd need to inspect every class in the system and derive from that whether anything had changed. Or add in one more serial number that we don't use, but which you want because it's how MRI does things in an internal, unofficial API.

One more serial number would be ideal, but exposing method cache ids on each module might also work fast enough (I'd have to test), as well as provide a more correct invalidation information.

I'd need to know a good way to compute a cache key from them, though. Would summing them up be fast enough? Or if the method cache ids were to be made always-increasing and unique across all modules, I could just take a max.

And that's what I mean by depending on implementation-specific details.

I disagree with that characterization because, even if JRuby undergoes architecture changes in the future, it should still be possible to provide the stat information that would be useful for our purposes. Even if the contents of the hash will have to change somehow.

@dgutov
dgutov commented Jan 7, 2017

I think we can come up with something appropriate for your use...but I don't feel like I understand all the cases you want to cover yet. Just asking for something "like MRI" is a bit vague, because we are similar in some ways and very, very different in others.

Allow me to copy-paste from the first message in this issue:

We want to have a fast way to determine whether the set of modules, classes and methods in the current runtime didn't change since the last time a certain method was invoked.

That's basically the only requirement. So we'll have a certain structure that is derived from all methods in all classes defined in the current runtime. And we want to know when the said structure needs to be rebuilt to accurately reflect the current methods and classes.

We'll use it to make code completion faster in Pry (in the "call target is unknown" case). I also anticipate using it in another, non-REPL package, also for code completion.

I'm not sure if there are any uses of RubyVM.stat out there, aside from diagnosing performance problems on MRI. If something comes up, we could add those as additional requirements. All I've found so far as "MRI internals" articles and libs like https://github.com/simeonwillbanks/busted.

@headius
Member
headius commented Jan 9, 2017

I imagine you also have to check there that the receiver's method table hasn't changed. So JRuby checks two counters (or does something similar), while MRI gets away with checking just one. I suppose it's a tradeoff between having a quicker validation check and invalidating fewer callsites when a method is (re)defined.

By saying we check these values I was oversimplifying a bit. In both cases, the check is not actively done; it works with what's called a "safepoint" in the JVM. Normally, if nothing changes, there's no overhead. If something changes, the code gets invalidated and never run again, and code already running has to back off. So while we have finer-grained tracking of system changes, it actually translates into much less overhead than MRI.

I disagree with that characterization because, even if JRuby undergoes architecture changes in the future, it should still be possible to provide the stat information that would be useful for our purposes. Even if the contents of the hash will have to change somehow.

Sure, but that's exactly what I mean by impl-specific. The contents of that hash reflect how our runtimes work internally. If you're fine with keeping up with our internal changes, and having that hash potentially change across JRuby versions, then it's fine.

At this point I think having a single global serial number is probably the best way to go. All places where we tickle fine-grained invalidation state would also want to bump this other value.

@enebo You have any thoughts on this?

@dgutov
dgutov commented Jan 10, 2017 edited

So while we have finer-grained tracking of system changes, it actually translates into much less overhead than MRI.

Thanks. That got me re-reading your Invokedynamic blog entry.

If you're fine with keeping up with our internal changes, and having that hash potentially change across JRuby versions, then it's fine.

Yes, I'm fine with it. But I'm hoping that even with the contents of the hash changing between versions of MRI and JRuby, I won't have to change my code.

As long as the hash as a whole can serve as a change canary.

@enebo
Member
enebo commented Jan 10, 2017

@dgutov @headius can we get a summary on the plan? Basically we are just adding an additional counter which if it is seen to change it only means some type in the system has changed. It does not have any additional connection to the class ids or does it?

@dgutov
dgutov commented Jan 10, 2017 edited

Basically we are just adding an additional counter which if it is seen to change it only means some type in the system has changed.

And a (EDIT:) JRuby::VM.stat method which returns a hash with this counter as the sole key-value pair (for now).

It does not have any additional connection to the class ids or does it?

I'd just expect a certain relation: if a new class id is allocated, the counter must change as well.

@enebo
Member
enebo commented Jan 10, 2017

If this gives some value and is pseudo compatible I am ok with this. It should not be hard to add this behavior.

@headius
Member
headius commented Jan 10, 2017

Ok, so if we're willing to go forward with this...it's design time. Do we want to just add one new counter? Or should we take advantage of the fact that we don't actually use the counter values to add a few different ones. I'm thinking method table modifications, constant table modifications, current highest-known class ID at least. If we're going to add counters wouldn't it be better to expose different changes?

@enebo
Member
enebo commented Jan 10, 2017

@headius so long as the design supports adding new values we can start simple with obvious things but is this compatible codewise with how MRI supports it? If not should we even use the same API. I guess I am interested in somewhat working without people have to have 2 code paths but at the same time if we expose different stats perhaps we should expose them in a clean way and not try to be as similar to MRI API? I am just asking questions...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment