8273107: RunThese24H times out with "java.lang.management.ThreadInfo.getLockName()" is null #25
Conversation
|
@dcubed-ojdk The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
I have logging in place for all of the RunThese configs to gather a little I have learned thru my own test jobs that a RunThese24H job doesn't |
The proposed solution in this patch is to call deflate_idle_monitors() from Also, the usual work done by deflate_idle_monitors() and deflate_monitor_list() |
This fix has been tested with Mach5 Tier[1-8]. I've also done some targeted It would be good to hear from @dholmes-ora, @coleenp, @fisk and @robehn JDK-8253064 monitor list simplifications and getting rid of TSM Update: Mach5 Tier[1-3] testing looks good. |
This looks good to me. You solved both the problem of the very long monitor list and the per-thread iteration too. I like it! |
@coleenp - Thanks for the draft review! It's now officially ready... |
Webrevs
|
@@ -57,6 +57,44 @@ | |||
#include "utilities/events.hpp" | |||
#include "utilities/preserveException.hpp" | |||
|
|||
class CleanupObjectMonitorsHashtable: StackObj { | |||
public: | |||
bool do_entry(JavaThread*& key, ObjectMonitorsHashtable::PtrList*& list) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this being called by anything ???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See L71 below: _ptrs->unlink(&cleanup); // cleanup the LinkedLists
That passes the cleanup object to ResourceHashtable::unlink()
function which calls the do_entry() function in the cleanup object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see (well I don't really see as this is very obscure). I would have expected CleanupObjectMonitorsHashtable
to be a subtype of something that I could then see is used by the ResourceHashtable. I'm going to guess that there is some template usage here that ensures the passed in object has the right API even if there is no subtyping relationship. Hard to see these connections at a distance.
To be clear not a review as I don't understand our Hashtable's enough to make much sense of what is actually happening here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a nice place for a lambda (on the list of future RFEs). You could also put this class in scope of the function that calls it which might help. (optional)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResourceHashtable::unlink() is definitely one of those template
things that I don't really like, but I have to use it... All that stuff
is in src/hotspot/share/utilities/resourceHash.hpp, but I don't
see any precedent for putting the CleanupObjectMonitorsHashtable
definition in there. I think I'll leave it where it is.
@dholmes-ora - Thanks for reviewing! |
@@ -279,6 +279,18 @@ void VM_ThreadDump::doit() { | |||
concurrent_locks.dump_at_safepoint(); | |||
} | |||
|
|||
ObjectMonitorsHashtable table; | |||
ObjectMonitorsHashtable* tablep = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you can remove this pointer.
ThreadStackTrace::dump_stack_at_safepoint also looks at "with_locked_monitors", if false it ignores the table.
So there should not be any need to pass in null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now ThreadStackTrace::dump_stack_at_safepoint()
is flexible in
that it does not require ObjectMonitorsHashtable* table
to be non-null
when _with_locked_monitors
is true.
From this call site, when _with_locked_monitors
is true we always pass
a non-nullptr ObjectMonitorsHashtable* table
and we do that because
we're at a safepoint and likely have to deal with a lot of threads if not all
the threads in the system.
For other ThreadStackTrace::dump_stack_at_safepoint()
call sites,
I want things to be more flexible. Mostly because I haven't done
detailed analysis of them to justify changing their behavior.
// Get the ObjectMonitors locked by this thread (if any): | ||
ObjectMonitorsHashtable::PtrList* list = table->get_entry(_thread); | ||
if (list != nullptr) { | ||
ObjectSynchronizer::monitors_iterate(&imc, list, _thread); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The InflatedMonitorsClosure walks the stack until object is found with this method:
ThreadStackTrace::is_owned_monitor_on_stack(oop object)
for every object...
If you instead just collect all locks on stack with one pass you don't have to walk the stack over and over, which should be major speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like an interesting RFE, but I'd prefer that be investigated
separately from this bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarification: ThreadStackTrace::is_owned_monitor_on_stack() is not actually
walking the (JavaThread's) stack, but it is walking the collection of
StackFrameInfo objects that was gathered earlier in
ThreadStackTrace::dump_stack_at_safepoint(). Adding a gather-locked-monitor
helper object to the "add_stack_frame()" call would provide an easy hook and
then InflatedMonitorsClosure could use the new helper object. Something like
that anyway...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry, we loop the stack frame info for every object. I.e. a loop in a loop which is bad for time-complexity.
If the we created a sorted list from the frame info, we can binary search it instead.
Since a thread often only have at max a few locks, but could have hundreds of frames, just to be able skip looping over all frames without locks would be very beneficial.
@dcubed-ojdk This change now passes all automated pre-integration checks. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 17 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
|
@@ -57,6 +57,44 @@ | |||
#include "utilities/events.hpp" | |||
#include "utilities/preserveException.hpp" | |||
|
|||
class CleanupObjectMonitorsHashtable: StackObj { | |||
public: | |||
bool do_entry(JavaThread*& key, ObjectMonitorsHashtable::PtrList*& list) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be a nice place for a lambda (on the list of future RFEs). You could also put this class in scope of the function that calls it which might help. (optional)
Mach5 Tier[4567] testing looks good. Update: Mach5 Tier8 testing for v0 finished and looks good. |
ObjectMonitorsHashtable::PtrList* list = get_entry(jt); | ||
if (list != nullptr) { | ||
// Add to existing list from the hash table: | ||
list->add(om); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines are executed in both the if and else clause. Consider putting it after instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look at doing that. Thanks for the suggestion.
ObjectMonitorsHashtable::PtrList* list, | ||
JavaThread* thread) { | ||
typedef LinkedListIterator<ObjectMonitor*> ObjectMonitorIterator; | ||
ObjectMonitorIterator iter = ObjectMonitorIterator(list->head()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to create an object and then copy it to iter. I think the more obvious syntax is ObjectMonitorIterator iter(list->head());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. I'll fix that.
// unowned ObjectMonitors that cannot be deflated because of a waiter. Since | ||
// deflate_idle_monitors() and deflate_monitor_list() can be called more than | ||
// once, we have to make sure the entry has not already been added. | ||
JavaThread* jt = (JavaThread*)mid->owner(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm. I'm not sure this is correct. The owned of an ObjectMonitor is a JavaThread most of the time, but not always. I think that if the owner started off with a stack lock, and another thread inflated the lock due to contention, then the ObjectMonitor will get the stack lock as the owner, as the random other thread is too lazy to look up which thread stack the stack lock belongs to.
Because of that, I think we might miss locks that really are owned by a thread, as the thread key now will not map to such locks. Some kind of ownership lookup probably needs to happen here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch on an oddity of ObjectMonitor ownership!
The baseline code has the same "issue":
void ObjectSynchronizer::monitors_iterate(MonitorClosure* closure, JavaThread* thread) {
<snip>
ObjectMonitor* mid = iter.next();
if (mid->owner() != thread) {
continue;
}
Please notice that I said "issue" and not "bug". This isn't a bug and you really
won't like the reason that it's not a bug. I know that I don't like it...
Because of this ownership check monitors_iterate()
won't apply the closure
(InflatedMonitorsClosure in this case) to an ObjectMonitor that is marked as
owned by a stack lock address because it will never match the target JavaThread*.
Here's the closure:
// Iterate through monitor cache to find JNI locked monitors
class InflatedMonitorsClosure: public MonitorClosure {
private:
ThreadStackTrace* _stack_trace;
public:
InflatedMonitorsClosure(ThreadStackTrace* st) {
_stack_trace = st;
}
void do_monitor(ObjectMonitor* mid) {
oop object = mid->object();
if (!_stack_trace->is_owned_monitor_on_stack(object)) {
_stack_trace->add_jni_locked_monitor(object);
}
}
};
This particular monitors_iterate() call uses InflatedMonitorsClosure to
gather just the JNI locked monitors owned by the target JavaThread*.
A JNI locked monitor will never have a stack lock address as its owner
marking. It will always have a JavaThread*. So the baseline code isn't
"broken" because it will find the JNI locked monitors just fine. Similarly
the new code isn't "broken" because it will also find the JNI locked
monitors just fine.
I definitely need to clarify the baseline comments and the new code
comments to make this subtlety much more clear.
@fisk - Thanks for the review! |
I made changes to address @fisk's review. I've also included comment Tested this latest patch with Mach5 Tier1. Mach5 Tier[23] are running now. Update: Mach5 Tier[1-5] look good. |
The @coleenp pointed out off-line that I had a bit of type confusion in my new My most recent push to this PR cleans up the type confusion. This update Update: Also ran Mach5 Tier6 and Tier7 on this latest patch and those results Update: Tier8 results also look good. |
When the owner field of an ObjectMonitor is the address of a stack lock instead of a thread, it doesn't mean that it is a stack lock. It means that it was a stack lock when it was acquired by that thread, but has since then been upgraded to a full ObjectMonitor (by another thread), due to contention. But the fact that it was a stack lock is irrelevant in this context; we care about what it is now. And it is an inflated lock that just happens to have a slightly obscured owner field, as the owner thread was not easily available when it was inflated. In other words, I still believe this is a bug. Such a monitor that gets upgraded from a stack lock to a full ObjectMonitor, can - unline actual stack locks - be involved in a deadlock that we are trying to find. If we don't associate such ObjectMonitor objects with its owner thread, then we will not be able to find the deadlock as we are supposed to with this code. |
Re: both Robbin's and Erik's comments. I think Robbin's should be an RFE and Erik's should be another bug report and fixed separately. This change fixes the timeout that's been observed multiple times and preserves existing behavior (with existing bugs). I think some redesign is in order to make this correct all the time. |
I'm okay with fixing the other bug in a separate RFR. |
For the record, Threads::owning_thread_from_monitor_owner seems to be the function to look up who really own an ObjectMonitor, based on the value in the owner field, to deal with the stuff I talked about. |
I need to address these comments from @fisk:
What @fisk says here is correct and I'm probably guilty of over simplifying what
@fisk, please notice that I don't say that this is a stack lock. I say this is
and that is the "bug" in your opinion. The baseline code and new code I think we are in strong agreement about what the attributes of the "bug" are.
You are missing part of the algorithm in the code that calls monitors_iterate(). If monitors_iterate() does not find the ObjectMonitor* in the JavaThread's stack Because an ObjectMonitor that is marked as owned by a stack lock address won't I have tried to clarify this in the comments that I added as part of this patch. I'll |
I've filed the following RFE for @robehn's improvement: JDK-8278968 InflatedMonitorsClosure repeats walks of StackFrameInfo and should cache instead |
/integrate |
Going to push as commit a5f5d60.
Your commit was automatically rebased without conflicts. |
@dcubed-ojdk Pushed as commit a5f5d60. |
I guess you can recursively JNI lock an ObjectMonitor, that has been promoted from a stack lock. Then it's a lock held by JNI and we won't report it here.
But this is JVMTI code so there is probably some other random check somewhere else that checks for that so I'm just going to assume this works somehow.
Ship it!
If you recursively JNI lock an ObjectMonitor that has been promoted from a stack lock, |
@fisk - Thanks for closing the loop on this thread. |
RunThese24H sometimes times out with a couple of error msgs:
The '"java.lang.management.ThreadInfo.getLockName()" is null' error msg was
due to RunThese's use of an older JCK test suite which has since been fixed.
The 'ForkJoin common pool thread stuck' failure mode is likely due to a thread
spending a lot of time in ObjectSynchronizer::monitors_iterate() due to a
VM_ThreadDump::doit() call. I say "likely" because I've never been able to
reproduce this failure mode in testing outside of Mach5. With the Mach5
sightings, all we have are thread dumps and core files and not a live process.
The VM_ThreadDump::doit() call is trying to gather owned monitor information
for all threads in the system. I've seen sightings of this failure mode with > 2000
threads. I've also seen passing runs with > 1.7 million monitors on the in-use list.
Imagine searching a larger in-use list for > 2000 threads. It just doesn't scale.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.java.net/jdk18 pull/25/head:pull/25
$ git checkout pull/25
Update a local copy of the PR:
$ git checkout pull/25
$ git pull https://git.openjdk.java.net/jdk18 pull/25/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 25
View PR using the GUI difftool:
$ git pr show -t 25
Using diff file
Download this PR as a diff file:
https://git.openjdk.java.net/jdk18/pull/25.diff