-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8298992: runtime/NMT/SummarySanityCheck.java failed with "Total committed (MMMMMM) did not match the summarized committed (NNNNNN)" #15306
Conversation
…tted (MMMMMM) did not match the summarized committed (NNNNNN)"
👋 Welcome back azafari! A progress list of the required criteria for merging this PR into |
@afshin-zafari The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
for (int index = 0; index < mt_number_of_types; index ++) { | ||
s->_malloc[index] = _malloc[index]; | ||
} | ||
size_t total_mallocs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename from total_mallocs
to total_size
? When I looked at the code at first I thought we were counting here the number of malloc operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
The loop in I am a bit uncomfortable with a loop here that in theory might never terminate even if in practice your testing reveals no issue. Just wanted to raise this issue, not 100% sure it needs to be addressed. |
I agree with you. The loop is limited now. Thanks. |
Webrevs
|
total_size += _malloc[index].malloc_size(); | ||
} | ||
} while(s->_all_mallocs.size() != total_size && ++loop_counter < loop_limit); | ||
assert(s->_all_mallocs.size() == total_size, "Total != sum of parts"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preexisting, but could you please move this to the cpp file? It should not live in the header.
Also I'm not sure the assert is really useful. Won't that be just a source for very rare intermittent errors? What would be the action we take if that assert fires, increase the loop count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We copy MemoryCounter
by value, not reference:
s->_all_mallocs = _all_mallocs;
where _all_mallocs
is MemoryCounter
, not *MemoryCounter
, so that assert can never trigger, right?
I mean, even if more memory gets allocated between when we exit the loop and assert, _all_mallocs.size()
might change, but s->_all_mallocs.size()
is frozen in time, isn't it?
So I agree that the assert is not helpful here, but it's because it will never trigger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the point was to compare the accumulated sum with the running total kept in _all_mallocs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that we make a snapshot of 2 moving parts: _all_mallocs
and _malloc[nmt_type]
The problem is that after we snapshot _all_mallocs
, _malloc[nmt_type]
can change, so that:
s._all_mallocs.size() != s._malloc[nmt_type1] + s._malloc[nmt_type2] + ...
so we keep trying in a loop until we catch the system in a quiet enough period where:
s._all_mallocs.size() == s._malloc[nmt_type1] + s._malloc[nmt_type2] + ...
so when we come out of the loop then the assert is guaranteed never to trigger, because we made copies that are not going to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we keep trying in a loop until we catch the system in a quiet enough period where:
s._all_mallocs.size() == s._malloc[nmt_type1] + s._malloc[nmt_type2] + ...
Either that or until the loop count is exhausted. That's what I meant - the assert could only fire if 100 tries were not enough to get a consistent picture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now what you meant. On one hand it would be nice to know if the loop was unable to get consistent copy, on the other hand if we do get an assert, the only thing we can do is increase the count.
total_size += _malloc[index].malloc_size(); | ||
} | ||
} while(s->_all_mallocs.size() != total_size && ++loop_counter < loop_limit); | ||
assert(s->_all_mallocs.size() == total_size, "Total != sum of parts"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we agree then that the assert on line 205 is not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here was that during copying malloc measures in the loop, some new allocations happen that change the copied items. This results in a mismatch of Total and the sum of items.
The ThreadCritical
in the code was supposed to block other threads' allocations while copying. But it did not work as expected, since the ThreadCritical
is used in a few deallocations in the code.
Therefore the while loop is written here to make sure that the malloc items that copied are consistent, i.e.
After Gerard's comment, the while-loop is upper limited to some iterations (loop_limit = 100
) rather than be an infinite loop.
So if after loop_limit
no of loops, the items are still not consistent then it is better to raise it here rather than to let this mismatch propagates up to the reports.
It is expected that replacing ThreadCritical
with mutex for NMT, will resolve the issue and no while-loop is needed anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see the reaching loop counter issue now.
Are we going to replace ThreadCritical with NMT mutex here or is that going to be a different follow up issue?
I don't understand what kind of concurrency problem we're reaching here. If the
I don't expect that to be true, ThreadCritical is a mutex. |
If we remove the while loop and keep the assert, then in some rare tests in tiers 4 and 7 ( 3-4 cases in 700+ tests) that the assertion raised. I had used other asserts/logs in the code to find the roots of mismatching Total and Sum. |
I am referring to this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with the change (with or without assert). Thanks for fixing.
The comment in
but I don't see I see atomic operations used to protect the individual memory counters, but I don't see anywhere where we protect the _all_mallocs + _malloc chunks. If anyone of them changes, then copying:
might at the end not to agree on:
Afshin proposed a simple fix, which even though not ideal, will get job done. An ideal fix would make sure that any operation involving changing Did I get this right? |
Some facts about these NMT metrics are:
The proposed small change here in this PR tries to be somewhere between fully-consistent metrics and never-consistent ones. |
OK, this fix looks good to me then as is. Thank you for fixing this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still good.
Thank you @tstuefe and @gerard-ziemski, for comments and review. |
@afshin-zafari This pull request has not yet been marked as ready for integration. |
/integrate |
@afshin-zafari This pull request has not yet been marked as ready for integration. |
/integrate |
@afshin-zafari Your integration request cannot be fulfilled at this time, as the status check |
@afshin-zafari This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 272 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
/integrate |
Going to push as commit bf63945.
Your commit was automatically rebased without conflicts. |
@afshin-zafari Pushed as commit bf63945. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
During exhaustive tests, it is observed that during taking snapshot of NMT metrics it is possible that new allocations happen concurrently, although a
ThreadCritical
is used during copying current metrics to the snapshot.A loop is surrounding the copying and checks whether the copied and original are the same.
Progress
Issue
Reviewers
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/15306/head:pull/15306
$ git checkout pull/15306
Update a local copy of the PR:
$ git checkout pull/15306
$ git pull https://git.openjdk.org/jdk.git pull/15306/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 15306
View PR using the GUI difftool:
$ git pr show -t 15306
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/15306.diff
Webrev
Link to Webrev Comment