Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NULL and movement check in process_edge #1032

Merged
merged 11 commits into from
Dec 20, 2023
Merged

Conversation

wks
Copy link
Collaborator

@wks wks commented Nov 30, 2023

This PR does two things:

  1. ProcessEdgesWork::process_edge will skip slots that are not holding references (i.e. if Edge::load() returns ObjectReference::NULL).
  2. ProcessEdgesWork::process_edge will skip the Edge::store() if the object is not moved.

Doing (1) removes unnecessary invocations of trace_object() as well as the subsequent Edge::store(). It also allows slots to hold non-reference tagged values. In that case, the VM binding can return ObjectReference::NULL in Edge::load() so that mmtk-core will simply skip the slot, fixing #1031

Doing (2) removes unnecessary Edge::store() operations in the case where the objects are not moved during trace_object. It reduces the STW time in most cases, fixing #574

Fixes: #1031
Fixes: #574

@wks wks mentioned this pull request Dec 7, 2023
This allows the VM binding to report slots that hold tagged
non-reference values as `ObjectReference::NULL` so that mmtk-core will
not try to trace it.
Update the doc comments of `Edge::load` and `Edge::store` to mention
tagged references.
@wks
Copy link
Collaborator Author

wks commented Dec 8, 2023

@k-sareen I took your advice and added a check after trace_object to skip the store() if the object is not moved. It fixes #574

I did some experiment.

Build build1 build2 build3
Check NULL before trace_object? no yes yes
Check if object is moved after trace_object? no no yes
  • Build1 is the master branch.
  • Build2 is currently this PR. It checks if the return value of slot.load() is NULL and, if it is, skips the slot completely.
  • Build3 is here. It further adds, on top of build2, a check of whether the object is moved and, if it is not moved, skips the store operation.

I tested with OpenJDK (unmodified). I ran lusearch of DaCapo Chopin on bobcat.moma, 40 invocations, 2.5x min heap size, no PGO.

Results (normalized to build1)
image

Plotty: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|bobcat-2023-12-08-Fri-072543&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.stw&|10&iteration^1^4|20&1^invocation|30&1&mmtk_gc&build;build1|41&Histogram%20(with%20CI)^build^mmtk_gc&

The effect of this PR (build1 vs build2) is uncertain. It slows down GenCopy, Immix and StickyImmix, but speeds up GenImmix and SemiSpace. The difference in Immix (0.7%) is likely to be a result of more GCs being triggered for some reasons. The differences for other plans are less than 0.3%, and are within the confidence intervals.

The effect of checking for object movement after trace_object is significant. It speeds up all plans except SemiSpace which moves every object in every GC. One possible fix is adding a predicate always_move_objects() in addition to may_move_objects() so that SemiSpace can return true and omit the check.

@k-sareen
Copy link
Collaborator

k-sareen commented Dec 11, 2023

Hmm. That's a bit odd tbh. Also even with SemiSpace we may not actually forward an object if the object is in the LOS or non-moving spaces. I get it's an extra check, but I would have thought in principle it would reduce unnecessary memory traffic so it would be faster.

@k-sareen
Copy link
Collaborator

Oh sorry. I misread the graphs (could I suggest giving them more semantic names instead of "build{1,2,3}"?). Yes okay skipping the object write for unmoved objects is a win which is good. I'll note that the percentage differences are minuscule (the graphs are < 1%) and the error bars are large, so either way this PR does not kill performance. I would recommend running a bigger experiment with all the benchmarks to see the impact across the entire suite.

@wks
Copy link
Collaborator Author

wks commented Dec 11, 2023

This is my favorite test: tracing over a static binary search tree. Every time I ran the test program, it triggered 100 GCs (plus 10 warm-ups) using System.gc() (I am sure it always triggers GC), and it recorded the time of each GC. All GCs were run with only 1 GC worker thread to exaggerate the tracing time. The three builds were the exact same binary as my last post.

In the following figure, each sub-figure window is a Plan*Build combination. Each vertical bar is an execution of the test program, and each dot is one GC, and the y-axis is the time for that GC. The violin plot shows the clustering. The colored horizontal bar in the middle is the median, and the black horizontal bar in the middle is the mean.

plot

Obviously, SemiSpace manifests a bi-modal distribution because the cost of dispatching the object to copyspace0 and copyspace1 is different. (We discussed the bi-modal distribution before. See #952 (comment)) The testing of new_object != object systematically adds a cost, as the entire "violin" is shifted upwards. But its effect is even smaller than the test of which space an object is in (i.e. if copyspace0.in_space(object) { ... } else if copyspace1.in_space(object) { ... } else ...). I will not worry about the slowdown in SemiSpace due to this PR because if we refactor the dispatching, we will have more significant improvement in the future.

And the effect on Immix is obvious. The speed-up is consistent and significant, probably because of the reduced memory traffic.

The effect on StickyImmix is a bit mixed. It reduced the variation of GC time, and reduced the maximum GC time. Although the median (colored bar in the middle) becomes higher, the GCs that are below the median are significantly lower. Therefore, the mean (black bar) is not that high. Overall, the effect on StickyImmix is not significant.

I think the conclusion is that this PR does not have significant impact on performance. I'll test it with more benchmarks.

@k-sareen
Copy link
Collaborator

Could you remove the couple of outliers we have for Gen{Copy,Immix}? I'm just interested to see the actual trend and it's currently being dominated by the outliers. Also just to confirm my understanding, each violin plot (1-5) for each subfigure is the same configuration but a different invocation?

@wks
Copy link
Collaborator Author

wks commented Dec 11, 2023

Could you remove the couple of outliers we have for Gen{Copy,Immix}? I'm just interested to see the actual trend and it's currently being dominated by the outliers.

The following plots have the data points with zscore > 2 removed (Removing only zscore > 3 will still keep many outliers).

image

Also just to confirm my understanding, each violin plot (1-5) for each subfigure is the same configuration but a different invocation?

Yes. The five violins in each subfigure have exactly the same configuration. They are just five different invocations.

@k-sareen
Copy link
Collaborator

k-sareen commented Dec 11, 2023

So avoiding the write is a performance win for all GCs except SS which is juuust slightly worse

@wks
Copy link
Collaborator Author

wks commented Dec 11, 2023

So avoiding the write is a performance win for all GCs except SS which is juuust slightly worse?

Yes. That's expected. The added check itself has overhead. But if it avoids more overhead from the write, it will still be a win overall. SemiSpace should be the worst case because it copies every single object, and it is just slightly worse. Immix profits from this check the most because Immix is deliberately designed to avoid copying. I guess if we enable the "stress GC" options for the ImmixSpace, it will be slower, too, because it will move every single object, too.

@k-sareen
Copy link
Collaborator

Okay I just ran into a correctness bug because we overwrote a forwarding pointer in an slot. In ART a slot can be the start of an object since the class pointer is stored at the start of the object (as opposed to OpenJDK where the start of the object is the mark word) and in this case the object was already forwarded but the slot.load() returns the object with the forwarding status bits due to previous concurrency issues. Then we overwrite the forwarding pointer with the address of the to-space object, corrupting the class-pointer of the from-space object. I'd like to fast-track this PR. I'm happy to review it.

@wks
Copy link
Collaborator Author

wks commented Dec 17, 2023

This time I ran the three builds (the same binaries as in the previous post) against more benchmarks. 20 invocations, 3x min heap size w.r.t. "vanilla" OpenJDK with G1 (note that the min heap size of some MMTk plans may sometimes be higher than G1's min heap size, so it may be less than 3x min heap size for some benchmarks.)

I omitted batik, eclipse, h2, jme and zxing because the min heap sizes of those benchmarks increases when the number of iterations increase, and that could indicate memory leak, possibly due to the lack of (soft/weak/phantom) reference processing. Tradebeans, tradesoap and tomcat were omitted, too, due to unexpected crashes and hangs during execution. I am not sure about the reason, but it may be due to some anomaly in the communication between the benchmark threads and the network server running during the benchmark.

image

Detailed results: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|vole-2023-12-13-Wed-173235&benchmark^build^invocation^iteration^mmtk_gc&GC^time^time.other^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark^mmtk_gc&build;build1|40&Histogram%20(with%20CI)^build^benchmark&

From the data, we see that

  1. (build1 -> build2) Skipping slots that contain NULL references has no visible effect in STW time.
  2. (build2 -> build3) Checking if the object is moved before storing reduces the STW time significantly for most plans. SemiSpace is an exception and the check has no obvious effect for it, for reasons we discussed before.

Some benchmarks, including biojava, graphchi, jython, luindex and lusearch, benefits from the check of object movement when using generational plans. Others are not that sensitive.

From the data, I think the check for object movement after trace_object is profitable. I will add that to this PR, too, and it will fix #574. The check for NULL before trace_object is necessary for correctness reasons, so I'll combine both checks into this PR.

@wks wks changed the title Skip slots where Edge::load() returns NULL NULL and movement check in process_edge Dec 18, 2023
@wks wks marked this pull request as ready for review December 18, 2023 06:54
let new_object = self.trace_object(object);
if Self::OVERWRITE_REFERENCE {
debug_assert!(!new_object.is_null());
if Self::OVERWRITE_REFERENCE && new_object != object {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need OVERWRITE_REFERENCE anymore. It's always true anyway.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SanityGCProcessEdges sets that constant to false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no harm to check it, and we should check it unless we remove OVERWRITE_REFERENCE. Otherwise if there is a new implementation of ProcessEdgesWork that sets OVERWRITE_REFERNECE to false, we have a bug here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OVERWRITE_REFERENCE was a hack to implement sanity GC. Sanity GC does not move objects anyway and the check can be obviated now. I'm guessing the compiler is smart enough to figure out that the constant is either always true in the case of real GC work packets and always false for sanity, so there is no performance overhead of the check. So at the end of the day, it's fine to keep it, but I just think it's a bit redundant.

@wks wks added the PR-testing Run binding tests for the pull request (deprecated: use PR-extended-testing instead) label Dec 18, 2023
@k-sareen
Copy link
Collaborator

@wks MarkCompact is failing the null object debug check in trace_object. Could you look into it?

@wks
Copy link
Collaborator Author

wks commented Dec 18, 2023

The OpenJDK binding test failed, and the error is in the reference processor. Now trace_object always asserts that the argument is not NULL. We need to update the documentation, too, where trace_object is exposed to the VM binding (such as in Scanning::process_weak_refs), and make it clear the caller of trace_object is responsible for ensuring the object reference is not NULL.

@k-sareen
Copy link
Collaborator

You need to add the null checks here as well (in the entire file):

) -> ObjectReference {

@wks
Copy link
Collaborator Author

wks commented Dec 18, 2023

You need to add the null checks here as well (in the entire file):

) -> ObjectReference {

Yes. And I think this is one more reason why eliminating ObjectReference::NULL is helpful for the clearness of the API. #1043

@k-sareen
Copy link
Collaborator

I don't quite understand the relevance of the bug to that issue. With this change, we need to do a null check before we do trace_object (i.e. we've moved the check up one level).

@wks
Copy link
Collaborator Author

wks commented Dec 18, 2023

I don't quite understand the relevance of the bug to that issue.

The code:

            let old_referent = <E::VM as VMBinding>::VMReferenceGlue::get_referent(reference);
            let new_referent = ReferenceProcessor::get_forwarded_referent(trace, old_referent);

The reference processor called get_forwarded_referent(..., old_referent) without checking whether old_referent is NULL or not. If we make ObjectReference to be non-NULL, then get_referent will have to return None if the referent is already cleared, and Some(referent) if not cleared. Since trace_object(object) always takes a non-NULL ObjectReference as a parameter, then get_forwarded_referent, which is a wrapper of trace_object, will also have to require its argument old_referent to be a non-NULL ObjectReference. That will force the programmer to check if old_referent is None before calling get_forwarded_referent, and the bug in the failed OpenJDK test will not appear in the first place.

With this change, we need to do a null check before we do trace_object (i.e. we've moved the check up one level).

Yes. With #1043 implemented, we will need to do null check before calling trace_object. And yes. It is moved one level above.

For the Ruby binding, this check always exists. Because a Ruby slot can hold a tagged non-reference value, the idiom is

if (SPECIAL_CONST_P(*field)) { // Check if it holds "special values".
    continue;
}
rb_gc_mark_movable(*field); // This will call `trace_object` underneath

true, false, nil and small integers are all "special values". If a slot holds a special value, it will skip calling trace_object.

For bindings that do not have NULL references, they will not need to do the checking because there is no NULL.

So I would argue that the check of whether a usize value represents a NULL is VM-specific, and should be pushed up to the VM binding.

@wks
Copy link
Collaborator Author

wks commented Dec 19, 2023

... With this change, we need to do a null check before we do trace_object (i.e. we've moved the check up one level).

The commit 9648aed introduced the method ReferenceGlue::is_referent_cleared. The intention was that not all VMs clear referents to ObjectReference::NULL. Specifically, the Julia binding replaces the reference with a pointer to jl_nothing, a pointer to a special Julia object which represents the special value Nothing. This means the reference processor must check if the referent is cleared using ReferenceGlue::is_referent_cleared instead of comparing against ObjectReference::NULL. We can see the following idiom in the ReferenceProcessor:

            let referent = <E::VM as VMBinding>::VMReferenceGlue::get_referent(*reference);
            if !<E::VM as VMBinding>::VMReferenceGlue::is_referent_cleared(referent) { // checks cleared (NULL or special values)
                Self::keep_referent_alive(trace, referent); // calls trace_object
            }

This also reflects the fact that ObjectReference::NULL is not generally applicable to all VMs (not even most VMs because Julia, Ruby, Python all represent Nothing, nil, None using things that are not ObjectReference::NULL).

@k-sareen
Copy link
Collaborator

My understanding is that ReferenceGlue is deprecated and slated for removal. The new API doesn't have this issue since everything is done by the VM/binding.

Re: #1043, yes I see. It would be a better API if ObjectReference was non-null.

Copy link
Collaborator

@k-sareen k-sareen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@k-sareen k-sareen added this pull request to the merge queue Dec 20, 2023
@wks wks removed this pull request from the merge queue due to a manual request Dec 20, 2023
Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wks wks added this pull request to the merge queue Dec 20, 2023
Merged via the queue into mmtk:master with commit e2951ad Dec 20, 2023
20 checks passed
@wks wks deleted the feature/early-null-check branch December 20, 2023 07:16
k-sareen pushed a commit to k-sareen/mmtk-core that referenced this pull request Apr 11, 2024
This PR does two things:

1. `ProcessEdgesWork::process_edge` will skip slots that are not holding
references (i.e. if `Edge::load()` returns `ObjectReference::NULL`).
2. `ProcessEdgesWork::process_edge` will skip the `Edge::store()` if the
object is not moved.

Doing (1) removes unnecessary invocations of `trace_object()` as well as
the subsequent `Edge::store()`. It also allows slots to hold
non-reference tagged values. In that case, the VM binding can return
`ObjectReference::NULL` in `Edge::load()` so that mmtk-core will simply
skip the slot, fixing mmtk#1031

Doing (2) removes unnecessary `Edge::store()` operations in the case
where the objects are not moved during `trace_object`. It reduces the
STW time in most cases, fixing
mmtk#574

Fixes: mmtk#1031
Fixes: mmtk#574
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR-testing Run binding tests for the pull request (deprecated: use PR-extended-testing instead)
Projects
None yet
3 participants