-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalise weak reference processing for other languages #700
Conversation
... so that it works for other languages, too.
These work bucket stages intend to replace the Java-specific stages.
- Improved documentation - Tell VM binding if it is nursery collection
Fixed a bug that when newly opened bucket is empty but has a boss, it forgets the boss and continue opening subsequent buckets. Added Collection::vm_prepare to pair with vm_release. Added a tls parameter to vm_release and process_weak_refs.
Fixed a typo The boss reschedules itself to VMRefForwarding when forwarding. VM-side reference handling is always enabled regardless of the MMTK_NO_{FINALIZER,REFERENCE_TYPE} options.
... so that when we add more information, bindings that don't use those extra info won't need to change their function signatures.
... so that ProcessWeakRefsContext can be a plain struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is worthwhile to run some performance measurements (maybe two different runs): 1.
make sure that leaving the VM-side weak ref processing on (but with an empty implementation) does not cost anything (this will tell whether we need a switch for it), 2. make sure that moving the current reference processor to the binding side does not cost anything (this will tell us the efficiency of the new API). The latter is more important.
See my inline-comments for a few other issues.
Split process and forwarding weak refs. Renamed QueuingTracerFactory to ObjectTracerContext. Reuse ProcessEdgesWorkTracerContext in ScanObjects. Added more comments
It happens at the same time as stop_all_mutators, and currently I am only using it for assertion. I'll consider adding it back later if there is a specific need.
This is a bit strange. For example, for |
Indeed. That's a problem. build3-null is still treating weak references as strong if |
I constructed a micro benchmark that creates 100000 weak reference. 50000 of them point to live objects, and the rest point to dead objects, and all of them has a ReferenceQueue to enqueue to when they die. public class RefProcTest {
public static Probe probe = new RustMMTkProbe();
public static int unused;
public static void doTest(int total, int live) {
// Java doesn't like `WeakReference<Object>[total]`.
var refs = new ArrayList<WeakReference<Object>>(total);
var liveObjs = new ArrayList<Object>(live);
var refQueue = new ReferenceQueue<Object>();
for (int i = 0; i < total; i++) {
var obj = new Object();
var ref = new WeakReference<Object>(obj, refQueue);
refs.add(ref);
if (i < live) {
liveObjs.add(obj);
}
}
System.gc(); // I know this always triggers GC with MMTk. It is not generally true, though.
unused = refs.size() + liveObjs.size(); // just use them, in case GC treat them as garbage.
}
public static void main(String[] args) {
final int total;
final int live;
final int warmups;
final int iterations;
// ... parse them from args
for (int i = 0; i < warmups; i++) {
doTest(total, live);
}
probe.begin("RefProcTest", 0, false);
for (int i = 0; i < iterations; i++) {
doTest(total, live);
}
probe.end("RefProcTest", 0, false);
}
} I ran them locally with the settings of build1 (master+master) and build3 (mmtk-core has new API, and mmtk-openjdk uses the new API), with
p.s. (1) Other work packets took very little time. (2) The execution time varies a lot. Vanilla HotSpot with its default G1 GC has large variation, too, probably because of the nondeterministic nature of weak references themselves. Despite of that, the data above is quite representative for a typical execution. Among the work packets,
From that table, I can identify two sources of inefficiency.
I looked into the code, and I found the difference between my API and the existing The existing impl<E: ProcessEdgesWork> GCWork<E::VM> for WeakRefProcessing<E> {
fn do_work(&mut self, worker: &mut GCWorker<E::VM>, mmtk: &'static MMTK<E::VM>) {
let mut w = E::new(vec![], false, mmtk);
w.set_worker(worker);
mmtk.reference_processors.scan_weak_refs(&mut w, mmtk);
w.flush(); // here
}
} On the other hand, my API attempted to wrap fn with_tracer<R, F>(&self, worker: &mut GCWorker<E::VM>, func: F) -> R
where
F: FnOnce(&mut Self::TracerType) -> R,
{
let mmtk = worker.mmtk;
// Prepare the underlying ProcessEdgesWork
let mut process_edges_work = E::new(vec![], false, mmtk);
// FIXME: This line allows us to omit the borrowing lifetime of worker.
// We should refactor ProcessEdgesWork so that it uses `worker` locally, not as a member.
process_edges_work.set_worker(worker);
// Cretae the tracer.
let mut tracer = ProcessEdgesWorkTracer { process_edges_work };
// The caller can use the tracer here.
let result = func(&mut tracer);
// Flush the queued nodes.
let ProcessEdgesWorkTracer {
mut process_edges_work,
} = tracer;
// ============ BEGIN: PROBLEMATIC CHUNK ======================
let next_nodes = process_edges_work.pop_nodes();
if !next_nodes.is_empty() {
// Divide the resulting nodes into appropriately sized packets.
let work_packets = next_nodes
.chunks(E::CAPACITY)
.map(|chunk| {
Box::new(process_edges_work.create_scan_work(chunk.into(), false)) as _
})
.collect::<Vec<_>>();
worker.scheduler().work_buckets[self.stage].bulk_add(work_packets);
}
// ============ END: PROBLEMATIC CHUNK ======================
result
} And that "PROBLEMATIC CHUNK" is the culprit. I replaced it with
This managed to make I'll try to run lusearch and zxing again to see if it makes a difference. |
I was wrong again.
I copied the code from I then added So the culprit is the hot |
I managed to implement auto-chunking so that it will flush and create a work packet as soon as the nodes buffer in In the following test, the
STW time (normalised to build1-null) total time (normalised to build1-null) In both lusearch and zxing, the STW time of build5 is very close to that of build1 and build2 when reference-processing is enabled. When testing with my synthetic micro-benchmark, build5 consistently outperforms build1, especially in terms of time consumed in |
Here is the plot for build1, build2 and build5 for more tests in the DaCapo Chopin benchmark suite. STW time (normalised to build1-null) STW time (normalised to build1-1) Total time (normalised to build1-1) Number of GC (normalised to build1-1) From the plot, we see both the STW time and the total time of build5 are almost as good as build1 on average. As for the the STW time of zxing, build5-1 still has a noticeably overhead compared to build1-1 and build2-1, and the error is small. It could be due to not parallelising Release and RefEnqueue, but other possibilities exist. The behaviour of some benchmark changed significantly after turning on reference processing.
|
We need to be chunking work packets probably. @wks didn't you look into the size of work packets a while ago and found there were packets >>> 4096 (default size)? |
Using
tracer_context.with_tracer(worker, |tracer| {
self.reference_processors
.scan_weak_refs(|o| tracer.trace_object(o)); // This inner lambda is not inlined.
}); We know how hot In the following plot, build1 and build2 are like before. Both build3 and build4 applied the auto-flushing mechanism of enqueued nodes in mmtk-core. The difference is, build4 applied the two optimisations above in the mmtk-openjdk repo.
STW time (normalised to build1-1): Total time (normalised to build1-1): From the plots, we can see that the STW time and total time of build4 (with the two optimisations above) are better than build3 (without the optimisations), and are on par with build1 (master+master). Since the performance differences are a result of inlining details in In the future, we should deprecate the reference processor and finalisation processor in mmtk-core, and update the reference/finaliser processing mechanism in mmtk-openjdk. The current ref processor in mmtk-core has a serious bottleneck, that is, the mutex lock on the Update: I re-ran the same build1 and build4 with MarkCompact. Benchmarks show that the API allows reference processing to implemented at least as efficiently as mmtk-core's built-in ref processor on MarkCompact, too. STW time for when using MarkCompact GC (normalised to build1-1): Total time for when using MarkCompact GC (normalised to build1-1): It is interesting that enabling ref processing makes lusearch run faster with MarkCompact. |
Yes. I once found that many work packets are much smaller than 4096. Some of them have less than 10 items (objects/edges). The problem is that we currently have no way to merge work packets. Packets only gets smaller, unless some objects have a large fan-out, i.e. one object points to many other objects. It should be helpful if the new API automatically chunks up the enqueued nodes into properly-sized work packets, and I have implemented that successfully. However, for my microbenchmark, packet size is not a problem. The mutex lock on the |
Rename vm_release to post_forwarding, and run it in a dedicated work packet in the Release bucket.
I renamed The following experiment investigates the performance impact of parallelising the reference-enqueuing work (done in
STW time: (normalised to build1-1) Total time: (normalised to build1-1) The experiment shows build4 performs the best among build{2,3,4}, but is still within the confidence interval. I think the reason is, the earlier the work packet is scheduled, the earlier a GC worker can start doing that work. |
@wks I feel you have understood and optimized enough on this. Let me know when you push all your changes, and I will take a look at this PR again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I suggest we run the binding tests before merging once the merge conflict is resolved.
/// | ||
/// NOTE: This will replace `RefEnqueue` in the future. | ||
/// | ||
/// NOTE: Although this work packet runs in parallel with the `Release` work packet, it does not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot guarantee this. The binding may access the plan in their implementation of Collection::post_forwarding()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. While Collection::post_forwarding()
doesn't expose the Plan
to the binding, and the Plan
field is declared as pub(crate)
, many functions in the memory_manager
module access the plan
instance.
The timing of RefEnqueue
(now VMPostForwarding
) should be after all references are forwarded. That includes the RefForwarding
and the FinalizerForwarding
buckets (both subsumed by the new VMRefForwarding
bucket). And it doesn't need to access the MMTk instance. However, it seems impossible to prevent the binding from accessing plan in our current program structure. As long as the binding has a &'static MMTk
, it can call functions in memory_manager
, and indirectly access the plan.
I think what we can do is telling the VM binding not to access the plan
indirectly. But even that sounds like a bad idea because we haven't exposed Plan
anyway, and the binding has no way to know which API function indirectly accesses the Plan
. So maybe we can provide the Collection::post_forwarding()
hook like this to make sure our ref-processing code in mmtk-core can be implemented in the binding. But we refactor mmtk-core in the future to solve the "plan
vs plan_mut
" problem from the root so we can actually prevent the binding from accidentally accessing the plan.
This commit adds a language-neutral API to let the VM binding handle weak references and finalisers. Added VMRefClosure and VMRefForwarding work bucket stages to replace the Java-specific stages. Added a "sentinel" mechanism to execute a work packet when a bucket is drained, and prevent the GC from entering the next stage, making it possible to expand the transitive closure multiple times. It replaces GCWorkScheduler::closure_end. Extended the Collection::process_weak_refs method to allow the binding to trace objects during weak reference processing. Renamed Collection::vm_release to Collection::post_forwarding, and it is now executed in a dedicated work packet.
Draft: I still need to make it parallelisable
ProcessEdgesWork
) so that it can be parallelised.Related repos:
ReferenceProcessor
andFinalizableProcessor
from mmtk-core to mmtk-openjdk and managed to get them running. However, it is a proof-of-concept, only, and shall not be merged.obj_free
, finalisers and global weak tables in Ruby. As a by-product, this PR also enables mmtk-ruby to fully support Ruby finalisers (ObjectSpace::define_finalize
). I'll merge itfor mmtk-ruby after this PR is merged.This PR adds a mechanism to let the VM binding handle weak references and finalisers.
Changes made in this PR include:
VMRefClosure
: expand the transitive closure by letting the VM binding trace and/or forward edges in weak references and/or tables of finalisable objects. The decision of which edge to trace and which edge to clear is made by the binding.VMRefForwarding
: as part of the "compaction" step in MarkCompact, it lets the VM binding forward those edges.Scheduler::closure_end
callback, and act as a more general mechanism.VMProcessWeakRefs
work packet is the sentinel forVMRefClosure
andVMRefForwarding
. It callsCollection::process_weak_refs
.Collection::process_weak_refs
trait method to provide extra parameters.worker
: The current GC worker.context
: Provide more contexts, such as whether the current GC is a nursery GC, and whether the current trace is for marking or forwarding.tracer_factory
: It can instantiateObjectTracer
and callObjectTracer::trace_object
. It basically wrapsProcessEdgesWork
and its initialisation and flushing, so that the VM binding can calltrace_object
without intimate knowledge of theProcessEdgesWork
trait.Collection::process_weak_refs
again when transitive closure is finished again. This allows multiple strength levels and ephemerons to be implemented.This PR basically resurrects and extends the mechanism introduced in 2b55f89 This PR kept the following mechanisms:
Collection::vm_release
: This can be used to do whatRefEnqueue
used to do.Collection::vm_prepare
to make the API symmetric.Differences are:
VMRefClosure
andVMRefForwarding
buckets: Weak reference processing is done in those dedicated buckets instead ofClosure
.closure_end
with "sentinels"VMRefClosure
andVMRefForwarding
.Collection::process_weak_refs
returns abool
and has similar semantics as the return value ofclosure_end
.Rationale
Why do we introduce another mechanism while MMTk core has ReferenceProcessor and FinalizableProcessor?
obj_free
is like Javafinalize()
but doesn't resurrect objects;Ruby's
ObjectSpace.define_finalize
is likePhantomReference
but can be unregistered;Ruby's
finalizer_table
,generic_iv_tbl_
,id_to_obj_tbl
andobj_to_id_tbl
are weak tables.Old discussions
Known issue: can we parallelise
Collection::process_weak_refs
?(Update: I added a "TracerFactory" so that different GC workers can instantiate
ObjectTracer
and calltrace_object
.)In this PR,
Collection::process_weak_refs
is called in one work packet running on one GC worker. This meansprocess_weak_refs
itself is executed sequentially. However, after it callstrace_object
multiple times, the underlyingProcessEdgesWork
may split the node list into multiple work packets. One question will be whether we should makeprocess_weak_refs
parallel, too, so we can process weak references in multiple work packets simultaneously.Currently, the
ReferenceProcessor
andFinalizableProcessor
in mmtk-core does not paralellise theirReferenceProcessor::scan_{soft,weak,phantom}_refs
andFinalizableProcessor::scan
methods. So when we reimplement our currentReferenceProcessor
andFinalizableProcessor
in the binding on top of the new mechanism in this PR (See 1), it has the same level of parallelism as before.But from my observation, the lxr branch is parallelising weak reference processing by creating multiple work packets (See 2) that have intimate knowledge about the
ProcessEdgesWork
work packet.I hesitate to expose
ProcessEdgesWork
to the VM binding, because I think it is an implementation detail of mmtk-core, and GC algorithms can actually implement node-enqueuing tracing instead of edge-enqueuing tracing. All the VM binding needs is the ability to calltrace_object
(and I defined theProcessWeakRefsTracer
trait this way. See 3).The problem is, for our current mmtk-core, we can only implement
ProcessWeakRefsTracer
viaProcessEdgesWork
. In this PR, I implemented it by wrapping a reference toProcessEdgesWork
to it, callingset_worker
before callingCollection::process_weak_refs
, and getting its nodes list after it. See 4.But if we wrap a
ProcessEdgesWork
instance inside aProcessWeakRefsTracer
, we will have to exposeset_worker
and the flushing to the VM binding, and I think it is not very elegant.