Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalise weak reference processing for other languages #700

Merged
merged 29 commits into from
Jan 16, 2023

Conversation

wks
Copy link
Collaborator

@wks wks commented Nov 14, 2022

Draft: I still need to make it parallelisable

  • Non-MarkCompact for OpenJDK
  • MarkCompact for OpenJDK
  • MarkSweep and Immix for Ruby
  • Let VM binding instantiate tracers (wrapping ProcessEdgesWork) so that it can be parallelised.

Related repos:

  • https://github.com/wks/mmtk-openjdk/tree/gen-weakref-api: I ported the ReferenceProcessor and FinalizableProcessor from mmtk-core to mmtk-openjdk and managed to get them running. However, it is a proof-of-concept, only, and shall not be merged.
  • https://github.com/wks/mmtk-ruby/tree/gen-weakref-api: With this PR, I am able to handle obj_free, finalisers and global weak tables in Ruby. As a by-product, this PR also enables mmtk-ruby to fully support Ruby finalisers (ObjectSpace::define_finalize). I'll merge itfor mmtk-ruby after this PR is merged.

This PR adds a mechanism to let the VM binding handle weak references and finalisers.

Changes made in this PR include:

  • Add two work bucket stages:
    • VMRefClosure: expand the transitive closure by letting the VM binding trace and/or forward edges in weak references and/or tables of finalisable objects. The decision of which edge to trace and which edge to clear is made by the binding.
    • VMRefForwarding: as part of the "compaction" step in MarkCompact, it lets the VM binding forward those edges.
  • Introduce "sentinel", a work packet to be executed after a work bucket is drained, but before the work bucket is considered to be finished.
    • The "sentinel" can add packets to the current bucket, and set another "sentinel", so that the GC stays in the same bucket stage. This allows transitive closures to be expanded multiple times, and code to be executed after each transitive closure expansion finishes.
    • Sentinels replace the Scheduler::closure_end callback, and act as a more general mechanism.
    • The VMProcessWeakRefs work packet is the sentinel for VMRefClosure and VMRefForwarding. It calls Collection::process_weak_refs.
  • Extend the Collection::process_weak_refs trait method to provide extra parameters.
    • worker: The current GC worker.
    • context: Provide more contexts, such as whether the current GC is a nursery GC, and whether the current trace is for marking or forwarding.
    • tracer_factory: It can instantiate ObjectTracer and call ObjectTracer::trace_object. It basically wraps ProcessEdgesWork and its initialisation and flushing, so that the VM binding can call trace_object without intimate knowledge of the ProcessEdgesWork trait.
    • Also use the return value to let the MMTk core call Collection::process_weak_refs again when transitive closure is finished again. This allows multiple strength levels and ephemerons to be implemented.

This PR basically resurrects and extends the mechanism introduced in 2b55f89 This PR kept the following mechanisms:

  • Collection::vm_release: This can be used to do what RefEnqueue used to do.
    • Also added Collection::vm_prepare to make the API symmetric.

Differences are:

  1. Dedicated VMRefClosure and VMRefForwarding buckets: Weak reference processing is done in those dedicated buckets instead of Closure.
  2. Replaced closure_end with "sentinels"
    • Sentinel can be added to any bucket. Currently we use sentinel for VMRefClosure and VMRefForwarding.
    • Collection::process_weak_refs returns a bool and has similar semantics as the return value of closure_end.

Rationale

Why do we introduce another mechanism while MMTk core has ReferenceProcessor and FinalizableProcessor?

  1. Those mechanisms are Java-centric, and does not match other languages/VMs well. For example,
    • V8 has ephemeron and 8 different kinds of weak references
    • Ruby's obj_free is like Java finalize() but doesn't resurrect objects;
      Ruby's ObjectSpace.define_finalize is like PhantomReference but can be unregistered;
      Ruby's finalizer_table, generic_iv_tbl_, id_to_obj_tbl and obj_to_id_tbl are weak tables.
  2. Even for Java, our current reference/finalizable processors are brought from JikesRVM. Although they are compatible with OpenJDK or other JVMs, they are implemented differently, and the performance and semantics have subtle differences. For this reason, if we compare MMTk with the vanilla GC in the JVM (such as G1 in OpenJDK), the difference in performance will not only reflect the difference of the GC mechanism, but also the difference in referene/finalizable processors, which is undesirable. We should be able to make apple-to-apple comparisons between, for example, a G1 plan in MMTk and the default G1 GC in OpenJDK, using the same reference/finalizable processors.

Old discussions

Known issue: can we parallelise Collection::process_weak_refs?

(Update: I added a "TracerFactory" so that different GC workers can instantiate ObjectTracer and call trace_object.)

In this PR, Collection::process_weak_refs is called in one work packet running on one GC worker. This means process_weak_refs itself is executed sequentially. However, after it calls trace_object multiple times, the underlying ProcessEdgesWork may split the node list into multiple work packets. One question will be whether we should make process_weak_refs parallel, too, so we can process weak references in multiple work packets simultaneously.

Currently, the ReferenceProcessor and FinalizableProcessor in mmtk-core does not paralellise their ReferenceProcessor::scan_{soft,weak,phantom}_refs and FinalizableProcessor::scan methods. So when we reimplement our current ReferenceProcessor and FinalizableProcessor in the binding on top of the new mechanism in this PR (See 1), it has the same level of parallelism as before.

But from my observation, the lxr branch is parallelising weak reference processing by creating multiple work packets (See 2) that have intimate knowledge about the ProcessEdgesWork work packet.

I hesitate to expose ProcessEdgesWork to the VM binding, because I think it is an implementation detail of mmtk-core, and GC algorithms can actually implement node-enqueuing tracing instead of edge-enqueuing tracing. All the VM binding needs is the ability to call trace_object (and I defined the ProcessWeakRefsTracer trait this way. See 3).

The problem is, for our current mmtk-core, we can only implement ProcessWeakRefsTracer via ProcessEdgesWork. In this PR, I implemented it by wrapping a reference to ProcessEdgesWork to it, calling set_worker before calling Collection::process_weak_refs, and getting its nodes list after it. See 4.

But if we wrap a ProcessEdgesWork instance inside a ProcessWeakRefsTracer, we will have to expose set_worker and the flushing to the VM binding, and I think it is not very elegant.

impl Collection for FooVMBinding {
    fn process_weak_refs(worker: &mut GCWorker, tracer: impl ProcessWeakRefsTracer) -> bool {
        trace.set_worker(worker); // This line may confuse VM binding writers.
        let work = make_another_work_packet(tracer, get_some_weak_refs());
        worker.scheduler().work_buckets[...].add(work)
    }
}

struct AnotherWorkPacket<T: ProcessWeakRefsTracer> {
    tracer: T,
}
impl GCWork for AnotherWorkPacket {
    fn do_work(&mut self, worker: &mut GCWorker<FooVMBinding>, mmtk: &'static MMTK<FooVMBinding>) {
        self.tracer.set_worker(worker); // This is confusing, too.
        self.tracer.trace_object(...);
        self.tracer.flush(); // This is confusing, too, but can be moved to `drop()`.
    }
}

... so that it works for other languages, too.
These work bucket stages intend to replace the Java-specific stages.
-   Improved documentation
-   Tell VM binding if it is nursery collection
Fixed a bug that when newly opened bucket is empty but has a boss, it
forgets the boss and continue opening subsequent buckets.

Added Collection::vm_prepare to pair with vm_release.

Added a tls parameter to vm_release and process_weak_refs.
Fixed a typo

The boss reschedules itself to VMRefForwarding when forwarding.

VM-side reference handling is always enabled regardless of
the MMTK_NO_{FINALIZER,REFERENCE_TYPE} options.
... so that when we add more information, bindings that don't use those
extra info won't need to change their function signatures.
... so that ProcessWeakRefsContext can be a plain struct.
@wks wks marked this pull request as ready for review December 19, 2022 09:40
Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worthwhile to run some performance measurements (maybe two different runs): 1.
make sure that leaving the VM-side weak ref processing on (but with an empty implementation) does not cost anything (this will tell whether we need a switch for it), 2. make sure that moving the current reference processor to the binding side does not cost anything (this will tell us the efficiency of the new API). The latter is more important.

See my inline-comments for a few other issues.

src/scheduler/gc_work.rs Outdated Show resolved Hide resolved
src/scheduler/gc_work.rs Show resolved Hide resolved
src/vm/collection.rs Outdated Show resolved Hide resolved
src/vm/scanning.rs Outdated Show resolved Hide resolved
src/scheduler/scheduler.rs Show resolved Hide resolved
src/vm/collection.rs Outdated Show resolved Hide resolved
src/vm/collection.rs Outdated Show resolved Hide resolved
src/vm/collection.rs Outdated Show resolved Hide resolved
Split process and forwarding weak refs.

Renamed QueuingTracerFactory to ObjectTracerContext.

Reuse ProcessEdgesWorkTracerContext in ScanObjects.

Added more comments
It happens at the same time as stop_all_mutators, and currently I am
only using it for assertion.  I'll consider adding it back later if
there is a specific need.
src/scheduler/scheduler.rs Outdated Show resolved Hide resolved
src/scheduler/scheduler.rs Show resolved Hide resolved
@qinsoon
Copy link
Member

qinsoon commented Jan 2, 2023

The difference between build1-1 (blue) and build3-null (purple) shows the difference of implementing reference processing in mmtk-core as in master vs implementing it in the binding using the new API. The plot shows the STW time is usually shorter (except h2) when implementing reference processing using the new API. But it is hard to explain why, because I copied and pasted the reference processor and finaliser processor from mmtk-core to mmtk-openjdk, with minor changes.

This is a bit strange. For example, for zxing, there seems to be a lot of work done for ref processing, as the reference processor in mmtk-core slows down the STW a lot. It is strange that the cost just magically disappeared by moving the reference processor to the binding. Can you double check that build3-null actually processes the weak references properly? For example, you will still need to set no_reference_types=false for build3-null, because of these lines in the binding:
https://github.com/mmtk/mmtk-openjdk/blob/b01e1cb235027b5441e46ef4e3c1969875ad2bff/mmtk/src/object_scanning.rs#L116
https://github.com/mmtk/mmtk-openjdk/blob/b01e1cb235027b5441e46ef4e3c1969875ad2bff/mmtk/src/object_scanning.rs#L139

@wks
Copy link
Collaborator Author

wks commented Jan 3, 2023

This is a bit strange. For example, for zxing, there seems to be a lot of work done for ref processing, as the reference processor in mmtk-core slows down the STW a lot. It is strange that the cost just magically disappeared by moving the reference processor to the binding. Can you double check that build3-null actually processes the weak references properly? For example, you will still need to set no_reference_types=false for build3-null, because of these lines in the binding: https://github.com/mmtk/mmtk-openjdk/blob/b01e1cb235027b5441e46ef4e3c1969875ad2bff/mmtk/src/object_scanning.rs#L116 https://github.com/mmtk/mmtk-openjdk/blob/b01e1cb235027b5441e46ef4e3c1969875ad2bff/mmtk/src/object_scanning.rs#L139

Indeed. That's a problem. build3-null is still treating weak references as strong if MMTK_NO_REFERENCE_TYPES is not set to false. I'll re-run zxing and lusearch with and without rt on all builds.

@wks
Copy link
Collaborator Author

wks commented Jan 3, 2023

The results: https://squirrel.anu.edu.au/plotty/wks/noproject/#0|boar-2023-01-03-Tue-054343&benchmark^buildstring^invocation^iteration&GC^time^time.stw&|10&iteration^1^4|20&1^invocation|30&1&benchmark&buildstring;build1.ms.s.c2.mmtk_gc-Immix.tph.probes_cp.probes.dacapochopin-f480064|41&Histogram%20(with%20CI)^buildstring^benchmark&

Re-run lusearch (233M heap size) and zxing (672M heap size). This time, I apply rt to all three builds.

STW time (normalised to build1-null):
image

time (normalised to build1-null):
image

Number of GC (not normalised) (FYI: zxing consistently triggers GC 5 times in all invocations)
image

The result shows that the STW time is significantly greater when reference processing is enabled (MMTK_NO_REFERENCE_TYPES=false). The work of processing weak references doesn't simply disappear when we move reference processing to the binding side.

The STW time without reference processing is similar among builds, and the STW time with reference processing is also similar among builds. For lusearch, enabling reference processing increases the STW time by approx. 30%, and for zxing, 60%. I tried to decrease the heap size. The increase in STW time is similar when the heap is smaller (160M for lusearch and 463M for zxing). zxing doesn't run with smaller heap sizes.

The STW time takes up a small fraction of total time, but is still visible when turning rt on and off.

There is an observable overhead in build3-rt. The following plot is an un-aggregated plot that shows the STW time of each invocation separately. We can see that the increase is consistent, and not due to outliers.

image

In build3-rt, reference processing is implemented on the binding side. I am not sure if the overhead comes from the inefficiency of this new API or any slight difference between the implementation of reference processor in mmtk-core and my mmtk-openjdk branch. But we probably needs to reimplement the reference processor anyway, using OpenJDK's own mechanism instead of JikesRVM's, likely using the code in the lxr branch.

@wks
Copy link
Collaborator Author

wks commented Jan 4, 2023

Added build4: mmtk-core is gen-weakref-api but without adding any existing reference/finalizer-processing work packets mmtk-core at all. mmtk-openjdk is also gen-weakref-api, i.e. ref/finalizer-processors implemented in the binding, the same as build3. This will remove the interference of mmtk-core's existing reference/finalizer work packets when MMTK_NO_REFERENCE_TYPES=false.

I ran lusearch and zxing, 40 iterations.

STW time: (normalised to build1-null)
image

Total time: (normalised to build1-null)
image

Note that the order of the bars is not 1, 2, 3 and 4.

The STW time for build4-1 (black) is slightly lower than build3-1 (blue), but still noticeably higher than build2-1 and build1-1. This indicates that the influence of unused/no-op work packets is small, but there is another source of overhead to be discovered.

RefEnqueue and Release

One difference I noticed is that

  • With mmtk-core's existing reference processors, the RefEnqueue work packet is added to the WorkBucketStage::Release bucket.
  • With this PR, the Release work packet will
    1. first call VMCollection::vm_release which the VM can implement to do what RefEnqueue does, and
    2. after VMCollection::vm_release returns, do the rest of the work in Release, such as calling plan_mut.release.

We can see that the old method allows RefEnqueue to be executed in parallel with the Release work packet, but the new API forces VMCollection::vm_release to compete before running the reset of Release. *If both VMCollection::vm_release and plan.release are time-consuming, this sequential constraint will likely to introduce a non-trivial overhead.

An obvious solution is adding a VMRelease work packet to the WorkBucketStage::Release bucket so that VMCollection::vm_release can be executed in a different work packet than the Release packet.

Alternatively, the VM can spawn a new work packet in VMCollection::vm_release so as not to block any subsequent work. I prefer this solution because it still allows the VM do "release" work which is guaranteed to happen before plan_mut.release(). I don't know what VM needs this.

It is easy to check. I'll run the benchmark again with a build5 that parallelises VMCollection::vm_release and Plan::release.

There is one risk, though, because we expect the Release work packet to be the only work packet that accesses the Plan instance. Paralellising the work may introduce a chance of racing, but the VM is not supposed to access Plan directly anyway.

@wks
Copy link
Collaborator Author

wks commented Jan 5, 2023

In build5, mmtk-core is the same as build2 and build3 (with both existing ref-processing implementation and the new API), but the binding side creates another work packet in vm_release, making the work parallelisable. However, experiment shows the performance is still worse than build1 and build2 (both using mmtk-core's built-in ref processors)

STW time (normalised to build1-null)
image

Total time (normalised to build1-null)
image

It is also worth noting that the total execution time has greater variance in build4 (removed existing ref-processing work packets) and build5 (parallelising ref-enqueue and plan-release).

I'll look into the details.

@wks
Copy link
Collaborator Author

wks commented Jan 5, 2023

I constructed a micro benchmark that creates 100000 weak reference. 50000 of them point to live objects, and the rest point to dead objects, and all of them has a ReferenceQueue to enqueue to when they die.

public class RefProcTest {
    public static Probe probe = new RustMMTkProbe();
    public static int unused;

    public static void doTest(int total, int live) {
        // Java doesn't like `WeakReference<Object>[total]`.
        var refs = new ArrayList<WeakReference<Object>>(total);
        var liveObjs = new ArrayList<Object>(live);
        var refQueue = new ReferenceQueue<Object>();

        for (int i = 0; i < total; i++) {
            var obj = new Object();
            var ref = new WeakReference<Object>(obj, refQueue);
            refs.add(ref);
            if (i < live) {
                liveObjs.add(obj);
            }
        }

        System.gc(); // I know this always triggers GC with MMTk.  It is not generally true, though.

        unused = refs.size() + liveObjs.size(); // just use them, in case GC treat them as garbage.
    }

    public static void main(String[] args) {
        final int total;
        final int live;
        final int warmups;
        final int iterations;

        // ... parse them from args

        for (int i = 0; i < warmups; i++) {
            doTest(total, live);
        }

        probe.begin("RefProcTest", 0, false);
        for (int i = 0; i < iterations; i++) {
            doTest(total, live);
        }
        probe.end("RefProcTest", 0, false);
    }
}

I ran them locally with the settings of build1 (master+master) and build3 (mmtk-core has new API, and mmtk-openjdk uses the new API), with total=100000, live=50000, warmups=10 and iterations=100. MMTK_PLAN=Immix, MMTK_NO_REFERENCE_TYPES=false. The heap is large enough so that it never triggers GC unless I call System.gc(). The following table is the result. Time unit is ms.

packet build1 build3 diff
SoftRefProcessing 0.998000 0.105000 -0.893000
WeakRefProcessing 890.801000 0.043000 -890.758000
Finalization 0.525000 0.381000 -0.144000
PhantomRefProcessing 0.175000 0.047000 -0.128000
VMProcessWeakRefs 0.038000 5.626000 5.588000
RefEnqueue 45.841000 0.090000 -45.751000
Release 3.873000 37.334000 33.461000
PlanProcessEdges 36934.860000 39979.562000 3044.702000
ProcessWeakRefsWork n/a 1008.507000 1008.507000

p.s. (1) Other work packets took very little time. (2) The execution time varies a lot. Vanilla HotSpot with its default G1 GC has large variation, too, probably because of the nondeterministic nature of weak references themselves. Despite of that, the data above is quite representative for a typical execution.

Among the work packets, VMProcessWeakRefs is introduced by the new API, and it should do all the work of {Soft,Weak,Phantom}RefProcessing and Finalization. The work of RefEnqueue is moved into Release.

ProcessWeakRefsWork is special. When doing experiment, I intentionally lets the binding create a separate work packet (ProcessWeakRefsWork) when processing weak references (not soft, final or phantom), just to test if my new API works if the binding splits the work into multiple work packets. It is not necessary because I can just process weak refrences in VMProcessWeakRefs itself.

From that table, I can identify two sources of inefficiency.

  1. (minor) ProcessWeakRefsWork is not necessary. I can just let VMProcessWeakRefs handle weak references and eliminate one work packet creation. But the impact of this is small.
  2. (minor) We could have a separate work packet that does the equivalent of RefEnqueue. But as we can see from the table, the impact is also small.
  3. (major) The time consumed by PlanProcessEdges is dramatically increased. The impact of this is greater than anything else.

I looked into the code, and I found the difference between my API and the existing ReferenceProcessor is how we flush ProcessEdgesWork.

The existing ReferenceProcessor simply calls ProcessEdgesWork::flush():

impl<E: ProcessEdgesWork> GCWork<E::VM> for WeakRefProcessing<E> {
    fn do_work(&mut self, worker: &mut GCWorker<E::VM>, mmtk: &'static MMTK<E::VM>) {
        let mut w = E::new(vec![], false, mmtk);
        w.set_worker(worker);
        mmtk.reference_processors.scan_weak_refs(&mut w, mmtk);
        w.flush();  // here
    }
}

On the other hand, my API attempted to wrap ProcessEdgesWork into a trait, and help dividing the enqueued nodes into chunks.

    fn with_tracer<R, F>(&self, worker: &mut GCWorker<E::VM>, func: F) -> R
    where
        F: FnOnce(&mut Self::TracerType) -> R,
    {
        let mmtk = worker.mmtk;

        // Prepare the underlying ProcessEdgesWork
        let mut process_edges_work = E::new(vec![], false, mmtk);
        // FIXME: This line allows us to omit the borrowing lifetime of worker.
        // We should refactor ProcessEdgesWork so that it uses `worker` locally, not as a member.
        process_edges_work.set_worker(worker);

        // Cretae the tracer.
        let mut tracer = ProcessEdgesWorkTracer { process_edges_work };

        // The caller can use the tracer here.
        let result = func(&mut tracer);

        // Flush the queued nodes.
        let ProcessEdgesWorkTracer {
            mut process_edges_work,
        } = tracer;

        // ============ BEGIN: PROBLEMATIC CHUNK ======================
        let next_nodes = process_edges_work.pop_nodes();
        if !next_nodes.is_empty() {
            // Divide the resulting nodes into appropriately sized packets.
            let work_packets = next_nodes
                .chunks(E::CAPACITY)
                .map(|chunk| {
                    Box::new(process_edges_work.create_scan_work(chunk.into(), false)) as _
                })
                .collect::<Vec<_>>();
            worker.scheduler().work_buckets[self.stage].bulk_add(work_packets);
        }
        // ============ END: PROBLEMATIC CHUNK ======================

        result
    }

And that "PROBLEMATIC CHUNK" is the culprit. I replaced it with process_edges_work.flush() and it becomes faster.

packet build1 build3 diff
SoftRefProcessing 0.965000 0.095000 -0.870000
WeakRefProcessing 841.921000 0.052000 -841.869000
Finalization 0.462000 0.347000 -0.115000
PhantomRefProcessing 0.167000 0.044000 -0.123000
VMProcessWeakRefs 0.037000 1.488000 1.451000
RefEnqueue 46.418000 0.094000 -46.324000
Release 13.944000 36.644000 22.700000
PlanProcessEdges 37451.898000 36654.279000 -797.619000
ProcessWeakRefsWork n/a 929.995000 929.995000

This managed to make PlanProcessEdges faster than build1, although ProcessWeakRefsWork still takes more time than WeakRefProcessing for some reason. (I tried to merge the packet ProcessWeakRefsWork back to VMProcessWeakRefs, but the improvement is not significant.)

I'll try to run lusearch and zxing again to see if it makes a difference.

@wks
Copy link
Collaborator Author

wks commented Jan 5, 2023

I was wrong again.

  • Replacing the "PROBLEMATIC CHUNK" with process_edges_work.flush() and it becomes faster: true.
  • Remove the chunking logic and create one single work packet and it becomes faster: false.

I copied the code from ProcessEdgesWork::flush to here, and changed its work bucket stage to self.stage, and it no longer has performance increase. How could this happen?

I then added #[inline(always)] to ProcessEdgesWorkTracer::trace_object, keeping the chunking logic, and the performance improved.

So the culprit is the hot trace_object method that is not inlined. Replacing the code with process_edges_work.flush() somehow changed the inlining decision of the Rust compiler. Sigh...

@wks
Copy link
Collaborator Author

wks commented Jan 6, 2023

Experiment shows that both inlining and chunking affect the performance.

I kept build{1,2,3} of previous experiments. The new build4 forces inlining the trace_object method. build5 also removes the chunking logic.

Feature build1 build2 build3 build4 build5
mmtk-core has the new API? No Yes Yes Yes Yes
mmtk-openjdk uses the new API? No No Yes Yes Yes
force inlining trace_object? n/a n/a No Yes Yes
split the enqueued nodes into chunks? No No Yes Yes No

Note: build1 and build2 uses the built-in ref processor in mmtk-core. It directly calls ProcessEdgesWork::trace_object, so "force inlining" is not relevant. build{3,4,5} uses a wrapper that wraps the E::trace_object method, so it will hurt the performance if the wrapper is not inlined.

Note2: build1 and build2 uses the built-in ref processor in mmtk-core. It calls ProcessEdgesWork::flush() to create ScanObjects work packets, and it doesn't do any chunking. build{3,4,5} uses my wrapper which "helps" the user split the vec of enqueued nodes into smaller chunks and create multiple work packets in order to increase parallelism.

Plotty page:

https://squirrel.anu.edu.au/plotty/wks/noproject/#0|boar-2023-01-05-Thu-143457&benchmark^build^invocation^iteration^rt&GC^time^time.stw&|60&build&rt|10&iteration^1^4|20&1^invocation|30&1&benchmark&build-rt;build1-null|40&Histogram%20(with%20CI)^build-rt^benchmark&

STW time (normalised to build1-null):
image

Total time (normalised to build1-null):
image

build5 matches the STW time and total time of build1 and build2.

The total time of build4 is shorter than other builds for zxing, but is worse than other builds for lusearch.

I still believe a good API should hide the "ideal packet size" from its user, and automatically flush the ProcessEdgesWork when needed. The problem should be with the way I do the chunking, i.e. split the list when it is already populated instead of flushing as soon as it is full.

@wks
Copy link
Collaborator Author

wks commented Jan 6, 2023

I managed to implement auto-chunking so that it will flush and create a work packet as soon as the nodes buffer in ProcessEdgesWork is full (the number of elements reached ProcessEdgesWork::CAPACITY, which is 4096 at this moment).

In the following test, the trace_object method is always inlined in all builds.

  • build{1,2} are the same as before.
  • In build{3,4,5}, mmtk-core provides the new API and mmtk-openjdk uses the new API.
  • build3 chunks the nodes buffer after it is filled (if there are 100000 weak references, it will try to chunk a 100000-element vector in to many 4096-element vectors);
  • build4 does not do the chunking (it will create one PlanScanObjects work packets with 100000 elemnets)
  • build5 checks the number of elements of the nodes buffer in trace_object, and creates a PlanScanObjects every time 4096 objects are enqueued.

STW time (normalised to build1-null)
image

total time (normalised to build1-null)
image

In both lusearch and zxing, the STW time of build5 is very close to that of build1 and build2 when reference-processing is enabled.

When testing with my synthetic micro-benchmark, build5 consistently outperforms build1, especially in terms of time consumed in PlanProcessEdge work packets.

@wks
Copy link
Collaborator Author

wks commented Jan 9, 2023

Here is the plot for build1, build2 and build5 for more tests in the DaCapo Chopin benchmark suite.

STW time (normalised to build1-null)

image

STW time (normalised to build1-1)

image

Total time (normalised to build1-1)

image

Number of GC (normalised to build1-1)

image

From the plot, we see both the STW time and the total time of build5 are almost as good as build1 on average.

As for the the STW time of zxing, build5-1 still has a noticeably overhead compared to build1-1 and build2-1, and the error is small. It could be due to not parallelising Release and RefEnqueue, but other possibilities exist.

The behaviour of some benchmark changed significantly after turning on reference processing.

  • lusearch: STW time is greatly increased after enabling ref processing, and total time also increased greatly.
    • sunflow, xalan: like lusearch, but milder.
  • zxing: STW time is greatly increased after enabling ref processing, but total time changed mildly.
  • cassandra, fop, jython, luindex, tomcat: STW time increased, but total time doesn't change.
  • h2o, pmd: No obvious change before or after enabling ref processing.
  • h2: STW time is greatly decreased after enabling ref processing, and total time reduced noticeably.

@k-sareen
Copy link
Collaborator

k-sareen commented Jan 9, 2023

We need to be chunking work packets probably. @wks didn't you look into the size of work packets a while ago and found there were packets >>> 4096 (default size)?

@wks
Copy link
Collaborator Author

wks commented Jan 9, 2023

Using perf to analyse my microbenchmark, I found two problems in my mmtk-openjdk repo.

  1. In my mmtk-openjdk branch, add_weak_reference attempts to borrow the WEAK_PROCESSOR instance immutably using AtomicRefCell every time it is called. The reference processor in mmtk-core doesn't have this problem because ref processor is a field of MMTK, and an &MMTK is passed to add_weak_reference as parameter.
    This function is called during object scanning every time a reachable weak reference object is discovered. Although AtomicRefCell is relatively cheap to borrow, it still involves an atomic operation. When it is on the hot path, the cost will accumulate. A wiser way is to let each GC worker borrow WEAK_PROCESSOR to its thread-local state, or change the synchronisation mechanism in other ways to eliminate this borrow. For now, I just replaced it with an unsafe operation. Also note that the main bottleneck of reference processing (of our current implementation) is the mutex lock on the HashMap.
  2. The closure that contains trace_object is not inlined. Concretely,
                    tracer_context.with_tracer(worker, |tracer| {
                        self.reference_processors
                            .scan_weak_refs(|o| tracer.trace_object(o)); // This inner lambda is not inlined.
                    });

We know how hot trace_object is. To force inlining it, I have to use the unusual syntax #[inline(always)] |o| tracer.trace_object(o) to annotate a lambda. Again the ref processor in mmtk-core doesn't have this problem because it passes a &ProcessEdgesWork directly to the reference processor.

In the following plot, build1 and build2 are like before. Both build3 and build4 applied the auto-flushing mechanism of enqueued nodes in mmtk-core. The difference is, build4 applied the two optimisations above in the mmtk-openjdk repo.

Feature build1 build2 build3 build4
mmtk-core has the new API? No Yes Yes Yes
mmtk-openjdk uses the new API? No No Yes Yes
force inlining trace_object? n/a n/a Yes Yes
flush enqueued nodes whenever full (4096)? No No Yes Yes
workaround sync cost of AtomicRefCell? No No No Yes
inline the hot lambda No No No Yes

STW time (normalised to build1-1):
image

Total time (normalised to build1-1):
image

From the plots, we can see that the STW time and total time of build4 (with the two optimisations above) are better than build3 (without the optimisations), and are on par with build1 (master+master).

Since the performance differences are a result of inlining details in mmtk-openjdk, I think the new weak reference API itself is efficient enough for general use.

In the future, we should deprecate the reference processor and finalisation processor in mmtk-core, and update the reference/finaliser processing mechanism in mmtk-openjdk. The current ref processor in mmtk-core has a serious bottleneck, that is, the mutex lock on the HashMap that holds the discovered weak references. This makes reference processing slower if there are more GC worker threads.

Update: I re-ran the same build1 and build4 with MarkCompact. Benchmarks show that the API allows reference processing to implemented at least as efficiently as mmtk-core's built-in ref processor on MarkCompact, too.

STW time for when using MarkCompact GC (normalised to build1-1):
image

Total time for when using MarkCompact GC (normalised to build1-1):
image

It is interesting that enabling ref processing makes lusearch run faster with MarkCompact.

@wks
Copy link
Collaborator Author

wks commented Jan 9, 2023

We need to be chunking work packets probably. @wks didn't you look into the size of work packets a while ago and found there were packets >>> 4096 (default size)?

Yes. I once found that many work packets are much smaller than 4096. Some of them have less than 10 items (objects/edges). The problem is that we currently have no way to merge work packets. Packets only gets smaller, unless some objects have a large fan-out, i.e. one object points to many other objects.

It should be helpful if the new API automatically chunks up the enqueued nodes into properly-sized work packets, and I have implemented that successfully.

However, for my microbenchmark, packet size is not a problem. The mutex lock on the HashMap of discovered {Soft,Weak,Phantom}Rreferences effectively makes reference processing single-threaded when discovering references, and single-threaded when iterating over discovered references. And an WeakReference instance either points to a live object or gets cleared, so there should be no objects enqueued when tracing WeakReference. However, trace_object still needs to be called to determine whether the object the WeakReference points to is still alive, so whether trace_object is inlined still affects the problem. SoftReference may be a different story, because a SoftReference can keep dead objects alive.

Rename vm_release to post_forwarding, and run it in a dedicated work
packet in the Release bucket.
@wks
Copy link
Collaborator Author

wks commented Jan 12, 2023

I renamed Collection::vm_release to Collection::post_forwarding, and created a VMPostForwarding work packet for it.

The following experiment investigates the performance impact of parallelising the reference-enqueuing work (done in post_forwarding, or vm_release in the past).

  • build1: master + master
  • build{2,3,4} uses the new API. The differences are:
    • build2: The Release work packet calls vm_release and then plan_mut.release, not parallelising at all.
    • build3: The Release work packet calls vm_release and then plan_mut.release, but vm_release creates a work packet that does ref-enqueuing.
    • build4: schedule_common_work schedules a VMPostForwarding work packet in the Release bucket. The Release packet doesn't call post_forwarding.

STW time: (normalised to build1-1)
image

Total time: (normalised to build1-1)
image

The experiment shows build4 performs the best among build{2,3,4}, but is still within the confidence interval. I think the reason is, the earlier the work packet is scheduled, the earlier a GC worker can start doing that work.

@qinsoon
Copy link
Member

qinsoon commented Jan 12, 2023

@wks I feel you have understood and optimized enough on this. Let me know when you push all your changes, and I will take a look at this PR again.

Copy link
Member

@qinsoon qinsoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I suggest we run the binding tests before merging once the merge conflict is resolved.

///
/// NOTE: This will replace `RefEnqueue` in the future.
///
/// NOTE: Although this work packet runs in parallel with the `Release` work packet, it does not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot guarantee this. The binding may access the plan in their implementation of Collection::post_forwarding().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. While Collection::post_forwarding() doesn't expose the Plan to the binding, and the Plan field is declared as pub(crate), many functions in the memory_manager module access the plan instance.

The timing of RefEnqueue (now VMPostForwarding) should be after all references are forwarded. That includes the RefForwarding and the FinalizerForwarding buckets (both subsumed by the new VMRefForwarding bucket). And it doesn't need to access the MMTk instance. However, it seems impossible to prevent the binding from accessing plan in our current program structure. As long as the binding has a &'static MMTk, it can call functions in memory_manager, and indirectly access the plan.

I think what we can do is telling the VM binding not to access the plan indirectly. But even that sounds like a bad idea because we haven't exposed Plan anyway, and the binding has no way to know which API function indirectly accesses the Plan. So maybe we can provide the Collection::post_forwarding() hook like this to make sure our ref-processing code in mmtk-core can be implemented in the binding. But we refactor mmtk-core in the future to solve the "plan vs plan_mut" problem from the root so we can actually prevent the binding from accidentally accessing the plan.

@qinsoon qinsoon added the PR-testing Run binding tests for the pull request (deprecated: use PR-extended-testing instead) label Jan 16, 2023
@wks wks merged commit 844aa0e into mmtk:master Jan 16, 2023
wenyuzhao pushed a commit to wenyuzhao/mmtk-core that referenced this pull request Mar 20, 2023
This commit adds a language-neutral API to let the VM binding handle weak
references and finalisers.

Added VMRefClosure and VMRefForwarding work bucket stages to replace the
Java-specific stages.

Added a "sentinel" mechanism to execute a work packet when a bucket is drained,
and prevent the GC from entering the next stage, making it possible to expand
the transitive closure multiple times. It replaces GCWorkScheduler::closure_end.

Extended the Collection::process_weak_refs method to allow the binding to trace
objects during weak reference processing.

Renamed Collection::vm_release to Collection::post_forwarding, and it is now
executed in a dedicated work packet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR-testing Run binding tests for the pull request (deprecated: use PR-extended-testing instead)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants