Let the coordinator thread open buckets #782

wks · 2023-03-16T14:08:17Z

Note: This PR is not intended to do any full-scale refactoring on the work packet framework as described in #774 Instead, this PR tries to fix #778 cleanly while not changing the overall structure.

The main problem is that Condvar::wait is allowed to spuriously wake up at any time even without another thread notifying it. Currently, work buckets are opened and closed under the assumption that all GC workers have parked. Due to spurious wake-ups, any GC worker can wake up at any time, and break the assumption.

This PR makes the following changes:

Only the coordinator can open more buckets (with the exception of Prepare. See the "won't fix" section below.).
When all workers have parked, they notify the coordinator that all workers have parked, and wait for the coordinator to open buckets.
- Because of this, workers no longer report the "GC finished" event to the coordinator. Now it is the coordinator that determines whether GC has finished.
When all workers have parked, a boolean flag WorkerMonitor::group_sleep is set. No GC workers can unpark until this flag is explicitly cleared by the coordinator.
- This flag makes the GC workers robust against spurious wake-ups. Even if any worker spuriously wakes up, it will find the group_sleep flag still set, and will wait again.
- Concretely, the flag is cleared if the coordinator adds more work packets to open buckets, or opens new buckets. If no more work is available, GC finishes.

To implement those changes, some data structures are modified.

The worker-to-coordinator channel is changed from mpsc::Channel to a custom scheduler::controller::channel::Channel.
- It is not a strict FIFO channel. The coordinator always executes all pending coordinator work packets before handling the "all workers have parked" condition.
Introduced a WorkerMonitor struct as the anything-to-workers notification channel.
- It encapsulate the existing Mutex and Condvar so that the coordinator or other workers can notify workers.
- It includes the parked_workers counter. WorkerGroup is no longer responsible for recording parked workers.
- It includes the group_sleep boolean flag.
Removed the pending_coordinator_packets counter.

Known issue

There are two boolean flags that records the "all workers have parked" condition.
- One is controller::channel::ChannelSync::all_workers_parked.
- The other is WorkerMonitorSync::group_sleep.
Currently, the former is used to notify the coordinator, while the latter is used to wake up workers. It may be possible to merge them.

However, as a consequence, if we merge them, both the coordinator and the workers will be waiting on the same mutex that guards that flag. That would mean changing the channels from one worker-to-coordinator channel and an anything-to-worker notifier to one single synchronisation entity guarded by one mutex (but may have multiple condvars). That will be a larger-scale change. On the contrary, both the worker monitor and the worker-to-coordinator channel already existed, except that the former was implemented as a tuple of (Mutex<()>, Condvar) while the latter was an mpsc::Channel. Therefore, it is easier (and requires less changes) to keep them separate.
Requiring the coordinator to open new buckets adds a tiny context-switching overhead, but this overhead is small compared to the load balance problem.

Update::
- On machines with many cores, the synchronisation overhead may be greater.
- This problem mainly affects work buckets with a single small work packet. One example is the VMRefClosure bucket. It has one packet that calls Scanning::process_weak_refs which does nothing by default. Previously, executing such a work packet has no observable cost. But with this PR, it will involve a context switch from the coordinator to a worker that executes this no-op packet, and another context switch from the worker to the coordinator to open more buckets. This mainly affects VMs that don't use the VMRefClosure bucket (such as OpenJDK which uses the legacy {Soft,Weak,Phantom}Closure buckets).
This implies that the cost is not negligible in scale, and this PR should only be a temporary workaround before we do a full-scale redesign of the work packet system.
Is there a better name for "group sleeping"? It should describe a state that once the group of threads entered that state, they cannot get out of that state by themselves, and must be release by an external object (coordinator).

Won't fix

The Prepare bucket is opened by the StopMutators work packet which may be executed by a worker (if the binding configured so).

In theory, we can just let the coordinator open the first STW bucket (Prepare). From the log, the GCWorkScheduler::notify_mutator_paused method is called near the end of the StopMutators work packet. Even in the best case, the Prepare work packet only starts several microseconds before the StopMutators work packet finishes. So the synchronisation overhead should be small.

I (Kunshan) tried implementing this. When all mutators stopped, the StopMutators packet sends a message to the coordinator. The coordinator, when received that message, opens the Prepare bucket. However, it introduced another synchronisation bug.

If the coordinator and the workers are executed in the following order, the coordinator will observe a strange state that all workers parked, but some open buckets still contain packets.

A worker executed the StopMutators packet. When mutators paused, it sent a "mutators stopped" message to the coordinator.
The same worker finished executing the StopMutators packet, and found no more packets in open buckets (`Prepare is not open yet.). The worker parked, and it found it is the last parked worker. Just at this moment, the OS performed a context switch.
The coordinator starts to execute. It received the "mutators stopped" message, and opened the Prepare bucket. The coordinator tried to clear the "group-sleep" flag and notify the workers, but workers were not in the "group-sleep" state, yet (the worker mentioned was about to wait, but had not started waiting). This clearing and notifying had no effect.
The OS did another context switching and the worker continued. The worker thought no packets were available in open buckets (which is no longer true because the coordinator opened the Prepare bucket during the context switch). Then the worker set the "group-sleep" flag and sent a "all workers parked" message to the coordinator, and started waiting for the "group-sleep" flag to be cleared.
The coordinator received the "all workers parked" message. It does an assertion: assert!(self.scheduler.all_activated_buckets_are_empty());, which fails because the Prepare bucket is not empty.

So is that assertion unreasonable? I feel reluctant to remove that assertion because that's quite a fundamental condition for opening new buckets. All buckets (except "Prepare") can only be open when all active buckets are drained.

In another execution sequence, the coordinator will observe "all workers are parked" when all workers are actually active.

Same as above
The same worker finished executing the StopMutators packet, and found no more packets in open buckets. It parked, found it was the last parked worker. Then it set the "group-sleep" flag, sent a "all workers parked" message to the coordinator, and started waiting for the "group-sleep" flag to be cleared.
The coordinator received the "mutators stopped" message, and opened the Prepare bucket. It cleared the "group-sleep" flag and notified the workers.
Workers woke up, executing packets in the Prepare bucket.
The coordinator continued to process the "workers parked" event. But now workers are not actually parked because it had cleared the "group-sleep" flag when notifying the workers after handling "mutators stopped".

I think the fundamental problem is that buckets can be opened without holding any lock. The WorkBucket::activate() method works by setting an AtomicBool flag (WorkBucket::active) to true. Because of this, one thread (the coordinator) can open a bucket while another thread (a worker) peek into the bucket to see if it is open and has any packets.

A thorough fix should employ the technique described in #546, i.e. let workers poll packets from one single queue, while buckets are just temporary places to hold packets when the buckets are not open.

I think I should not fix it in this PR because this PR aims to be a small-scale fix. The current code, i.e. letting the StopMutators packet call GCWorkScheduler::notify_mutators_paused and open the Prepare bucket directly, works fine without synchronisation issues.

wks · 2023-03-17T15:25:10Z

I ran lusearch from Dacapo Chopin (without PGO). With this PR, the STW time is 1.013 (-2.76%, +2.83%) times the master branch, and the total time is 1.002 (-0.89%, +0.90%) of master. This data shows the difference is small, but doesn't show whether there is any structural difference in the scheduling of work packets (for example, load balance).

That counter is used for synchronization, so we move it to WorkerMonitorSync and make it a plain usize instead of AtomicUsize.

Use pub instead of pub(crate) for the methods if the struct is already pub(crate).

wenyuzhao · 2023-04-11T22:32:59Z

Currently, work buckets are opened and closed under the assumption that all GC workers have parked. Due to wake-ups, any GC worker can wake up at any time, and break the assumption.

work buckets are opened based on only one assumption that all the previous buckets are empty, including all the worker local queues. The only reason it want all the workers to reach the parked state is to check if all the local queues are empty. So it is corrected to open more buckets as long as all the workers reached the "all parked" state once. It probably doesn't matter that some workers spuriously wake up before a new bucket is open.

Same as work bucket close condition. As long as all the buckets are empty, and all the workers reached the "all parked" state once, it is correct to finish the current GC and close all the buckets.

This makes me wondering that probably this assertion is not strictly correct. I think it's okay for workers to poll for more packets, or even try to open more buckets when the coordinator thread is doing other jobs -- for example when doing StopMutators, the worker is totally safe to do other things, like scanning some stacks before all are yielded. Does this spurious wakeups actually cause any correctness issues or crashes without this assertion?

qinsoon · 2023-04-11T22:42:39Z

work buckets are opened based on only one assumption that all the previous buckets are empty, including all the worker local queues.

This does not sound 100% right. Let's clarify one thing: if a worker is executing a packet, the packet is not in any bucket, and it is not in any local queue, right? Under this definition, If all the active buckets are empty but a worker is executing a work packet, we still cannot open new buckets, as the worker that is executing a work packet may generate more work packets.

So it is corrected to open more buckets as long as all the workers reached the "all parked" state once.

This makes sense if all the packets are executed by the workers. But what about the coordinator? A coordinator work also belongs to a bucket. If an coordinator is executing a packet, we should not open new buckets, right? The coordinator may generate new work as well.

wenyuzhao · 2023-04-11T23:00:19Z

Under this definition, If all the active buckets are empty but a worker is executing a work packet, we still cannot open new buckets, as the worker that is executing a work packet may generate more work packets.

For this case, workers will not reach the "all parked" state. The above assumption still holds.

A coordinator work also belongs to a bucket.

No I don't think this is true. coordinator work never goes to the bucket's queue. There is a separate message channel for that.

If an coordinator is executing a packet, we should not open new buckets, right?

This is probably not true as well. For example, when the coordinator thread is stopping mutators, there are some stacks can already be scanned before all the mutators are yielded. We did not implement this feature for now, but in theory some other workers should be able to scan these stacks when the coordinator is still stopping mutators.

Another example is that when stopping mutators, workers can already start zeroing mark tables. It's totally safe and correct to do so.

The coordinator may generate new work as well.

Yes. But given your stronger assumption that you can't open buckets when coordinator is doing some jobs, these new work can never have a chance to be executed early in parallel.

wenyuzhao · 2023-04-11T23:10:24Z

I'm curious whether there are any real correctness issues or crashes after removing these strong assertions. If so, that means the current worker synchronization mechanism is broken. Otherwise, it is probably less efficient to enforce a strong sequential order between coordinator packets and normal work packets.

qinsoon · 2023-04-11T23:21:37Z

I'm curious whether there are any real correctness issues or crashes after removing these strong assertions. If so, that means the current worker synchronization mechanism is broken. Otherwise, it is probably less efficient to enforce a strong sequential order between coordinator packets and normal work packets.

What exactly do you mean by 'strong assertions'?

If you meant 'we cannot open new buckets if we have pending_coordinator_packets', I think this was used to prevent an issue in ScheduleCollection. When the coordinator is executing ScheduleCollection, most of the GC packets are not yet added to the buckets, and a worker may see the active buckets as empty, and start opening subsequent buckets.

wenyuzhao · 2023-04-11T23:48:59Z

What exactly do you mean by 'strong assertions'?

I was talking about this one [1] But probably there are others

(I just noticed one more below)

When the coordinator is executing ScheduleCollection, most of the GC packets are not yet added to the buckets, and a worker may see the active buckets as empty, and start opening subsequent buckets.

When doing ScheduleCollection no buckets should be active. I remember the first bucket is open at the end of StopMutators [2] .

Even if the workers see all the active buckets are empty, it still cannot open buckets because we already have a check here [3] and prevent buckets from opening before pending coordinator work goes to zero.

I don't think this check [3] is necessary as well. It should still be correct if we remove this, because workers cannot open the first stw bucket [4].

[1] https://github.com/mmtk/mmtk-core/blob/master/src/scheduler/scheduler.rs#L242-L245
[2] https://github.com/mmtk/mmtk-core/blob/master/src/scheduler/scheduler.rs#L482-L489
[3] https://github.com/mmtk/mmtk-core/blob/master/src/scheduler/scheduler.rs#L425-L432
[4] https://github.com/mmtk/mmtk-core/blob/master/src/scheduler/scheduler.rs#L81

qinsoon · 2023-04-12T00:36:56Z

What exactly do you mean by 'strong assertions'?

I was talking about this one [1] But probably there are others

(I just noticed one more below)

When the coordinator is executing ScheduleCollection, most of the GC packets are not yet added to the buckets, and a worker may see the active buckets as empty, and start opening subsequent buckets.

When doing ScheduleCollection no buckets should be active. I remember the first bucket is open at the end of StopMutators [2] .

Even if the workers see all the active buckets are empty, it still cannot open buckets because we already have a check here [3] and prevent buckets from opening before pending coordinator work goes to zero.

I don't think this check [3] is necessary as well. It should still be correct if we remove this, because workers cannot open the first stw bucket [4].

[1] https://github.com/mmtk/mmtk-core/blob/master/src/scheduler/scheduler.rs#L242-L245 [2] https://github.com/mmtk/mmtk-core/blob/master/src/scheduler/scheduler.rs#L482-L489 [3] https://github.com/mmtk/mmtk-core/blob/master/src/scheduler/scheduler.rs#L425-L432 [4] https://github.com/mmtk/mmtk-core/blob/master/src/scheduler/scheduler.rs#L81

If we don't have checks for pending_coordinator_packets for opening buckets, when the coordinator executes ScheduleCollection, StopMutators could be added to the Unconstrained bucket, and it could start execution in parallel with ScheduleCollection . In that case, if StopMutators finishes before ScheduleCollection, a worker may open any bucket if there is no work in them (ScheduleCollection hasn't yet pushed work to them). That is the reason we would need such a check.

The next problem is that even with the check for pending_coordinator_packets before opening buckets, the assertion you mentioned (https://github.com/mmtk/mmtk-core/blob/master/src/scheduler/scheduler.rs#L242-L245) may fail. That could still cause the problem I discussed above.

Generally I think it is troublesome to allow every worker to open new buckets and also require them to coordinate with the coordinator. It is way simpler to just let the coordinator to do this kind of global work like opening new buckets.

wenyuzhao · 2023-04-12T01:09:04Z

In that case, if StopMutators finishes before ScheduleCollection, a worker may open any bucket if there is no work in them (ScheduleCollection hasn't yet pushed work to them). That is the reason we would need such a check.

StopMutators has two versions. One is executed by the worker, the other one is a coordinator work. The first one is just used to create the coordinator StopMutators.

The first STW bucket can only be open at the end of the coordinator StopMutators. Workers cannot open the first STW bucket. If the first STW bucket is not open, no worker can open any other buckets. So, no bucket can open before the coordinator StopMutators is finished.

After the coordinator StopMutators is finished, no matter if ScheduleCollection is finished or not, the first STW will be activated and it should not be empty. There must be some root scanning packets there.

Also, under the current implementation, the coordinator StopMutators cannot actually be executed before ScheduleCollection is finished. So this shouldn't be a problem at the very beginning.

The next problem is that even with the check for pending_coordinator_packets before opening buckets, the assertion you mentioned (https://github.com/mmtk/mmtk-core/blob/master/src/scheduler/scheduler.rs#L242-L245) may fail.

As mentioned before, this assertion is probably incorrect. The assumption in this assertion is too strong. The pending_coordinator_packets check is probably not necessary as well.

Yeah you can argue that not having a sequential order between coordinator and worker packets can make things complicated. But my guess is that relaxing the order like this probably will not bring more real correctness issues. We already have a lot of phase-like buckets and synchronization barriers, and making coordinator and workers non-overlapping will bring more single-threaded "coordinator" phases.

wenyuzhao · 2023-04-12T01:27:55Z

Well the current work packet system is problematic. But in the original implementation, the coordinator thread can only receive four signals:

GC is triggered (and it will call ScheduleCollection to initiate the GC)
Stop the mutators (because openjdk expects the same thread to stop and wake mutators up. So this cannot be done by an arbitrary worker.
CoordinatorMessage::AllWorkerParked or CoordinatorMessage::BucketDrained to open more buckets
CoordinatorMessage::Finish to finish a GC and wake the mutaturs up

Note the signal (3) is a part of the things this PR want to do -- don't make workers do too much things other than executing packets. This was removed lately in the work stealing PR because on a machine with more cores (say 32 cores), it's already expensive to do a synchronization between buckets. It will be more expensive to communicate with one more thread to do one more synchronization to open buckets.

Also, I believe signal (2) can now be moved back to the normal workers as well. We have a VM companion thread in the openjdk binding to trigger safepoints now.

If we remove (2) and (3) out of the coordinator, the coordinator can be greatly simplified: it can only initiate a GC, and finish a GC. We don't need to care about most of the synchronization issues between coordinator and workers anymore.

Given that (2) and (3) can be removed, the coordinator message queue and pending work counter can be removed as well. We only need the coordinator to have three states, and it triggers a GC when the state transits from Not In GC to GC In Progress, and finishes the GC when the state transits back to Not In GC.

qinsoon · 2023-04-12T01:42:19Z

In that case, if StopMutators finishes before ScheduleCollection, a worker may open any bucket if there is no work in them (ScheduleCollection hasn't yet pushed work to them). That is the reason we would need such a check.

StopMutators has two versions. One is executed by the worker, the other one is a coordinator work. The first one is just used to create the coordinator StopMutators.

This is not right. If COORDINATOR_ONLY_STW is false, a worker will execute the work packet. And we may see the problem.

mmtk-core/src/scheduler/gc_work.rs

Lines 179 to 183 in f8469b0

    
           if <E::VM as VMBinding>::VMCollection::COORDINATOR_ONLY_STW && !worker.is_coordinator() { 
        
               mmtk.scheduler 
        
                   .add_coordinator_work(StopMutators::<E>::new(), worker); 
        
               return; 
        
           }

wenyuzhao · 2023-04-12T01:47:53Z

If COORDINATOR_ONLY_STW is false, a worker will execute the work packet. And we may see the problem.

This is still not a problem. No worker can open the first stw bucket. Thus no worker can open any buckets, until stop mutators work is finished.

qinsoon · 2023-04-12T02:03:19Z

If COORDINATOR_ONLY_STW is false, a worker will execute the work packet. And we may see the problem.

This is still not a problem. No worker can open the first stw bucket. Thus no worker can open any buckets, until stop mutators work is finished.

During the execution of StopMutators, the first stw bucket is activated.

mmtk-core/src/scheduler/gc_work.rs

Line 191 in f8469b0

mmtk.scheduler.notify_mutators_paused(mmtk);

After this, a worker can open new buckets if our defined requirement is met. At this point, the coordinator may be still executing ScheduleCollection. This is the problem I talked about -- we need some sort of synchronisation between workers and coordinator.

Checking pending_coordinator_packets before opening buckets attempts to solve the problem. I don't think we can just remove the assertions or the check. However, our implementation with pending_coordinator_packets still has issues. That is why we have this PR to simplify the synchronization.

wenyuzhao · 2023-04-12T02:23:42Z

After this, a worker can open new buckets if our defined requirement is met. At this point, the coordinator may be still executing ScheduleCollection. This is the problem I talked about -- we need some sort of synchronisation between workers and coordinator.

With the current pending_coordinator_packets checks we have, this will never happen. The counter will not go to zero before ScheduleCollection finishes.

If you are really worrying about this issue, a simpler solution would be taking ScheduleCollection out of the coordinator thread, and treating it as a normal work packet executed by a worker thread.

Checking pending_coordinator_packets before opening buckets attempts to solve the problem. I don't think we can just remove the assertions or the check. However, our implementation with pending_coordinator_packets still has issues. That is why we have this PR to simplify the synchronization.

See my other proposal above. I believe all of the jobs can be moved out of the coordinator, apart from initiating a GC and ending a GC. If you think pending_coordinator_packets is not perfect for solving the problem, you can completely remove it now:)

qinsoon

LGTM. Once Wenyu approves this PR, we can merge it.

qinsoon · 2023-04-11T23:34:34Z

src/scheduler/controller/channel.rs

+use super::*;
+
+/// A one-way channel for workers to send coordinator packets and notifications to the controller.
+struct Channel<VM: VMBinding> {


How is this different from a standard multi-producer single-consumer channel (https://doc.rust-lang.org/std/sync/mpsc/fn.channel.html)?

The differences are

The message "all workers have parked" is idempotent, i.e. setting the all_workers_parked flag twice is the same as setting it once. It avoids the previous problem of sending multiple CoordinatorMessage::Finish messages in a row, which forced the coordinator to "consume" extraneous Finish messages after GC.

The coordinator always processes all pending coordinator packets before processing the all_workers_parked flag. In this way, the coordinator is sure that no "pending coordinator packets" exist when opening new buckets.

wenyuzhao

lgtm

This reverts commit 8dd1b73.

wks added 6 commits March 16, 2023 19:06

WIP: new controller/worker sync mechanism

48c32b8

Move the controller channel to a dedicated module

fba3489

Remove unused items

53b8896

Comments, logging and style

e01ff66

Revert file renaming for easier review

25f6f10

Use VecDeque instead of Vec

fe7e064

wks added 5 commits April 6, 2023 18:48

Merge branch 'master' into coord-open-buckets

541a284

Minor style changes

4bd2943

Move WorkerGroup.parked_workers to the monitor

93bb121

That counter is used for synchronization, so we move it to WorkerMonitorSync and make it a plain usize instead of AtomicUsize.

pub(crate) to pub

2988898

Use pub instead of pub(crate) for the methods if the struct is already pub(crate).

Merge remote-tracking branch 'origin/master' into coord-open-buckets

a08d3e0

wks marked this pull request as ready for review April 11, 2023 14:01

wks requested review from qinsoon and wenyuzhao April 11, 2023 14:02

k-sareen changed the title ~~Let the coordintator thread open buckets~~ Let the coordinator thread open buckets Apr 12, 2023

qinsoon reviewed Apr 12, 2023

View reviewed changes

wenyuzhao approved these changes Apr 12, 2023

View reviewed changes

wks merged commit 8dd1b73 into mmtk:master Apr 12, 2023
13 of 14 checks passed

This was referenced Apr 15, 2023

For testing only. Don't merge. mmtk/mmtk-openjdk#215

Closed

All workers parked when some open buckets are not empty. #792

Closed

The "wrong" worker wakes up when other workers have designated work. #793

Open

wks mentioned this pull request Apr 24, 2023

No coordinator work #794

Merged

wenyuzhao added a commit to wenyuzhao/mmtk-core that referenced this pull request Aug 1, 2023

Revert "Let the coordinator thread open buckets (mmtk#782)"

f51f1dc

This reverts commit 8dd1b73.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let the coordinator thread open buckets #782

Let the coordinator thread open buckets #782

wks commented Mar 16, 2023 •

edited

wks commented Mar 17, 2023

wenyuzhao commented Apr 11, 2023

qinsoon commented Apr 11, 2023

wenyuzhao commented Apr 11, 2023

wenyuzhao commented Apr 11, 2023

qinsoon commented Apr 11, 2023

wenyuzhao commented Apr 11, 2023 •

edited

qinsoon commented Apr 12, 2023

wenyuzhao commented Apr 12, 2023

wenyuzhao commented Apr 12, 2023

qinsoon commented Apr 12, 2023

wenyuzhao commented Apr 12, 2023

qinsoon commented Apr 12, 2023 •

edited

wenyuzhao commented Apr 12, 2023

qinsoon left a comment

qinsoon Apr 11, 2023

wks Apr 12, 2023

wenyuzhao left a comment

Let the coordinator thread open buckets #782

Let the coordinator thread open buckets #782

Conversation

wks commented Mar 16, 2023 • edited

Known issue

Won't fix

wks commented Mar 17, 2023

wenyuzhao commented Apr 11, 2023

qinsoon commented Apr 11, 2023

wenyuzhao commented Apr 11, 2023

wenyuzhao commented Apr 11, 2023

qinsoon commented Apr 11, 2023

wenyuzhao commented Apr 11, 2023 • edited

qinsoon commented Apr 12, 2023

wenyuzhao commented Apr 12, 2023

wenyuzhao commented Apr 12, 2023

qinsoon commented Apr 12, 2023

wenyuzhao commented Apr 12, 2023

qinsoon commented Apr 12, 2023 • edited

wenyuzhao commented Apr 12, 2023

qinsoon left a comment

Choose a reason for hiding this comment

qinsoon Apr 11, 2023

Choose a reason for hiding this comment

wks Apr 12, 2023

Choose a reason for hiding this comment

wenyuzhao left a comment

Choose a reason for hiding this comment

wks commented Mar 16, 2023 •

edited

wenyuzhao commented Apr 11, 2023 •

edited

qinsoon commented Apr 12, 2023 •

edited