Fix corruption when remarking a pool in another domain and that domain allocates #466

ctk21 · 2021-01-28T16:15:12Z

We have had a number of CI failures recently; particularly with the bytecode test for parallel/domain_parallel_spawn_burn. The segfaults are non-deterministic and quite horrible to reproduce in rr (very time consuming); however you can get the crash in gdb with some patience.

Probing around, I have isolated a problem in the way we handle remarking pools due to a mark stack overflow. The particular issue is:

one domain overflows its mark stack and determines that a pool belonging to another domain needs remarking
at the time the original domain is remarking the pool belonging to the other domain, the other domain decides to allocate from that pool
if you get unlucky the remarking domain sees the newly allocated object as needing marking, but the newly allocated object may have garbage in its fields; this leads to a crash

We have identified a couple of ways to fix this:

lock the pools; highly undesirable as it inserts a locking operation into the allocation path
use some epoch based sychronization of garbage collection over the top of remarked pools
pass pools for remarking back to the pool owner and alter the concurrent major garbage collector barriers to check that all remarking queues are serviced before progressing the major cycle

The text was updated successfully, but these errors were encountered:

ctk21 · 2021-02-11T16:18:27Z

After some back and forth with some of the ways to fix this, I've got one that might work based around passing the pool back to its owner.

Care is needed to handle:

conservation of the work todo as a pool is passed back to a domain that owns it (i.e. we need to make sure all remarking work is done and all marking work is done before a GC phase transition)
domain termination issues, a domain needs to be able to exit and other domains need to be able to push remark work somewhere while the domain is terminating (or has already terminated)
the remark work for orphaned pools is tracked and done somewhere

With these in mind, my current thinking is that this would work:

each domain owns a remark queue which can 'accept' work or be 'closed' (because the domain is terminating); this allows the posting domain to know the work will be done by the receiver or it needs to post the work to the global queue.
there is a global remark queue which holds work for any orphaned pools or work which could not be enqueued onto a domain as the queue was not accepting
when a domain has drained its mark stack, it services its remarking queue and then the global remark queue
when checking for phase transitions, all mark stacks must be drained and all remarking queues (including the global one) must be drained
when dequeuing from the global remark queue (and our own remark queue):
- we check if the pool is owned by another domain, in which case we attempt to push to the owner (or requeue to global if that domain not accepting)
- if we own the pool, we push the remarking work locally
- if the pool is orphaned; we lock the orphaned pool list (to avoid it getting allocated to a domain) and push the remarking work locally
care will be taken in the termination and marking code to ensure that marking progresses to drain the remark queue even when the mark stack is complete; this ensures that the GC is always moving forwards

kayceesrk · 2021-02-12T09:00:44Z

This looks reasonable to me. Given that remarking is rare, this queue could just be a lock-protected linked list, couldn't it?

How does work end up in the global remark queue?

ctk21 · 2021-02-12T09:35:00Z

This looks reasonable to me. Given that remarking is rare, this queue could just be a lock-protected linked list, couldn't it?

Absolutely - I was assuming the queues would be a lock-protected linked list.

How does work end up in the global remark queue?

If when trying to push work to a target domain, we discover that domain is not accepting work (because it is terminating) we push the work to the global remark queue.

I was thinking we could do away with the global remark queue and requeue to your own remark queue in the case that the target domain is terminating, but you could end up with deadlock as follows:

domain A has a pool that needs remarking by domain B, domain A is terminating and so not accepting
domain B has a pool that needs remarking by domain A, domain B is also terminating and so not accepting
neither domain A or domain B can drain their remark queues and so can not terminate

The global remark queue ensures that the domain local remark queues can always be drained independent of the state of any other domain and so avoids deadlocks that could occur with terminating domains.

kayceesrk · 2021-02-12T09:42:57Z

Ok. Global remark queue sounds reasonable.

This was referenced Jan 28, 2021

Disable the pruning of the mark stack (pending a better fix) #467

Merged

Finalisers causing segfault with multiple domains #468

Closed

ctk21 mentioned this issue Feb 19, 2021

Fixing remarking to be safe with parallel domains #474

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix corruption when remarking a pool in another domain and that domain allocates #466

Fix corruption when remarking a pool in another domain and that domain allocates #466

ctk21 commented Jan 28, 2021

ctk21 commented Feb 11, 2021

kayceesrk commented Feb 12, 2021

ctk21 commented Feb 12, 2021

kayceesrk commented Feb 12, 2021

Fix corruption when remarking a pool in another domain and that domain allocates #466

Fix corruption when remarking a pool in another domain and that domain allocates #466

Comments

ctk21 commented Jan 28, 2021

ctk21 commented Feb 11, 2021

kayceesrk commented Feb 12, 2021

ctk21 commented Feb 12, 2021

kayceesrk commented Feb 12, 2021