Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix corruption when remarking a pool in another domain and that domain allocates #466

Open
ctk21 opened this issue Jan 28, 2021 · 4 comments

Comments

@ctk21
Copy link
Collaborator

ctk21 commented Jan 28, 2021

We have had a number of CI failures recently; particularly with the bytecode test for parallel/domain_parallel_spawn_burn. The segfaults are non-deterministic and quite horrible to reproduce in rr (very time consuming); however you can get the crash in gdb with some patience.

Probing around, I have isolated a problem in the way we handle remarking pools due to a mark stack overflow. The particular issue is:

  • one domain overflows its mark stack and determines that a pool belonging to another domain needs remarking
  • at the time the original domain is remarking the pool belonging to the other domain, the other domain decides to allocate from that pool
  • if you get unlucky the remarking domain sees the newly allocated object as needing marking, but the newly allocated object may have garbage in its fields; this leads to a crash

We have identified a couple of ways to fix this:

  • lock the pools; highly undesirable as it inserts a locking operation into the allocation path
  • use some epoch based sychronization of garbage collection over the top of remarked pools
  • pass pools for remarking back to the pool owner and alter the concurrent major garbage collector barriers to check that all remarking queues are serviced before progressing the major cycle
@ctk21
Copy link
Collaborator Author

ctk21 commented Feb 11, 2021

After some back and forth with some of the ways to fix this, I've got one that might work based around passing the pool back to its owner.

Care is needed to handle:

  • conservation of the work todo as a pool is passed back to a domain that owns it (i.e. we need to make sure all remarking work is done and all marking work is done before a GC phase transition)
  • domain termination issues, a domain needs to be able to exit and other domains need to be able to push remark work somewhere while the domain is terminating (or has already terminated)
  • the remark work for orphaned pools is tracked and done somewhere

With these in mind, my current thinking is that this would work:

  • each domain owns a remark queue which can 'accept' work or be 'closed' (because the domain is terminating); this allows the posting domain to know the work will be done by the receiver or it needs to post the work to the global queue.
  • there is a global remark queue which holds work for any orphaned pools or work which could not be enqueued onto a domain as the queue was not accepting
  • when a domain has drained its mark stack, it services its remarking queue and then the global remark queue
  • when checking for phase transitions, all mark stacks must be drained and all remarking queues (including the global one) must be drained
  • when dequeuing from the global remark queue (and our own remark queue):
    • we check if the pool is owned by another domain, in which case we attempt to push to the owner (or requeue to global if that domain not accepting)
    • if we own the pool, we push the remarking work locally
    • if the pool is orphaned; we lock the orphaned pool list (to avoid it getting allocated to a domain) and push the remarking work locally
  • care will be taken in the termination and marking code to ensure that marking progresses to drain the remark queue even when the mark stack is complete; this ensures that the GC is always moving forwards

@kayceesrk
Copy link
Contributor

This looks reasonable to me. Given that remarking is rare, this queue could just be a lock-protected linked list, couldn't it?

How does work end up in the global remark queue?

@ctk21
Copy link
Collaborator Author

ctk21 commented Feb 12, 2021

This looks reasonable to me. Given that remarking is rare, this queue could just be a lock-protected linked list, couldn't it?

Absolutely - I was assuming the queues would be a lock-protected linked list.

How does work end up in the global remark queue?

If when trying to push work to a target domain, we discover that domain is not accepting work (because it is terminating) we push the work to the global remark queue.

I was thinking we could do away with the global remark queue and requeue to your own remark queue in the case that the target domain is terminating, but you could end up with deadlock as follows:

  • domain A has a pool that needs remarking by domain B, domain A is terminating and so not accepting
  • domain B has a pool that needs remarking by domain A, domain B is also terminating and so not accepting
  • neither domain A or domain B can drain their remark queues and so can not terminate

The global remark queue ensures that the domain local remark queues can always be drained independent of the state of any other domain and so avoids deadlocks that could occur with terminating domains.

@kayceesrk
Copy link
Contributor

Ok. Global remark queue sounds reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants