Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in Task.bind task #2853

Closed
Kha opened this issue Nov 9, 2023 · 1 comment · Fixed by #2959
Closed

Segmentation fault in Task.bind task #2853

Kha opened this issue Nov 9, 2023 · 1 comment · Fixed by #2959
Labels
bug Something isn't working

Comments

@Kha
Copy link
Member

Kha commented Nov 9, 2023

Executing the last portion of a Task.bind can run into the closure for it already having been nulled.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007ff709a56a7c in lean_ptr_tag (o=0x0) at /home/sebastian/lean/lean/build/debug/stage1/include/lean/lean.h:386
386	    return o->m_tag;
[Current thread is 1 (Thread 0x7ff6feffd6c0 (LWP 1784522))]
(gdb) up
...
#5  0x00007ff709a6498c in lean::task_manager::run_task (this=0x55b73faa1380, lock=..., t=0x7ff6fdc4c098) at /home/sebastian/lean/lean/src/runtime/object.cpp:718
718	            add_dep(lean_to_task(closure_arg_cptr(t->m_imp->m_closure)[0]), t);

To reproduce:
Not easily. This is a somewhat rare failure in my new snapshot types branch.

Analysis:
This appears to be a plain race condition in the following code:

if (t->m_imp->m_deleted) {
lock.unlock();
if (v) lean_dec(v);
free_task(t);
lock.lock();
} else if (v != nullptr) {
lean_assert(t->m_imp->m_closure == nullptr);
resolve_core(t, v);
} else {
// `bind` task has not finished yet, re-add as dependency of nested task
lock.unlock();
add_dep(lean_to_task(closure_arg_cptr(t->m_imp->m_closure)[0]), t);

We check that the task is not deleted then, in the last branch, we drop the lock and access the closure, but a concurrent thread can empty the closure field in deactivate_task_core in between! I've confirmed that retrieving the closure within the lock like is done in the same function before fixes my test failures: Kha@994a088. But it is not actually clear to me why/whether this is sufficient in either of these closure accesses: who is actually keeping the closure alive after we've dropped the lock? deactivate_task_core will dec it as soon as the task is unreferenced but I don't see or remember a counteracting inc.

@leodemoura In general I am concerned about the cost/benefit ratio of the two following task_object features that are involved in the bug:

  • the deactivated state that eagerly clears up specific fields (such as m_closure) even while the task is queued or running. If it merely set m_canceled and left all cleanup to when the task_manager is completely done with the task, there would be no interference with other task functions.
  • reusing the task object in task_bind_fn1 for a second logical task instead of allocating a new task object, which requires the special case branch in task_manager that broke above.

As far as I can see, both of these provide marginal memory use benefits while they significantly complicate the task_object invariants and thus its maintainability. Both parts had to be fixed before.

@Kha Kha added the bug Something isn't working label Nov 9, 2023
@leodemoura
Copy link
Member

As far as I can see, both of these provide marginal memory use benefits while they significantly complicate the task_object invariants and thus its maintainability. Both parts had to be fixed before.

Agreed. Please feel free to remove them.

github-merge-queue bot pushed a commit that referenced this issue Dec 12, 2023
#2959)

Fixes #2853, unblocking my work before I get to refactoring this part of
the task manager.
@Kha Kha closed this as completed in #2959 Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants