Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Access violation when using winrt::resume_on_signal #1329

Closed
fereeh opened this issue Jul 4, 2023 · 22 comments
Closed

Bug: Access violation when using winrt::resume_on_signal #1329

fereeh opened this issue Jul 4, 2023 · 22 comments

Comments

@fereeh
Copy link

fereeh commented Jul 4, 2023

Version

No response

Summary

We ran into an access violation when using winrt::resume_on_signal. It was caused by a race condition in signal_awaiter::create_threadpool_wait. The method accesses *this after scheduling a "transfer" of the coroutine handle across threads using a threadpool wait. By the time m_state is accessed, a worker thread may have already resumed the coroutine, destroying the signal_awaiter and eventually the coroutine state.

Reproducible example

No response

Expected behavior

No response

Actual behavior

No response

Additional comments

No response

@kennykerr
Copy link
Collaborator

Keep in mind that a co_await expression is about async synchronization and not about async ownership. Its still the coroutine's responsibility to stabilize any references (including this) prior to suspension.

@sylveon
Copy link
Contributor

sylveon commented Jul 5, 2023

The coroutine code in this case is correct - it's just co_await winrt::resume_on_signal(handle) (which is done in sample code mind you) which will only destroy the signal_awaiter once the wait is over.

The issue is that if the threadpool wait immediately fires, the coroutine execution resumes and destroys the signal_awaiter before reaching this code (which seems to assume the threadpool wait won't immediately fire):

state expected = state::idle;
if (!m_state.compare_exchange_strong(expected, state::pending, std::memory_order_release))
{
fire_immediately();
}

This seems like it could be a problem with the other awaiters that rely on SetThreadpoolXXX functions. Why not move this assignment to before SetThreadpoolXXX is called? I think @oldnewthing wrote this originally, he probably knows why it was done that way.

@fereeh
Copy link
Author

fereeh commented Jul 5, 2023

Quoting cppreference for additional context:

Note that because the coroutine is fully suspended before entering awaiter.await_suspend(), that function is free to transfer the coroutine handle across threads, with no additional synchronization. For example, it can put it inside a callback, scheduled to run on a threadpool when async I/O operation completes. In that case, since the current coroutine may have been resumed and thus executed the awaiter object's destructor, all concurrently as await_suspend() continues its execution on the current thread, await_suspend() should treat *this as destroyed and not access it after the handle was published to other threads.

They have some sample code right below the quoted text that further illustrates the problem.

@github-actions
Copy link

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@sylveon
Copy link
Contributor

sylveon commented Jul 16, 2023

What's the opinions of maintainers on this? Is moving the assignment before SetThreadpoolXxx being called an acceptable solution?

@kennykerr
Copy link
Collaborator

It was simpler when I originally wrote it, and I'm not comfortable messing with it now. Unless @oldnewthing is available, I suggest you write your own resume_on_signal and move on. It's not hard to get it right if you keep things simple. 😊

@oldnewthing
Copy link
Member

I'm busy right now but I'll try to find time to look at this in the next few weeks.

@github-actions
Copy link

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@sylveon
Copy link
Contributor

sylveon commented Jul 28, 2023

Still a problem

@github-actions
Copy link

github-actions bot commented Aug 8, 2023

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@sylveon
Copy link
Contributor

sylveon commented Aug 11, 2023

The easiest way to reproduce this is to insert a sleep between SetThreadpoolX and the atomic access, to simulate a case where the threadpool thread fires before the atomic access runs. This is true of all resume_ functions

The result is that m_state is accessed after the temporary awaitable object is destroyed. This is easily confirmed by ASAN, and also gives a runtime crash in debug mode (because the object is memset'd to 0xDD). It "works" in release because the 0xDD memset doesn't happen.

I believe the solution would be to set m_state to pending before doing the threadpool wait. Something like this:

state expected = state::idle;
if (m_state.compare_exchange_strong(expected, state::pending, std::memory_order_release))
{
    WINRT_IMPL_SetThreadpoolWaitEx(m_wait.get(), m_handle, file_time, nullptr);
}
else
{
    // fire the callback immediately
    int64_t now = 0;
    WINRT_IMPL_SetThreadpoolWaitEx(m_wait.get(), WINRT_IMPL_GetCurrentProcess(), &now, nullptr);
}

I can PR it if this sounds like a good solution.

This use-after-free could also happen with winrt::impl::await_adapter, as it does suspending.exchange() after a registering a callback to async.Completed (some theorical IAsyncAction implementation could short-circuit and immediately call a callback being registered if the coroutine is already completed).

@oldnewthing
Copy link
Member

I agree that this is a problem. Now I have to reverse-engineer how "cancellation v2 (#1246)" works.

@sylveon
Copy link
Contributor

sylveon commented Aug 18, 2023

I have a solution that I'm about to open a PR for. This problem wasn't introduced by #1246 as far as I know, as the code was the same before. It got extracted into its own type by the PR, hence why it shows up in the git blame.

sylveon added a commit to sylveon/cppwinrt that referenced this issue Aug 18, 2023
Fixes microsoft#1329

This fixes the issue by using m_state before suspending, rather than after. The issue occurs because the callback could fire before our thread of execution resumes, causing the timespan_awaiter/signal_awaiter to be destroyed inside the coroutine frame before m_state is accessed.

As a drive-by improvement, currently if await_suspend is called with a non-idle state, the threadpool object is closed (cancelling the timer/wait), the existing coroutine handle is just dropped, and resume on the new handle is fired immediately. This would cause the existing pending coroutine to hang forever. Instead, avoid doing anything and throw an exception when the awaiter is not idle. This is a very unlikely event, the test does some gymnastics (reused awaiter) to achieve this state, but better safe than sorry.
@sylveon
Copy link
Contributor

sylveon commented Aug 18, 2023

Opened #1342

@github-actions
Copy link

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@sylveon
Copy link
Contributor

sylveon commented Aug 29, 2023

I'm gonna look at the fixing the PR this week

@github-actions
Copy link

github-actions bot commented Sep 9, 2023

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions
Copy link

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions
Copy link

github-actions bot commented Oct 8, 2023

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@fereeh
Copy link
Author

fereeh commented Oct 12, 2023

still a problem

@sylveon
Copy link
Contributor

sylveon commented Oct 12, 2023

If you're able to build cppwinrt yourself, you could try #1342 to make sure it solves your problem

@github-actions
Copy link

This issue is stale because it has been open 10 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants