Skip to content

Conversation

@oldnewthing
Copy link
Member

@oldnewthing oldnewthing commented Sep 27, 2021

If we were unable to switch to a target apartment (either by an explicit co_await apartment_context or implicitly at the completion of co_await IAsyncInfo), we now report the exception in the co_await'ing coroutine, so it can be caught and handled.

Note that when this error occurs, we are stuck in limbo. We left the original apartment for the MTA, and now we can't get from the MTA to the target apartment. We have no choice but to raise the exception on the MTA, even though the coroutine may not have expected to be running there.

IAsyncAction Trapped()
{
    Object o; // some object with a destructor
    co_await destroyed_ui_context; // throws an exception
}

The destructor for Object will run on the MTA, even though the Trapped coroutine never runs on the MTA under normal conditions.

This is making the best of a bad situation. We need to raise the exception into the coroutine so that the coroutine can enter a failure state and allow the coroutine chain to proceed.

Since the object o already had to be prepared for destruction on the original apartment or the switched-to apartment (in case an exception occurs in either apartment), we know that the object cannot have apartment affinity anyway. So we're probably okay to destruct it on the MTA.

There is still a gotcha: If the co_await completes successfully, but is unable to return to the original apartment, the completed value of the co_await is lost, since the success is transformed to an apartment-switching exception. If the awaited coroutine's return value requires special cleanup, that cleanup will not occur. Example:

IAsyncAction Trapped()
{
    auto lamp = co_await TurnOnAsync();
    auto ensure_off = wil::scope_exit([&] { lamp.TurnOff(); });
    /* do more stuff with the lamp */
}

If the TurnOnAsync succeeds, but we are unable to return to the original apartment, the lamp is not returned by co_await, which means it never gets turned off. There's not really much we can do about this since co_await cannot both return a value and throw an exception.

One workaround is to do the co_await on the MTA:

IAsyncAction Trapped()
{
    apartment_context original_apartment;
    auto lampAction = TurnOnAsync();
    co_await resume_background();
    auto lamp = co_await lampAction; // await on the MTA
    auto ensure_off = wil::scope_exit([&] { lamp.TurnOff(); });
    co_await original_apartment;
    /* do more stuff with the lamp */
}

Another is to clean up the lamp in an exception handler.

IAsyncAction Trapped()
{
    auto lampAction = TurnOnAsync();
    Lamp lamp{ nullptr };
    try
    {
        lamp = co_await lampAction;
    }
    catch (...)
    {
        if (lampAction.Status() == AsyncStatus::Completed) lamp = lampAction.GetResults().TurnOff();
        throw;
    }
    auto ensure_off = wil:;scope_exit([&] { lamp.TurnOff(); });
    /* do more stuff with the lamp */
}

Implementation notes

The old code raised the exception on whatever thread happened to notice that something bad happened. Sometimes that's okay, for example if you do a co_await destroyed_ui_context from an MTA thread, in which case the exception is thrown into the coroutine and can be caught, or left uncaught and send the coroutine into an error state.

However, all of the other cases weren't quite so happy. To avoid deadlocks, the apartment switch is routed through the threadpool, and that means that the failure is raised on a threadpool thread. There's nobody around to catch that exception, so the process fails fast.

We solve the problem by having an hresult in the awaiter, which is initialized to 0 (S_OK) and transitions to error when we are about to resume the coroutine on the wrong apartment, so that the error can be propagated during await_resume(), which happens in the context of the awaiting coroutine and therefore can be caught by the coroutine (or sent to unhandled_exception).

This requires apartment_context to have a separate awaiter, rather than being its own awaiter, because we need a place to communicate the apartment switch failure. We can't store it in the apartment_context because we support awaiting a const apartment_context.

If we were unable to switch to a target apartment
(either by an explicit `co_await apartment_context`
or implicitly at the completion of ` co_await IAsyncInfo`),
we now report the exception in the `co_await`'ing coroutine,
so it can be caught and handled.

Note that when this error occurs, we are stuck in limbo.
We left the original apartment for the MTA, and now we can't
get from the MTA to the target apartment. We have no choice but
to raise the exception on the MTA, even though the coroutine
may not have expected to be running there.

```cpp
IAsyncAction Trapped()
{
    Object o; // some object with a destructor
    co_await destroyed_ui_context; // throws an exception
}
```

The destructor for `Object` will run on the MTA, even though
the `Trapped` coroutine never runs on the MTA under normal
conditions.

This is making the best of a bad situation. We need to raise
the exception into the coroutine so that the coroutine can
enter a failure state and allow the coroutine chain to proceed.

Since the object `o` already had to be prepared for destruction
on the original apartment or the switched-to apartment (in case
an exception occurs in either apartment), we know that the
object cannot have apartment affinity anyway. So we're probably
okay to destruct it on the MTA.

The old code raised the exception on whatever thread happened
to notice that something bad happened. Sometimes that's okay,
for example if you do a `co_await destroyed_ui_context` from
an MTA thread, in which case the exception is thrown into the
coroutine and can be caught, or left uncaught and send the
coroutine into an error state.

However, all of the other cases weren't quite so happy.
To avoid deadlocks, the apartment switch is routed through
the threadpool, and that means that the failure is raised
on a threadpool thread. There's nobody around to catch that
exception, so the process fails fast.

We solve the problem by having an atomic hresult in the
awaiter, which is initialized to 0 (`S_OK`) and transitions
to error when we are about to resume the coroutine on the wrong
apartment, so that the error can be propagated during
`await_resume()`, which happens in the context of the awaiting
coroutine and therefore can be caught by the coroutine
(or sent to `unhandled_exception`).

The hresult in which we record the failure needs to be atomic
to avoid data races in case multiple threads detect failure
simultaneously. However, the access to it can be
relaxed, because it is set and consumed by the same thread.

The current implementation pulls a sneaky trick: To avoid
having to create an `operator co_await` for the `apartment_context`
object, we continue to let it be its own awaiter. This means
that if the apartment switch fails, we record the failure in
the `apartment_context` itself. We are making the assumption
that once an apartment context goes bad, it is unrecoverably
broken.
This also removes the need for an atomic `int32_t`,
and removes the assumption that context failure is
permanent.
Should be initializing the int32_t, not the pointer
@antmor
Copy link
Contributor

antmor commented Sep 28, 2021

nit: the description's scope_exit have typos, :;scope_exit instead of ::scope_exit

@antmor
Copy link
Contributor

antmor commented Sep 28, 2021

So the only drawback here is that, if you have a STA-bound object, the Release and possibly the destructor of the object is not guaranteed to be freed on a STA if the co_await fails. We should at least make sure that in our documentation, we say that objects crossing a co_await need to be agile enough to not fail out if destruction doesn't happen happen on the source STA, if the thread can't be guaranteed by the developer.

@oldnewthing
Copy link
Member Author

@antmor If you have an STA-bound object, you're already in trouble, because it might destruct on the wrong apartment even in the absence of the MTA weirdness.

fire_and_forget Victim()
{
   StaObject object;
   DoSomething1();
   co_await resume_foreground(dispatcher);
   DoSomething2();
}

If DoSomething1() or resume_foreground() raise an exception, then the StaObject destructs on the old apartment. But if the co_await or the DoSomething2() raise an exception, then the StaObject destructs on the new apartment.

So the StaObject needs to be prepared to be destructed in either the old apartment or the new one. In other words, it must already be agile.

@kennykerr kennykerr merged commit ae411cc into microsoft:master Sep 30, 2021
@oldnewthing oldnewthing deleted the failed-apartment-switch branch September 30, 2021 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants