Skip to content

Conversation

@oldnewthing
Copy link
Member

In PR #662, we changed the way we switch from STA to NA by bouncing through the threadpool so that we don't run the NA on an STA thread.

We extend this behavior to switches from STA to STA, to avoid RO_E_BLOCKED_CROSS_ASTA_CALL exceptions, which has been identified as the source of some crashes.

Interestingly, we add this support by deleting code. We would normally add tests for STA and MAINSTA to the NA test, but since MTA was ruled out by the previous test, and NA+STA+MAINSTA completely cover all the non-MTA possibilities, there's no need to perform any test at all. The fact that we got here means that the destination is NA/STA/MAINSTA.

Note also that the first test in this block checks if we are resuming in the same apartment that we are currently in. That test is important, because it avoids us bouncing through the threadpool just to return to our own thread, which would otherwise burn a threadpool thread.

Added a new unit test to verify that STA-to-STA transfers release the source STA. The test failed before the change, and passes after.

Note however that this introduces a possibly undesirable behavior change: Suppose STA thread "A" hangs and STA thread "B" has many tasks, some of which want to resume on STA thread "A".

Under the old behavior, STA thread "B" would run the first task, and then that task will try to switch to STA thread "A" and hang. STA thread "B" would remain hung and be unavailable to process tasks. When STA thread "A" finally wakes up, STA thread "B" also wakes up and resumes doing work. The number of hung threads is capped at the number of STA threads.

Under the new behavior, STA thread "B" would run the first task, and then ask a threadpool to resume on STA thread "A". The threadpool thread hangs instead of STA thread "B". STA thread "B" is now available to do other work (yay), but at a cost of a hung threadpool thread. One of the things STA thread "B" might do is create more work items that want to switch to STA thread "A", so we slowly (or maybe quickly) exhaust the threadpool.

Mitigation for this worst-case scenario is that we sort of had that problem anyway: If any task on a threadpool thread wants to switch to STA thread "A", it will hang with or without these changes. Since it is common to pass through a threadpool thread at some point, you are likely to exhaust the threadpool even under the old algorithm.

Previously, we changed the way we switch from STA to NA by
bouncing through the threadpool so that we don't run the NA
on an STA thread.

We extend this behavior to switches from STA to STA, to avoid
RO_E_BLOCKED_CROSS_ASTA_CALL exceptions.

Interestingly, we add this support by deleting code. We would
normally add tests for STA and MAINSTA to the NA test, but since
MTA was ruled out by the previous test, and NA+STA+MAINSTA completely
cover all the non-MTA possibilities, there's no need to perform any
test at all. The fact that we got here means that the destination
is NA/STA/MAINSTA.

Note also that the first test in this block checks if we are resuming
in the same apartment that we are currently in. That test is
important, because it avoids us bouncing through the threadpool
just to return to our own thread, which would otherwise burn a threadpool
thread.

Added a new unit test to verify that STA-to-STA transfers release
the source STA. The test failed before the change, and passes after.

Note however that this introduces a possibly undesirable behavior change:
Suppose STA thread "A" hangs and STA thread "B" has many tasks, some of
which want to resume on STA thread "A".

Under the old behavior, STA thread "B" would run the first task,
and then that task will try to switch to STA thread "A" and hang.
STA thread "B" would remain hung and be unavailable to process tasks.
When STA thread "A" finally wakes up, STA thread "B" also wakes up
and resumes doing work. The number of hung threads is capped at the
number of STA threads.

Under the new behavior, STA thread "B" would run the first task,
and then ask a threadpool to resume on STA thread "A". The threadpool
thread hangs instead of STA thread "B". STA thread "B" is now available
to do other work (yay), but at a cost of a hung threadpool thread.
One of the things STA thread "B" might do is create more work items
that want to switch to STA thread "A", so we slowly (or maybe quickly)
exhaust the threadpool.

Mitigation for this worst-case scenario is that we sort of had that
problem anyway: If any task on a threadpool thread wants to switch to
STA thread "A", it will hang with or without these changes. Since
it is common to pass through a threadpool thread at some point,
you are likely to exhaust the threadpool even under the old algorithm.
// processes work, it will unstick the first STA thread.
// This test verifies that the second STA thread does become available
// to do work and is not hung waiting for the first STA thread.
co_await context1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if this were broken (aka before your changes) this co_await would block forever if not for the 1000 timeout on line 85. Is that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. My initial version had INFINITE, and the test hung. We don't like it when tests hang, so I had it detect that it had been stuck for 1 second (should be plenty) and declare failure if so.

Copy link
Collaborator

@kennykerr kennykerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - thanks Raymond!

@kennykerr kennykerr merged commit 0835afb into microsoft:master Feb 11, 2021
@oldnewthing oldnewthing deleted the sta-to-sta branch February 16, 2021 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants