Fixes for C# concurrency bugs in 7492 #7529

t0yv0 · 2021-07-14T18:57:59Z

Description

reproduces the problem in a failing test
tentative fix that passes the test
fix existing test failures due to wrapping of exception in AggregateException
add a test that early termination behavior is preserved on Exception
add a test that description logging behavior is preserved

Fixes #7492 and #6377

Checklist

I have added tests that prove my fix is effective or that my feature works

I have updated the CHANGELOG-PENDING file with my change

Yes, there are changes in this PR that warrants bumping the Pulumi Service API version

t0yv0 · 2021-07-14T21:16:00Z

@orionstudt early preview, I still have some tests to write here.

sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs

sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs

sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

mikhailshilkov · 2021-07-15T09:19:56Z

Probably, also fixes #6377 ?

sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs

Co-authored-by: Justin Van Patten <jvp@justinvp.com>

Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com>

sdk/dotnet/Pulumi.Tests/Deployment/DeploymentRunnerTests.cs

sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com>

t0yv0 · 2021-07-19T16:01:54Z

Removed the AggregatedException wrapper, it would seem that the special fancy type checks on Exception would not be able to detect what they want to detect if seeing AggregateException that originated from the new code, in LogExceptionToErrorStream (formerly HandleExceptionAsync). Seems like a good idea to preserve the behavior. Let me try to run the CI test suites a few times and see how this last version does on the test suite and determinism.

t0yv0 · 2021-07-20T13:37:20Z

OK folks finally a green checkmark from CI (though you need to restart it a few times to get past Python etc flakes...). Ready for final review!

sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs

sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

mikhailshilkov · 2021-07-21T11:35:04Z

sdk/dotnet/Pulumi.Tests/Deployment/DeploymentRunnerTests.cs

+                    runner.RegisterTask($"task{i}", Task.Delay(100 + i));
+                }
+
+                this.RunnerResult = runner.RunAsync<EmptyStack>();


I'm curious why we even allow deploying two stacks.

What does this test prove exactly? It seems that RegisterTask could be a no-op and it would still pass. Do we have a test that makes sure that all the tasks are awaited appropriately?

This test failed on master. This is our first test to reproduce the original problem. It was non-deterministic but N=100 was enough to trigger it pretty consistently. Consider it a regression test? I an add a comment about that.

The RunAsync<EmptyStack>() clause indirectly invokes WhileRunningAsync() which is where the bug was. Should I instead make that internal and invoke that directly?

Ok this is pretty interesting I investigated a bit further here.

The RunAsync<EmptyStack> is essential to the repro. In fact it seems that I can only repro the exception when more than one stack is being deployed against the same Runner object (and same Deployment object). So to reproduce the issue we need a race of two concurrent WhileRunningAsync() threads on the same runner. In fact even with these changes here, we will not see exceptions, but we might have race conditions in this case of mis-attributing exceptions (or spurious waits). If the single runner is servicing two stacks, it's going to mixup their exceptions.

So taking a big step back. Do we believe it's an error to be running two stacks simultaneously against the single Deployment? If it's an error, how can we best prevent that from happening?

I missed this on my review. To @mikhailshilkov 's point I didn't even know this was possible. I don't think a consumer would be able to do this because the access modifiers would prevent them from accessing the Runner. So to answer your question @t0yv0 I don't think you need to prevent it from happening because a consumer currently cannot do this. Their only entrypoint into pulumi execution is the static Deployment.RunAsync(...) methods which ensures that a new Deployment is instantiated.

Conceptually I think it is an error. Currently a Deployment is an instance of a pulumi update action against a single stack, and the Runner is an implementation detail of that Deployment - and it is 1:1. I don't think it would ever make sense the way this it is currently architected to have 2 update actions executing against a single Deployment.

So the issue is that we need to find a way to repro the original issue which was a flakey race condition caused by access to the inner task dictionary of the Runner? It's always difficult to try to repro a flakey issue but maybe we could instantiate a Runner for testing purposes and just throw N tasks at it? Maybe you don't need a Deployment to repro since this issue was isolated to the Runner

Hrm. That could just be a bad test case. Maybe this situation is unrealistic but under some other situations this condition can arise as well. I'm leaning a little to accepting changes here as they make the locking a little more foolproof?

Let me have one more look at CI failures that manifested with the exception we're fixing here.

Also. We could put a little check into WhileRunningAsync to make sure it's never invoked concurrently (one instance at a time), and fail loudly if it does. If we believe it should never happen, but if we manage to do it somehow anyway, that'd catch it.

Probably wouldn't hurt anything. Yea when I made that original Automation API PR I specifically refactored the core SDK so that all pulumi execution enters right here, even the Automation API enters there. And you'll see the first thing it does is instantiate a Deployment

I've put a PR here with the defensive check #7597 - locally it does not seem to detect it, let's see on CI... Also I locally sporadically get "key not found" in the tests, without triggering the concurrent WRA. So basically I have a poor repro test here, there must be other ways to repro the situation.

Yeah that PR basically passes! So far so good.

sdk/dotnet/Pulumi.Tests/Deployment/DeploymentRunnerTests.cs

sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

sdk/dotnet/Pulumi.Tests/Deployment/DeploymentRunnerTests.cs

t0yv0 · 2021-07-21T16:58:36Z

Ah the duplication of exceptions is still present in CI (and can't for the life of me get it locally). OK I need to work through this.


   xUnit.net 00:00:02.08]     Pulumi.Tests.DeploymentRunnerTests.TerminatesEarlyOnException [FAIL]
  X Pulumi.Tests.DeploymentRunnerTests.TerminatesEarlyOnException [47ms]
  Error Message:
   The collection was expected to contain a single element, but it contained more than one element.
  Stack Trace:
     at Pulumi.Tests.DeploymentRunnerTests.TerminatesEarlyOnExceptionStack.<>c__DisplayClass4_0.<.ctor>b__0(Task`1 t) in


   xUnit.net 00:00:17.76]     Pulumi.Tests.DeploymentRunnerTests.TerminatesEarlyOnException [FAIL]
  Failed Pulumi.Tests.DeploymentRunnerTests.TerminatesEarlyOnException [90 ms]
  Error Message:
   Assert.IsType() Failure
Expected: Pulumi.RunException
Actual:   System.AggregateException
  Stack Trace:
     at Pulumi.Tests.DeploymentRunnerTests.TerminatesEarlyOnException() in /Users/runner/work/pulumi/pulumi/sdk/dotnet/Pulumi.Tests/Deployment/DeploymentRunnerTests.cs:line 47
--- End of stack trace from previous location where exception was thrown ---

t0yv0 · 2021-07-21T18:37:22Z

Allright folks I've played with timeouts until I could reliably reproduce exception duplication locally, and it's an artifact of badly written test. Here's what was happening exactly:

Stack had outputs, and was participating in void IDeploymentInternal.RegisterResourceOutputs(Resource resource, Output<IDictionary<string, object?>> outputs), which is a registered task also (with the runner)
RegisterResourceOutputs task was failing with "Deliberate test error", so it was reported via the runner twice, once from the inner task and once from RegisterResourceOutputs
Normally, error deduplication works just fine to remove the dup reporting here; however since the test was using EmptyStack, it introduced a race with two WhileRunningAsync instances active on the same runner; under certain circumstances the two instances grabbed one copy of the exception each, so deduplication did not apply
Based on conversation here and aux PR we believe WhileRunningAsync never runs more than one copy, so it was a borked test
Last version of test does not have this problem

t0yv0 · 2021-07-21T18:41:39Z

Allright, sorry for many detours here.. So my repro was bad, but I think this change still helps. On master I get sporadic test failure. With the change the tests are deterministic for me.

@mikhailshilkov please take another look? Appreciate it.

mikhailshilkov

Your changes LGTM. Just one question about the test.

sdk/dotnet/Pulumi.Tests/Deployment/DeploymentRunnerTests.cs

…ger from user code

* Reproduce the issue in a failing test * Fix * Tentative fix * Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs Co-authored-by: Justin Van Patten <jvp@justinvp.com> * Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs Co-authored-by: Justin Van Patten <jvp@justinvp.com> * Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs Co-authored-by: Justin Van Patten <jvp@justinvp.com> * Update sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs Co-authored-by: Justin Van Patten <jvp@justinvp.com> * Do not allocate TaskCompletionSource when not needed * Update sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com> * Fix warning * Cache delegate * Simplify with named tuples * Test early exception termination * Test logging * Remove the smelly method of suppressing engine exceptions * Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com> * Fix typo; check in xml docs * Try CI again * Add CHANGELOG entry * Dedup exceptions before reporting * Lock access to _exceptions list * Fix typos * Version of HandleExceptionsAsync that accepts N exceptions * Do not aggregate exceptions prematurely * Rename private members * Formatting * Summary markers * Short-circuit return * Stylistic fixes * Strengthen test * Check that we have only 1 exception * Remove defensive clause about AggregateException from the test * Simplify TerminatesEarly test * Remove EmptyStack * Notes on the regression nature of the WorksUnderStress test * Remove race condition repro as it is a poor repro, impossible to trigger from user code * Brace style Co-authored-by: Justin Van Patten <jvp@justinvp.com> Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com>

t0yv0 added 3 commits July 14, 2021 14:57

Reproduce the issue in a failing test

d0e9b8a

Fix

c41bfb0

Tentative fix

0d55d42

t0yv0 requested review from mikhailshilkov and komalali July 14, 2021 21:15

orionstudt reviewed Jul 14, 2021

View reviewed changes

sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs Outdated Show resolved Hide resolved

justinvp reviewed Jul 15, 2021

View reviewed changes

mikhailshilkov reviewed Jul 15, 2021

View reviewed changes

sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs Outdated Show resolved Hide resolved

orionstudt reviewed Jul 15, 2021

View reviewed changes

sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs Outdated Show resolved Hide resolved

t0yv0 and others added 11 commits July 15, 2021 11:40

Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

92d153d

Co-authored-by: Justin Van Patten <jvp@justinvp.com>

Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

486a706

Co-authored-by: Justin Van Patten <jvp@justinvp.com>

Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

89401f8

Co-authored-by: Justin Van Patten <jvp@justinvp.com>

Update sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs

5c73adc

Co-authored-by: Justin Van Patten <jvp@justinvp.com>

Do not allocate TaskCompletionSource when not needed

b96260b

Update sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs

6616a92

Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com>

Fix warning

f2a4dbc

Cache delegate

9c273db

Simplify with named tuples

efd26cf

Test early exception termination

e53927a

Test logging

eae9a7a

t0yv0 commented Jul 16, 2021

View reviewed changes

sdk/dotnet/Pulumi.Tests/Deployment/DeploymentRunnerTests.cs Outdated Show resolved Hide resolved

orionstudt reviewed Jul 16, 2021

View reviewed changes

sdk/dotnet/Pulumi.Tests/Deployment/DeploymentRunnerTests.cs Outdated Show resolved Hide resolved

sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs Outdated Show resolved Hide resolved

t0yv0 and others added 4 commits July 16, 2021 12:02

Remove the smelly method of suppressing engine exceptions

0d1fe49

Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

a5d3e2e

Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com>

Fix typo; check in xml docs

aa19d95

Try CI again

e7f3f51

t0yv0 marked this pull request as ready for review July 16, 2021 17:48

Add CHANGELOG entry

7b65c03

Do not aggregate exceptions prematurely

62a4d11

t0yv0 added 2 commits July 19, 2021 13:15

Merge branch 'master' into t0yv0/7492

39f4918

Merge branch 'master' into t0yv0/7492

00c6c17

Merge branch 'master' into t0yv0/7492

e8b4968

mikhailshilkov reviewed Jul 21, 2021

View reviewed changes

t0yv0 added 11 commits July 21, 2021 10:31

Rename private members

0cd3bec

Formatting

76ec582

Summary markers

655dfd6

Short-circuit return

f8e1b55

Stylistic fixes

a20e4ec

Strengthen test

4a20b73

Check that we have only 1 exception

a766b8a

Remove defensive clause about AggregateException from the test

e0f21db

Simplify TerminatesEarly test

e5edb66

Remove EmptyStack

0cd8a9f

Notes on the regression nature of the WorksUnderStress test

0bd105c

mikhailshilkov approved these changes Jul 22, 2021

View reviewed changes

sdk/dotnet/Pulumi.Tests/Deployment/DeploymentRunnerTests.cs Outdated Show resolved Hide resolved

sdk/dotnet/Pulumi.Tests/Deployment/DeploymentRunnerTests.cs Outdated Show resolved Hide resolved

t0yv0 added 4 commits July 22, 2021 09:51

Merge branch 'master' into t0yv0/7492

f917108

Remove race condition repro as it is a poor repro, impossible to trig…

e84c6f2

…ger from user code

Merged

6c22778

Brace style

816fe4a

t0yv0 merged commit b0f51a6 into master Jul 22, 2021

pulumi-bot deleted the t0yv0/7492 branch July 22, 2021 16:49

emiliza mentioned this pull request Nov 4, 2021

System.Collections.Generic.KeyNotFoundException: The given key 'System.Threading.Tasks.UnwrapPromise`1[System.Threading.Tasks.VoidTaskResult]' was not present in the dictionary. #8355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for C# concurrency bugs in 7492 #7529

Fixes for C# concurrency bugs in 7492 #7529

t0yv0 commented Jul 14, 2021 •

edited

Loading

t0yv0 commented Jul 14, 2021

mikhailshilkov commented Jul 15, 2021

t0yv0 commented Jul 19, 2021

t0yv0 commented Jul 20, 2021

mikhailshilkov Jul 21, 2021

t0yv0 Jul 21, 2021

t0yv0 Jul 21, 2021

orionstudt Jul 21, 2021 •

edited

Loading

orionstudt Jul 21, 2021

orionstudt Jul 21, 2021 •

edited

Loading

t0yv0 Jul 21, 2021

orionstudt Jul 21, 2021

t0yv0 Jul 21, 2021

t0yv0 Jul 21, 2021

t0yv0 commented Jul 21, 2021

t0yv0 commented Jul 21, 2021

t0yv0 commented Jul 21, 2021

mikhailshilkov left a comment

Fixes for C# concurrency bugs in 7492 #7529

Fixes for C# concurrency bugs in 7492 #7529

Conversation

t0yv0 commented Jul 14, 2021 • edited Loading

Description

Checklist

t0yv0 commented Jul 14, 2021

mikhailshilkov commented Jul 15, 2021

t0yv0 commented Jul 19, 2021

t0yv0 commented Jul 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orionstudt Jul 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orionstudt Jul 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

t0yv0 commented Jul 21, 2021

t0yv0 commented Jul 21, 2021

t0yv0 commented Jul 21, 2021

mikhailshilkov left a comment

Choose a reason for hiding this comment

t0yv0 commented Jul 14, 2021 •

edited

Loading

orionstudt Jul 21, 2021 •

edited

Loading

orionstudt Jul 21, 2021 •

edited

Loading