Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for C# concurrency bugs in 7492 #7529

Merged
merged 43 commits into from
Jul 22, 2021
Merged

Fixes for C# concurrency bugs in 7492 #7529

merged 43 commits into from
Jul 22, 2021

Conversation

t0yv0
Copy link
Member

@t0yv0 t0yv0 commented Jul 14, 2021

Description

  • reproduces the problem in a failing test
  • tentative fix that passes the test
  • fix existing test failures due to wrapping of exception in AggregateException
  • add a test that early termination behavior is preserved on Exception
  • add a test that description logging behavior is preserved

Fixes #7492 and #6377

Checklist

  • I have added tests that prove my fix is effective or that my feature works
  • Yes, there are changes in this PR that warrants bumping the Pulumi Service API version

@t0yv0
Copy link
Member Author

t0yv0 commented Jul 14, 2021

@orionstudt early preview, I still have some tests to write here.

@mikhailshilkov
Copy link
Member

Probably, also fixes #6377 ?

@t0yv0 t0yv0 marked this pull request as ready for review July 16, 2021 17:48
@t0yv0
Copy link
Member Author

t0yv0 commented Jul 19, 2021

Removed the AggregatedException wrapper, it would seem that the special fancy type checks on Exception would not be able to detect what they want to detect if seeing AggregateException that originated from the new code, in LogExceptionToErrorStream (formerly HandleExceptionAsync). Seems like a good idea to preserve the behavior. Let me try to run the CI test suites a few times and see how this last version does on the test suite and determinism.

@t0yv0
Copy link
Member Author

t0yv0 commented Jul 20, 2021

OK folks finally a green checkmark from CI (though you need to restart it a few times to get past Python etc flakes...). Ready for final review!

sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs Outdated Show resolved Hide resolved
sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs Outdated Show resolved Hide resolved
runner.RegisterTask($"task{i}", Task.Delay(100 + i));
}

this.RunnerResult = runner.RunAsync<EmptyStack>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we even allow deploying two stacks.

What does this test prove exactly? It seems that RegisterTask could be a no-op and it would still pass. Do we have a test that makes sure that all the tasks are awaited appropriately?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test failed on master. This is our first test to reproduce the original problem. It was non-deterministic but N=100 was enough to trigger it pretty consistently. Consider it a regression test? I an add a comment about that.

The RunAsync<EmptyStack>() clause indirectly invokes WhileRunningAsync() which is where the bug was. Should I instead make that internal and invoke that directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this is pretty interesting I investigated a bit further here.

The RunAsync<EmptyStack> is essential to the repro. In fact it seems that I can only repro the exception when more than one stack is being deployed against the same Runner object (and same Deployment object). So to reproduce the issue we need a race of two concurrent WhileRunningAsync() threads on the same runner. In fact even with these changes here, we will not see exceptions, but we might have race conditions in this case of mis-attributing exceptions (or spurious waits). If the single runner is servicing two stacks, it's going to mixup their exceptions.

So taking a big step back. Do we believe it's an error to be running two stacks simultaneously against the single Deployment? If it's an error, how can we best prevent that from happening?

Copy link
Contributor

@orionstudt orionstudt Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this on my review. To @mikhailshilkov 's point I didn't even know this was possible. I don't think a consumer would be able to do this because the access modifiers would prevent them from accessing the Runner. So to answer your question @t0yv0 I don't think you need to prevent it from happening because a consumer currently cannot do this. Their only entrypoint into pulumi execution is the static Deployment.RunAsync(...) methods which ensures that a new Deployment is instantiated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually I think it is an error. Currently a Deployment is an instance of a pulumi update action against a single stack, and the Runner is an implementation detail of that Deployment - and it is 1:1. I don't think it would ever make sense the way this it is currently architected to have 2 update actions executing against a single Deployment.

Copy link
Contributor

@orionstudt orionstudt Jul 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue is that we need to find a way to repro the original issue which was a flakey race condition caused by access to the inner task dictionary of the Runner? It's always difficult to try to repro a flakey issue but maybe we could instantiate a Runner for testing purposes and just throw N tasks at it? Maybe you don't need a Deployment to repro since this issue was isolated to the Runner

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. That could just be a bad test case. Maybe this situation is unrealistic but under some other situations this condition can arise as well. I'm leaning a little to accepting changes here as they make the locking a little more foolproof?

Let me have one more look at CI failures that manifested with the exception we're fixing here.

Also. We could put a little check into WhileRunningAsync to make sure it's never invoked concurrently (one instance at a time), and fail loudly if it does. If we believe it should never happen, but if we manage to do it somehow anyway, that'd catch it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably wouldn't hurt anything. Yea when I made that original Automation API PR I specifically refactored the core SDK so that all pulumi execution enters right here, even the Automation API enters there. And you'll see the first thing it does is instantiate a Deployment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put a PR here with the defensive check #7597 - locally it does not seem to detect it, let's see on CI... Also I locally sporadically get "key not found" in the tests, without triggering the concurrent WRA. So basically I have a poor repro test here, there must be other ways to repro the situation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that PR basically passes! So far so good.

sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs Outdated Show resolved Hide resolved
sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs Outdated Show resolved Hide resolved
@t0yv0
Copy link
Member Author

t0yv0 commented Jul 21, 2021

Ah the duplication of exceptions is still present in CI (and can't for the life of me get it locally). OK I need to work through this.


   xUnit.net 00:00:02.08]     Pulumi.Tests.DeploymentRunnerTests.TerminatesEarlyOnException [FAIL]
  X Pulumi.Tests.DeploymentRunnerTests.TerminatesEarlyOnException [47ms]
  Error Message:
   The collection was expected to contain a single element, but it contained more than one element.
  Stack Trace:
     at Pulumi.Tests.DeploymentRunnerTests.TerminatesEarlyOnExceptionStack.<>c__DisplayClass4_0.<.ctor>b__0(Task`1 t) in


   xUnit.net 00:00:17.76]     Pulumi.Tests.DeploymentRunnerTests.TerminatesEarlyOnException [FAIL]
  Failed Pulumi.Tests.DeploymentRunnerTests.TerminatesEarlyOnException [90 ms]
  Error Message:
   Assert.IsType() Failure
Expected: Pulumi.RunException
Actual:   System.AggregateException
  Stack Trace:
     at Pulumi.Tests.DeploymentRunnerTests.TerminatesEarlyOnException() in /Users/runner/work/pulumi/pulumi/sdk/dotnet/Pulumi.Tests/Deployment/DeploymentRunnerTests.cs:line 47
--- End of stack trace from previous location where exception was thrown ---


@t0yv0
Copy link
Member Author

t0yv0 commented Jul 21, 2021

Allright folks I've played with timeouts until I could reliably reproduce exception duplication locally, and it's an artifact of badly written test. Here's what was happening exactly:

  1. Stack had outputs, and was participating in void IDeploymentInternal.RegisterResourceOutputs(Resource resource, Output<IDictionary<string, object?>> outputs), which is a registered task also (with the runner)

  2. RegisterResourceOutputs task was failing with "Deliberate test error", so it was reported via the runner twice, once from the inner task and once from RegisterResourceOutputs

  3. Normally, error deduplication works just fine to remove the dup reporting here; however since the test was using EmptyStack, it introduced a race with two WhileRunningAsync instances active on the same runner; under certain circumstances the two instances grabbed one copy of the exception each, so deduplication did not apply

  4. Based on conversation here and aux PR we believe WhileRunningAsync never runs more than one copy, so it was a borked test

  5. Last version of test does not have this problem

@t0yv0
Copy link
Member Author

t0yv0 commented Jul 21, 2021

Allright, sorry for many detours here.. So my repro was bad, but I think this change still helps. On master I get sporadic test failure. With the change the tests are deterministic for me.

@mikhailshilkov please take another look? Appreciate it.

Copy link
Member

@mikhailshilkov mikhailshilkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your changes LGTM. Just one question about the test.

@t0yv0 t0yv0 merged commit b0f51a6 into master Jul 22, 2021
@pulumi-bot pulumi-bot deleted the t0yv0/7492 branch July 22, 2021 16:49
abhinav pushed a commit to pulumi/pulumi-dotnet that referenced this pull request Jan 11, 2023
* Reproduce the issue in a failing test

* Fix

* Tentative fix

* Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

Co-authored-by: Justin Van Patten <jvp@justinvp.com>

* Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

Co-authored-by: Justin Van Patten <jvp@justinvp.com>

* Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

Co-authored-by: Justin Van Patten <jvp@justinvp.com>

* Update sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs

Co-authored-by: Justin Van Patten <jvp@justinvp.com>

* Do not allocate TaskCompletionSource when not needed

* Update sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs

Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com>

* Fix warning

* Cache delegate

* Simplify with named tuples

* Test early exception termination

* Test logging

* Remove the smelly method of suppressing engine exceptions

* Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs

Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com>

* Fix typo; check in xml docs

* Try CI again

* Add CHANGELOG entry

* Dedup exceptions before reporting

* Lock access to _exceptions list

* Fix typos

* Version of HandleExceptionsAsync that accepts N exceptions

* Do not aggregate exceptions prematurely

* Rename private members

* Formatting

* Summary markers

* Short-circuit return

* Stylistic fixes

* Strengthen test

* Check that we have only 1 exception

* Remove defensive clause about AggregateException from the test

* Simplify TerminatesEarly test

* Remove EmptyStack

* Notes on the regression nature of the WorksUnderStress test

* Remove race condition repro as it is a poor repro, impossible to trigger from user code

* Brace style

Co-authored-by: Justin Van Patten <jvp@justinvp.com>
Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[auto/dotnet] - Flakey Automation API tests for C#
4 participants