-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for C# concurrency bugs in 7492 #7529
Conversation
@orionstudt early preview, I still have some tests to write here. |
Probably, also fixes #6377 ? |
Co-authored-by: Justin Van Patten <jvp@justinvp.com>
Co-authored-by: Justin Van Patten <jvp@justinvp.com>
Co-authored-by: Justin Van Patten <jvp@justinvp.com>
Co-authored-by: Justin Van Patten <jvp@justinvp.com>
Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com>
Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com>
Removed the AggregatedException wrapper, it would seem that the special fancy type checks on Exception would not be able to detect what they want to detect if seeing AggregateException that originated from the new code, in LogExceptionToErrorStream (formerly HandleExceptionAsync). Seems like a good idea to preserve the behavior. Let me try to run the CI test suites a few times and see how this last version does on the test suite and determinism. |
OK folks finally a green checkmark from CI (though you need to restart it a few times to get past Python etc flakes...). Ready for final review! |
runner.RegisterTask($"task{i}", Task.Delay(100 + i)); | ||
} | ||
|
||
this.RunnerResult = runner.RunAsync<EmptyStack>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why we even allow deploying two stacks.
What does this test prove exactly? It seems that RegisterTask
could be a no-op and it would still pass. Do we have a test that makes sure that all the tasks are awaited appropriately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test failed on master. This is our first test to reproduce the original problem. It was non-deterministic but N=100 was enough to trigger it pretty consistently. Consider it a regression test? I an add a comment about that.
The RunAsync<EmptyStack>()
clause indirectly invokes WhileRunningAsync()
which is where the bug was. Should I instead make that internal and invoke that directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this is pretty interesting I investigated a bit further here.
The RunAsync<EmptyStack>
is essential to the repro. In fact it seems that I can only repro the exception when more than one stack is being deployed against the same Runner object (and same Deployment object). So to reproduce the issue we need a race of two concurrent WhileRunningAsync()
threads on the same runner. In fact even with these changes here, we will not see exceptions, but we might have race conditions in this case of mis-attributing exceptions (or spurious waits). If the single runner is servicing two stacks, it's going to mixup their exceptions.
So taking a big step back. Do we believe it's an error to be running two stacks simultaneously against the single Deployment? If it's an error, how can we best prevent that from happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this on my review. To @mikhailshilkov 's point I didn't even know this was possible. I don't think a consumer would be able to do this because the access modifiers would prevent them from accessing the Runner. So to answer your question @t0yv0 I don't think you need to prevent it from happening because a consumer currently cannot do this. Their only entrypoint into pulumi execution is the static Deployment.RunAsync(...)
methods which ensures that a new Deployment is instantiated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually I think it is an error. Currently a Deployment is an instance of a pulumi update action against a single stack, and the Runner is an implementation detail of that Deployment - and it is 1:1. I don't think it would ever make sense the way this it is currently architected to have 2 update actions executing against a single Deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the issue is that we need to find a way to repro the original issue which was a flakey race condition caused by access to the inner task dictionary of the Runner? It's always difficult to try to repro a flakey issue but maybe we could instantiate a Runner for testing purposes and just throw N tasks at it? Maybe you don't need a Deployment to repro since this issue was isolated to the Runner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm. That could just be a bad test case. Maybe this situation is unrealistic but under some other situations this condition can arise as well. I'm leaning a little to accepting changes here as they make the locking a little more foolproof?
Let me have one more look at CI failures that manifested with the exception we're fixing here.
Also. We could put a little check into WhileRunningAsync to make sure it's never invoked concurrently (one instance at a time), and fail loudly if it does. If we believe it should never happen, but if we manage to do it somehow anyway, that'd catch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably wouldn't hurt anything. Yea when I made that original Automation API PR I specifically refactored the core SDK so that all pulumi execution enters right here, even the Automation API enters there. And you'll see the first thing it does is instantiate a Deployment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put a PR here with the defensive check #7597 - locally it does not seem to detect it, let's see on CI... Also I locally sporadically get "key not found" in the tests, without triggering the concurrent WRA. So basically I have a poor repro test here, there must be other ways to repro the situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that PR basically passes! So far so good.
Ah the duplication of exceptions is still present in CI (and can't for the life of me get it locally). OK I need to work through this.
|
Allright folks I've played with timeouts until I could reliably reproduce exception duplication locally, and it's an artifact of badly written test. Here's what was happening exactly:
|
Allright, sorry for many detours here.. So my repro was bad, but I think this change still helps. On master I get sporadic test failure. With the change the tests are deterministic for me. @mikhailshilkov please take another look? Appreciate it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your changes LGTM. Just one question about the test.
* Reproduce the issue in a failing test * Fix * Tentative fix * Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs Co-authored-by: Justin Van Patten <jvp@justinvp.com> * Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs Co-authored-by: Justin Van Patten <jvp@justinvp.com> * Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs Co-authored-by: Justin Van Patten <jvp@justinvp.com> * Update sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs Co-authored-by: Justin Van Patten <jvp@justinvp.com> * Do not allocate TaskCompletionSource when not needed * Update sdk/dotnet/Pulumi/Deployment/Deployment.Runner.cs Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com> * Fix warning * Cache delegate * Simplify with named tuples * Test early exception termination * Test logging * Remove the smelly method of suppressing engine exceptions * Update sdk/dotnet/Pulumi/Deployment/TaskMonitoringHelper.cs Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com> * Fix typo; check in xml docs * Try CI again * Add CHANGELOG entry * Dedup exceptions before reporting * Lock access to _exceptions list * Fix typos * Version of HandleExceptionsAsync that accepts N exceptions * Do not aggregate exceptions prematurely * Rename private members * Formatting * Summary markers * Short-circuit return * Stylistic fixes * Strengthen test * Check that we have only 1 exception * Remove defensive clause about AggregateException from the test * Simplify TerminatesEarly test * Remove EmptyStack * Notes on the regression nature of the WorksUnderStress test * Remove race condition repro as it is a poor repro, impossible to trigger from user code * Brace style Co-authored-by: Justin Van Patten <jvp@justinvp.com> Co-authored-by: Josh Studt <32800478+orionstudt@users.noreply.github.com>
Description
Fixes #7492 and #6377
Checklist