-
-
Notifications
You must be signed in to change notification settings - Fork 628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NodeTestIntegrationTest.test_test_simple
flaky in CI
#6621
Comments
Does it always happen with this particular test, or does it fail in other tests in the same manner? |
@baroquebobcat 3 times in a row same shard, Node each time ... but I can't absolutely claim same exact test. I do not recall that level of detail. |
That's fair. |
Went green on try 4 fwiw. This was a 3/5 failure rate in this case. |
@cosmicexplorer: Any thoughts on why the new logging isn't triggering here? |
|
Also, I saw this test flake a million times in this same way when working on #6552, except the fatal error logging was working either most or all of the time. I don't have logs of that because I didn't realize that Travis reuses the build log text link when you restart a build (I thought it was a permalink) -- I messaged @stuhood on an internal slack at 7:25pm California time on Tuesday, Oct 2nd about such a log, but the log doesn't show the error anymore -- so I might check commits from #6552 around then and see if I removed anything obvious. |
...it looks like the first commit made after that time was |
There also seem to be a lot of things going on here.
The above
The above alone (the number 4) might hide real fatal errors in the time between when the pantsd-runner is invoked and when the pantsd process is terminated (we could of course just wait to have received the PID and PGRP chunks to trigger the exception catching mentioned above and avoid some of the issues with this situation). I'm also not at all sure how to interpret not having sent the PID and PGRP chunks over yet, |
@cosmicexplorer maybe relevant?: #6626 |
Probably very relevant, tracking that one now too. I have been able to repro this node integration test error locally on osx before, so if that happens I can try to investigate. |
### Problem As described in #6565, I had a fundamental misunderstanding of what was legal during a fork, which resulted in intermittent hangs when a `PantsService` thread (usually `FSEventService`, but potentially the `StoreGCService` threads as well) were blocked on locks while attempting to acquire resources pre-fork. As mentioned on the ticket, the `fork_lock` used to act as a honeypot to block any threads that might be attempting to interact with the Engine. This mostly had the right effect, but it was possible to forget to wrap some other lock access in the fork lock (because, in effect, _all_ shared lock access needed to be protected by the `fork_lock`). But also, after the fork the `fork_lock` would be poisoned. This represented a challenging line to walk, where we needed to protect all shared locks, but we also needed to avoid that protection post fork in order to avoid deadlock. ### Solution Rather than using a lock to approximate all of the other locks, instead ensure that all Service threads that might interact with any non-private locks are "paused" in well-known locations ("safe points"?) while we fork. Since we would like to have more and more-fine-grained locks over time, while keeping the number of un-pooled threads relatively constrained, adding this bookkeeping to service threads seems like the right tradeoff. While we continue to move toward a world (via `--v2` and `@console_rule`) where forking is not necessary, we'd also like to be able to incrementally gain benefit from the daemon by porting portions of `--v1` pipelines pre-fork. ### Result Fixes #6565 and fixes #6621.
As described in #6565, I had a fundamental misunderstanding of what was legal during a fork, which resulted in intermittent hangs when a `PantsService` thread (usually `FSEventService`, but potentially the `StoreGCService` threads as well) were blocked on locks while attempting to acquire resources pre-fork. As mentioned on the ticket, the `fork_lock` used to act as a honeypot to block any threads that might be attempting to interact with the Engine. This mostly had the right effect, but it was possible to forget to wrap some other lock access in the fork lock (because, in effect, _all_ shared lock access needed to be protected by the `fork_lock`). But also, after the fork the `fork_lock` would be poisoned. This represented a challenging line to walk, where we needed to protect all shared locks, but we also needed to avoid that protection post fork in order to avoid deadlock. Rather than using a lock to approximate all of the other locks, instead ensure that all Service threads that might interact with any non-private locks are "paused" in well-known locations ("safe points"?) while we fork. Since we would like to have more and more-fine-grained locks over time, while keeping the number of un-pooled threads relatively constrained, adding this bookkeeping to service threads seems like the right tradeoff. While we continue to move toward a world (via `--v2` and `@console_rule`) where forking is not necessary, we'd also like to be able to incrementally gain benefit from the daemon by porting portions of `--v1` pipelines pre-fork. Fixes #6565 and fixes #6621.
As described in #6565, I had a fundamental misunderstanding of what was legal during a fork, which resulted in intermittent hangs when a `PantsService` thread (usually `FSEventService`, but potentially the `StoreGCService` threads as well) were blocked on locks while attempting to acquire resources pre-fork. As mentioned on the ticket, the `fork_lock` used to act as a honeypot to block any threads that might be attempting to interact with the Engine. This mostly had the right effect, but it was possible to forget to wrap some other lock access in the fork lock (because, in effect, _all_ shared lock access needed to be protected by the `fork_lock`). But also, after the fork the `fork_lock` would be poisoned. This represented a challenging line to walk, where we needed to protect all shared locks, but we also needed to avoid that protection post fork in order to avoid deadlock. Rather than using a lock to approximate all of the other locks, instead ensure that all Service threads that might interact with any non-private locks are "paused" in well-known locations ("safe points"?) while we fork. Since we would like to have more and more-fine-grained locks over time, while keeping the number of un-pooled threads relatively constrained, adding this bookkeeping to service threads seems like the right tradeoff. While we continue to move toward a world (via `--v2` and `@console_rule`) where forking is not necessary, we'd also like to be able to incrementally gain benefit from the daemon by porting portions of `--v1` pipelines pre-fork. Fixes #6565 and fixes #6621.
I've hit this three times in a row on the 'Py3 - Python contrib tests' shard.
Looks like:
The text was updated successfully, but these errors were encountered: