Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeTestIntegrationTest.test_test_simple flaky in CI #6621

Closed
jsirois opened this issue Oct 11, 2018 · 11 comments · Fixed by #6671
Closed

NodeTestIntegrationTest.test_test_simple flaky in CI #6621

jsirois opened this issue Oct 11, 2018 · 11 comments · Fixed by #6671
Assignees
Labels

Comments

@jsirois
Copy link
Contributor

jsirois commented Oct 11, 2018

I've hit this three times in a row on the 'Py3 - Python contrib tests' shard.

Looks like:

                    ____ NodeTestIntegrationTest.test_test_simple ____
                     
                     self = <pants_test.contrib.node.tasks.test_node_test_integration.NodeTestIntegrationTest testMethod=test_test_simple>
                     args = (), kwargs = {}, enable_daemon = 'true'
                     subprocess_dir = '/tmp/tmpcnr7t5r1'
                     env = {'HERMETIC_ENV': 'PANTS_ENABLE_PANTSD,PANTS_ENABLE_V2_ENGINE,PANTS_SUBPROCESSDIR', 'PANTS_ENABLE_PANTSD': 'true', 'PANTS_ENABLE_V2_ENGINE': 'true', 'PANTS_SUBPROCESSDIR': '/tmp/tmpcnr7t5r1'}
                     
                         def wrapper(self, *args, **kwargs):
                           for enable_daemon in ('false', 'true',):
                             with temporary_dir() as subprocess_dir:
                               env = {
                                   'HERMETIC_ENV': 'PANTS_ENABLE_PANTSD,PANTS_ENABLE_V2_ENGINE,PANTS_SUBPROCESSDIR',
                                   'PANTS_ENABLE_PANTSD': enable_daemon,
                                   'PANTS_ENABLE_V2_ENGINE': enable_daemon,
                                   'PANTS_SUBPROCESSDIR': subprocess_dir,
                                 }
                               with environment_as(**env):
                                 try:
                                   f(self, *args, **kwargs)
                                 finally:
                                   if enable_daemon:
                     >               self.assert_success(self.run_pants(['kill-pantsd']))
                     
                     .pants.d/pyprep/sources/6f7d4414b9dea6fdf4a197230582e66e9e215ca7/pants_test/pants_run_integration_test.py:117: 
                     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
                     .pants.d/pyprep/sources/6f7d4414b9dea6fdf4a197230582e66e9e215ca7/pants_test/pants_run_integration_test.py:408: in assert_success
                         self.assert_result(pants_run, self.PANTS_SUCCESS_CODE, expected=True, msg=msg)
                     .pants.d/pyprep/sources/6f7d4414b9dea6fdf4a197230582e66e9e215ca7/pants_test/pants_run_integration_test.py:429: in assert_result
                         assertion(value, pants_run.returncode, error_msg)
                     E   AssertionError: 0 != 1 : /home/travis/build/pantsbuild/pants/pants --no-pantsrc --pants-workdir=/home/travis/build/pantsbuild/pants/.pants.d/tmp/tmplefzzqct.pants.d --kill-nailguns --print-exception-stacktrace=True kill-pantsd
                     E   returncode: 1
                     E   stdout:
                     E   	
                     E   stderr:
                     E   	WARN] Not watching pantsd pidfile because subprocessdir is outside of buildroot. Having subprocessdir be a child of buildroot (as it is by default) may help avoid stray pantsd processes.
                     �[31mE   	No handlers could be found for logger "pants.pantsd.pants_daemon"
                     E   	CRITICAL] 
                     E   	CRITICAL] lost active connection to pantsd!
                     E   	timestamp: 2018-10-11T19:06:48.188044
                     E   	Exception caught: (pants.bin.remote_pants_runner.Terminated)
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_loader.py", line 75, in <module>
                     E   	    main()
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_loader.py", line 71, in main
                     E   	    PantsLoader.run()
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_loader.py", line 67, in run
                     E   	    cls.load_and_execute(entrypoint)
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_loader.py", line 60, in load_and_execute
                     E   	    entrypoint_main()
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_exe.py", line 39, in main
                     E   	    PantsRunner(exiter, start_time=start_time).run()
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_runner.py", line 48, in run
                     E   	    return RemotePantsRunner(self._exiter, self._args, self._env, bootstrap_options).run()
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/remote_pants_runner.py", line 189, in run
                     E   	    self._run_pants_with_retry(pantsd_handle)
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/remote_pants_runner.py", line 134, in _run_pants_with_retry
                     E   	    raise_with_traceback(self._extract_remote_exception(pantsd_handle.pid, e))
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/remote_pants_runner.py", line 113, in _run_pants_with_retry
                     E   	    return self._connect_and_execute(pantsd_handle.port)
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/remote_pants_runner.py", line 154, in _connect_and_execute
                     E   	    result = client.execute(self.PANTS_COMMAND, *self._args, **modified_env)
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/java/nailgun_client.py", line 276, in execute
                     E   	    traceback=sys.exc_info()[2]
                     E   	
                     E   	Exception message: abruptly lost active connection to pantsd runner: NailgunError(u"Problem talking to nailgun server (address: 127.0.0.1:46339, remote_pid=<remote PID chunk not yet received!>, remote_pgrp=<remote PGRP chunk not yet received!>): error(104, 'Connection reset by peer')", error(104, 'Connection reset by peer'))
                      generated xml file: /home/travis/build/pantsbuild/pants/.pants.d/test/pytest/contrib.node.tests.python.pants_test.contrib.node.tasks.node_test_integration/junitxml/TEST-contrib.node.tests.python.pants_test.contrib.node.tasks.node_test_integration.xml 
                     ============ slowest 3 test durations ============
                     32.86s call     ../contrib/node/tests/python/pants_test/contrib/node/tasks/test_node_test_integration.py::NodeTestIntegrationTest::test_test_multiple_targets
                     27.53s call     ../contrib/node/tests/python/pants_test/contrib/node/tasks/test_node_test_integration.py::NodeTestIntegrationTest::test_test_target_with_non_default_script_name
                     21.37s call     ../contrib/node/tests/python/pants_test/contrib/node/tasks/test_node_test_integration.py::NodeTestIntegrationTest::test_test_simple
                     ====== 1 failed, 4 passed in 100.86 seconds ======
@jsirois jsirois added the bug label Oct 11, 2018
@baroquebobcat
Copy link
Contributor

Does it always happen with this particular test, or does it fail in other tests in the same manner?

@jsirois
Copy link
Contributor Author

jsirois commented Oct 11, 2018

@baroquebobcat 3 times in a row same shard, Node each time ... but I can't absolutely claim same exact test. I do not recall that level of detail.

@baroquebobcat
Copy link
Contributor

That's fair.

@jsirois
Copy link
Contributor Author

jsirois commented Oct 11, 2018

Went green on try 4 fwiw. This was a 3/5 failure rate in this case.

@stuhood
Copy link
Sponsor Member

stuhood commented Oct 11, 2018

@cosmicexplorer: Any thoughts on why the new logging isn't triggering here?

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Oct 12, 2018

remote_pid=<remote PID chunk not yet received!>, remote_pgrp=<remote PGRP chunk not yet received!> is why. #6579 partially addresses this by at least ensuring both chunks are received before trying to receive anything different (so we would have had a clearer idea of where the error occurred).

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Oct 12, 2018

Also, I saw this test flake a million times in this same way when working on #6552, except the fatal error logging was working either most or all of the time. I don't have logs of that because I didn't realize that Travis reuses the build log text link when you restart a build (I thought it was a permalink) -- I messaged @stuhood on an internal slack at 7:25pm California time on Tuesday, Oct 2nd about such a log, but the log doesn't show the error anymore -- so I might check commits from #6552 around then and see if I removed anything obvious.

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Oct 12, 2018

...it looks like the first commit made after that time was f70e7c0b7d55a389d31b9b3b706d0901bc9f655e (in catch-everything on my fork), which reverts a lot of things to try to cut down the PR. I will see if I made any mistakes in that one which could have caused it to not get the chunks correctly.

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Oct 12, 2018

There also seem to be a lot of things going on here.

  1. We're losing the traceback to the actual method called within execute() that's throwing, which would shed more light on this than any supposition I can make. Is raise_with_traceback() not raising the traceback? Are we clobbering our desired traceback with some intermediate exception which overwrites sys.exc_info()?
  2. Both of these messages seem concerning and like they should be fixed:
E   	WARN] Not watching pantsd pidfile because subprocessdir is outside of buildroot. Having subprocessdir be a child of buildroot (as it is by default) may help avoid stray pantsd processes.
�[31mE   	No handlers could be found for logger "pants.pantsd.pants_daemon"

Any thoughts on why the new logging isn't triggering here?

E   	CRITICAL] 
E   	CRITICAL] lost active connection to pantsd!

The above logger.fatal() calls only occur in RemotePantsRunner#_run_pants_with_retry(), and only if we can't connect to pantsd (therefore obtaining a pid). So we're a little out of luck unless we want to scrape every single pid's log.

  1. But of course, we don't need to do any scraping of any logs, considering this is occurring during a kill-pantsd invocation, where killing pantsd is not a fatal error -- there should be a some sort of workaround for the case where we literally know in advance we are going to kill another process (premeditated murder) and therefore may experience error 104 on the RemotePantsRunner side, as in this log. It could be indicative of a bug (I don't know if it continues to occur with [WIP] split NailgunProtocol into PailgunProtocol versions of everything, add PGRP chunk #6579 in place, for example), or it could not be -- either way, I don't think it's ridiculous to try to catch that socket error in the case of specifically kill-pantsd, although I haven't thought about how best to do that yet. EDIT: We can totally get the goals from using the ArgSplitter on the RemotePantsRunner process's sys.argv before we start looping through pailgun chunks, and scan it for kill-pantsd, and if we find that goal in the parsed command line, we will catch any error(104, 'Connection reset by peer') and just log it or something instead of blowing up.

The above alone (the number 4) might hide real fatal errors in the time between when the pantsd-runner is invoked and when the pantsd process is terminated (we could of course just wait to have received the PID and PGRP chunks to trigger the exception catching mentioned above and avoid some of the issues with this situation). I'm also not at all sure how to interpret not having sent the PID and PGRP chunks over yet, DaemonPantsRunner sends those before it starts the LocalPantsRunner -- unless kill-pantsd isn't executed by a pantsd-runner? If core_tasks/ gets evaluated specially, that would explain it, but I feel like that's not true although I can't prove it yet.

@jsirois
Copy link
Contributor Author

jsirois commented Oct 12, 2018

@cosmicexplorer maybe relevant?: #6626

@cosmicexplorer
Copy link
Contributor

Probably very relevant, tracking that one now too. I have been able to repro this node integration test error locally on osx before, so if that happens I can try to investigate.

@stuhood stuhood self-assigned this Oct 29, 2018
stuhood added a commit to twitter/pants that referenced this issue Oct 29, 2018
stuhood pushed a commit that referenced this issue Oct 29, 2018
### Problem

As described in #6565, I had a fundamental misunderstanding of what was legal during a fork, which resulted in intermittent hangs when a `PantsService` thread (usually `FSEventService`, but potentially the `StoreGCService` threads as well) were blocked on locks while attempting to acquire resources pre-fork.

As mentioned on the ticket, the `fork_lock` used to act as a honeypot to block any threads that might be attempting to interact with the Engine. This mostly had the right effect, but it was possible to forget to wrap some other lock access in the fork lock (because, in effect, _all_ shared lock access needed to be protected by the `fork_lock`). But also, after the fork the `fork_lock` would be poisoned. This represented a challenging line to walk, where we needed to protect all shared locks, but we also needed to avoid that protection post fork in order to avoid deadlock.

### Solution

Rather than using a lock to approximate all of the other locks, instead ensure that all Service threads that might interact with any non-private locks are "paused" in well-known locations ("safe points"?) while we fork. Since we would like to have more and more-fine-grained locks over time, while keeping the number of un-pooled threads relatively constrained, adding this bookkeeping to service threads seems like the right tradeoff.

While we continue to move toward a world (via `--v2` and `@console_rule`) where forking is not necessary, we'd also like to be able to incrementally gain benefit from the daemon by porting portions of `--v1` pipelines pre-fork.

### Result

Fixes #6565 and fixes #6621.
stuhood pushed a commit that referenced this issue Oct 29, 2018
As described in #6565, I had a fundamental misunderstanding of what was legal during a fork, which resulted in intermittent hangs when a `PantsService` thread (usually `FSEventService`, but potentially the `StoreGCService` threads as well) were blocked on locks while attempting to acquire resources pre-fork.

As mentioned on the ticket, the `fork_lock` used to act as a honeypot to block any threads that might be attempting to interact with the Engine. This mostly had the right effect, but it was possible to forget to wrap some other lock access in the fork lock (because, in effect, _all_ shared lock access needed to be protected by the `fork_lock`). But also, after the fork the `fork_lock` would be poisoned. This represented a challenging line to walk, where we needed to protect all shared locks, but we also needed to avoid that protection post fork in order to avoid deadlock.

Rather than using a lock to approximate all of the other locks, instead ensure that all Service threads that might interact with any non-private locks are "paused" in well-known locations ("safe points"?) while we fork. Since we would like to have more and more-fine-grained locks over time, while keeping the number of un-pooled threads relatively constrained, adding this bookkeeping to service threads seems like the right tradeoff.

While we continue to move toward a world (via `--v2` and `@console_rule`) where forking is not necessary, we'd also like to be able to incrementally gain benefit from the daemon by porting portions of `--v1` pipelines pre-fork.

Fixes #6565 and fixes #6621.
stuhood pushed a commit that referenced this issue Oct 29, 2018
As described in #6565, I had a fundamental misunderstanding of what was legal during a fork, which resulted in intermittent hangs when a `PantsService` thread (usually `FSEventService`, but potentially the `StoreGCService` threads as well) were blocked on locks while attempting to acquire resources pre-fork.

As mentioned on the ticket, the `fork_lock` used to act as a honeypot to block any threads that might be attempting to interact with the Engine. This mostly had the right effect, but it was possible to forget to wrap some other lock access in the fork lock (because, in effect, _all_ shared lock access needed to be protected by the `fork_lock`). But also, after the fork the `fork_lock` would be poisoned. This represented a challenging line to walk, where we needed to protect all shared locks, but we also needed to avoid that protection post fork in order to avoid deadlock.

Rather than using a lock to approximate all of the other locks, instead ensure that all Service threads that might interact with any non-private locks are "paused" in well-known locations ("safe points"?) while we fork. Since we would like to have more and more-fine-grained locks over time, while keeping the number of un-pooled threads relatively constrained, adding this bookkeeping to service threads seems like the right tradeoff.

While we continue to move toward a world (via `--v2` and `@console_rule`) where forking is not necessary, we'd also like to be able to incrementally gain benefit from the daemon by porting portions of `--v1` pipelines pre-fork.

Fixes #6565 and fixes #6621.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants