`NodeTestIntegrationTest.test_test_simple` flaky in CI #6621

jsirois · 2018-10-11T20:29:45Z

I've hit this three times in a row on the 'Py3 - Python contrib tests' shard.

Looks like:

                    ____ NodeTestIntegrationTest.test_test_simple ____
                     
                     self = <pants_test.contrib.node.tasks.test_node_test_integration.NodeTestIntegrationTest testMethod=test_test_simple>
                     args = (), kwargs = {}, enable_daemon = 'true'
                     subprocess_dir = '/tmp/tmpcnr7t5r1'
                     env = {'HERMETIC_ENV': 'PANTS_ENABLE_PANTSD,PANTS_ENABLE_V2_ENGINE,PANTS_SUBPROCESSDIR', 'PANTS_ENABLE_PANTSD': 'true', 'PANTS_ENABLE_V2_ENGINE': 'true', 'PANTS_SUBPROCESSDIR': '/tmp/tmpcnr7t5r1'}
                     
                         def wrapper(self, *args, **kwargs):
                           for enable_daemon in ('false', 'true',):
                             with temporary_dir() as subprocess_dir:
                               env = {
                                   'HERMETIC_ENV': 'PANTS_ENABLE_PANTSD,PANTS_ENABLE_V2_ENGINE,PANTS_SUBPROCESSDIR',
                                   'PANTS_ENABLE_PANTSD': enable_daemon,
                                   'PANTS_ENABLE_V2_ENGINE': enable_daemon,
                                   'PANTS_SUBPROCESSDIR': subprocess_dir,
                                 }
                               with environment_as(**env):
                                 try:
                                   f(self, *args, **kwargs)
                                 finally:
                                   if enable_daemon:
                     >               self.assert_success(self.run_pants(['kill-pantsd']))
                     
                     .pants.d/pyprep/sources/6f7d4414b9dea6fdf4a197230582e66e9e215ca7/pants_test/pants_run_integration_test.py:117: 
                     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
                     .pants.d/pyprep/sources/6f7d4414b9dea6fdf4a197230582e66e9e215ca7/pants_test/pants_run_integration_test.py:408: in assert_success
                         self.assert_result(pants_run, self.PANTS_SUCCESS_CODE, expected=True, msg=msg)
                     .pants.d/pyprep/sources/6f7d4414b9dea6fdf4a197230582e66e9e215ca7/pants_test/pants_run_integration_test.py:429: in assert_result
                         assertion(value, pants_run.returncode, error_msg)
                     E   AssertionError: 0 != 1 : /home/travis/build/pantsbuild/pants/pants --no-pantsrc --pants-workdir=/home/travis/build/pantsbuild/pants/.pants.d/tmp/tmplefzzqct.pants.d --kill-nailguns --print-exception-stacktrace=True kill-pantsd
                     E   returncode: 1
                     E   stdout:
                     E   	
                     E   stderr:
                     E   	WARN] Not watching pantsd pidfile because subprocessdir is outside of buildroot. Having subprocessdir be a child of buildroot (as it is by default) may help avoid stray pantsd processes.
                     �[31mE   	No handlers could be found for logger "pants.pantsd.pants_daemon"
                     E   	CRITICAL] 
                     E   	CRITICAL] lost active connection to pantsd!
                     E   	timestamp: 2018-10-11T19:06:48.188044
                     E   	Exception caught: (pants.bin.remote_pants_runner.Terminated)
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_loader.py", line 75, in <module>
                     E   	    main()
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_loader.py", line 71, in main
                     E   	    PantsLoader.run()
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_loader.py", line 67, in run
                     E   	    cls.load_and_execute(entrypoint)
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_loader.py", line 60, in load_and_execute
                     E   	    entrypoint_main()
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_exe.py", line 39, in main
                     E   	    PantsRunner(exiter, start_time=start_time).run()
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/pants_runner.py", line 48, in run
                     E   	    return RemotePantsRunner(self._exiter, self._args, self._env, bootstrap_options).run()
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/remote_pants_runner.py", line 189, in run
                     E   	    self._run_pants_with_retry(pantsd_handle)
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/remote_pants_runner.py", line 134, in _run_pants_with_retry
                     E   	    raise_with_traceback(self._extract_remote_exception(pantsd_handle.pid, e))
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/remote_pants_runner.py", line 113, in _run_pants_with_retry
                     E   	    return self._connect_and_execute(pantsd_handle.port)
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/bin/remote_pants_runner.py", line 154, in _connect_and_execute
                     E   	    result = client.execute(self.PANTS_COMMAND, *self._args, **modified_env)
                     E   	  File "/home/travis/build/pantsbuild/pants/src/python/pants/java/nailgun_client.py", line 276, in execute
                     E   	    traceback=sys.exc_info()[2]
                     E   	
                     E   	Exception message: abruptly lost active connection to pantsd runner: NailgunError(u"Problem talking to nailgun server (address: 127.0.0.1:46339, remote_pid=<remote PID chunk not yet received!>, remote_pgrp=<remote PGRP chunk not yet received!>): error(104, 'Connection reset by peer')", error(104, 'Connection reset by peer'))
                      generated xml file: /home/travis/build/pantsbuild/pants/.pants.d/test/pytest/contrib.node.tests.python.pants_test.contrib.node.tasks.node_test_integration/junitxml/TEST-contrib.node.tests.python.pants_test.contrib.node.tasks.node_test_integration.xml 
                     ============ slowest 3 test durations ============
                     32.86s call     ../contrib/node/tests/python/pants_test/contrib/node/tasks/test_node_test_integration.py::NodeTestIntegrationTest::test_test_multiple_targets
                     27.53s call     ../contrib/node/tests/python/pants_test/contrib/node/tasks/test_node_test_integration.py::NodeTestIntegrationTest::test_test_target_with_non_default_script_name
                     21.37s call     ../contrib/node/tests/python/pants_test/contrib/node/tasks/test_node_test_integration.py::NodeTestIntegrationTest::test_test_simple
                     ====== 1 failed, 4 passed in 100.86 seconds ======

The text was updated successfully, but these errors were encountered:

baroquebobcat · 2018-10-11T20:49:20Z

Does it always happen with this particular test, or does it fail in other tests in the same manner?

jsirois · 2018-10-11T20:52:31Z

@baroquebobcat 3 times in a row same shard, Node each time ... but I can't absolutely claim same exact test. I do not recall that level of detail.

baroquebobcat · 2018-10-11T21:04:23Z

That's fair.

jsirois · 2018-10-11T21:48:46Z

Went green on try 4 fwiw. This was a 3/5 failure rate in this case.

stuhood · 2018-10-11T22:24:08Z

@cosmicexplorer: Any thoughts on why the new logging isn't triggering here?

cosmicexplorer · 2018-10-12T01:00:24Z

remote_pid=<remote PID chunk not yet received!>, remote_pgrp=<remote PGRP chunk not yet received!> is why. #6579 partially addresses this by at least ensuring both chunks are received before trying to receive anything different (so we would have had a clearer idea of where the error occurred).

cosmicexplorer · 2018-10-12T02:15:08Z

Also, I saw this test flake a million times in this same way when working on #6552, except the fatal error logging was working either most or all of the time. I don't have logs of that because I didn't realize that Travis reuses the build log text link when you restart a build (I thought it was a permalink) -- I messaged @stuhood on an internal slack at 7:25pm California time on Tuesday, Oct 2nd about such a log, but the log doesn't show the error anymore -- so I might check commits from #6552 around then and see if I removed anything obvious.

cosmicexplorer · 2018-10-12T02:19:12Z

...it looks like the first commit made after that time was f70e7c0b7d55a389d31b9b3b706d0901bc9f655e (in catch-everything on my fork), which reverts a lot of things to try to cut down the PR. I will see if I made any mistakes in that one which could have caused it to not get the chunks correctly.

cosmicexplorer · 2018-10-12T06:19:38Z

There also seem to be a lot of things going on here.

We're losing the traceback to the actual method called within execute() that's throwing, which would shed more light on this than any supposition I can make. Is raise_with_traceback() not raising the traceback? Are we clobbering our desired traceback with some intermediate exception which overwrites sys.exc_info()?
Both of these messages seem concerning and like they should be fixed:

E   	WARN] Not watching pantsd pidfile because subprocessdir is outside of buildroot. Having subprocessdir be a child of buildroot (as it is by default) may help avoid stray pantsd processes.
�[31mE   	No handlers could be found for logger "pants.pantsd.pants_daemon"

Any thoughts on why the new logging isn't triggering here?

E   	CRITICAL] 
E   	CRITICAL] lost active connection to pantsd!

The above logger.fatal() calls only occur in RemotePantsRunner#_run_pants_with_retry(), and only if we can't connect to pantsd (therefore obtaining a pid). So we're a little out of luck unless we want to scrape every single pid's log.

But of course, we don't need to do any scraping of any logs, considering this is occurring during a kill-pantsd invocation, where killing pantsd is not a fatal error -- there should be a some sort of workaround for the case where we literally know in advance we are going to kill another process (premeditated murder) and therefore may experience error 104 on the RemotePantsRunner side, as in this log. It could be indicative of a bug (I don't know if it continues to occur with [WIP] split NailgunProtocol into PailgunProtocol versions of everything, add PGRP chunk #6579 in place, for example), or it could not be -- either way, I don't think it's ridiculous to try to catch that socket error in the case of specifically kill-pantsd, although I haven't thought about how best to do that yet. EDIT: We can totally get the goals from using the ArgSplitter on the RemotePantsRunner process's sys.argv before we start looping through pailgun chunks, and scan it for kill-pantsd, and if we find that goal in the parsed command line, we will catch any error(104, 'Connection reset by peer') and just log it or something instead of blowing up.

The above alone (the number 4) might hide real fatal errors in the time between when the pantsd-runner is invoked and when the pantsd process is terminated (we could of course just wait to have received the PID and PGRP chunks to trigger the exception catching mentioned above and avoid some of the issues with this situation). I'm also not at all sure how to interpret not having sent the PID and PGRP chunks over yet, DaemonPantsRunner sends those before it starts the LocalPantsRunner -- unless kill-pantsd isn't executed by a pantsd-runner? If core_tasks/ gets evaluated specially, that would explain it, but I feel like that's not true although I can't prove it yet.

jsirois · 2018-10-12T14:59:11Z

@cosmicexplorer maybe relevant?: #6626

cosmicexplorer · 2018-10-12T18:17:14Z

Probably very relevant, tracking that one now too. I have been able to repro this node integration test error locally on osx before, so if that happens I can try to investigate.

### Problem As described in #6565, I had a fundamental misunderstanding of what was legal during a fork, which resulted in intermittent hangs when a `PantsService` thread (usually `FSEventService`, but potentially the `StoreGCService` threads as well) were blocked on locks while attempting to acquire resources pre-fork. As mentioned on the ticket, the `fork_lock` used to act as a honeypot to block any threads that might be attempting to interact with the Engine. This mostly had the right effect, but it was possible to forget to wrap some other lock access in the fork lock (because, in effect, _all_ shared lock access needed to be protected by the `fork_lock`). But also, after the fork the `fork_lock` would be poisoned. This represented a challenging line to walk, where we needed to protect all shared locks, but we also needed to avoid that protection post fork in order to avoid deadlock. ### Solution Rather than using a lock to approximate all of the other locks, instead ensure that all Service threads that might interact with any non-private locks are "paused" in well-known locations ("safe points"?) while we fork. Since we would like to have more and more-fine-grained locks over time, while keeping the number of un-pooled threads relatively constrained, adding this bookkeeping to service threads seems like the right tradeoff. While we continue to move toward a world (via `--v2` and `@console_rule`) where forking is not necessary, we'd also like to be able to incrementally gain benefit from the daemon by porting portions of `--v1` pipelines pre-fork. ### Result Fixes #6565 and fixes #6621.

As described in #6565, I had a fundamental misunderstanding of what was legal during a fork, which resulted in intermittent hangs when a `PantsService` thread (usually `FSEventService`, but potentially the `StoreGCService` threads as well) were blocked on locks while attempting to acquire resources pre-fork. As mentioned on the ticket, the `fork_lock` used to act as a honeypot to block any threads that might be attempting to interact with the Engine. This mostly had the right effect, but it was possible to forget to wrap some other lock access in the fork lock (because, in effect, _all_ shared lock access needed to be protected by the `fork_lock`). But also, after the fork the `fork_lock` would be poisoned. This represented a challenging line to walk, where we needed to protect all shared locks, but we also needed to avoid that protection post fork in order to avoid deadlock. Rather than using a lock to approximate all of the other locks, instead ensure that all Service threads that might interact with any non-private locks are "paused" in well-known locations ("safe points"?) while we fork. Since we would like to have more and more-fine-grained locks over time, while keeping the number of un-pooled threads relatively constrained, adding this bookkeeping to service threads seems like the right tradeoff. While we continue to move toward a world (via `--v2` and `@console_rule`) where forking is not necessary, we'd also like to be able to incrementally gain benefit from the daemon by porting portions of `--v1` pipelines pre-fork. Fixes #6565 and fixes #6621.

jsirois added the bug label Oct 11, 2018

stuhood mentioned this issue Oct 29, 2018

Pause all PantsService threads before forking a pantsd-runner #6671

Merged

stuhood self-assigned this Oct 29, 2018

stuhood added a commit to twitter/pants that referenced this issue Oct 29, 2018

Increase max_user_watches to fix pantsbuild#6621.

e8412d7

stuhood closed this as completed in #6671 Oct 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`NodeTestIntegrationTest.test_test_simple` flaky in CI #6621

`NodeTestIntegrationTest.test_test_simple` flaky in CI #6621

jsirois commented Oct 11, 2018

baroquebobcat commented Oct 11, 2018

jsirois commented Oct 11, 2018

baroquebobcat commented Oct 11, 2018

jsirois commented Oct 11, 2018

stuhood commented Oct 11, 2018

cosmicexplorer commented Oct 12, 2018 •

edited

Loading

cosmicexplorer commented Oct 12, 2018 •

edited

Loading

cosmicexplorer commented Oct 12, 2018 •

edited

Loading

cosmicexplorer commented Oct 12, 2018 •

edited

Loading

jsirois commented Oct 12, 2018

cosmicexplorer commented Oct 12, 2018

NodeTestIntegrationTest.test_test_simple flaky in CI #6621

NodeTestIntegrationTest.test_test_simple flaky in CI #6621

Comments

jsirois commented Oct 11, 2018

baroquebobcat commented Oct 11, 2018

jsirois commented Oct 11, 2018

baroquebobcat commented Oct 11, 2018

jsirois commented Oct 11, 2018

stuhood commented Oct 11, 2018

cosmicexplorer commented Oct 12, 2018 • edited Loading

cosmicexplorer commented Oct 12, 2018 • edited Loading

cosmicexplorer commented Oct 12, 2018 • edited Loading

cosmicexplorer commented Oct 12, 2018 • edited Loading

jsirois commented Oct 12, 2018

cosmicexplorer commented Oct 12, 2018

`NodeTestIntegrationTest.test_test_simple` flaky in CI #6621

`NodeTestIntegrationTest.test_test_simple` flaky in CI #6621

cosmicexplorer commented Oct 12, 2018 •

edited

Loading

cosmicexplorer commented Oct 12, 2018 •

edited

Loading

cosmicexplorer commented Oct 12, 2018 •

edited

Loading

cosmicexplorer commented Oct 12, 2018 •

edited

Loading