Mitogen intermittent hangs on "Connection timed out" target #598
First thanks for the awesome project! I had a 11 times performance improvement!!! after using mitogen. However, I am facing below problem and wondering if you can take a look.
Submit ansible playbook jobs in a for loop for 20 time (96 hosts)
and it will intermittently hang on (Observed output)
If it doesn't hang it looks like below (Expected output)
if I take out
Below is the content of ping.yml playbook
The text was updated successfully, but these errors were encountered:
more log with "MITOGEN_DUMP_THREAD_STACKS=10"
Sorry for the late acknowledgement, I've been busy elsewhere. :) Thanks for an amazing bug report! It sounds like you might be hitting a deadlock early during startup.
I will aim to set up a reproduction 'real soon'. The current master branch needs a soak test before release, will try running it against 100 nodes to see if we can flush this one out.
There are some forking-related deadlocks on current master that might explain this. I have not changed in the recent releases relating to forking, and the last soak I did was fine.
How is the quality of networking? I don't think it is a network issue, but worth asking just in case
I reproduced your issue using 96 Google Cloud nodes ( https://github.com/dw/mitogen/blob/master/tests/ansible/gcloud/mitogen-load-testing.tf ), 0.2.7 fails very quickly.
I have found and fixed 2 deadlocks, one during startup in the target that looks like it could have impacted a lot of people ( 769a8b2 ), and one in the master ( f78a5f0 ). After 120 runs (11,520 connections) I can no longer reproduce your issue
Thanks for reporting this, and apologies for the delay in responding.
This is now on the master branch and will make it into the next release. To be updated when a new release is made, subscribe to https://networkgenomics.com/mail/mitogen-announce/
* origin/dmw: issue #613: must await 'exit' and 'disconnect' in wait=False test Import LGTM config to disable some stuff Fix up another handful of LGTM errors. tests: work around AnsibleModule.run_command() race. docs: mention another __main__ safeguard docs: tweaks formatting error docs: make Sphinx install soft fail on Python 2. issue #598: allow disabling preempt in terraform issue #598: update Changelog.
* origin/v028: (383 commits) Bump version for release. docs: update Changelog for 0.2.8. issue #627: add test and tweak Reaper behaviour. docs: lots more changelog concision docs: changelog concision docs: more changelog tweaks docs: reorder chapters docs: versionless <title> docs: update supported Ansible version, mention unsupported features docs: changelog fixes/tweaks issue #590: update Changelog. issue #621: send ADD_ROUTE earlier and add test for early logging. issue #590: whoops, import missing test modules issue #590: rework ParentEnumerationMethod to recursively handle bad modules issue #627: reduce the default pool size in a child to 2. tests: add a few extra service tests. docs: some more hyperlink joy docs: more hyperlinks docs: add domainrefs plugin to make link aliases everywhere \o/ docs: link IS_DEAD in changelog docs: tweaks to better explain changelog race issue #533: update routing to account for DEL_ROUTE propagation race tests: use defer_sync() Rather than defer() + ancient sync_with_broker() tests: one case from doas_test was invoking su tests: hide memory-mapped files from lsof output issue #615: remove meaningless test issue #625: ignore SIGINT within MuxProcess issue #625: use exec() instead of subprocess in mitogen_ansible_playbook issue #615: regression test issue #615: update Changelog. issue #615: ensure 4GB max_message_size is configured for task workers. issue #615: update Changelog. issue #615: route a dead message to recipients when no reply is expected issue #615: fetch_file() might be called with AnsibleUnicode. issue #615: redirect 'fetch' action to 'mitogen_fetch'. issue #615: extricate slurp brainwrong from mitogen_fetch issue #615: ansible: import Ansible fetch.py action plug-in issue #533: include object identity of Stream in repr() docs: lots more changelog issue #595: add buildah to docs and changelog. docs: a few more internals.rst additions ci: update to Ansible 2.8.3 tests: another random string changed in 2.8.3 tests: fix sudo_flags_failure for Ansible 2.8.3 ci: fix procps command line format warning Whoops, merge together lgtm.yml and .lgtm.yml issue #440: log Python version during bootstrap. docs: update changelog issue #558: disable test on OSX to cope with boundless mediocrity issue #558, #582: preserve remote tmpdir if caller did not supply one issue #613: must await 'exit' and 'disconnect' in wait=False test Import LGTM config to disable some stuff Fix up another handful of LGTM errors. tests: work around AnsibleModule.run_command() race. docs: mention another __main__ safeguard docs: tweaks formatting error docs: make Sphinx install soft fail on Python 2. issue #598: allow disabling preempt in terraform issue #598: update Changelog. issue #605: update Changelog. issue #605: ansible: share a sem_t instead of a pthread_mutex_t issue #613: add tests for all the weird shutdown methods Add mitogen.core.now() and use it everywhere; closes #614. docs: move decorator docs into core.py and use autodecorator preamble_size: make it work on Python 3. docs: upgrade Sphinx to 2.1.2, require Python 3 to build docs. docs: fix Sphinx warnings, add LogHandler, more docstrings docs: tidy up some Changelog text issue #615: fix up FileService tests for new logic issue #615: another Py3x fix. issue #615: Py3x fix. issue #615: update Changelog. issue #615: use FileService for target->controll file transfers issue #482: another Py3 fix ci: try removing exclude: to make Azure jobs work again compat: fix Py2.4 SyntaxError issue #482: remove 'ssh' from checked processes ci: Py3 fix issue #279: add one more test for max_message_size issue #482: ci: add stray process checks to all jobs tests: fix format string error core: MitogenProtocol.is_privileged was not set in children issue #482: tests: fail DockerMixin tests if stray processes exist docs: update Changelog. issue #586: update Changelog. docs: update Changelog. [security] core: undirectional routing wasn't respected in some cases docs: tidy up Select.all() issue #612: update Changelog. master: fix TypeError pkgutil: fix Python3 compatibility parent: use protocol for getting remote_id docs: merge signals.rst into internals.rst os_fork: do not attempt to cork the active thread. parent: fix get_log_level() for split out loggers. issue #547: fix service_test failures. issue #547: update Changelog. issue #547: core/service: race/deadlock-free service pool init docs: update Changelog. ...
first and foremost: thanks for the great work!
Perhaps this might be of interest to you:
This is a very minimal playbook which shows the problem for roughly 5 or more random hosts.
- hosts: all strategy: mitogen_linear gather_facts: no tasks: - setup:
If I reset
Any advice would be appreciated.
EDIT: From time to time this issue shows up even when I limit to a single machine.