Description
We had a specific Ansible task failed due to stream disconnected.
TASK [*** : Pull image from registry]
:
{"changed": false, "msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected", "unreachable": true}
We faced this on our test environment with 19 hosts during the task where all the hosts pull about 1GB docker image from the registry running in the 1st host. This makes outgoing traffic from 1st host almost maxed out for its 1Gbps NIC for about 70 secs. In each attempt a few out of 19 hosts failed, and the failing hosts changed over the attempts.
Debug messages around the error is below (IPs were replaced):
[mux 8488] 19:40:32.178381 D mitogen.ctx.ssh.host18:22.sudo.root: mitogen: Importer().find_module('ansible.module_utils.common.')
[mux 8488] 19:40:32.178750 D mitogen.ctx.ssh.host18:22.sudo.root: mitogen: Importer(): master doesn't know u'ansible.module_utils.common.'
[mux 8488] 19:40:32.179137 D mitogen.ctx.ssh.host18:22.sudo.root: mitogen: Importer().find_module('ansible.module_utils.parsing.')
[mux 8488] 19:40:32.179546 D mitogen.ctx.ssh.host18:22.sudo.root: mitogen: Importer(): master doesn't know u'ansible.module_utils.parsing.'
[mux 8488] 19:40:32.180691 D mitogen.ctx.ssh.host18:22.sudo.root: mitogen: Importer().find_module('ansible.module_utils.six.')
[mux 8488] 19:40:32.181083 D mitogen.ctx.ssh.host18:22.sudo.root: mitogen: Importer(): master doesn't know u'ansible.module_utils.six.'
[mux 8488] 19:40:36.941557 D mitogen: mitogen.ssh.Stream('ssh.host4:22').on_disconnect()
[mux 8488] 19:40:36.942205 D mitogen: RouteMonitor(): mitogen.ssh.Stream('ssh.host4:22') is gone; propagating DEL_ROUTE for set([14040, 9, 32039, 14039])
[mux 8488] 19:40:36.942600 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(14040)
[mux 8488] 19:40:36.943055 I ansible_mitogen.services: ContextService(): Forgetting Context(14040, u'ssh.host4:22.sudo.root') due to stream disconnect
[task 9534] 19:40:36.943212 D mitogen: RouteMonitor(): firing local disconnect for Context(14040, u'ssh.host4:22.sudo.root')
[mux 8488] 19:40:36.943448 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(9)
[mux 8488] 19:40:36.943781 I ansible_mitogen.services: ContextService(): Forgetting Context(9, u'ssh.host4:22') due to stream disconnect
[task 9534] 19:40:36.943906 D mitogen: RouteMonitor(): deleting route to 14040 via mitogen.core.Stream('unix_listener.8488')
[mux 8488] 19:40:36.944137 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(32039)
[task 9534] 19:40:36.944366 D mitogen: Router(Broker(0x7fdb3a4293d0)).del_route(14040)
[mux 8488] 19:40:36.944434 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(14039)
[task 9534] 19:40:36.944889 D mitogen: CallChain(Context(14040, u'ssh.host4:22.sudo.root')).call_no_reply(): mitogen.core.Dispatcher.forget_chain('psr-ex3000-168-24-9534-7fdb4c9d3740-58a0bf10be265')
[mux 8488] 19:40:36.944966 D mitogen: mitogen.parent.DiagLogStream(fd=128, 'ssh.host4:22').on_disconnect()
[mux 8488] 19:40:36.945471 D mitogen: mitogen.ssh.Stream('ssh.host4:22'): PID 8610 exited with return code 255
[task 9534] 19:40:36.945478 D mitogen: Context(0, None).call_service_async(u'ansible_mitogen.services.ContextService', u'reset', {'context': Context(14040, u'ssh.host4:22.sudo.root')})
[task 9534] 19:40:36.945922 D mitogen: Context(0, None).send_async(Message(0, 40102, 40102, 110, 1004, "\x80\x02X'\x00\x00\x00ansible_mitogen.services.ContextServiceq\x01X\x05"..186))
[mux 8488] 19:40:36.946478 D mitogen: Router(Broker(0x7fdb3bbe6210)): Message(14040, 40102, 0, 101, 0, '\x80\x02(NX\x0c\x00\x00\x00mitogen.coreX\n\x00\x00\x00Dispatcherq\x01X\x0c\x00\x00\x00forget_'..143) is dead: 'no route to 14040, my ID is 0'
[mux 8488] 19:40:36.947071 D ansible_mitogen.services: ContextService().reset(Context(14040, u'ssh.host4:22.sudo.root'))
[mux 8488] 19:40:36.947425 I ansible_mitogen.services: ContextService()._shutdown_unlocked(): shutting down Context(14040, u'ssh.host4:22.sudo.root')
[mux 8488] 19:40:36.947690 D mitogen: Context(14040, u'ssh.host4:22.sudo.root').shutdown() sending SHUTDOWN
[mux 8488] 19:40:36.948014 D ansible_mitogen.services: ContextService(): attempt to forget unknown Context(14040, u'ssh.host4:22.sudo.root')
[mux 8488] 19:40:36.948399 D mitogen: Router(Broker(0x7fdb3bbe6210)): Message(14040, 0, 0, 106, None, ''..0) is dead: 'no route to 14040, my ID is 0'
[task 9534] 19:40:36.949069 D mitogen: Broker(0x7fdb3a4293d0).shutdown()
[task 9534] 19:40:36.949775 D mitogen: mitogen.core.Stream('unix_listener.8488').on_disconnect()
[mux 8488] 19:40:36.950449 D mitogen: mitogen.core.Stream('unix_client.9534').on_disconnect()
[task 9534] 19:40:36.950419 D mitogen: Context(0, None).on_disconnect()
[mux 8488] 19:40:36.950894 D mitogen: Context(40102, None).on_disconnect()
[task 9534] 19:40:36.950986 D mitogen: Waker(Broker(0x7fdb3a4293d0) rfd=30, wfd=31).on_shutdown()
[task 9534] 19:40:36.951540 D mitogen: Waker(Broker(0x7fdb3a4293d0) rfd=30, wfd=31).on_disconnect()
[task 9534] 19:40:36.952548 D mitogen: Router(Broker(0x7fdb3a4293d0)): stats: 0 module requests in 0 ms, 0 sent (0 ms minify time), 0 negative responses. Sent 0.0 kb total, 0.0 kb avg.
[task 9534] 19:40:36.954727 D ansible_mitogen.connection: Call took 60033 ms: ansible_mitogen.target.run_module(kwargs={u'module_map': {u'builtin': [u'ansible.module_utils._text', u'ansible.module_utils.basic', u'ansible.module_utils.common', u'ansible.module_utils.common._collections_compat', u'ansible.module_utils.common.collections', u'ansible.module_utils.common.file', u'ansible.module_utils.common.process', u'ansible.module_utils.parsing', u'ansible.module_utils.parsing.convert_bool', u'ansible.module_utils.pycompat24', u'ansible.module_utils.six'], u'custom': []}, u'emulate_tty': True, u'good_temp_dir': u'/root/.ansible/tmp', u'module': u'command', u'cwd': None, u'service_context': Context(0, None), u'extra_env': None, u'is_python': None, u'env': {}, u'path': u'/opt/exabeam_installer/.env/lib/python2.7/site-packages/ansible/modules/commands/command.py', u'runner_name': u'NewStyleRunner', u'interpreter_fragment': None, u'json_args': '{"_ansible_version": "2.7.10", "warn": true, "_ansible_selinux_special_fs": ["fuse", "nfs", "vboxsf", "ramfs", "9p"], "_ansible_no_log": false, "_ansible_module_name": "command", "_raw_params": "docker pull host1:5000/exabeam-dl-kafka:LMS-i32.6_4", "_ansible_verbosity": 3, "_ansible_keep_remote_files": false, "_ansible_syslog_facility": "LOG_USER", "_ansible_socket": null, "_ansible_diff": false, "_ansible_remote_tmp": "/root/.ansible/tmp", "_ansible_shell_executable": "/bin/sh", "_ansible_check_mode": false, "_ansible_tmpdir": null, "_ansible_debug": false}'})
fatal: [host4]: UNREACHABLE! => {
"changed": false,
"msg": "Mitogen was disconnected from the remote environment while a call was in-progress. If you feel this is in error, please file a bug. Original error was: the respondent Context has disconnected",
"unreachable": true
}
[mux 8488] 19:40:37.332524 D mitogen: mitogen.ssh.Stream('ssh.host9:22').on_disconnect()
[mux 8488] 19:40:37.333122 D mitogen: RouteMonitor(): mitogen.ssh.Stream('ssh.host9:22') is gone; propagating DEL_ROUTE for set([11040, 18, 23039, 11039])
[mux 8488] 19:40:37.333499 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(11040)
[mux 8488] 19:40:37.333945 I ansible_mitogen.services: ContextService(): Forgetting Context(11040, u'ssh.host9:22.sudo.root') due to stream disconnect
[task 9552] 19:40:37.334135 D mitogen: RouteMonitor(): firing local disconnect for Context(11040, u'ssh.host9:22.sudo.root')
[mux 8488] 19:40:37.334357 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(18)
[mux 8488] 19:40:37.334701 I ansible_mitogen.services: ContextService(): Forgetting Context(18, u'ssh.host9:22') due to stream disconnect
[task 9552] 19:40:37.334695 D mitogen: RouteMonitor(): deleting route to 11040 via mitogen.core.Stream('unix_listener.8488')
[task 9552] 19:40:37.335039 D mitogen: Router(Broker(0x7fdb3a2c0410)).del_route(11040)
[mux 8488] 19:40:37.335079 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(23039)
[mux 8488] 19:40:37.335408 D mitogen: Router(Broker(0x7fdb3bbe6210)).del_route(11039)
[task 9552] 19:40:37.335502 D mitogen: CallChain(Context(11040, u'ssh.host9:22.sudo.root')).call_no_reply(): mitogen.core.Dispatcher.forget_chain('psr-ex3000-168-24-9552-7fdb4c9d3740-58a0bf111c129')
[mux 8488] 19:40:37.335901 D mitogen: mitogen.parent.DiagLogStream(fd=143, 'ssh.host9:22').on_disconnect()
[task 9552] 19:40:37.336038 D mitogen: Context(0, None).call_service_async(u'ansible_mitogen.services.ContextService', u'reset', {'context': Context(11040, u'ssh.host9:22.sudo.root')})
[mux 8488] 19:40:37.336375 D mitogen: mitogen.ssh.Stream('ssh.host9:22'): PID 8612 exited with return code 255
[task 9552] 19:40:37.336455 D mitogen: Context(0, None).send_async(Message(0, 40107, 40107, 110, 1004, "\x80\x02X'\x00\x00\x00ansible_mitogen.services.ContextServiceq\x01X\x05"..186))
One suspicious thing was this error came along with the message ansible_mitogen.connection: Call took 60XXX ms
, which always indicate duration around 60 secs. It sounds there's some timeout.. We wonder if there's any parameter we're not aware of triggering.
We tried adding ServerAliveInterval
, ServerAliveCountMax
and also TCPKeepAlive=yes
to ssh_args
. They were recognized as mitogen_get_stack
showed one of those combinations below. While trying the ssh_args, the task passed for a couple attempts, but still none of the variations of args made the task pass multiple times in a row.
$ ANSIBLE_STRATEGY=mitogen_linear ansible host1 -i ../inventory -m mitogen_get_stack -b
host1 | CHANGED => {
"changed": true,
"result": [
{
"kwargs": {
"check_host_keys": "ignore",
"compression": true,
"connect_timeout": 90,
"hostname": "<IP>",
"identities_only": false,
"identity_file": "<key>",
"password": null,
"port": 22,
"python_path": [
"/usr/bin/python"
],
"remote_name": null,
"ssh_args": [
"-o",
"ControlMaster=auto",
"-o",
"ControlPersist=18000s",
"-o",
"ForwardAgent=yes",
"-o",
"ServerAliveInterval=15",
"-o",
"ServerAliveCountMax=12"
],
"ssh_debug_level": null,
"ssh_path": "ssh",
"username": "exabeam"
},
"method": "ssh"
},
{
"enable_lru": true,
"kwargs": {
"connect_timeout": 90,
"password": null,
"python_path": [
"/usr/bin/python"
],
"remote_name": null,
"sudo_args": [
"-H",
"-S",
"-n"
],
"sudo_path": null,
"username": "root"
},
"method": "sudo"
}
]
}
Some more background info:
-
Which version of Ansible are you running?
We tested following combinations but all had the issue:
Ansible 2.7.10 + Mitogen 0.2.6,
2.7.10 + 0.2.7,
2.7.11 + 0.2.7, and
2.7.11 + master -
Is your version of Ansible patched in any way? - No
-
Are you running with any custom modules, or
module_utils
loaded? - No -
Cluster set-up:
- 19 onprem hosts with 1Gbps NIC
- CentOS 7 + Python 2.7.5.
Note: We suspected bandwidth as the interface of host1 was maxed for upload during the task (as all the other 18 hosts pulls docker image from host1). However, the issue was not reproduced on AWS and GCP (20/30 VMs) with limiting bandwidth to 1Gbps or even 100Mbps using wondershaper.
-
Have you tried the latest master version from Git? - Yes
We really liked speed up with Mitogen in our tests in cloud and hope to get this solved so we can use it onprem. Any suggestion for configs or things to look at would be appreciated.