Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ansible Copy module stops (no subsequent task) #351

Closed
berenddeschouwer opened this issue Sep 6, 2018 · 13 comments
Closed

Ansible Copy module stops (no subsequent task) #351

berenddeschouwer opened this issue Sep 6, 2018 · 13 comments

Comments

@berenddeschouwer
Copy link

@berenddeschouwer berenddeschouwer commented Sep 6, 2018

  • Have you tried the latest master version from Git? yes, and 0.2.2
  • Fedora 28, x86_64
  • python 2.7.
  • I'll attach -vvv and MITOGEN_DUMP_THREAD_STACKS=1

Ansible "hangs" -- waits forever to move to the next task. This seems to always happen on a copy task if/when an ssh process closes unexpectedly. The remaining hosts continue the copy task, but no subsequent task starts. The playbook stops (doesn't finish)

There's a traceback in mitogen:

ERROR! [pid 5883] 13:15:07.091804 E mitogen: mitogen.service.Pool(0x7f4906caf850, size=16,th='mitogen.service.Pool.7f4906caf850.worker-1'): while invoking u'get' of u'ansible_mitogen.services.ContextService'
Traceback (most recent call last):
  File "/home/berend/Source/mitogen/mitogen/service.py", line 519, in _on_service_call
     return invoker.invoke(method_name, kwargs, msg)
  File "/home/berend/Source/mitogen/mitogen/service.py", line 253, in invoke
    response = self._invoke(method_name, kwargs, msg)
  File "/home/berend/Source/mitogen/mitogen/service.py", line 239, in _invoke
    ret = method(**kwargs)
  File "/home/berend/Source/mitogen/ansible_mitogen/services.py", line 408, in get
    reraise(*result)
  File "/home/berend/Source/mitogen/ansible_mitogen/services.py", line 372, in _wait_or_start
    response = self._connect(key, spec, via=via)
  File "/home/berend/Source/mitogen/ansible_mitogen/services.py", line 336, in _connect
    candidate_temp_dirs=self._get_candidate_temp_dirs(),
  File "/home/berend/Source/mitogen/mitogen/parent.py", line 1121, in call
    return receiver.get().unpickle(throw_dead=False)
  File "/home/berend/Source/mitogen/mitogen/core.py", line 681, in get
    raise ChannelError(ChannelError.local_msg)
ChannelError: Channel closed by local end.

and then later Ansible notices it:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ansible/executor/task_executor.py", line 139, in run
    res = self._execute()
  File "/usr/lib/python2.7/site-packages/ansible/executor/task_executor.py", line 577, in _execute
    result = self._handler.run(task_vars=variables)
  File "/home/berend/Source/mitogen/ansible_mitogen/mixins.py", line 116, in run
    return super(ActionModuleMixin, self).run(tmp, task_vars)
  File "/usr/lib/python2.7/site-packages/ansible/plugins/action/copy.py", line 387, in run
    result = super(ActionModule, self).run(tmp, task_vars)
  File "/usr/lib/python2.7/site-packages/ansible/plugins/action/__init__.py", line 97, in run
    self._make_tmp_path()
  File "/home/berend/Source/mitogen/ansible_mitogen/mixins.py", line 190, in _make_tmp_path
    self._connection._shell.tmpdir = self._connection.get_temp_dir()
  File "/home/berend/Source/mitogen/ansible_mitogen/connection.py", line 654, in get_temp_dir
    self._connect()
  File "/home/berend/Source/mitogen/ansible_mitogen/connection.py", line 673, in _connect
    self._connect_stack(stack)
  File "/home/berend/Source/mitogen/ansible_mitogen/connection.py", line 635, in _connect_stack
    stack=mitogen.utils.cast(list(stack)),
  File "/home/berend/Source/mitogen/mitogen/core.py", line 1238, in call_service
    return recv.get().unpickle()
  File "/home/berend/Source/mitogen/mitogen/core.py", line 595, in unpickle
    raise obj
CallError: mitogen.core.ChannelError: Channel closed by local end.
  File "/home/berend/Source/mitogen/mitogen/service.py", line 519, in _on_service_call
    return invoker.invoke(method_name, kwargs, msg)
  File "/home/berend/Source/mitogen/mitogen/service.py", line 253, in invoke
    response = self._invoke(method_name, kwargs, msg)
  File "/home/berend/Source/mitogen/mitogen/service.py", line 239, in _invoke
    ret = method(**kwargs)
  File "/home/berend/Source/mitogen/ansible_mitogen/services.py", line 408, in get
    reraise(*result)
  File "/home/berend/Source/mitogen/ansible_mitogen/services.py", line 372, in _wait_or_start
    response = self._connect(key, spec, via=via)
  File "/home/berend/Source/mitogen/ansible_mitogen/services.py", line 336, in _connect
    candidate_temp_dirs=self._get_candidate_temp_dirs(),
  File "/home/berend/Source/mitogen/mitogen/parent.py", line 1121, in call
    return receiver.get().unpickle(throw_dead=False)
  File "/home/berend/Source/mitogen/mitogen/core.py", line 681, in get
    raise ChannelError(ChannelError.local_msg)
@berenddeschouwer
Copy link
Author

@berenddeschouwer berenddeschouwer commented Sep 6, 2018

I tried the obvious patch, attached here, but that doesn't work. It removes the traceback, but ansible still stops dead.

diff --git a/ansible_mitogen/services.py b/ansible_mitogen/services.py
index 952e991..da0d303 100644
--- a/ansible_mitogen/services.py
+++ b/ansible_mitogen/services.py
@@ -414,6 +414,13 @@ class ContextService(mitogen.service.Service):
                     'method_name': spec['method'],
                     'msg': str(e),
                 }
+            except mitogen.core.ChannelError as e:
+                return {
+                    'context': None,
+                    'init_child_result': None,
+                    'method_name': spec['method'],
+                    'msg': str(e),
+                }
 
         return result
 
@berenddeschouwer
Copy link
Author

@berenddeschouwer berenddeschouwer commented Sep 6, 2018

Attaching threadstacks. Keeps looping with "no change since last time"
threadstacks.log

@berenddeschouwer
Copy link
Author

@berenddeschouwer berenddeschouwer commented Sep 6, 2018

When this happens, there are no ssh processes running any longer, but ansible still has 300+ open fd's to various sockets and pipes.

@dw
Copy link
Member

@dw dw commented Sep 6, 2018

This one looks exciting! threadstacks.log shows worker-0, 1, 2, 3 and 4 in the process of connection attempts, but you say there are no SSH processes?

Regarding the "if/when SSH disconnects", is this usual for your environment?

How many hosts were your targetting in this run before the hang?

Thanks for reporting!

@dw
Copy link
Member

@dw dw commented Sep 6, 2018

This may at least in part be a manifestation of the missing downstream propagation from https://github.com/dw/mitogen/issues/76

@dw
Copy link
Member

@dw dw commented Sep 6, 2018

Am I right in saying the inline tracebacks are from Git, but the threadstacks.log is from 0.2.2?

@berenddeschouwer
Copy link
Author

@berenddeschouwer berenddeschouwer commented Sep 7, 2018

All the logs are from git, but they may have been from me patching it, so I'll try and redo them.

@berenddeschouwer
Copy link
Author

@berenddeschouwer berenddeschouwer commented Sep 7, 2018

I was targeting about 150 hosts.

It is normal for a small percentage of ssh connections to disconnect. I've so far found the following reasons for that:

  • This is running in various countries in africa, and the networks aren't always reliable.
  • There's also a openssh bug when combining multiplexing and jumphosts, so multiplexing is switched off.
  • I've also upped the number of open file descriptors with ulimit to avoid similar silent failures.
@berenddeschouwer
Copy link
Author

@berenddeschouwer berenddeschouwer commented Sep 7, 2018

Including a threadstack that's confirmed from git master (18685da)
threadstack.log

@dw
Copy link
Member

@dw dw commented Sep 7, 2018

This sounds like you're hitting the same bug as #352, in that one an EC2 instance went down for retirement and the code didn't notice. I hope to have this fixed in 0.2.3, but if not 0.2.3, an 0.2.4 that'll follow shortly afterwards, as 0.2.3 is huge already :)

@berenddeschouwer
Copy link
Author

@berenddeschouwer berenddeschouwer commented Sep 8, 2018

Sound like it's the same problem to me too: some connection errors aren't processed correctly. Looking forward to testing 0.2.3 and 0.2.4.

@berenddeschouwer
Copy link
Author

@berenddeschouwer berenddeschouwer commented Sep 11, 2018

If you're looking for an easy way to test:

in ansible.cfg, add

[ssh_connection]
ssh_executable = /path/to/timessh

and in /path/to/timessh deliberately stop ssh:

#!/bin/sh
exec /usr/bin/timeout 3 /usr/bin/ssh "${@}"

then any task in a playbook that takes longer than 3 seconds will trigger the bug.

@dw
Copy link
Member

@dw dw commented Nov 1, 2018

Hi Berend,

Sorry for such a huge delay on this one -- I forgot there were related bugs. Your attached test case is indeed a reproducer for the class of issues resolved in #76. This is now on master. Thanks so much for reporting this, and most of all, thanks a ton for a reproducer that made the work equivalent to another ticket. :)

This is now on the master branch and will make it into the next release. To be updated when a new release is made, subscribe to https://networkgenomics.com/mail/mitogen-announce/

Thanks for reporting this!

@dw dw added the os:wsl label Nov 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants