Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve debugging on lxc exec failure #373

Closed
candlerb opened this issue Sep 18, 2018 · 4 comments
Closed

Improve debugging on lxc exec failure #373

candlerb opened this issue Sep 18, 2018 · 4 comments
Labels

Comments

@candlerb
Copy link
Contributor

@candlerb candlerb commented Sep 18, 2018

  • mitogen master plus #372 patch
  • ubuntu 16.04.5, lxd 2.0.11
  • python 2.7.9
  • ansible 2.5.9

This was a really hard-to-debug issue - and even though the underlying issue turned out to be simple, I wonder if output can be improved to help others in future.

Here's the test case:

==> test.yml <==
- hosts: gold-master
  gather_facts: false
  tasks:
    - raw: echo hello world

==> hosts.yml <==
all:
  hosts:
    gold-master:
      ansible_connection: lxd

==> ansible.cfg <==
[defaults]
inventory = hosts.yml

#strategy_plugins = /usr/local/share/mitogen/ansible_mitogen/plugins/strategy
#strategy = mitogen_linear

"gold-master" is a fresh ubuntu 16.04 container.

With mitogen commented out as shown, it works. With mitogen uncommented, it aborts like this:

$ ansible-playbook test.yml

PLAY [gold-master] ******************************************************************************************************************

TASK [raw] **************************************************************************************************************************
fatal: [gold-master]: UNREACHABLE! => {"changed": false, "msg": "EOF on stream; last 300 bytes received: u''", "unreachable": true}
	to retry, use: --limit @/home/ubuntu/ansible/misc/test.retry

PLAY RECAP **************************************************************************************************************************
gold-master                : ok=0    changed=0    unreachable=1    failed=0

Running with MITOGEN_ROUTER_DEBUG=1 shows

09:45:08 D mitogen: mitogen.lxd.Stream(u'default').connect()
09:45:08 D mitogen: create_child() child 2161 fd 63, parent 2139, cmd: lxc exec --mode=noninteractive gold-master -- /usr/bin/python
-c "import codecs,os,sys;_=codecs.decode;exec(_(_(\"eNqFkMFLwzAUxs/rX9HbS1hY0+4gFgLKDuJBhCLuoEPaJtVgl4S0XZ1/vW+tsHYePOX9+L6X7/FlbCtss
3LaKUIDz/oJ6SpEqKz/JDQNFjjLziWEs5hzeuaMTcmjGo9c1rZRJJuCn8J2Cj0CBjZHjK/zFlP3oRAhyNz32kCYGzmI6kuVXZsXtRrkqGt8VGgTuWP7YQ3gnYsL21IMiwflG2
3NS7reDbHKHLRHhtvs7pnDTszXRg9iTeYCm+MSyF639l2ZtCs603Y345Mm8fqaAg3wm97rVpGYwcP90yPn/NUAxpdWYss02Ig3cupZWqcMtgu+ALryKpckTq6wTAbf2uFPlRN
n35ZBX8Cp+sr9BmyGeazzwt3/5/57ZTy78gfndK2h\".encode(),\"base64\"),\"zip\"))"
09:45:08 D mitogen: mitogen.lxd.Stream(u'local.2161').connect(): child process stdin/stdout=63
09:45:08 D mitogen: mitogen.lxd.Stream(u'local.2161'): child process exit status was 65280
09:45:08 D mitogen: mitogen.core.Stream(u'unix_client.2158').on_disconnect()

Running with strace on top of that shows little more, but narrows it down to lxc exec ... terminating with exit code 255.

The real clue is buried in /var/log/lxd/gold-master/lxc.log:

lxc 20180918094508.427 ERROR    lxc_attach - attach.c:lxc_attach_run_command:1292 - No such file or directory - Failed to exec "/usr/bin/python".

And indeed it turns out I can reproduce this at the command line:

$ lxc exec gold-master -- /usr/bin/python -c 'print("hello")'
$ echo $?
255

Sadly, lxc exec just aborts with 255 and no error message at all, if the target command does not exist.

The reason that started all this is that I have been using ansible to prepare containers, and I have been using ansible's "raw" module to install python2 in the container, before continuing with real ansible commands:

- name: check python is installed in container
  raw: dpkg -s python
  register: python_install_check
  failed_when: python_install_check.rc not in [0, 1]
  changed_when: false

- name: install python in container
  raw: apt-get update && apt-get install -y python
  when: python_install_check.rc == 1

# After this we are OK to use ansible

This works with normal ansible, but fails with mitogen because mitogen depends on python already being installed in the container, even for "raw" actions. Of course it's obvious when you think about it.

I think this uncovers several issues.

  1. It would be really helpful if mitogen could generate some output when 'lxc exec' fails - even if it just said lxc exec failed with exit code 255, there may be more information in /var/log/lxd/<container>/lxc.log

  2. lxc exec itself is arguably broken by not generating any output on stderr, and returning an invalid exit code (should be 0 to 127; 128+ indicates that the process terminated with a signal). I'll raise that separately with lxd project. If lxc exec did generate stderr output, it should be passed through.

  3. Obviously I still have my underlying issue with bootstrapping containers. Probably the best long-term solution would be to use ansible_python_interpreter=/usr/bin/python3, since the container already has python3, as long as mitogen supports python3 (which I believe it does). Otherwise I should be able to disable mitogen while bootstrapping:

- hosts: gold-master
  gather_facts: false
  strategy: linear
  tasks:
    - raw: echo hello world
@dw
Copy link
Member

@dw dw commented Oct 2, 2018

I think the best thing we can do here, except record any future version of LXC's output, is to include a hint as part of the log? Say appending something like:

EOF on stream; last 300 bytes received: u''

Note: many versions of LXC do not report program execution failure. Please check
/foo/bar/baz to log if the above message was not helpful.
@candlerb
Copy link
Contributor Author

@candlerb candlerb commented Oct 3, 2018

How about:

  1. adding the exit code (or only if it's non zero)
  2. in addition, if the stream is empty (u'') and the exit code is 255 or 127, then also add the more verbose description

The internally logged message child process exit status was 65280 suggests that there is an endianness problem (65280 = 0xff00). So maybe this can be fixed and then the exit code brought forward to display.

EOF on stream; last 300 bytes received: u''; exit code 255.  Note: ... etc
dw added a commit that referenced this issue Oct 3, 2018
Don't log the raw waitpid() result, convert it to a useful string first.
@dw
Copy link
Member

@dw dw commented Oct 3, 2018

Your suggestion was blindingly obvious :) Glad I asked!

The endianness issue was actually because it was logging the full wait status, I've replaced that with a neat string. All that code should eventually revert to using the subprocess package, but it's a relatively disruptive change so it's been perpetually delayed.

@dw dw added the user-reported label Oct 23, 2018
dw added a commit that referenced this issue Oct 30, 2018
@dw dw closed this in 71f9e84 Oct 30, 2018
dw added a commit that referenced this issue Oct 30, 2018
- poller tests, event versioning
- issue #373
@dw
Copy link
Member

@dw dw commented Oct 30, 2018

I punted on including the exit status, it requires too much refactoring for now. In future we will be pulling all the process management code out into a new class of its own, so error reporting should improve by then Thanks for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants