Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to create a PTY. Operation not permitted [on CentOS7] #462

Closed
yhafri opened this issue Dec 23, 2018 · 22 comments
Closed

Failed to create a PTY. Operation not permitted [on CentOS7] #462

yhafri opened this issue Dec 23, 2018 · 22 comments

Comments

@yhafri
Copy link

@yhafri yhafri commented Dec 23, 2018

I'm getting the following error when running my playbook to update a remote CentOS7 machine
(the same playbook works perfectly well on all my Ubuntu instances).

  ____________
< PLAY [all] >
 ------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

 ________________________
< TASK [Gathering Facts] >
 ------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

fatal: [centos7]: FAILED! => {"msg": "error occurred on host xxx.xxx.xxx.xxx: Failed to create a PTY: [Errno 1] Operation not permitted. It is likely the maximum number of PTYs has been reached. Consider increasing the 'kern.tty.ptmx_max' sysctl on OS X, the 'kernel.pty.max' sysctl on Linux, or modifying your configuration to avoid PTY use."}
        to retry, use: --limit @/Users/xxx/ansible/script/update.retry
 ____________
< PLAY RECAP >
 ------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

centos7                 : ok=0    changed=0    unreachable=0    failed=1

My CentOS7 instance has enough PTYs as you can see:

# sysctl kernel.pty.max
kernel.pty.max = 5120

Moreover, it's an idle machine. No service is running on it except the default ones.

My host config is:

  1. macOS High Sierra 10.13.6
  2. Mitogen from GIT master (ec05604)
  3. Python
$ python
Python 2.7.15 (default, Oct  2 2018, 11:47:18)
[GCC 4.2.1 Compatible Apple LLVM 10.0.0 (clang-1000.11.45.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
  1. Ansible from brew
$ ansible --version 
ansible 2.7.5
  config file = /Users/xxx/.ansible.cfg
  configured module search path = ['/Users/xxx/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/Cellar/ansible/2.7.5/libexec/lib/python3.7/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.7.1 (default, Nov 28 2018, 11:51:54) [Clang 10.0.0 (clang-1000.11.45.5)]
  1. My ~/.ansible.cfg is:
[ssh_connection]
retries=3
pipelining = True

[defaults]
timeout = 20
host_key_checking = False
ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ServerAliveInterval=60
strategy_plugins = ~/.ansible/plugins/mitogen/ansible_mitogen/plugins/strategy
strategy = mitogen_linear

[ IMPORTANT ]
if I disable mitogen, the playbook works well for CentOS7 !!!

@yhafri yhafri changed the title Failed to create a PTY on CentOS7 Failed to create a PTY. Operation not permitted [on CentOS7] Dec 23, 2018
@sboisson
Copy link

@sboisson sboisson commented Jan 4, 2019

I have the exact same problem, with a host running on macOS Mojave 10.14.2 and a specific server running CentOS Linux release 7.5.1804 (Core).
But I do not have the problem with others servers running the same OS or CentOS Linux 6 or Ubuntu.

On both CentOS 7 servers I have: kernel.pty.max = 4096

> ansible --version
ansible 2.7.0
  config file = /Users/sboisson/.ansible.cfg
  configured module search path = ['/Users/sboisson/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.7/site-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 3.7.1 (default, Nov 23 2018, 18:08:03) [Clang 10.0.0 (clang-1000.11.45.5)]

I've attached logs I get when I run ansible with -vvvv flag. The issue seems to be related to sudo, but I couldn't find the difference in /etc/sudoer and /etc/ssh/sshd_config between the two servers
ansible.log

@dw
Copy link
Member

@dw dw commented Jan 17, 2019

Sorry for the extreme delay replying to this one. Can either of you confirm whether SELinux is enabled or not? Is there anything else about your configuration that might be considered 'non-default', say, are you logging in with an LDAP account or suchlike that might effect group memberships, etc.

@dw
Copy link
Member

@dw dw commented Jan 17, 2019

Would it be possible to include the output of these commands executed from the SSH account that is failing to sudo:

echo pty perms:
ls -ld /dev/ptmx /dev/pts `tty`
echo

echo devpts modes:
mount | grep dev/pts
echo

echo user account perms, selinux context:
id
echo

echo real, effective, saveds ids:
python -c 'import os; print(os.getresuid())'
python -c 'import os; print(os.getresgid())'
echo

echo 'auxiliary groups:'
python -c 'import os; print(os.getgroups())'
echo

echo selinux:
getenforce
echo

Also if you can think of any security features that might be enabled on these servers that aren't enabled elsewhere.. different install script / admin team perhaps? etc.

If strace is available on these hosts, it would be immensely useful to get the output of an Ansible run including -e ansible_python_interpreter=/path/to/script, with the script below saved to the remote server's disk and marked executable:

#!/bin/bash
strace -o /tmp/whybroken -ff -- python "$@"

This produces a set of /tmp/whybroken.* files, one of which, if you grep for it, will contain a reference to /dev/ptmx. A copy of that file would be absolutely ideal

Thanks again!

@yhafri
Copy link
Author

@yhafri yhafri commented Jan 18, 2019

Here you are:

pty perms:
crw-rw-rw- 1 root   tty    5, 2 Jan 18 03:10 /dev/ptmx
drwxr-xr-x 2 root   root      0 Dec 23 11:01 /dev/pts
crw--w---- 1 younes tty  136, 0 Jan 18 03:10 /dev/pts/0

devpts modes:
devpts on /dev/pts type devpts (rw,relatime,mode=600,ptmxmode=000)
devpts on /dev/pts type devpts (rw,relatime,mode=600,ptmxmode=000)
devpts on /dev/pts type devpts (rw,relatime,mode=600,ptmxmode=000)

user account perms, selinux context:
uid=1000(younes) gid=1000(younes) groups=1000(younes)

real, effective, saveds ids:
(1000, 1000, 1000)
(1000, 1000, 1000)

auxiliary groups:
[1000]

selinux:
Disabled

As explained above, eveything works as expected if i use Ansible without mitogen. Thus, it's unlikely that the issue is related to any security features that might be enabled.

Strace's test:

## on remote node
$ cd /tmp; chmod +x ansible-strace.sh; cat ansible-strace.sh
#!/bin/bash
exec strace -o /tmp/whybroken -ff -- python "$@"

## on local machine where Ansible is installed
$ cat upgrade-strace.sh
#!/bin/bash
exec ansible-playbook -e ansible_python_interpreter=/tmp/ansible-strace.sh -i ~/.ansible/hosts ${PWD}/upgrade.yml

$ ./upgrade-strace.sh

## output on remote machine
$ ls -lah /tmp/whybroken.*
-rw-rw-r--  1 younes younes 404K Jan 18 03:23 whybroken.10011
-rw-rw-r--  1 younes younes 1.9K Jan 18 03:23 whybroken.10014
-rw-rw-r--  1 younes younes  16K Jan 18 03:23 whybroken.10015
-rw-rw-r--  1 younes younes 5.3K Jan 18 03:23 whybroken.10017
-rw-rw-r--  1 younes younes 1.8K Jan 18 03:23 whybroken.10018

$ grep -i "ptmx" /tmp/whybroken.*
/tmp/whybroken.10011:open("/dev/ptmx", O_RDWR)               = 17
@dw
Copy link
Member

@dw dw commented Jan 18, 2019

This is amazing, thanks so much -- it's ruled out one problem (ioctl() against ptmx may fail if rUID/eUID do not match).

Re the strace log, would it be possible to get a copy of that log, or at least, the lines starting at /dev/ptmx and running to what I assume is a line that ends with "= -1 EPERM (Operation not permitted)"

@yhafri
Copy link
Author

@yhafri yhafri commented Jan 18, 2019

Sure thing. Attached whybroken.10011
whybroken.10011.zip

@dw
Copy link
Member

@dw dw commented Jan 18, 2019

The devpts mount is set with no 'gid=' option, therefore when the new PTY is allocated, the kernel gives it the GID of the process (1000). Something in Python or the C library does not like that, and wants to change its group to 5 (probably the 'tty' group, so "wall" and commands like that function), and that's the source of the error.

From looking at a uname() line in the log, and also the presence of multiple devpts mounts in the previous output, am I right in saying this is a container of some sort? What kind of container runtime created it?

Mitogen does not require the group to be changed on its PTYs, so this can definitely be worked around, I assume it is overzealous library code causing the problem. This issue in rkt looks like a very similar problem, the strace is basically the same: rkt/rkt#2152

As for why it does not manifest with regular Ansible, that is because in that case, IIRC the whole SSH connection is started interactively, which means the SSH daemon sets up the PTY. This puts severe constraints on non-interactive use of the channel (e.g. buffer sizes for a PTY are tiny), and so Mitogen only allocates PTYs as needed and (ideally) only in cases where they are definitely required.

Does your Ansible configuration include a become password? If not, then that is a separate bug -- Mitogen should not be allocating a PTY at all in this case.

@dw
Copy link
Member

@dw dw commented Jan 18, 2019

The failing library code is grantpt() in the C library, called by openpty() in the C library, called by os.openpty() in Python

@yhafri
Copy link
Author

@yhafri yhafri commented Jan 18, 2019

Thanks for the detailed explanations.

From looking at a uname() line in the log, and also the presence of multiple devpts mounts in the previous output, am I right in saying this is a container of some sort? What kind of container runtime created it?

Yes, that's a customer kernel from a Cloud provider.

Does your Ansible configuration include a become password? If not, then that is a separate bug -- Mitogen should not be allocating a PTY at all in this case.

Nope. There is no become password in my config.

---
- hosts:
    - all
  become: true
  tasks:
    - name: update cache (Ubuntu)
      apt: update_cache=yes
      when: ansible_distribution == 'Ubuntu'

    - name: upgrade packages (Ubuntu)
      apt: upgrade=dist
      when: ansible_distribution == 'Ubuntu'

    - name: update packages (CentOS/RedHat)
      yum: name=* state=latest
      when: ansible_distribution == 'CentOS' or ansible_distribution == 'Red Hat Enterprise Linux'

    - name: Check if a reboot is required (Ubuntu)
      register: reboot_required_file
      stat: path=/var/run/reboot-required get_md5=no
      when: ansible_distribution == 'Ubuntu'

    - name: Check if a reboot is required (CentOS/RedHat)
      copy: src=needs-restarting.py dest=/root/needs-restarting.py mode=0700
      when: ansible_distribution == 'CentOS' or ansible_distribution == 'Red Hat Enterprise Linux'
@yhafri
Copy link
Author

@yhafri yhafri commented Jan 18, 2019

Can you think of any solution/workaround?

@dw
Copy link
Member

@dw dw commented Jan 18, 2019

When you say it is a customer kernel, does it mean they have replaced the OVH-supplied kernel? CentOS 7 should have shipped with Linux 3.10, and unfortunately for us, the fix mentioned in the LWN article did not arrive until Linux 4.8.

If I understand things correctly, I think you are SSHing into a container whose magic /dev/pts filesystem is not configured correctly. This problem does not manifest on my local CentOS 7 VM, and indeed, only one /dev/pts mount is visible on that VM, so those extra mounts must be coming from some piece of software -- I assume a container manager program like LXC, LXD, or Docker, but perhaps there is another explanation so far we are missing.

Thanks for all your help so far. I will need to setup a reproduction for this in order to test a workaround. TTY handling code is very fragile, and if we are short-circuiting the C library to setup a TTY ourselves, it would be nice to limit it only to this particular case. This may take some more time over the weekend.

I remembered the reason sudo is always called with a TTY -- it is because many sudo configurations ship with an incredibly pointless requiretty option enabled. It is pure security theatre. Without it, no password can be typed, and even when no password is required, sudo will still refuse to run. It might be possible to detect requiretty's existence, but not without creating authentication failures in the system log. So it is better we try to fix PTY allocation instead.

Thanks again for your help

@yhafri
Copy link
Author

@yhafri yhafri commented Jan 18, 2019

When you say it is a customer kernel, does it mean they have replaced the OVH-supplied kernel? CentOS 7 should have shipped with Linux 3.10, and unfortunately for us, the fix mentioned in the LWN article did not arrive until Linux 4.8.

Kernel version is 4.9.58, so the issue should be fixed (>4.8).

$ uname -r
4.9.58-xxxx-std-ipv6-64

No rush @dw. Tell me what I can do to help if/when needed.

@dw
Copy link
Member

@dw dw commented Jan 18, 2019

I managed to reproduce the problem on CentOS 7.5.1810 easily enough, but not in a satisfactory manner. On a clean install, it is sufficent to run mount -t devpts none /dev/pts -oremount to cause the gid option to be unset, at which point Python will behave as documented in this bug.

When the gid parameter is incorrect, programs running as root, such as SSHd, will be able to call openpty() successfully, but non-root programs like Python will fail.

On investigating, Dracut, the initramfs framework in CentOS 7.5, is responsible for setting up the host's /dev/pts, and it does this with the correct options hard-wired. So it is some extra package or configuration on these machines causing the problem after initramfs has already run.

I have scoured Red Hat's bugzilla and could find no decisive reference to a single package that could cause it, but did find some interesting hints. In this ticket the user had modified /etc/fstab of Fedora 19 by removing the /dev/pts line, causing it to be mounted with default arguments. However on my CentOS 7.5 install, /dev/pts mounting is explicitly handled in initramfs, and therefore no /dev/pts line should exist in /etc/fstab.

It is possible even if initramfs is setting things up correctly, that a redundant line in /etc/fstab is causing the options to be reset later during boot. Does your /etc/fstab have an entry for /dev/pts? It should not, but perhaps it was placed there by OVH's automated installer (it loves to customize machines, annoyingly).

Finally the presence of multiple /dev/pts in the output of mount is a huge signal that something is messing with these after boot, that may not be part of the core system. The only software I know of that needs to do this are container runtimes like Docker, LXC, LXD, systemd-nspawn, or rkt -- is there anything like that running on these machines?

While it is possible to add code to Mitogen to handle this case, the more I investigate the issue, the more it feels like a configuration error somewhere. I would like to understand the root cause of that configuration error -- if it is in a common package, or a common misconfiguration, then it makes sense to add the code. But if it is rare, then a better option would be to fix the error and update Mitogen's failure message to explain what is likely broken, as the extra code involved is not quite trivial.

Thanks again! This is a fun one :)

@dw
Copy link
Member

@dw dw commented Jan 18, 2019

After more Google searches, I'm sufficiently convinced there are enough broken systems around that this merits a code change.

I have created a new branch containing a fix. Would it be possible to run Mitogen from the issue462 branch to ensure both issues have abated? Pretty sure this fixes @yhafri's case, but it would nice to confirm @sboisson's issue is identical.

git clone -b issue462 https://github.com/dw/mitogen/
dw added a commit that referenced this issue Jan 18, 2019
@yhafri
Copy link
Author

@yhafri yhafri commented Jan 18, 2019

Awesome. The fix in branch issue462 worked.
I only got these 2 errors in the middle of the output:

ERROR! [pid 68929] 22:58:54.282173 E mitogen.ctx.ssh.91.121.153.46:51789: mitogen: RouteMonitor(): received DEL_ROUTE for 2007 from mitogen.sudo.Stream(u'sudo.root'), expected mitogen.core.Stream('parent')
ERROR! [pid 68929] 22:58:54.283527 E mitogen.ctx.ssh.91.121.153.46:51789.sudo.root: mitogen: RouteMonitor(): received DEL_ROUTE for 2007 from mitogen.fork.Stream(u'fork.25829'), expected mitogen.core.Stream('parent')
@yhafri
Copy link
Author

@yhafri yhafri commented Jan 18, 2019

After Googling a little bit, i found this.

Thus, switching from ignore_errors: True to failed_when: False solved the 2 errors unrelated to mitogen IMHO.

@dw
Copy link
Member

@dw dw commented Jan 18, 2019

Those errors should have been fixed already :) They will definitely be gone for 0.2.4, there is another ticket open for it. You can safely ignore them

@yhafri
Copy link
Author

@yhafri yhafri commented Jan 18, 2019

Thanks a mill and have a great weekend.
Long life to mitogen !!!

@dw
Copy link
Member

@dw dw commented Jan 18, 2019

Thanks so much for your help :) I will keep the ticket open until @sboisson can confirm the fix.

@dw dw closed this in a4c7a98 Jan 19, 2019
dw added a commit that referenced this issue Jan 19, 2019
* origin/issue462:
  issue #462: docs: update Changelog.
  parent: cope with broken /dev/pts on Linux; closes #462.
@dw
Copy link
Member

@dw dw commented Jan 19, 2019

I merged this down to master in order to work on other things -- @sboisson please reopen if you're still having the problem.


This is now on the master branch and will make it into the next release. To be updated when a new release is made, subscribe to https://networkgenomics.com/mail/mitogen-announce/

Thanks for reporting this!

@sboisson
Copy link

@sboisson sboisson commented Jan 31, 2019

Sorry for the delay, I was away for holidays. I confirm the change works for me too! :-) (tested with branch master)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants