Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection hangs on RHEL 8.4 server #1152

Closed
byron-hawkins opened this issue Sep 29, 2021 · 8 comments
Closed

Connection hangs on RHEL 8.4 server #1152

byron-hawkins opened this issue Sep 29, 2021 · 8 comments

Comments

@byron-hawkins
Copy link

byron-hawkins commented Sep 29, 2021

On a fresh install of Oracle Linux (based on RHEL 8.4), any attempt to connect with mosh over a direct ethernet connection shows two blank lines and nothing else. I can start a mosh-server on the server, copy the MOSH_KEY to the client and then connect successfully using mosh-client on port 60001--everything about the session seems totally normal in this case. But attempting mosh --port=60001 results in the same 2 blank lines.

Diagnosing via gdb, the two server processes are stopped at read() and select(), while the client is at wait(). The perl command remains active in the client's terminal title bar. With --no-ssh-pty the two blank lines are not generated on the connection attempt, but nothing else changes.

[root@localhost ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux release 8.4 (Ootpa)
[root@localhost ~]# uname -a
Linux localhost.localdomain 5.4.17-2102.205.7.3.el8uek.x86_64 #2 SMP Fri Sep 17 17:54:52 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux
[root@localhost ~]# mosh --version
mosh 1.3.2 [build mosh 1.3.2]
@keithw
Copy link
Member

keithw commented Oct 5, 2021

Hmm, does this problem happen on actual RHEL 8.4, or only on Oracle Linux? Does it happen with the EPEL package, or only building from source? Does it happen with the Mosh 1.3.2 release, or only tip-of-tree from the Git repository?

@njaard
Copy link
Contributor

njaard commented Oct 27, 2021

i am seeing this on debian bullseye. This problem occurs sporadically but only started about 2 weeks ago, probably coinciding with system updates. Maybe a new ssh. no new version of ssh since March.

It also seems to correlate with the server being under load.

@njaard
Copy link
Contributor

njaard commented Oct 28, 2021

I found that mosh opens the ssh connection where it runs mosh-server new, then that program exits, and then ssh stalls indefinitely, which means that mosh cannot proceed past the wait

I'm still investigating.

If I just run the ssh command ssh hostname -- mosh-server [many parameters], ssh always completes correctly.

This problem occurs even with --experimental-remote-ip=local.

@njaard
Copy link
Contributor

njaard commented Oct 28, 2021

So it seems like an openssh bug, but the bug only seems to occur when it's mosh that starts the ssh client up. I'm stumped at this point.

A workaround that seems to work but that is going to make you feel slightly ill is to create a mosh-server wrapper script on your servers:

#!/bin/bash
export B_PP=$PPID
nohup bash -c "sleep 2; kill $B_PP" 2> /dev/null > /dev/null &
mosh-server "$@"

$PPID here represents the session's sshd.

Then run mosh as such:
/usr/bin/mosh --server=mosh-wrapper HOSTNAME

(of course mosh-wrapper must be in $PATH)

@njaard
Copy link
Contributor

njaard commented Oct 28, 2021

Here's the strace out of the sshd server, starting at the point that it received the SIGCHLD because the mosh-server task completed (and forked off).

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=36204, si_uid=1001, si_status=0, si_utime=0, si_stime=0} ---
write(9, "\0", 1)                       = 1
rt_sigreturn({mask=[]})                 = 0
write(4, "[encrypted]"..., 36) = 36
select(13, [4 7 12], [], NULL, {tv_sec=0, tv_usec=100000}) = 2 (in [7 12], left {tv_sec=0, tv_usec=99997})
read(7, "\0", 1)                        = 1
read(7, 0x7ffe37d7a55f, 1)              = -1 EAGAIN (Resource temporarily unavailable)
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 36204
getpid()                                = 36203
write(5, "\0\0\0\17\36", 5)             = 5
write(5, "\0\0\0\n/dev/pts/7", 14)      = 14
close(13)                               = 0
wait4(-1, 0x7ffe37d7a3cc, WNOHANG, NULL) = -1 ECHILD (No child processes)
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
read(12, "There is NO WARRANTY, to the ext"..., 16384) = 94
getpid()                                = 36203
select(13, [4 7 12], [4], NULL, NULL)   = 2 (in [12], out [4])
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
read(12, 0x7ffe37d763b0, 16384)         = -1 EIO (Input/output error)
close(12)                               = 0
write(4, "[encrypted]"..., 184) = 184
getpid()                                = 36203
getpid()                                = 36203
select(13, [4 7], [4], NULL, NULL)      = 1 (out [4])
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(4, "[encrypted]"..., 72) = 72
select(13, [4 7], [], NULL, NULL

@njaard
Copy link
Contributor

njaard commented Oct 28, 2021

@byron-hawkins you can trivially patch your mosh perl script according to the change I made.

@eminence
Copy link
Member

I think @njaard tracked this issue to an openssh bug and provided a nice fix for us in #1160, so closing this issue.

@SyamGadde
Copy link

I know this is closed, but I just wanted to report that the simple perl script change from #1160 does not work for me on RHEL 8.5 (server) and Fedora 31 (client), but the mosh-wrapper hack does work. Same exact issue -- two blank lines and a hang when I use mosh, but works perfectly if I run mosh-server and mosh-client directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants