Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mosh sometimes hangs on connect #1273

Open
falsifian opened this issue May 22, 2023 · 13 comments
Open

mosh sometimes hangs on connect #1273

falsifian opened this issue May 22, 2023 · 13 comments
Labels

Comments

@falsifian
Copy link

Sometimes, when I try to use mosh to connect to a VM running on my computer, mosh just prints a couple of blank lines and does nothing else:

moth ~ $ mosh h4 


(h4 is the VM.) Ctrl-c and ctrl-\ do nothing. No mosh TUI with a blue banner or anything; it's as if I typed echo;echo;cat>/dev/null and somehow disabled ctrl-c and ctrl-\. pgrep shows there are three mosh-server processes on the VM and no mosh-client processes on the host.

The host machine (i.e. mosh client) runs OpenBSD-current amd64. The VM runs OpenBSD 7.3. Both have mosh 1.4.0.

Here is an excerpt from pstree output on the host where I ran mosh h4:

 |       |-+= 40796 falsifia /usr/X11R6/bin/xterm
 |       | \-+= 25909 falsifia ksh
 |       |   \-+= 36986 falsifia perl /usr/local/bin/mosh h4
 |       |     \-+- 11626 falsifia ssh -n -tt -S none -o ProxyCommand='/usr/local/bin/mosh' '--family=prefer-inet' --fake-proxy -- %h %p h4 -- mosh-server 'new' '-c' '8' '-s' '-l' 'LANG=en_CA.UTF-8' '-l' 'LC_TIME=C' '-l' 'LC_COLLATE=C'
 |       |       \-+- 36023 falsifia perl /usr/local/bin/mosh --family=prefer-inet --fake-proxy -- h4 22
 |       |         \--- 54912 falsifia perl /usr/local/bin/mosh --family=prefer-inet --fake-proxy -- h4 22
 |       |--= 15846 falsifia emacs (emacs-28.2)

This bug isn't consistent: usually mosh h4 successfully sets up a remote session on h4.

I don't think I've observed this when connecting to a machine that isn't a VM on the same computer. (I wonder if it's got something to do with network to the VM being super low-latency.)

I saw #114 which looks quite similar, but it's old and closed, and I'm running mosh 1.4.0 on OpenBSD.

@achernya
Copy link
Collaborator

There's no mosh-client in your pstree, so the hang is purely in the mosh perl script it seems. Can you strace the perl script and figure out where it's hanging?

@falsifian
Copy link
Author

Here are the last few lines of a ktrace (OpenBSD equivalent of strace?), captured with ktrace -d mosh h4 and displayed with kdump -T:

 64743 perl     1684788563.430282 CALL  kbind(0x7d90e3381848,24,0xc35e0b3254818a
 64743 perl     1684788563.430378 RET   kbind 0
 64743 perl     1684788563.430389 CALL  kbind(0x7d90e33817f8,24,0xc35e0b3254818a
 64743 perl     1684788563.430480 RET   kbind 0
 64743 perl     1684788563.430528 CALL  kbind(0x7d90e33816f8,24,0xc35e0b3254818a
 64743 perl     1684788563.430616 RET   kbind 0
 64743 perl     1684788563.430623 CALL  write(1,0xeefb1ab8000,0x3)
 64743 perl     1684788563.430644 GIO   fd 1 wrote 3 bytes
       "\r\r
       "
 64743 perl     1684788563.430649 RET   write 3
 64743 perl     1684788563.430666 CALL  read(3,0xeef4b095000,0x2000)
 64743 perl     1684788563.430676 GIO   fd 3 read 2 bytes
       "\r
       "
 64743 perl     1684788563.430686 RET   read 2
 64743 perl     1684788563.430698 CALL  write(1,0xeefb1ab8000,0x2)
 64743 perl     1684788563.430709 GIO   fd 1 wrote 2 bytes
       "\r
       "
 64743 perl     1684788563.430713 RET   write 2
 64743 perl     1684788563.430721 CALL  read(3,0xeef4b095000,0x2000)
 64743 perl     1684788563.431092 GIO   fd 3 read 43 bytes
       "MOSH CONNECT 60001 IdI1SCGUzJ3uli7Ovethwg\r
       "
 64743 perl     1684788563.431109 RET   read 43/0x2b
 64743 perl     1684788563.431269 CALL  close(3)
 64743 perl     1684788563.431440 RET   close 0
 64743 perl     1684788563.431455 CALL  wait4(2918,0x7d90e3381868,0<>,0)
 64743 perl     1684788592.311392 PSIG  SIGTERM SIG_DFL

I think the SIGTERM at the end is from me manually running the kill command after around 30 seconds. So, it's stuck on a call to wait4.

@achernya
Copy link
Collaborator

If I'm reading that correctly, it's read the MOSH CONNECT line from ssh successfully, but is then waiting for a process (presumably the ssh client) to terminate with the wait4 syscall. So presumably the hang is at https://github.com/mobile-shell/mosh/blob/master/scripts/mosh.pl#L443; do you know what the status of the awaited-for child process was during this?

@falsifian
Copy link
Author

Here's the end of a more complete ktrace output which includes the ssh and perl child processes (ktrace -i). The main mosh process is apparently PID 75476. I've replaced some text that looks like binary data with (deleted) because I don't know for sure that it's not sensitive; I'm guessing it's the encrypted ssh network traffic.

Based on the complete ktrace output, I found these PIDs:

  • 75476: main /usr/local/bin/mosh process
  • 77519: ssh, forked from main
  • 85637: perl fake proxy, forked from ssh (write from ssh to network?)
  • 78646: perl fake proxy, forked from previous (read from network to ssh?)

It looks like ssh is stuck on poll (maybe reading from 78646?) and 78646 is stuck on read. Did mosh-server forget to close a connection? I could try some debugging on the server side next.

@achernya
Copy link
Collaborator

mosh-server forks and detaches. It shouldn't be responsible for closing the connection per se

@falsifian
Copy link
Author

falsifian commented May 23, 2023

If I run pkill mosh-server on the VM, then a few seconds later, the mosh h4 command that was stuck on the host shows a blue banner mosh: Nothing received from server on UDP port 60001. [To quit: Ctrl-^ .] and then exits a while after that.

@achernya
Copy link
Collaborator

Yes that's expected. Mosh requires a running mosh-server process to talk to.

@falsifian
Copy link
Author

I added some printf debugging (to a log file) to mosh-server. It seems like it's getting stuck on tcdrain( STDOUT_FILENO ), currently on line 473 of mosh-server.cc.

My printf-laden code looks like this (xxx_s is the log file)

    if ( isatty( STDOUT_FILENO ) ) {
      fprintf( xxx_s, "XXX yes, parent tcdraining\n" ); 
      fflush(xxx_s);
      tcdrain( STDOUT_FILENO );
    }
    fprintf( xxx_s, "XXX parent exiting\n" );   
    fflush(xxx_s);

and the end of the log is XXX yes, parent tcdraining. In cases where mosh doesn't hang, the XXX parent existing line is seen, as expected.

@achernya
Copy link
Collaborator

Is there any way to see which bytes are not flushed? I'm not super familiar with debugging on the BSDs to know how to do that offhand.

@falsifian
Copy link
Author

I don't know a reliable way, but I found some evidence the unflushed bytes are "[mosh server detached..." and "Warning: termios IUTF8 flag...".

Details:

I ktraced sshd's children including mosh-server. In the output I see a pattern of mosh-server writing something and an sshd process reading it, like so:

 19928 mosh-server 1684875867.961988 CALL  write(1,0xee5cb224000,0x2a)
 19928 mosh-server 1684875867.962006 GIO   fd 1 wrote 42 bytes
       "MOSH CONNECT 60001 YfMH/f2uNxLhq7lAQe4Puw
       "
 19928 mosh-server 1684875867.962010 RET   write 42/0x2a
 17754 sshd     1684875867.962021 STRU  struct pollfd [3] { fd=4, events=0x1<POL
 17754 sshd     1684875867.962026 RET   ppoll 1
 17754 sshd     1684875867.962037 CALL  sigprocmask(SIG_UNBLOCK,0x80000<SIGCHLD>
 17754 sshd     1684875867.962042 RET   sigprocmask 0x80000<SIGCHLD>
 17754 sshd     1684875867.962054 CALL  read(8,0x6809cc32000,0x8000)
 17754 sshd     1684875867.962065 GIO   fd 8 read 43 bytes
       "MOSH CONNECT 60001 YfMH/f2uNxLhq7lAQe4Puw\r
       "
 17754 sshd     1684875867.962069 RET   read 43/0x2b

However, these final two messages don't have matching reads:

 19928 mosh-server 1684875867.966139 CALL  write(2,0x702fa5307340,0x24)
 19928 mosh-server 1684875867.966153 GIO   fd 2 wrote 36 bytes
       "[mosh-server detached, pid = 82611]
       "
 19928 mosh-server 1684875867.966156 RET   write 36/0x24
 19928 mosh-server 1684875867.966166 CALL  write(2,0xee31bda4717,0x8c)
 19928 mosh-server 1684875867.966178 GIO   fd 2 wrote 140 bytes
       "
        Warning: termios IUTF8 flag not defined.
        Character-erase of multibyte character sequence
        probably does not work properly on this platform.
       "
 19928 mosh-server 1684875867.966182 RET   write 140/0x8c
 19928 mosh-server 1684875867.966192 CALL  fcntl(1,F_ISATTY)
 19928 mosh-server 1684875867.966198 RET   fcntl 1
 19928 mosh-server 1684875867.966206 CALL  ioctl(1,TIOCDRAIN,0)

That last ioctl is the last I see of 19928.

I guess something is causing sshd to decide to stop reading?

(I've been trying to reproduce the tcdrain hang with a simple program that forks, and has the child process close stdout and the parent sleep(1) then try to write to stdout and then tcdrain(1). I ssh to my VM using that program as the command to run. But so far I haven't been able to reproduce, i.e, I can't get tcdrain to hang.)

@falsifian
Copy link
Author

falsifian commented May 23, 2023

Here's a piece of the ktrace output where I think sshd is closing the connection EDIT: (I mean, the connection to the mosh-server process).

It begins with the sshd process calling ppoll(...,3,...): listening on 3 fds.

Then one last message is written by mosh-server and read by sshd on fd 8.

Then the sshd process reads what I guess is a network packet.

Then it closes fd 8, and the next call to ppoll has 2 instead of 3.

@achernya
Copy link
Collaborator

Is there any evidence this is not a bug in ssh?

@falsifian
Copy link
Author

I don't know. I wish I knew the exact sequence of events which is causing tcdrain to hang; maybe then I would either have a clean bug report for ssh or an argument that mosh is doing the wrong thing.

If I spend more time on this, I think my next step will be to remove things from mosh.pl and mosh-server.cc until I have some sort of minimum reproducible example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants