Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ssh connection occasionally aborted by dropbear while establishing the connection (regression) #173

Closed
m5jt opened this issue Jun 13, 2022 · 6 comments

Comments

@m5jt
Copy link

m5jt commented Jun 13, 2022

We use dropbear on the server side, to create an ssh session with a command-line application.
An ssh client connects to it. Sometimes the connection fails.
We repeat the connection sequence a large number of times in an automated test.

The problem started after upgrading from 2018-76 to 2022-82. We believe that the regression appeared in 2022-82 because it seems related to issue #85 (https://github.com/mkj/dropbear/pull/85/files)

We found the failure mechanism, as well as two potential fixes

Failure mechanism :

  • In function session_loop() (common-session.c) :
    • The problem starts when select() returns 2 while chansess.pid is not yet initialized. When this happens :
      • ses.channel_signal_pending is set to 1 (because there is something in ses.signal_pipe[0])
      • process_packet() is also called (because there is something in ses.sock_in) and initializes ses.channels[0]
    • From this point, the problem happens everytime.
  • Then, in function channelio() (common-channel.c)
    • The « for » loop is executed once because sess.channel[0] is initialized
    • Because ses.channel_signal_pending is set, do_check_close is set to 1, causing check_close() to be called
  • Then, sesscheckclose() (in srv-chansession.c) is called by check_close()
    • At that moment, chansess->pid is not initialized yet.
    • This causes sesscheckclose() to return 1, and the session is closed .

Potential fixes :

  • We found 2 « hacks » that make our test pass. But we do not know if they address the root cause of the problem or if they cause other issues.
  • The first possible hack is in session_loop(). Add condition « if (ses.channel_signal_pending == 0) » around the code block « /* process session socket's incoming data */ ».
  • The second possible hack is to modify the return statement in sesscheckclose() to remove the « chansess->pid == 0 » condition
@mkj
Copy link
Owner

mkj commented Jun 14, 2022

Thanks for the report. I think the problem might occur when SIGCHLD arrives after loophandler() gets called, but before channelio()'s channel_signal_pending test. (loophandler() is the callback svr_chansess_checksignal())

If it's easy for you to currently reproduce, would you be able to try putting a call svr_chansess_checksignal(); at the start of sesscheckclose() ? I'll try add a reproducible test case for myself and see if there's a less bodgy fix.

@m5jt
Copy link
Author

m5jt commented Jun 14, 2022

Hello mkj! Thanks for the response.

Our automated test can reproduce the problem rather quickly.
I tried the patch that you proposed in the previous comment. It did not fix the problem.
To make sure the setup is consistent with past observations, I also ran the same test :

  • with no mod at all – the failure is present
  • with our hack in session_loop() – the failure is absent

Note that for the tests today, I removed the debug traces we previously added, so they don’t interfere. These traces tend to increase the occurrence of the issue.

@mkj
Copy link
Owner

mkj commented Jun 14, 2022

Ah yes, my theory doesn't make sense - I'd forgotten that the signal handler is no longer asynchronous (noticed some other things to clean up there too).

I'm a bit puzzled why ses.channel_signal_pending is set if chansess.pid isn't initialized. A sigchld should only occur after the pid has been filled out, unless another child process (unrelated to that channel) has exited? Could you give me some more details of your test setup so I can try reproduce it (or if you want to paste/email me debug traces I have a look at them)

@m5jt
Copy link
Author

m5jt commented Jun 14, 2022

Deleted.

@m5jt
Copy link
Author

m5jt commented Jun 16, 2022

Hello Matt. Thanks for the support.

Your last comment helped us understand how the problem happens. We previously made a modification in srv_auth.c (function recv_msg_userauth_request()) where we fork a child process to perform user/password validation.

This fork causes a call to sesssigchild_handler() (in srv-chansession.c) , which writes into ses.signal_pipe. This then causes session_loop() to set ses.channel_signal_pending to 1 before there is a pid in ChanSess.pid.

The reason why this worked in the past is probably because the older versions of dropbear did not nave the « chansess->pid==0 » condition in the return() statement of sesscheckclose().

And the reason why it doesn’t fail all the time is because most of the time, there is nothing to read from ses.sock_in in the same loop iteration.

I’m trying to find a way to deal with that. Do you have suggestions?
I can imagine a few potential ideas, but nothing solid yet.

  • Find a way to make sesssigchild_handler() ignore the fork made for password validation (don’t know if this is possible)
  • Avoid calling read_packet() in the same loop iteration when ses.channel_signal_pending == 1 (have not enough insight to foresee side effects)

@m5jt
Copy link
Author

m5jt commented Jun 21, 2022

I close this issue. Thanks for the hints, this was very helpful
The problem was caused by a modification that we made that forks a child process, causing the SIGCHLD handler to be called and thus setting ses.channel_signal_pending to 1. It was solved by exiting the sigchld handler immediately at the begining in this situation.

@m5jt m5jt closed this as completed Jun 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants