Skip to content

fix(ssh): resolve goodbye-phase deadlock + load ~/.ssh/config in russh path#4154

Merged
oferchen merged 2 commits into
masterfrom
fix/ssh-deadlock-and-config
May 16, 2026
Merged

fix(ssh): resolve goodbye-phase deadlock + load ~/.ssh/config in russh path#4154
oferchen merged 2 commits into
masterfrom
fix/ssh-deadlock-and-config

Conversation

@oferchen
Copy link
Copy Markdown
Owner

Summary

Two SSH transport fixes, both required to unblock v0.6.2 SSH push on aarch64/Debian and to make the russh transport actually usable in real environments.

1. Subprocess SSH goodbye-phase deadlock (primary)

After the SSH child exits, the parent's stderr drain thread could remain parked in `read(2)` on the socketpair/pipe endpoint because the kernel had not yet propagated EOF — a re-execed ssh helper occasionally inherits the write end and keeps it open for a brief window after the main `ssh` is reaped. `wait_with_stderr()` then blocked forever in `JoinHandle::join`, even though every protocol byte had already been written and the SSH child had exited cleanly.

Confirmed via gdb on an aarch64 Debian 13 / kernel 6.18 reproducer: both threads parked in `futex_do_wait`, drain thread (`ssh-stderr-drai`) blocked in `syscall()` on the parent end of the socketpair, main thread waiting in `pthread_join`. On CI (x86_64 Ubuntu 24.04) the same race resolved soon enough that the benchmark merely recorded the no-change push at 1.5x upstream rather than hanging outright.

Fix. Extend `StderrAuxChannel` with a `shutdown_read()` hook and call it from both `SshConnection::wait*` and `SshChildHandle::wait*` once the child has been reaped. `SocketpairStderrChannel` holds a `try_clone()` of the parent endpoint and shuts it down via `UnixStream::shutdown(Both)` — the pending `read(2)` returns zero immediately. `PipeStderrChannel` has no safe out-of-band wake-up, so `join()` is bounded by a 50 ms timeout and the drain thread is abandoned if it doesn't exit in time; it terminates naturally on process teardown. All changes stay within `#![deny(unsafe_code)]`.

2. russh path now reads `~/.ssh/config`

The embedded transport never consulted the user's SSH config, so `Host` aliases, `Hostname` rewrites, `User`, `Port`, `IdentityFile`, `IdentitiesOnly`, and `IdentityAgent` directives were silently ignored. That is why the 3-way SSH benchmark's russh runs returned exit 5 on the GitHub runner.

`SshConfig::apply_ssh_config(host_alias)` (plus an `apply_ssh_config_from(path, alias)` variant for tests) parses `~/.ssh/config` and applies the directives above. URL-supplied fields win over file directives — except `Hostname`, which always overrides the alias to match OpenSSH semantics. `from_url()` calls this automatically. Missing/unreadable/malformed configs are treated as empty so a broken file never blocks a transfer.

Verification (aarch64 container)

Reproducer: `oc-rsync -av --timeout=10 -e ssh /tmp/src/ localhost:/tmp/dst/` against the rsync-profile container's local sshd.

Case Before (v0.6.2) After Upstream rsync 3.4.1
1-file push 10.0s + exit 124 (hang) 0.214s exit 0 0.21s
10K-file initial push (177 MB) hung indefinitely 1.12s 1.02s
10K-file no-change push hung indefinitely 0.36s 0.19s

Test plan

  • Subprocess path: deadlock no longer reproducible in the container reproducer; transfers complete in upstream-comparable wall-time
  • ssh_config parser unit tests cover missing file, simple Host block, wildcards, negation, comment stripping, equals separator, first-match-wins, IdentitiesOnly, invalid port
  • CI nextest stable / beta / nightly
  • Linux musl matrix
  • Windows ACL/xattr
  • Interop validation (subprocess and embedded paths)

…h path

Two SSH transport fixes:

1. Subprocess path deadlock (primary). After the SSH child exits, the
   parent's stderr drain thread could remain parked in `read(2)` on the
   socketpair/pipe endpoint because the kernel had not yet propagated
   EOF -- a re-execed ssh helper occasionally inherits the write end and
   keeps it open for a brief window after the main `ssh` process is
   reaped. `wait_with_stderr()` then blocked forever in
   `JoinHandle::join`, which prevented the rsync process from returning
   to the user even though every protocol byte had already been written
   and the SSH child had exited cleanly.

   Confirmed via gdb on an aarch64 Debian 13 / kernel 6.18 reproducer:
   both threads parked in `futex_do_wait`, drain thread named
   `ssh-stderr-drai` blocked in `syscall()` on the parent end of the
   socketpair while the main thread waited in `pthread_join`. On CI
   (x86_64 Ubuntu 24.04) the same race resolved soon enough that the
   benchmark merely recorded the no-change push at 1.5x upstream rather
   than hanging outright.

   Fix: extend `StderrAuxChannel` with a `shutdown_read()` hook, and
   call it from both `SshConnection::wait*` and
   `SshChildHandle::wait*` once the child has been reaped.
   `SocketpairStderrChannel` holds a `try_clone()` of the parent
   endpoint and shuts it down via `UnixStream::shutdown(Both)`, which
   forces the pending `read(2)` to return zero immediately.
   `PipeStderrChannel` has no safe out-of-band wake-up, so `join()` is
   bounded by `DRAIN_JOIN_TIMEOUT` (50 ms) and the drain thread is
   abandoned if it does not exit in time; it terminates naturally on
   process teardown. All changes stay within the `#![deny(unsafe_code)]`
   policy for `rsync_io`.

   Before (aarch64 reproducer): 10.0 s wall, exit 124 (timeout-killed).
   After: 0.214 s, exit 0 (matches upstream `rsync 3.4.1` at 0.21 s).
   10K-file dataset (177 MB, push to localhost) drops from infinite
   hang to 1.12 s initial / 0.36 s no-change vs upstream 1.02 / 0.19 s.

2. russh path missing `~/.ssh/config`. The embedded transport never
   consulted the user's SSH config, so `Host` aliases, `Hostname`
   rewrites, `User`, `Port`, `IdentityFile`, `IdentitiesOnly`, and
   `IdentityAgent` directives were silently ignored. This is why the
   3-way SSH benchmark's russh runs returned exit 5 on the
   GitHub-hosted runner: russh attempted to connect to `localhost` with
   no matching identity even though the runner had keys configured for
   the alias the test typed.

   Adds `SshConfig::apply_ssh_config(host_alias)` (and a
   `apply_ssh_config_from(path, alias)` variant for tests) plus a
   minimal `~/.ssh/config` parser that handles `Host` patterns with
   `*`/`?`/`!` wildcards and applies the directives above. URL-supplied
   fields win over file directives, with the exception of `Hostname`
   which always overrides the URL alias to match OpenSSH semantics.
   `from_url()` now calls `apply_ssh_config(raw_host)` so the russh
   transport picks up the config automatically.

   Missing, unreadable, or malformed config files are treated as empty
   so a broken config never blocks a transfer.
@github-actions github-actions Bot added the bug Something isn't working label May 16, 2026
@oferchen oferchen merged commit c99bbbc into master May 16, 2026
39 checks passed
@oferchen oferchen deleted the fix/ssh-deadlock-and-config branch May 16, 2026 14:47
oferchen added a commit that referenced this pull request May 16, 2026
)

Verify that PR #4154 (SSH socketpair stderr drain shutdown + russh
~/.ssh/config support) has restored SSH push and daemon push out of the
pre-fix harness timeout, and that no other transfer mode regressed.

Data source is benchmark workflow run 25964839057 on tag v0.6.2 at SHA
c99bbbc (the PR #4154 commit itself). SSH push 1.29x slower than
upstream on initial sync, daemon push 1.33x slower. Both modes recovered
from the prior 120 s / 30 s harness timeouts seen in the v0.6.1 release
benchmark appendix; the remaining gap to the "SSH on par" and
"daemon 2x faster" targets is pre-existing and unrelated to this fix.
oferchen added a commit that referenced this pull request May 18, 2026
…h path (#4154)

* fix(ssh): resolve goodbye-phase deadlock + load ~/.ssh/config in russh path

Two SSH transport fixes:

1. Subprocess path deadlock (primary). After the SSH child exits, the
   parent's stderr drain thread could remain parked in `read(2)` on the
   socketpair/pipe endpoint because the kernel had not yet propagated
   EOF -- a re-execed ssh helper occasionally inherits the write end and
   keeps it open for a brief window after the main `ssh` process is
   reaped. `wait_with_stderr()` then blocked forever in
   `JoinHandle::join`, which prevented the rsync process from returning
   to the user even though every protocol byte had already been written
   and the SSH child had exited cleanly.

   Confirmed via gdb on an aarch64 Debian 13 / kernel 6.18 reproducer:
   both threads parked in `futex_do_wait`, drain thread named
   `ssh-stderr-drai` blocked in `syscall()` on the parent end of the
   socketpair while the main thread waited in `pthread_join`. On CI
   (x86_64 Ubuntu 24.04) the same race resolved soon enough that the
   benchmark merely recorded the no-change push at 1.5x upstream rather
   than hanging outright.

   Fix: extend `StderrAuxChannel` with a `shutdown_read()` hook, and
   call it from both `SshConnection::wait*` and
   `SshChildHandle::wait*` once the child has been reaped.
   `SocketpairStderrChannel` holds a `try_clone()` of the parent
   endpoint and shuts it down via `UnixStream::shutdown(Both)`, which
   forces the pending `read(2)` to return zero immediately.
   `PipeStderrChannel` has no safe out-of-band wake-up, so `join()` is
   bounded by `DRAIN_JOIN_TIMEOUT` (50 ms) and the drain thread is
   abandoned if it does not exit in time; it terminates naturally on
   process teardown. All changes stay within the `#![deny(unsafe_code)]`
   policy for `rsync_io`.

   Before (aarch64 reproducer): 10.0 s wall, exit 124 (timeout-killed).
   After: 0.214 s, exit 0 (matches upstream `rsync 3.4.1` at 0.21 s).
   10K-file dataset (177 MB, push to localhost) drops from infinite
   hang to 1.12 s initial / 0.36 s no-change vs upstream 1.02 / 0.19 s.

2. russh path missing `~/.ssh/config`. The embedded transport never
   consulted the user's SSH config, so `Host` aliases, `Hostname`
   rewrites, `User`, `Port`, `IdentityFile`, `IdentitiesOnly`, and
   `IdentityAgent` directives were silently ignored. This is why the
   3-way SSH benchmark's russh runs returned exit 5 on the
   GitHub-hosted runner: russh attempted to connect to `localhost` with
   no matching identity even though the runner had keys configured for
   the alias the test typed.

   Adds `SshConfig::apply_ssh_config(host_alias)` (and a
   `apply_ssh_config_from(path, alias)` variant for tests) plus a
   minimal `~/.ssh/config` parser that handles `Host` patterns with
   `*`/`?`/`!` wildcards and applies the directives above. URL-supplied
   fields win over file directives, with the exception of `Hostname`
   which always overrides the URL alias to match OpenSSH semantics.
   `from_url()` now calls `apply_ssh_config(raw_host)` so the russh
   transport picks up the config automatically.

   Missing, unreadable, or malformed config files are treated as empty
   so a broken config never blocks a transfer.

* style: remove duplicate blank line after default_ssh_config_path
oferchen added a commit that referenced this pull request May 18, 2026
)

Verify that PR #4154 (SSH socketpair stderr drain shutdown + russh
~/.ssh/config support) has restored SSH push and daemon push out of the
pre-fix harness timeout, and that no other transfer mode regressed.

Data source is benchmark workflow run 25964839057 on tag v0.6.2 at SHA
f58e6df (the PR #4154 commit itself). SSH push 1.29x slower than
upstream on initial sync, daemon push 1.33x slower. Both modes recovered
from the prior 120 s / 30 s harness timeouts seen in the v0.6.1 release
benchmark appendix; the remaining gap to the "SSH on par" and
"daemon 2x faster" targets is pre-existing and unrelated to this fix.
oferchen added a commit that referenced this pull request May 18, 2026
…h path (#4154)

* fix(ssh): resolve goodbye-phase deadlock + load ~/.ssh/config in russh path

Two SSH transport fixes:

1. Subprocess path deadlock (primary). After the SSH child exits, the
   parent's stderr drain thread could remain parked in `read(2)` on the
   socketpair/pipe endpoint because the kernel had not yet propagated
   EOF -- a re-execed ssh helper occasionally inherits the write end and
   keeps it open for a brief window after the main `ssh` process is
   reaped. `wait_with_stderr()` then blocked forever in
   `JoinHandle::join`, which prevented the rsync process from returning
   to the user even though every protocol byte had already been written
   and the SSH child had exited cleanly.

   Confirmed via gdb on an aarch64 Debian 13 / kernel 6.18 reproducer:
   both threads parked in `futex_do_wait`, drain thread named
   `ssh-stderr-drai` blocked in `syscall()` on the parent end of the
   socketpair while the main thread waited in `pthread_join`. On CI
   (x86_64 Ubuntu 24.04) the same race resolved soon enough that the
   benchmark merely recorded the no-change push at 1.5x upstream rather
   than hanging outright.

   Fix: extend `StderrAuxChannel` with a `shutdown_read()` hook, and
   call it from both `SshConnection::wait*` and
   `SshChildHandle::wait*` once the child has been reaped.
   `SocketpairStderrChannel` holds a `try_clone()` of the parent
   endpoint and shuts it down via `UnixStream::shutdown(Both)`, which
   forces the pending `read(2)` to return zero immediately.
   `PipeStderrChannel` has no safe out-of-band wake-up, so `join()` is
   bounded by `DRAIN_JOIN_TIMEOUT` (50 ms) and the drain thread is
   abandoned if it does not exit in time; it terminates naturally on
   process teardown. All changes stay within the `#![deny(unsafe_code)]`
   policy for `rsync_io`.

   Before (aarch64 reproducer): 10.0 s wall, exit 124 (timeout-killed).
   After: 0.214 s, exit 0 (matches upstream `rsync 3.4.1` at 0.21 s).
   10K-file dataset (177 MB, push to localhost) drops from infinite
   hang to 1.12 s initial / 0.36 s no-change vs upstream 1.02 / 0.19 s.

2. russh path missing `~/.ssh/config`. The embedded transport never
   consulted the user's SSH config, so `Host` aliases, `Hostname`
   rewrites, `User`, `Port`, `IdentityFile`, `IdentitiesOnly`, and
   `IdentityAgent` directives were silently ignored. This is why the
   3-way SSH benchmark's russh runs returned exit 5 on the
   GitHub-hosted runner: russh attempted to connect to `localhost` with
   no matching identity even though the runner had keys configured for
   the alias the test typed.

   Adds `SshConfig::apply_ssh_config(host_alias)` (and a
   `apply_ssh_config_from(path, alias)` variant for tests) plus a
   minimal `~/.ssh/config` parser that handles `Host` patterns with
   `*`/`?`/`!` wildcards and applies the directives above. URL-supplied
   fields win over file directives, with the exception of `Hostname`
   which always overrides the URL alias to match OpenSSH semantics.
   `from_url()` now calls `apply_ssh_config(raw_host)` so the russh
   transport picks up the config automatically.

   Missing, unreadable, or malformed config files are treated as empty
   so a broken config never blocks a transfer.

* style: remove duplicate blank line after default_ssh_config_path
oferchen added a commit that referenced this pull request May 18, 2026
)

Verify that PR #4154 (SSH socketpair stderr drain shutdown + russh
~/.ssh/config support) has restored SSH push and daemon push out of the
pre-fix harness timeout, and that no other transfer mode regressed.

Data source is benchmark workflow run 25964839057 on tag v0.6.2 at SHA
d8e0e70 (the PR #4154 commit itself). SSH push 1.29x slower than
upstream on initial sync, daemon push 1.33x slower. Both modes recovered
from the prior 120 s / 30 s harness timeouts seen in the v0.6.1 release
benchmark appendix; the remaining gap to the "SSH on par" and
"daemon 2x faster" targets is pre-existing and unrelated to this fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant