fix(ssh): resolve goodbye-phase deadlock + load ~/.ssh/config in russh path#4154
Merged
Conversation
…h path Two SSH transport fixes: 1. Subprocess path deadlock (primary). After the SSH child exits, the parent's stderr drain thread could remain parked in `read(2)` on the socketpair/pipe endpoint because the kernel had not yet propagated EOF -- a re-execed ssh helper occasionally inherits the write end and keeps it open for a brief window after the main `ssh` process is reaped. `wait_with_stderr()` then blocked forever in `JoinHandle::join`, which prevented the rsync process from returning to the user even though every protocol byte had already been written and the SSH child had exited cleanly. Confirmed via gdb on an aarch64 Debian 13 / kernel 6.18 reproducer: both threads parked in `futex_do_wait`, drain thread named `ssh-stderr-drai` blocked in `syscall()` on the parent end of the socketpair while the main thread waited in `pthread_join`. On CI (x86_64 Ubuntu 24.04) the same race resolved soon enough that the benchmark merely recorded the no-change push at 1.5x upstream rather than hanging outright. Fix: extend `StderrAuxChannel` with a `shutdown_read()` hook, and call it from both `SshConnection::wait*` and `SshChildHandle::wait*` once the child has been reaped. `SocketpairStderrChannel` holds a `try_clone()` of the parent endpoint and shuts it down via `UnixStream::shutdown(Both)`, which forces the pending `read(2)` to return zero immediately. `PipeStderrChannel` has no safe out-of-band wake-up, so `join()` is bounded by `DRAIN_JOIN_TIMEOUT` (50 ms) and the drain thread is abandoned if it does not exit in time; it terminates naturally on process teardown. All changes stay within the `#![deny(unsafe_code)]` policy for `rsync_io`. Before (aarch64 reproducer): 10.0 s wall, exit 124 (timeout-killed). After: 0.214 s, exit 0 (matches upstream `rsync 3.4.1` at 0.21 s). 10K-file dataset (177 MB, push to localhost) drops from infinite hang to 1.12 s initial / 0.36 s no-change vs upstream 1.02 / 0.19 s. 2. russh path missing `~/.ssh/config`. The embedded transport never consulted the user's SSH config, so `Host` aliases, `Hostname` rewrites, `User`, `Port`, `IdentityFile`, `IdentitiesOnly`, and `IdentityAgent` directives were silently ignored. This is why the 3-way SSH benchmark's russh runs returned exit 5 on the GitHub-hosted runner: russh attempted to connect to `localhost` with no matching identity even though the runner had keys configured for the alias the test typed. Adds `SshConfig::apply_ssh_config(host_alias)` (and a `apply_ssh_config_from(path, alias)` variant for tests) plus a minimal `~/.ssh/config` parser that handles `Host` patterns with `*`/`?`/`!` wildcards and applies the directives above. URL-supplied fields win over file directives, with the exception of `Hostname` which always overrides the URL alias to match OpenSSH semantics. `from_url()` now calls `apply_ssh_config(raw_host)` so the russh transport picks up the config automatically. Missing, unreadable, or malformed config files are treated as empty so a broken config never blocks a transfer.
4 tasks
oferchen
added a commit
that referenced
this pull request
May 16, 2026
) Verify that PR #4154 (SSH socketpair stderr drain shutdown + russh ~/.ssh/config support) has restored SSH push and daemon push out of the pre-fix harness timeout, and that no other transfer mode regressed. Data source is benchmark workflow run 25964839057 on tag v0.6.2 at SHA c99bbbc (the PR #4154 commit itself). SSH push 1.29x slower than upstream on initial sync, daemon push 1.33x slower. Both modes recovered from the prior 120 s / 30 s harness timeouts seen in the v0.6.1 release benchmark appendix; the remaining gap to the "SSH on par" and "daemon 2x faster" targets is pre-existing and unrelated to this fix.
4 tasks
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…h path (#4154) * fix(ssh): resolve goodbye-phase deadlock + load ~/.ssh/config in russh path Two SSH transport fixes: 1. Subprocess path deadlock (primary). After the SSH child exits, the parent's stderr drain thread could remain parked in `read(2)` on the socketpair/pipe endpoint because the kernel had not yet propagated EOF -- a re-execed ssh helper occasionally inherits the write end and keeps it open for a brief window after the main `ssh` process is reaped. `wait_with_stderr()` then blocked forever in `JoinHandle::join`, which prevented the rsync process from returning to the user even though every protocol byte had already been written and the SSH child had exited cleanly. Confirmed via gdb on an aarch64 Debian 13 / kernel 6.18 reproducer: both threads parked in `futex_do_wait`, drain thread named `ssh-stderr-drai` blocked in `syscall()` on the parent end of the socketpair while the main thread waited in `pthread_join`. On CI (x86_64 Ubuntu 24.04) the same race resolved soon enough that the benchmark merely recorded the no-change push at 1.5x upstream rather than hanging outright. Fix: extend `StderrAuxChannel` with a `shutdown_read()` hook, and call it from both `SshConnection::wait*` and `SshChildHandle::wait*` once the child has been reaped. `SocketpairStderrChannel` holds a `try_clone()` of the parent endpoint and shuts it down via `UnixStream::shutdown(Both)`, which forces the pending `read(2)` to return zero immediately. `PipeStderrChannel` has no safe out-of-band wake-up, so `join()` is bounded by `DRAIN_JOIN_TIMEOUT` (50 ms) and the drain thread is abandoned if it does not exit in time; it terminates naturally on process teardown. All changes stay within the `#![deny(unsafe_code)]` policy for `rsync_io`. Before (aarch64 reproducer): 10.0 s wall, exit 124 (timeout-killed). After: 0.214 s, exit 0 (matches upstream `rsync 3.4.1` at 0.21 s). 10K-file dataset (177 MB, push to localhost) drops from infinite hang to 1.12 s initial / 0.36 s no-change vs upstream 1.02 / 0.19 s. 2. russh path missing `~/.ssh/config`. The embedded transport never consulted the user's SSH config, so `Host` aliases, `Hostname` rewrites, `User`, `Port`, `IdentityFile`, `IdentitiesOnly`, and `IdentityAgent` directives were silently ignored. This is why the 3-way SSH benchmark's russh runs returned exit 5 on the GitHub-hosted runner: russh attempted to connect to `localhost` with no matching identity even though the runner had keys configured for the alias the test typed. Adds `SshConfig::apply_ssh_config(host_alias)` (and a `apply_ssh_config_from(path, alias)` variant for tests) plus a minimal `~/.ssh/config` parser that handles `Host` patterns with `*`/`?`/`!` wildcards and applies the directives above. URL-supplied fields win over file directives, with the exception of `Hostname` which always overrides the URL alias to match OpenSSH semantics. `from_url()` now calls `apply_ssh_config(raw_host)` so the russh transport picks up the config automatically. Missing, unreadable, or malformed config files are treated as empty so a broken config never blocks a transfer. * style: remove duplicate blank line after default_ssh_config_path
oferchen
added a commit
that referenced
this pull request
May 18, 2026
) Verify that PR #4154 (SSH socketpair stderr drain shutdown + russh ~/.ssh/config support) has restored SSH push and daemon push out of the pre-fix harness timeout, and that no other transfer mode regressed. Data source is benchmark workflow run 25964839057 on tag v0.6.2 at SHA f58e6df (the PR #4154 commit itself). SSH push 1.29x slower than upstream on initial sync, daemon push 1.33x slower. Both modes recovered from the prior 120 s / 30 s harness timeouts seen in the v0.6.1 release benchmark appendix; the remaining gap to the "SSH on par" and "daemon 2x faster" targets is pre-existing and unrelated to this fix.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…h path (#4154) * fix(ssh): resolve goodbye-phase deadlock + load ~/.ssh/config in russh path Two SSH transport fixes: 1. Subprocess path deadlock (primary). After the SSH child exits, the parent's stderr drain thread could remain parked in `read(2)` on the socketpair/pipe endpoint because the kernel had not yet propagated EOF -- a re-execed ssh helper occasionally inherits the write end and keeps it open for a brief window after the main `ssh` process is reaped. `wait_with_stderr()` then blocked forever in `JoinHandle::join`, which prevented the rsync process from returning to the user even though every protocol byte had already been written and the SSH child had exited cleanly. Confirmed via gdb on an aarch64 Debian 13 / kernel 6.18 reproducer: both threads parked in `futex_do_wait`, drain thread named `ssh-stderr-drai` blocked in `syscall()` on the parent end of the socketpair while the main thread waited in `pthread_join`. On CI (x86_64 Ubuntu 24.04) the same race resolved soon enough that the benchmark merely recorded the no-change push at 1.5x upstream rather than hanging outright. Fix: extend `StderrAuxChannel` with a `shutdown_read()` hook, and call it from both `SshConnection::wait*` and `SshChildHandle::wait*` once the child has been reaped. `SocketpairStderrChannel` holds a `try_clone()` of the parent endpoint and shuts it down via `UnixStream::shutdown(Both)`, which forces the pending `read(2)` to return zero immediately. `PipeStderrChannel` has no safe out-of-band wake-up, so `join()` is bounded by `DRAIN_JOIN_TIMEOUT` (50 ms) and the drain thread is abandoned if it does not exit in time; it terminates naturally on process teardown. All changes stay within the `#![deny(unsafe_code)]` policy for `rsync_io`. Before (aarch64 reproducer): 10.0 s wall, exit 124 (timeout-killed). After: 0.214 s, exit 0 (matches upstream `rsync 3.4.1` at 0.21 s). 10K-file dataset (177 MB, push to localhost) drops from infinite hang to 1.12 s initial / 0.36 s no-change vs upstream 1.02 / 0.19 s. 2. russh path missing `~/.ssh/config`. The embedded transport never consulted the user's SSH config, so `Host` aliases, `Hostname` rewrites, `User`, `Port`, `IdentityFile`, `IdentitiesOnly`, and `IdentityAgent` directives were silently ignored. This is why the 3-way SSH benchmark's russh runs returned exit 5 on the GitHub-hosted runner: russh attempted to connect to `localhost` with no matching identity even though the runner had keys configured for the alias the test typed. Adds `SshConfig::apply_ssh_config(host_alias)` (and a `apply_ssh_config_from(path, alias)` variant for tests) plus a minimal `~/.ssh/config` parser that handles `Host` patterns with `*`/`?`/`!` wildcards and applies the directives above. URL-supplied fields win over file directives, with the exception of `Hostname` which always overrides the URL alias to match OpenSSH semantics. `from_url()` now calls `apply_ssh_config(raw_host)` so the russh transport picks up the config automatically. Missing, unreadable, or malformed config files are treated as empty so a broken config never blocks a transfer. * style: remove duplicate blank line after default_ssh_config_path
oferchen
added a commit
that referenced
this pull request
May 18, 2026
) Verify that PR #4154 (SSH socketpair stderr drain shutdown + russh ~/.ssh/config support) has restored SSH push and daemon push out of the pre-fix harness timeout, and that no other transfer mode regressed. Data source is benchmark workflow run 25964839057 on tag v0.6.2 at SHA d8e0e70 (the PR #4154 commit itself). SSH push 1.29x slower than upstream on initial sync, daemon push 1.33x slower. Both modes recovered from the prior 120 s / 30 s harness timeouts seen in the v0.6.1 release benchmark appendix; the remaining gap to the "SSH on par" and "daemon 2x faster" targets is pre-existing and unrelated to this fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two SSH transport fixes, both required to unblock v0.6.2 SSH push on aarch64/Debian and to make the russh transport actually usable in real environments.
1. Subprocess SSH goodbye-phase deadlock (primary)
After the SSH child exits, the parent's stderr drain thread could remain parked in `read(2)` on the socketpair/pipe endpoint because the kernel had not yet propagated EOF — a re-execed ssh helper occasionally inherits the write end and keeps it open for a brief window after the main `ssh` is reaped. `wait_with_stderr()` then blocked forever in `JoinHandle::join`, even though every protocol byte had already been written and the SSH child had exited cleanly.
Confirmed via gdb on an aarch64 Debian 13 / kernel 6.18 reproducer: both threads parked in `futex_do_wait`, drain thread (`ssh-stderr-drai`) blocked in `syscall()` on the parent end of the socketpair, main thread waiting in `pthread_join`. On CI (x86_64 Ubuntu 24.04) the same race resolved soon enough that the benchmark merely recorded the no-change push at 1.5x upstream rather than hanging outright.
Fix. Extend `StderrAuxChannel` with a `shutdown_read()` hook and call it from both `SshConnection::wait*` and `SshChildHandle::wait*` once the child has been reaped. `SocketpairStderrChannel` holds a `try_clone()` of the parent endpoint and shuts it down via `UnixStream::shutdown(Both)` — the pending `read(2)` returns zero immediately. `PipeStderrChannel` has no safe out-of-band wake-up, so `join()` is bounded by a 50 ms timeout and the drain thread is abandoned if it doesn't exit in time; it terminates naturally on process teardown. All changes stay within `#![deny(unsafe_code)]`.
2. russh path now reads `~/.ssh/config`
The embedded transport never consulted the user's SSH config, so `Host` aliases, `Hostname` rewrites, `User`, `Port`, `IdentityFile`, `IdentitiesOnly`, and `IdentityAgent` directives were silently ignored. That is why the 3-way SSH benchmark's russh runs returned exit 5 on the GitHub runner.
`SshConfig::apply_ssh_config(host_alias)` (plus an `apply_ssh_config_from(path, alias)` variant for tests) parses `~/.ssh/config` and applies the directives above. URL-supplied fields win over file directives — except `Hostname`, which always overrides the alias to match OpenSSH semantics. `from_url()` calls this automatically. Missing/unreadable/malformed configs are treated as empty so a broken file never blocks a transfer.
Verification (aarch64 container)
Reproducer: `oc-rsync -av --timeout=10 -e ssh /tmp/src/ localhost:/tmp/dst/` against the rsync-profile container's local sshd.
Test plan