Summary
Under identical hardware and workload, bssh-server sustains roughly half the SFTP throughput of OpenSSH's sftp-server. The gap does not appear to come from the cryptographic primitives themselves (bssh-russh uses aws-lc-rs for AEAD ciphers by default), but from per-SSH-packet framework and protocol overhead around the crypto path.
This matters because a realistic goal for bssh-server in Backend.AI is to replace dedicated SFTP agents (which currently rely on OpenSSH). On slower/enterprise-grade CPUs, the current overhead effectively halves single-connection SFTP throughput.
Environment
- bssh / bssh-server: v2.1.1 (
linux-x86_64-musl build)
- Client: OpenSSH
sftp (Ubuntu 22.04)
- Transfer: 1 GiB random file via
sftp put (SFTP subsystem)
- Agent hosts tested:
- "slower" host — Intel Xeon Silver 4214 @ 2.20 GHz, AES-NI, 1 Gbps internal
- "faster" host — AMD EPYC 7742 @ 2.25 GHz (boost 3.4 GHz), AES-NI + VAES, 1 Gbps internal
- Client and servers are in the same datacenter; no network bottleneck observed for OpenSSH.
Measurements (1 GiB upload, sftp put, average of 3 runs)
| Server |
CPU |
Container cores |
Cipher |
Throughput |
| OpenSSH (sftp-server) |
Xeon Silver 4214 |
1 |
chacha20-poly1305 |
101 MiB/s (NIC-limited) |
| bssh-server |
Xeon Silver 4214 |
1 |
chacha20-poly1305 |
57 MiB/s |
| bssh-server |
Xeon Silver 4214 |
1 |
aes256-gcm@openssh.com |
57 MiB/s |
| bssh-server |
Xeon Silver 4214 |
1 |
aes256-ctr |
41 MiB/s |
| bssh-server |
Xeon Silver 4214 |
1 |
aes128-ctr |
48 MiB/s |
| bssh-server |
Xeon Silver 4214 |
2 |
chacha20-poly1305 |
60 MiB/s (≈ +5%) |
| bssh-server |
EPYC 7742 |
1 |
chacha20-poly1305 |
94 MiB/s |
| bssh-server |
EPYC 7742 |
2 |
chacha20-poly1305 |
100 MiB/s |
Observations:
- AEAD ciphers that route through aws-lc-rs (AES-GCM, ChaCha20-Poly1305) cap at the same ~57 MiB/s on Xeon Silver, even though the underlying primitives can do several GB/s. This strongly suggests the bottleneck is outside the AEAD primitive.
- AES-CTR modes are additionally slowed by going through pure-Rust `aes`/`ctr` (see `crates/bssh-russh/src/cipher/block.rs`) plus a separate HMAC pass.
- Adding a second core yields only ~5%; a single SSH connection is inherently mostly sequential, so the gain is expected to be small. The CPU architecture (Zen 2 + VAES vs. Cascade Lake) explains most of the host-to-host spread.
- CPU profile during transfer: one tokio worker thread pegs ~90% CPU while the other tokio workers stay near 0%. The process is CPU-bound on this single hot thread.
Why the gap is interesting
`bssh-russh` already uses `aws-lc-rs` for AEAD ciphers (`crates/bssh-russh/src/cipher/gcm.rs`, `crates/bssh-russh/src/cipher/chacha20poly1305.rs`), which is the same family of assembly-optimised code that OpenSSL/OpenSSH use. So the crypto primitive cannot be twice as slow as OpenSSH — yet the end-to-end SFTP throughput is. Meaning the extra cycles are almost certainly spent in the code around the primitive.
Hypotheses for the per-packet overhead
Unverified; posted for discussion / profiling.
- Per-packet AEAD invocation cost. SSH packets default to ~32 KiB, so a 1 GiB transfer issues ~32k encrypt/decrypt calls. Each call goes through `BoundKey` + `NonceSequence` construction and crosses the aws-lc-rs FFI boundary. OpenSSH amortises this with long-lived cipher contexts and direct function calls.
- Extra buffer copies. Network buffer → decrypt buffer → russh channel → russh-sftp layer → file write: each stage potentially memcpys the full 32 KiB. OpenSSH's `sftp-server` uses `sendfile`/`splice` where possible and keeps data in fewer buffers.
- Tokio channel/task overhead per SSH packet. Small-packet async pipelines are known to have non-trivial cost per message (channel send + await + wake), which becomes dominant when the crypto itself is fast.
- russh-sftp protocol layer. Worth profiling to see whether SFTP request/response handling is request-blocked rather than pipelined.
Suggested investigations / directions
- Profile with `cargo flamegraph` on a 1 GiB SFTP upload. This should immediately show whether time is spent in aws-lc-rs, in russh packet framing, in russh-sftp, or in tokio.
- Increase SSH/SFTP packet size. Larger packets amortise per-call overhead; a quick experiment with the server's `max_packet_size` / SFTP buffer would confirm whether per-call overhead is the dominant factor.
- Zero-copy file I/O on the server side. Teach russh-sftp to use `sendfile`/`splice` when the destination is a regular file.
- Route AES-CTR through aws-lc-rs (or drop CTR ciphers from the default list). Currently `crates/bssh-russh/src/cipher/block.rs` uses pure-Rust `aes`/`ctr`, which is a secondary but real contributor for users who negotiate CTR.
- Build-time optimisations. Confirm LTO / `codegen-units = 1` / `target-cpu` settings used for release binaries; PGO can be meaningful for crypto+I/O hot paths.
Reproduction
# Server (inside a container on the slower host)
/tmp/bssh-server gen-host-key --output /tmp/bssh_host_key -t ed25519
/tmp/bssh-server run -b 0.0.0.0 -p 2200 -k /tmp/bssh_host_key -D
# Client
dd if=/dev/urandom of=/tmp/testfile_1G bs=1M count=1024
echo 'put /tmp/testfile_1G testfile_1G' | \
sftp -i id_container -P 2200 \
-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \
work@<server>
Cross-check with a stock OpenSSH `sftp-server` on the same host/container/user for a baseline.
Related
Summary
Under identical hardware and workload,
bssh-serversustains roughly half the SFTP throughput of OpenSSH'ssftp-server. The gap does not appear to come from the cryptographic primitives themselves (bssh-russhusesaws-lc-rsfor AEAD ciphers by default), but from per-SSH-packet framework and protocol overhead around the crypto path.This matters because a realistic goal for
bssh-serverin Backend.AI is to replace dedicated SFTP agents (which currently rely on OpenSSH). On slower/enterprise-grade CPUs, the current overhead effectively halves single-connection SFTP throughput.Environment
linux-x86_64-muslbuild)sftp(Ubuntu 22.04)sftp put(SFTP subsystem)Measurements (1 GiB upload,
sftp put, average of 3 runs)Observations:
Why the gap is interesting
`bssh-russh` already uses `aws-lc-rs` for AEAD ciphers (`crates/bssh-russh/src/cipher/gcm.rs`, `crates/bssh-russh/src/cipher/chacha20poly1305.rs`), which is the same family of assembly-optimised code that OpenSSL/OpenSSH use. So the crypto primitive cannot be twice as slow as OpenSSH — yet the end-to-end SFTP throughput is. Meaning the extra cycles are almost certainly spent in the code around the primitive.
Hypotheses for the per-packet overhead
Unverified; posted for discussion / profiling.
Suggested investigations / directions
Reproduction
Cross-check with a stock OpenSSH `sftp-server` on the same host/container/user for a baseline.
Related