Skip to content

0.13.0

Choose a tag to compare

@github-actions github-actions released this 11 Jun 12:48
· 55 commits to develop since this release
1c6e67f

This release mainly focuses on bug fix and performance improvement.

Important Changes

  • Breaking: rename https_redirection_port to public_https_port. The option now represents the client-visible HTTPS/H3 port used by both HTTP->HTTPS redirects and HTTP/3 Alt-Svc advertisement. Existing configs using https_redirection_port must be updated. If clients already reach the same port as listen_port_tls, this option is still unnecessary.

Bugfix

  • Fix: the file-cache bookkeeping no longer stalls cache publication - or leaks files - under sustained store churn (cache feature). The count of committed file-cache objects was kept behind a read-write lock, and file I/O was performed while holding it: evicting a displaced cache file held the lock exclusively across the file unlink, every newly stored object had to take the same exclusive lock to be counted before its metadata was published, and serving a file-backed hit held a shared guard across the cache-file open. Under sustained store-and-evict pressure (many concurrent cacheable misses for distinct URIs), one slow unlink made every in-flight store queue behind it: publication stopped within seconds, lock waits grew into tens of seconds, and committed-but-never-published cache files accumulated on disk without bound (a synthetic stress test left >135k orphaned files after one minute against an LRU capacity of 10). The count is now a lock-free atomic and unlinks/opens run without holding any lock, so publication can never queue behind another task's file I/O and eviction degrades gracefully at filesystem speed instead of collapsing. Counting semantics (best-effort, saturating, count-before-publish ordering), eviction tolerance for already-missing files, and the integrity-check behavior are unchanged.
  • Fix: HTTP/1.1 responses to slow clients no longer buffer the entire response body in memory. rpxy enabled hyper's experimental pipeline_flush option on HTTP/1.1 server connections (since the initial implementation), which - besides aggregating flushes for pipelined requests - bypasses hyper's per-connection write-buffer cap (~400 KB) and forces the flattened (copying) write strategy. A client reading more slowly than the upstream or the cache produced the body therefore caused the whole response body, however large, to be copied into that connection's write buffer: a handful of deliberately slow readers of a large response could grow resident memory by hundreds of megabytes, on any response path (proxied or cached, cleartext or TLS). The option is now left at hyper's default (disabled): the write buffer is capped again, backpressure propagates from the client socket to the upstream read or cache file read, and the write strategy returns to hyper's default (zero-copy queueing with vectored writes where the transport supports them). Pipelined HTTP/1.1 clients still receive correct responses and only lose the flush batching (HTTP/1.1 pipelining is effectively unused by real clients); no throughput change was measured for normal keep-alive traffic.

Improvement

  • Fix HTTP/3 Alt-Svc advertisement for HTTPS-only deployments. rpxy now advertises HTTP/3 on secure non-mTLS responses when HTTP/3 is enabled, independent of per-app HTTP redirect settings. Plain HTTP responses and mTLS apps do not advertise HTTP/3.
  • Reduce per-request allocations in the forwarding-header path. Building the outgoing X-Forwarded-* / Proxy headers no longer re-validates or re-allocates values that are already known: the constant headers (X-Forwarded-Proto, X-Forwarded-Ssl, Proxy) are written via HeaderValue::from_static, X-Forwarded-Port via HeaderValue::from(u16), and X-Real-IP reuses the IP string already computed for X-Forwarded-For and is handed to HeaderValue without an extra copy. The immediate-peer forwarding entry is also no longer built twice per request, and request host parsing no longer constructs an error value on the success path. The forwarding/trust-boundary logic and every emitted header value are byte-for-byte unchanged. This trims roughly ten heap allocations per request on the common path; it is a CPU/allocation cleanup, not a measured throughput change (no throughput difference was observed on a loopback benchmark).
  • Reduce per-request allocations in path routing and request-URI rebuilding. Longest-prefix route matching now compares the request path bytes directly instead of allocating a PathName per request, and rebuilding the outgoing request URI now reuses the original path-and-query via a shallow clone (instead of copying it into a Vec and re-validating it) when no replace_path is configured. Routing decisions and the rewritten URI are byte-for-byte unchanged. Like the forwarding-header change above, this is a CPU/allocation cleanup rather than a measured throughput change.
  • Avoid cloning the whole request header map on the sticky-cookie path (sticky-cookie feature). When a request reaches a sticky-session (StickyRoundRobin) upstream group, extracting the sticky cookie no longer clones the entire HeaderMap; it reads the Cookie header(s) directly and only re-materializes the cookie tokens. Which cookie is consumed versus forwarded upstream, the recovered backend id, and all reject/ignore paths are unchanged. CPU/allocation cleanup only.
  • Read file-cache hits in larger chunks (cache feature). Serving a cache hit stored on disk previously read the file into a zero-capacity BytesMut, which read_buf grows only ~64 bytes at a time — so an 8 KB object was read in ~128 tiny iterations, each allocating a buffer, a copied Bytes, and a body frame. The read now fills a 64 KiB buffer and hands each chunk downstream without an extra copy, collapsing a hit from hundreds of allocations to a handful. Large objects still stream a chunk at a time, and the integrity hash check plus eviction on mismatch are unchanged. CPU/allocation cleanup only.
  • Skip the per-hit re-hash of on-memory cache objects (cache feature). Serving a cache hit held in memory previously recomputed a full SHA-256 over the whole object on every hit and compared it to the stored hash. Unlike a file-backed object — which lives on disk as an external, mutable resource and is therefore still hash-verified on every read — an on-memory object is an immutable Bytes held inside the same cache entry as its hash and is never mutated after insertion, so re-hashing it on each hit only guarded against in-process RAM corruption (which the stored hash itself equally suffers) at the cost of a full hash per hit. On-memory hits now return the stored object directly. The file-cache integrity check is unchanged. CPU/allocation cleanup only.
  • Stream the file-cache store path to disk and bound its memory (cache feature). Storing a cacheable response previously buffered the entire body in memory, hashed it, and only then wrote a file-backed object to disk — so a file-cache object up to cache_max_each_size was held in full in RAM before spilling to disk. The store path now hashes incrementally while streaming: a body that crosses the on-memory threshold spills to a temp file and subsequent bytes are written straight to disk, capping the store-path buffer regardless of how large cache_max_each_size is configured. The temp file is created with create_new and atomically renamed to a generation-unique final path, and the cache metadata is published only after the file is fully written — closing a window where a concurrent reader could see metadata pointing at a not-yet-written file, and letting concurrent stores of the same URI no longer clobber each other's file. Any cache-side failure (too-large body, upstream error, or file I/O error) still forwards the full response to the client and simply skips caching; the file/on-memory selection threshold and the file-cache integrity check are unchanged. This is a memory-bound and correctness cleanup — material mainly when cache_max_each_size is configured large — not a measured throughput change at default settings.
  • Raise the default on-memory cache threshold from 4 KiB to 64 KiB (cache feature). max_cache_each_size_on_memory now defaults to the same value as max_cache_each_size (65,535 bytes), so by default every cacheable object is served from memory; the file-backed tier engages only when max_cache_each_size is raised beyond it. Rationale: serving a hit from memory is several times faster than the file-backed path, which opens and reads the cache file on every hit (measured on loopback: ~150k req/s on-memory vs ~36k req/s file-backed for an 8 KiB object), and typical HTML/API responses fall in the 4-64 KiB range that the old default sent to disk. The trade-off is a larger worst-case cache memory footprint: max_cache_entry (default 1,000) x this threshold ≈ 64 MB at defaults, versus ~4 MB before — deployments that prefer the old behavior can set max_cache_each_size_on_memory = 4096 explicitly. Explicitly configured values are unaffected.
  • Drop the second copy when resolving the request host name. Resolving a request's server name parsed the Host header / request-URI host into an owned, port-stripped byte buffer and then lowercased it into a second freshly allocated buffer. The conversion now lowercases the already-owned buffer in place, removing one allocation and copy per request on the always-on path. The resulting server name bytes are identical for every input, so routing and the SNI consistency check are unchanged. CPU/allocation cleanup only.
  • Precompute the sticky-cookie AEAD AAD at config build time (sticky-cookie feature). The additional authenticated data binding a sticky cookie to its app (name/domain/path) was re-validated and re-assembled on every request that opens or seals a sticky cookie, even though its inputs are fixed when the backend is built. It is now validated and computed once per load-balancer configuration and reused per request; as a side effect, an invalid component (e.g. a NUL byte in a configured path) is rejected at startup/config reload with a proper error instead of failing every request at runtime. The AAD bytes are unchanged, so cookies sealed before this change still open after it, including across a hot config reload. CPU/allocation cleanup only; no behavior change for valid configurations.
  • Bound the cache streaming channels so a slow client no longer queues unbounded response data in memory (cache feature). Serving a file-cache hit and storing a cacheable miss previously relayed body frames to the client over an unbounded in-memory channel: the producer (the disk read, or the upstream response) ran at full speed regardless of how fast the client consumed, so a slow-reading client could queue an entire large cached object — or an entire upstream response — in memory per request. Both paths now relay over a small bounded channel and the producer waits when it is full, capping per-stream queued memory at a few frames (on the order of a few hundred KiB worst case for file-backed hits) and propagating flow control to the file read and to the upstream connection, as the non-cache forwarding path already does. Cache hit/miss decisions, the stored bytes, the integrity check, and every failure-handling path are unchanged; a cache-side failure still never cuts the response to the client. With a fast consumer the channel never fills and behavior is unchanged apart from two extra small allocations per request of channel bookkeeping — this is a memory-robustness improvement for slow-consumer scenarios, not a throughput change. Note: a related, pre-existing slow-client buffering point below this layer (the HTTP/1.1 connection write buffer growing without bound when request pipelining support is enabled, unrelated to the cache) was identified during verification and is fixed in this release (see the Bugfix above).

Internal

  • Add an off-by-default dhat-heap feature for developer heap profiling. Building rpxy with --features dhat-heap swaps the global allocator for the dhat heap profiler and writes a dhat-heap.json (viewable with dhat/dh_view.html) on a Ctrl-C graceful shutdown, so per-request allocation call-sites can be measured before micro-optimizing the request hot path. The feature is off by default and not built into release binaries: normal builds keep mimalloc (and the system allocator on illumos) and are unchanged in both behavior and dependencies. This is a development aid only; it is not a runtime or configuration change.

What's Changed

  • feat(bin): add dhat-heap feature for developer heap profiling by @junkurihara in #587
  • perf(forwarding): cut per-request allocations in forwarding-header path by @junkurihara in #588
  • perf(routing): avoid per-request allocations in path match and URI rebuild by @junkurihara in #589
  • perf(sticky-cookie): avoid cloning the whole HeaderMap in cookie takeout by @junkurihara in #590
  • perf(cache): read file-cache hits in 64 KiB chunks by @junkurihara in #591
  • perf(cache): skip per-hit re-hash of on-memory cache objects by @junkurihara in #592
  • perf(cache): stream the file-cache store path to disk by @junkurihara in #593
  • fix(alt-svc)!: advertise HTTP/3 on secure responses, decoupled from redirect config by @junkurihara in #594
  • feat(cache): bound the cache streaming channels (backpressure) by @junkurihara in #595
  • fix(proxy): stop enabling hyper's pipeline_flush on h1 connections by @junkurihara in #596
  • fix(cache): make the file-store count lock-free to stop store-churn stails by @junkurihara in #597
  • feat(sticky-cookie): precompute the AEAD AAD at config build time by @junkurihara in #598
  • feat(routing): lowercase the parsed request host in place by @junkurihara in #599
  • feat(cache): raise the default on-memory cache threshold to 64 KiB by @junkurihara in #600
  • 0.13.0 by @junkurihara in #601

Full Changelog: 0.12.1...0.13.0