http: add per-cluster upstream client codec factory seam#3
Open
ivpr wants to merge 11 commits into
Open
Conversation
This was referenced Jun 11, 2026
22b605f to
dc1fddb
Compare
ivpr
pushed a commit
that referenced
this pull request
Jun 15, 2026
…envoyproxy#45324) ### Description This PR resolves a use-after-free (UAF) / pure virtual method call crash that occurs during `ConnectivityGrid` teardown when connection pools (HTTP/3, HTTP/2, or HTTP/1) are destroyed while active connection attempts are still ongoing in the background. --- ### 💥 The Bug & Crash Callstack When the `ConnectivityGrid` is deleted, its destructor destroys its underlying connection pools in reverse order. If a pool has active connection attempts, deleting the pool synchronously cancels those attempts. In C++, once a derived class destructor finishes, the object's vtable pointer transitions to point to the base class vtable. Because `destructAllConnections()` is invoked in `~HttpConnPoolImplBase()` (the base class destructor), the connection cancellation callback wrappers execute `onConnectionAttemptFailed()` after the derived pool object has already been destructed. When `onConnectionAttemptFailed()` attempts to write a log trace calling `describePool(attempt->pool())` (which executes `attempt->pool().protocolDescription()`), virtual dispatch invokes a **pure virtual function call** on the base class `HttpConnPoolImplBase`, crashing Envoy instantly: ```log Program received signal SIGSEGV, Segmentation fault. 0x000056453663656e in Envoy::Http::ConnectivityGrid::WrapperCallbacks::onConnectionAttemptFailed() Backtrace: #0 0x000056453663656e in Envoy::Http::ConnectivityGrid::WrapperCallbacks::onConnectionAttemptFailed() #1 0x000056453663606d in Envoy::Http::ConnectivityGrid::WrapperCallbacks::ConnectionAttemptCallbacks::onPoolFailure() #2 0x000056453664e7f1 in Envoy::Http::HttpConnPoolImplBase::onPoolFailure() #3 0x00005645366cb447 in Envoy::ConnectionPool::ConnPoolImplBase::purgePendingStreams() #4 0x00005645366c9d6e in Envoy::ConnectionPool::ConnPoolImplBase::onConnectionEvent() #5 0x000056453666394d in Envoy::Tcp::ActiveTcpClient::onEvent() envoyproxy#6 0x000056453709b0e9 in Envoy::Network::MockConnectionBase::raiseEvent() ... envoyproxy#24 0x00005645366be287 in Envoy::ConnectionPool::ConnPoolImplBase::destructAllConnections() envoyproxy#25 0x00005645366766b8 in Envoy::Http::HttpConnPoolImplBase::~HttpConnPoolImplBase() envoyproxy#26 0x000056453664e70f in Envoy::Http::HttpConnPoolImplMixed::~HttpConnPoolImplMixed() envoyproxy#27 0x000056453664e739 in Envoy::Http::HttpConnPoolImplMixed::~HttpConnPoolImplMixed() envoyproxy#28 0x000056453647099c in std::__1::default_delete<>::operator()() envoyproxy#29 0x000056453647091c in std::__1::unique_ptr<>::reset() envoyproxy#30 0x000056453663a3ad in Envoy::Http::ConnectivityGrid::~ConnectivityGrid() envoyproxy#31 0x000056453663a4f9 in Envoy::Http::ConnectivityGrid::~ConnectivityGrid() envoyproxy#32 0x00005645365de65c in std::__1::default_delete<>::operator()() envoyproxy#33 0x000056453656b55c in std::__1::unique_ptr<>::reset() envoyproxy#34 0x00005645363174fa in ConnectivityGridTest_DestroyGridWithActiveH2Attempts_Test::TestBody() ``` --- ### 🛠️ The Fix Introduced a reloadable feature flag guard `envoy.reloadable_features.conn_pool_grid_early_return_on_teardown` (enabled by default) to shield both H3 and H2/H1 pools during destruction: * **Early Return**: Inside `WrapperCallbacks::onConnectionAttemptFailed()`, we check if `delete_started_` is `true`. If the flag is enabled and the wrapper is already being deleted/torn down, we return immediately, completely skipping the unsafe virtual method lookups on the half-destructed pool object. * **ALPN & Protocol Safety**: This safely protects the system regardless of which pool is being torn down (H3 or H2/H1 mixed pools). --- ### 🧪 Testing & Verification * Added the `DestroyGridWithActiveH2Attempts` unit test to verify that tearing down the `ConnectivityGrid` while an active H2 connection attempt is running does crash without the fix and not with the fix. Risk Level: low, only exposed with log level trace Testing: new unit test Docs Changes: N/A Release Notes: Y Platform Specific Features: N/A Runtime guard: envoy.reloadable_features.conn_pool_grid_early_return_on_teardown Signed-off-by: Dan Zhang <danzh@google.com> Co-authored-by: Dan Zhang <danzh@google.com>
Introduce Http::ClientCodecFactory, an optional per-cluster extension point for building the upstream (client) HTTP codec. ClusterInfo exposes it via upstreamHttpClientCodecFactory(); ClusterInfoImpl recovers it from typed_extension_protocol_options by sidecast (an options object that also implements ClientCodecFactory is the factory). CodecClientProd routes codec construction through the factory when present, passing a thunk that builds the stock codec so the factory can either decorate it or return a fresh codec. This is a generic, opt-in seam: with no factory configured, behavior is unchanged and the stock codec is used. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
dc1fddb to
eb48580
Compare
added 10 commits
June 16, 2026 08:00
Add RELEASE_ASSERT checks to guard against calling create_default multiple times and returning a null codec from custom upstream client codec factories. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Avoid eager stock codec construction when an upstream codec factory is configured, make factory selection deterministic by sorting extension option names, and add explicit pure.h include for PURE usage. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Replace the create_default callback with a nullptr-return contract so CodecClientProd builds the stock codec via a single switch only when no custom codec is used (avoids eager stock construction side effects). Discover the per-cluster factory via an explicit ProtocolOptionsConfig::upstreamHttpClientCodecFactory() hook instead of a dynamic_cast probe, and enforce at-most-one factory per cluster. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Avoid the non-dictionary word flagged by the pedantic spell check. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Add the connection's transport socket options to ClientCodecFactory::Context so a custom codec factory can replicate or decorate the stock HTTP/1 codec, which inspects http11ProxyInfo() to detect a proxied connection. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Note that upstreamHttpClientCodecFactory() must return a factory owned by the options object, since ClusterInfoImpl pins it via a shared_ptr aliasing the options object. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Add HTTP/2 factory-used and factory-returns-nullptr cases, assert the transport socket options are forwarded to the factory, and verify that configuring multiple upstream client codec factories on one cluster is rejected. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Note that a factory returning a non-null codec owns the full construction the stock path would do, including QUIC session initialization for HTTP/3, and that returning nullptr defers to the stock codec for unsupported types. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Verify the factory is consulted for HTTP/3 and that returning a codec skips the stock QUIC construction path (running against a mock connection without crashing proves the dynamic_cast/Initialize path is not taken). Signed-off-by: Prasad I V <prasad.iv@databricks.com>
Per the style guide, include the headers defining the types used in ClientCodecFactory::Context rather than forward-declaring them. No dependency cycle: upstream_interface only forward-declares Http::ClientCodecFactory. Signed-off-by: Prasad I V <prasad.iv@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation: why reverse tunnels need this seam
This seam is generic and opt-in, but it exists to unblock a concrete problem Envoy has no clean
hook for today: graceful, codec-level drain of reverse-tunnel connections.
Reverse tunnels invert the HTTP roles. The connection is established by the downstream dialing
out to the upstream, but HTTP requests then flow from the upstream to the downstream over that
tunnel -- so the upstream is the HTTP/2 client and the downstream is the HTTP/2 server, the
opposite of a normal hop. Graceful drain of a reverse tunnel therefore needs cooperation in both
directions, and both needs land on the upstream (client) codec:
the downstream dials a replacement tunnel to a healthy upstream, while in-flight requests
finish on the old tunnel. The upstream is the HTTP/2 client here, and the stock client codec has
no way to initiate a graceful drain toward the server.
first tell the upstream to stop sending new requests over this tunnel and let the downstream
dial a replacement, again finishing in-flight requests on the old tunnel. The upstream client
codec must observe that drain signal and drain its connection pool gracefully rather than
resetting in-flight streams.
Neither behavior is expressible with Envoy's stock upstream codec, and Envoy has no per-cluster
hook to substitute the upstream (client) codec --
CodecClientProdhard-codes it by protocol. Theonly alternatives are forking
codec_client.ccor intervening at a coarser layer (connection pool /load balancer), which is far more invasive and not per-cluster-configurable.
This PR adds the minimal seam: an optional, per-cluster
Http::ClientCodecFactorydiscovered fromtyped_extension_protocol_options, invoked byCodecClientProdwith a thunk that builds the stockcodec -- so an extension can decorate the stock codec or return a fresh one. It is scoped (only
clusters that configure such options get it) and zero-impact when unused (no factory -> stock codec,
behavior identical).
How it is used (reverse tunnels)
ClientCodecFactorytosignal a drain toward the downstream and to drain its pool gracefully when the downstream drains.
when the downstream itself rotates), keeping in-flight requests on the old tunnel.
Summary
Http::ClientCodecFactory, an optional per-cluster extension point for building theupstream (client) HTTP codec.
ClusterInfoexposes it viaupstreamHttpClientCodecFactory();ClusterInfoImplrecovers itfrom
typed_extension_protocol_optionsby sidecast.CodecClientProdroutes codec construction through the factory when present, passing a thunkthat builds the stock codec (factory may decorate it or return a fresh codec).
Stack
Base of a 3-PR stack (#3 <- #4 <- #5). Targets
main.Test plan
bazel build //source/exe:envoy-static