visibility into client-side queuing and flow control #11114

benjaminp · 2024-04-19T15:55:23Z

Outgoing messages in a call can be internally queued by gRPC due to HTTP2 flow control. Client calls can additionally be delayed from even starting due to the underlying HTTP2 connection hitting its MAX_CONCURRENT_STREAMS limit. There should be a public API that can observe gRPC-internal queuing of outgoing messages. This could be useful when when debugging a DEADLINE_EXCEEDED RPC; it's very informative to know if the request never left the client's queues.

Basically, I'd like to revive #8945. To recapitulate the findings on that issue:

StreamTracer.outboundMessageSent is called after a message is sent to the transport not when the message hits the network.
onReady callbacks are strongly correlated with messages being released onto the network. However, unary calls do not send onReady callbacks.

One way to resolve this issue would be to change StreamTracer.outboundMessageSent to be called after the message is actually sent to the network. Alternatively a new StreamTracer callback—outboundMessageTransmitted?—could serve that purpose.

The text was updated successfully, but these errors were encountered:

ejona86 · 2024-04-19T21:02:35Z

This could be useful when when debugging a DEADLINE_EXCEEDED RPC

We already package up that information into the DEADLINE_EXCEEDED status message. And we have infrastructure where we could add more, if something is missing.

What are you trying to use this information for? Are you wanting code that reacts dynamically based on the results or are you wanting to gather more details for a human to debug?

benjaminp · 2024-04-22T18:20:21Z

This could be useful when when debugging a DEADLINE_EXCEEDED RPC

We already package up that information into the DEADLINE_EXCEEDED status message.

Where in DEADLINE_EXCEEDED? I see the "delayed" transport can add information to a DEADLINE_EXCEEDED. But the flow control and MAX_CONCURRENT_STREAMS buffering I'm worried about happens within netty.

Improving the DEADLINE_EXCEEDED message is of course good. I'm also interested in obtaining this information for successful but "slow" calls and observing the buffering of outbound messages in streaming calls. I want to build a trace of all the queues a message goes through.

What are you trying to use this information for? Are you wanting code that reacts dynamically based on the results or are you wanting to gather more details for a human to debug?

At the moment, debugging and understanding performance.

(I suppose in the future, it's possible to explore automatically opening another channel if calls are blocked too long on MAX_CONCURRENT_STREAMS.)

ejona86 · 2024-04-23T18:39:48Z

Oh, I missed you wanted messages. There's some bits you can infer from traces, but in general talking about when a message is sent complicated.

MAX_CONCURRENT_STREAMS would fit into our current stream.appendTimeoutInsight() flow for DEADLINE_EXCEEDED, and Netty can supplement it. It might be a bit hard to plumb this into Netty, and even harder to support this in the new HTTP/2 Netty API. But if we figure out the Netty part, the rest of it is easy.

We have support for OpenCensus tracing today, and we're adding OpenTelemetry tracing. There's also Perfmark tracing, which shows more about thread interactions and wouldn't be as good for watching a single RPC you were suspicious about after-the-fact. The current stuff is lacking MAX_CONCURRENT_STREAMS visibility, and flow control is hard to express.

There has been talk (and a private design) of opening new connections when hitting MAX_CONCURRENT_STREAMS.

benjaminp · 2024-04-29T18:58:54Z

I realize the most detailed efforts would require netty plumbing or forking. But what about my suggestion in the OP to notify the stream tracer in NettyClientHandler.TransportState.onWriteFrameData? That at least would provided better information than today.

ejona86 · 2024-05-16T16:52:57Z

CC @YifeiZhuang

YifeiZhuang · 2024-05-16T23:12:01Z

I think adding a StreamTracer. outboundMessageTransmitted after flow control and add to Otel Tracing seems a good idea and low hanging fruit. We probably need census from other languages. And when to support it is another question. @ejona86

YifeiZhuang · 2024-06-13T16:55:56Z

We talked about this in cross language meeting and we had census that have a notification when the last bytes are written to the wire sounds fair. @benjaminp do you want to send a PR for this?

cc.@ejona

benjaminp added the enhancement label Apr 19, 2024

sergiitk added the Waiting on reporter there was a request for more information without a response or answer or advice has been provided label Apr 25, 2024

benjaminp changed the title ~~visiiblity into client-side queuing and flow control~~ visiblity into client-side queuing and flow control Apr 29, 2024

benjaminp changed the title ~~visiblity into client-side queuing and flow control~~ visibility into client-side queuing and flow control Apr 29, 2024

sergiitk removed the Waiting on reporter there was a request for more information without a response or answer or advice has been provided label May 2, 2024

ejona86 added this to the Unscheduled milestone Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

visibility into client-side queuing and flow control #11114

visibility into client-side queuing and flow control #11114

benjaminp commented Apr 19, 2024

ejona86 commented Apr 19, 2024

benjaminp commented Apr 22, 2024

ejona86 commented Apr 23, 2024

benjaminp commented Apr 29, 2024

ejona86 commented May 16, 2024

YifeiZhuang commented May 16, 2024

YifeiZhuang commented Jun 13, 2024

visibility into client-side queuing and flow control #11114

visibility into client-side queuing and flow control #11114

Comments

benjaminp commented Apr 19, 2024

ejona86 commented Apr 19, 2024

benjaminp commented Apr 22, 2024

ejona86 commented Apr 23, 2024

benjaminp commented Apr 29, 2024

ejona86 commented May 16, 2024

YifeiZhuang commented May 16, 2024

YifeiZhuang commented Jun 13, 2024