close request receiving channel on errors immediately #416

altkdf · 2024-01-11T13:00:15Z

Fixes the problem described in #415.

google-cla · 2024-01-11T13:00:19Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

altkdf · 2024-01-11T14:53:31Z

I signed the CLA but don't know how to rerun the CLA job.

altkdf · 2024-01-11T15:34:12Z

Clicking "rescan" gives me a "400. That’s an error. That’s all we know."

tikue · 2024-01-21T21:31:32Z

tarpc/src/client.rs

@@ -352,6 +352,7 @@ where
                    let _entered = span.enter();
                    tracing::info!("ReceiveError");
                }
+                self.pending_requests_mut().close();


The thing that's confusing me is, pending requests should close once the request dispatch is dropped. Can you show me your code that polls the request dispatch? What does the code do after RequestDispatch::poll returns an error?

According to the documentation of tokio it is possible to have messages in the channel after dropping the Receiver. This does not happen on each run but maybe once in 10 or 100 times. My test currently produces the following log.

2024-01-23T22:14:29.394867Z INFO tarpc::client: dropping request dispatch 2024-01-23T22:14:29.394871Z INFO RPC{rpc.deadline=2024-01-23T22:14:39.394769825Z otel.kind="client" otel.name="TarpcCspVault.idkg_gen_dealing_encryption_key_pair" rpc.trace_id=00}: tarpc::client: sending request to dispatch 2024-01-23T22:14:29.394973Z TRACE tarpc::server: Expired requests: Closed, Inbound: Closed 2024-01-23T22:14:29.395005Z TRACE tarpc::server: poll_flush 2024-01-23T22:14:29.395019Z TRACE tokio_util::codec::framed_impl: flushing framed transport 2024-01-23T22:14:29.395032Z TRACE tokio_util::codec::framed_impl: framed transport flushed 2024-01-23T22:14:29.395025Z WARN tarpc::client: Connection broken: could not read from the transport Caused by: frame size too big 2024-01-23T22:14:29.395086Z INFO RPC{rpc.deadline=2024-01-23T22:14:39.394769825Z otel.kind="client" otel.name="TarpcCspVault.idkg_gen_dealing_encryption_key_pair" rpc.trace_id=00}: tarpc::client: sending request to dispatch done; waiting for response test rpc_connection::should_unfortunately_be_dead_after_response_from_server_cannot_be_received_by_client_because_too_large has been running for over 60 seconds

Our code makes use of the tarpc Client as a black box by calling its generated methods using tokio's block_on. Here's an example.

Thanks! Will review your code and get back to you.

(It shouldn't be a problem if there are messages in the channel when the channel is dropped — the messages in the channel contain the sender side of oneshot channels used to send individual RPC responses. When the channel is dropped, the senders are also dropped, and that should cause the receiver side to receive an error.)

Thank you!

I'm not sure if the channel is actually dropped. The only thing that is dropped (IIUC) is the receiver part of the dispatch channel and some messages from the channel. But the channel itself, again IIUC, is just a synchronized queue that is shared by both sender and receiver. The sender part is not dropped until we drop the client, right, it only changes to an errored state to prevent sending further messages to the channel? My hypothesis so far was that:

A dispatch request (which contains the sender part of the response channel) is sent to the request dispatcher.

The caller waits on the receiver end of the response channel.

The dispatcher receiver is being dropped but the channel is not properly cleared, so the dispatch request along with the response sender handle is not dropped.

Client waits indefinitely.

WDYT @tikue?

Ah, that's really interesting and makes a lot of sense, thanks!

tikue · 2024-01-31T00:03:29Z

tarpc/src/client.rs

@@ -352,6 +352,7 @@ where
                    let _entered = span.enter();
                    tracing::info!("ReceiveError");
                }
+                self.pending_requests_mut().close();


I think there could be pending requests after closing the receiver. Maybe those requests should be completed as well, similar to what's done with in_flight_requests (immediately above)? Something like this: https://github.com/google/tarpc/blob/876fd4724b4ff051d8631be8bbc892023bd98e71/tarpc/src/client/in_flight_requests.rs#L102C19-L102C69

I pushed some commits, PTAL.

…errors

tikue · 2024-01-31T21:07:51Z

tarpc/src/client.rs

+                        // channel is open or a spurious failure
+                        Poll::Pending => continue,


Oh, acually, the channel returns Pending when it's empty, I think. https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Receiver.html#method.poll_recv

So this loop can break on Poll::Pending | Poll::Ready(None).

Oh, I see now why you did it this way — basically, the receiver is never supposed to return Poll::Pending after closure. Interesting — I guess this is good then!

Actually, I think there's some subtlety here: reading the docs for close, it looks like outstanding Permits are still allowed to send into the channel. But there's no guarantee when the send will happen, so this loop is actually a blocking call. Since this code path is nonblocking, I think that's a problem we have to solve.

What about spawning a task that drains the channel?

On a separate note, I think there are other errors that could occur besides the error in pump_read, and I hadn't made any attempt to drain the requests for those paths. And likewise, errors in other areas (like in pump_write) will not result in the receiver being drained. What if all of this draining was moved into a drop impl for RequestDispatch?

(BTW, if this is more than you want to sign up for, I'm happy to take a look at fixing this sometime in the next couple of weeks)

tikue · 2024-02-03T07:25:04Z

Obsoleted by #423. Thanks so much for identifying this problem as well as the solution!

close request receiving channel on errors immediately

876fd47

tikue reviewed Jan 21, 2024

View reviewed changes

tikue reviewed Jan 31, 2024

View reviewed changes

altkdf added 4 commits January 31, 2024 11:09

drain sender's messages after closing the channel

d88040a

close the channel as the first action

fb3c4d7

propagate RpcError::Receive

c398356

Merge branch 'master' into alex/fix-hanging-requests-on-clients-read-…

fbb781f

…errors

tikue reviewed Jan 31, 2024

View reviewed changes

tikue closed this Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

close request receiving channel on errors immediately #416

close request receiving channel on errors immediately #416

altkdf commented Jan 11, 2024

google-cla bot commented Jan 11, 2024

altkdf commented Jan 11, 2024

altkdf commented Jan 11, 2024

tikue Jan 21, 2024 •

edited

Loading

altkdf Jan 23, 2024

tikue Jan 24, 2024

altkdf Jan 24, 2024

tikue Jan 30, 2024

tikue Jan 31, 2024

altkdf Jan 31, 2024

tikue Jan 31, 2024

tikue Jan 31, 2024

tikue Jan 31, 2024 •

edited

Loading

tikue commented Feb 3, 2024

		// channel is open or a spurious failure
		Poll::Pending => continue,

close request receiving channel on errors immediately #416

close request receiving channel on errors immediately #416

Conversation

altkdf commented Jan 11, 2024

google-cla bot commented Jan 11, 2024

altkdf commented Jan 11, 2024

altkdf commented Jan 11, 2024

tikue Jan 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tikue Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

tikue commented Feb 3, 2024

tikue Jan 21, 2024 •

edited

Loading

tikue Jan 31, 2024 •

edited

Loading