Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

close request receiving channel on errors immediately #416

Conversation

altkdf
Copy link

@altkdf altkdf commented Jan 11, 2024

Fixes the problem described in #415.

Copy link

google-cla bot commented Jan 11, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@altkdf
Copy link
Author

altkdf commented Jan 11, 2024

I signed the CLA but don't know how to rerun the CLA job.

@altkdf
Copy link
Author

altkdf commented Jan 11, 2024

Clicking "rescan" gives me a "400. That’s an error. That’s all we know."

@@ -352,6 +352,7 @@ where
let _entered = span.enter();
tracing::info!("ReceiveError");
}
self.pending_requests_mut().close();
Copy link
Collaborator

@tikue tikue Jan 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that's confusing me is, pending requests should close once the request dispatch is dropped. Can you show me your code that polls the request dispatch? What does the code do after RequestDispatch::poll returns an error?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation of tokio it is possible to have messages in the channel after dropping the Receiver. This does not happen on each run but maybe once in 10 or 100 times. My test currently produces the following log.

2024-01-23T22:14:29.394867Z  INFO tarpc::client: dropping request dispatch
2024-01-23T22:14:29.394871Z  INFO RPC{rpc.deadline=2024-01-23T22:14:39.394769825Z otel.kind="client" otel.name="TarpcCspVault.idkg_gen_dealing_encryption_key_pair" rpc.trace_id=00}: tarpc::client: sending request to dispatch
2024-01-23T22:14:29.394973Z TRACE tarpc::server: Expired requests: Closed, Inbound: Closed
2024-01-23T22:14:29.395005Z TRACE tarpc::server: poll_flush
2024-01-23T22:14:29.395019Z TRACE tokio_util::codec::framed_impl: flushing framed transport
2024-01-23T22:14:29.395032Z TRACE tokio_util::codec::framed_impl: framed transport flushed
2024-01-23T22:14:29.395025Z  WARN tarpc::client: Connection broken: could not read from the transport

Caused by:
    frame size too big
2024-01-23T22:14:29.395086Z  INFO RPC{rpc.deadline=2024-01-23T22:14:39.394769825Z otel.kind="client" otel.name="TarpcCspVault.idkg_gen_dealing_encryption_key_pair" rpc.trace_id=00}: tarpc::client: sending request to dispatch done; waiting for response
test rpc_connection::should_unfortunately_be_dead_after_response_from_server_cannot_be_received_by_client_because_too_large has been running for over 60 seconds

Our code makes use of the tarpc Client as a black box by calling its generated methods using tokio's block_on. Here's an example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Will review your code and get back to you.

(It shouldn't be a problem if there are messages in the channel when the channel is dropped — the messages in the channel contain the sender side of oneshot channels used to send individual RPC responses. When the channel is dropped, the senders are also dropped, and that should cause the receiver side to receive an error.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

I'm not sure if the channel is actually dropped. The only thing that is dropped (IIUC) is the receiver part of the dispatch channel and some messages from the channel. But the channel itself, again IIUC, is just a synchronized queue that is shared by both sender and receiver. The sender part is not dropped until we drop the client, right, it only changes to an errored state to prevent sending further messages to the channel? My hypothesis so far was that:

  1. A dispatch request (which contains the sender part of the response channel) is sent to the request dispatcher.
  2. The caller waits on the receiver end of the response channel.
  3. The dispatcher receiver is being dropped but the channel is not properly cleared, so the dispatch request along with the response sender handle is not dropped.
  4. Client waits indefinitely.

WDYT @tikue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's really interesting and makes a lot of sense, thanks!

@@ -352,6 +352,7 @@ where
let _entered = span.enter();
tracing::info!("ReceiveError");
}
self.pending_requests_mut().close();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there could be pending requests after closing the receiver. Maybe those requests should be completed as well, similar to what's done with in_flight_requests (immediately above)? Something like this: https://github.com/google/tarpc/blob/876fd4724b4ff051d8631be8bbc892023bd98e71/tarpc/src/client/in_flight_requests.rs#L102C19-L102C69

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some commits, PTAL.

Comment on lines +352 to +353
// channel is open or a spurious failure
Poll::Pending => continue,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, acually, the channel returns Pending when it's empty, I think. https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Receiver.html#method.poll_recv

So this loop can break on Poll::Pending | Poll::Ready(None).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see now why you did it this way — basically, the receiver is never supposed to return Poll::Pending after closure. Interesting — I guess this is good then!

Copy link
Collaborator

@tikue tikue Jan 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think there's some subtlety here: reading the docs for close, it looks like outstanding Permits are still allowed to send into the channel. But there's no guarantee when the send will happen, so this loop is actually a blocking call. Since this code path is nonblocking, I think that's a problem we have to solve.

What about spawning a task that drains the channel?

On a separate note, I think there are other errors that could occur besides the error in pump_read, and I hadn't made any attempt to drain the requests for those paths. And likewise, errors in other areas (like in pump_write) will not result in the receiver being drained. What if all of this draining was moved into a drop impl for RequestDispatch?

(BTW, if this is more than you want to sign up for, I'm happy to take a look at fixing this sometime in the next couple of weeks)

@tikue
Copy link
Collaborator

tikue commented Feb 3, 2024

Obsoleted by #423. Thanks so much for identifying this problem as well as the solution!

@tikue tikue closed this Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants