On client's read errors quick subsequent requests may wait for response indefinitely #415

altkdf · 2024-01-10T18:06:53Z

If I understand correctly, the dispatch task is terminated on a receiving error in transport (from pump_read()) and subsequent requests return in

        self.to_dispatch
            .send(DispatchRequest {
                ctx,
                span,
                request_id,
                request,
                response_completion,
            })
            .await
            .map_err(|mpsc::error::SendError(_)| RpcError::Shutdown)?;

But for me it does sometime happen that if the request is filed immediately after the error, then

        self.to_dispatch
            .send

does not error and waits indefinitely in response_guard.response().await. Not sure why this is happening, since I would expect that all DispatchRequests would be dropped and also the waiters would notice that and return an error. But it seems that there is a race condition?

Unfortunately, I couldn't produce a minimal working example. A small example "just works".

If I add self.pending_requests_mut().close(); to pump_read(), I can't reproduce the issue anymore. Not sure if that is a solution or it just masks the issue.

    fn pump_read(
        mut self: Pin<&mut Self>,
        cx: &mut Context<'_>,
    ) -> Poll<Option<Result<(), ChannelError<C::Error>>>> {
        self.transport_pin_mut()
            .poll_next(cx)
            .map_err(|e| {
                let e = Arc::new(e);
                for span in self
                    .in_flight_requests()
                    .complete_all_requests(|| Err(RpcError::Receive(e.clone())))
                {
                    let _entered = span.enter();
                    tracing::info!("ReceiveError");
                }
                self.pending_requests_mut().close();
                ChannelError::Read(e)
            })
            .map_ok(|response| {
                self.complete(response);
            })
    }

The text was updated successfully, but these errors were encountered:

tikue · 2024-01-10T18:39:30Z

Thanks for raising this issue! I'll try to take a closer look this week, but also, feel free to send a PR for review.

tikue · 2024-01-18T21:33:14Z

Could I see your code where the problem occurs? I'm curious how the RequestDispatch is being polled. When it hits an error, is it dropped?

tikue · 2024-02-01T23:30:23Z

BTW, I was able to refactor an existing test to consistently trigger this race: master...request_dispatch_race

tikue · 2024-02-02T05:19:30Z

Can you see if this branch fixes the problem for you? https://github.com/tikue/tarpc/tree/request_dispatch_race

altkdf · 2024-02-02T14:37:59Z

Can you see if this branch fixes the problem for you? https://github.com/tikue/tarpc/tree/request_dispatch_race

Thanks a lot for implementing it @tikue! I tried it out and, unfortunately, this still results in the same problem. But your code LGTM in general. So I suspect that this may be due to the use of ready! here, which may be returned due to spurious failures and I think returns something else compared to the desired return Poll::Ready(Err(e));. Also note that the PR I created consistently doesn't have this problem.

tikue · 2024-02-02T18:23:08Z

Hm, but the ready! macro only returns Poll::Pending if the underlying future being polled also returns Pending. The dispatch future should continue to be polled after that, right? If it's a spurious failure, than the mpsc receiver should have already arranged for a wakeup.

The problem with polling in a loop even when pending is returned is that it blocks in a nonblocking function. For example, a sender may have reserved a permit and then subsequently went to sleep for an hour. We don't want to block a Tokio thread for an hour waiting for the client to wake up, as there could be other async tasks that need to run in the meantime.

altkdf · 2024-02-02T19:40:42Z

Oh, true, you're right - the code should work just fine like that!

Also, I just noticed something strange in my tests. Although you changed the propagated error a little, my tests worked. But now when I try to reproduce my tests, I need to adjust the expected error message. Maybe I or cargo did something wrong, let me test it again.

altkdf · 2024-02-02T20:07:01Z

So the test has been running repeatedly for more than 20min now and no errors so far. Sorry for causing confusion earlier and thanks again for implementing the fix @tikue !

tikue · 2024-02-02T20:08:59Z

That's great news, thanks for confirming! Yeah, I changed the client errors a little since more types of channel errors are propagated now. I might still revisit them a bit. (benefits of being perpetually pre-1.0...)

chore(crypto): CRP-2380 bump `tarpc` version to `0.34` To test a fix for one of our tests (see MR !17088 and the [github issue](google/tarpc#415) for `tarpc`), the CSP code had to be adapted to the newer `tarpc` version. Since this work was already done, we can also just bump the version already now and it's likely that `0.34` will get a minor update with the fix that will work without adjusting our code. See merge request dfinity-lab/public/ic!17478

altkdf mentioned this issue Jan 11, 2024

close request receiving channel on errors immediately #416

Closed

tikue mentioned this issue Feb 2, 2024

Fix hanging clients on RequestDispatch error #423

Merged

tikue closed this as completed in #423 Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On client's read errors quick subsequent requests may wait for response indefinitely #415

On client's read errors quick subsequent requests may wait for response indefinitely #415

altkdf commented Jan 10, 2024

tikue commented Jan 10, 2024

tikue commented Jan 18, 2024

tikue commented Feb 1, 2024

tikue commented Feb 2, 2024

altkdf commented Feb 2, 2024

tikue commented Feb 2, 2024 •

edited

Loading

altkdf commented Feb 2, 2024

altkdf commented Feb 2, 2024

tikue commented Feb 2, 2024

On client's read errors quick subsequent requests may wait for response indefinitely #415

On client's read errors quick subsequent requests may wait for response indefinitely #415

Comments

altkdf commented Jan 10, 2024

tikue commented Jan 10, 2024

tikue commented Jan 18, 2024

tikue commented Feb 1, 2024

tikue commented Feb 2, 2024

altkdf commented Feb 2, 2024

tikue commented Feb 2, 2024 • edited Loading

altkdf commented Feb 2, 2024

altkdf commented Feb 2, 2024

tikue commented Feb 2, 2024

tikue commented Feb 2, 2024 •

edited

Loading