Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On client's read errors quick subsequent requests may wait for response indefinitely #415

Closed
altkdf opened this issue Jan 10, 2024 · 9 comments · Fixed by #423
Closed

On client's read errors quick subsequent requests may wait for response indefinitely #415

altkdf opened this issue Jan 10, 2024 · 9 comments · Fixed by #423

Comments

@altkdf
Copy link

altkdf commented Jan 10, 2024

If I understand correctly, the dispatch task is terminated on a receiving error in transport (from pump_read()) and subsequent requests return in

        self.to_dispatch
            .send(DispatchRequest {
                ctx,
                span,
                request_id,
                request,
                response_completion,
            })
            .await
            .map_err(|mpsc::error::SendError(_)| RpcError::Shutdown)?;

But for me it does sometime happen that if the request is filed immediately after the error, then

        self.to_dispatch
            .send

does not error and waits indefinitely in response_guard.response().await. Not sure why this is happening, since I would expect that all DispatchRequests would be dropped and also the waiters would notice that and return an error. But it seems that there is a race condition?

Unfortunately, I couldn't produce a minimal working example. A small example "just works".

If I add self.pending_requests_mut().close(); to pump_read(), I can't reproduce the issue anymore. Not sure if that is a solution or it just masks the issue.

    fn pump_read(
        mut self: Pin<&mut Self>,
        cx: &mut Context<'_>,
    ) -> Poll<Option<Result<(), ChannelError<C::Error>>>> {
        self.transport_pin_mut()
            .poll_next(cx)
            .map_err(|e| {
                let e = Arc::new(e);
                for span in self
                    .in_flight_requests()
                    .complete_all_requests(|| Err(RpcError::Receive(e.clone())))
                {
                    let _entered = span.enter();
                    tracing::info!("ReceiveError");
                }
                self.pending_requests_mut().close();
                ChannelError::Read(e)
            })
            .map_ok(|response| {
                self.complete(response);
            })
    }
@tikue
Copy link
Collaborator

tikue commented Jan 10, 2024

Thanks for raising this issue! I'll try to take a closer look this week, but also, feel free to send a PR for review.

@tikue
Copy link
Collaborator

tikue commented Jan 18, 2024

Could I see your code where the problem occurs? I'm curious how the RequestDispatch is being polled. When it hits an error, is it dropped?

@tikue
Copy link
Collaborator

tikue commented Feb 1, 2024

BTW, I was able to refactor an existing test to consistently trigger this race: master...request_dispatch_race

@tikue
Copy link
Collaborator

tikue commented Feb 2, 2024

Can you see if this branch fixes the problem for you? https://github.com/tikue/tarpc/tree/request_dispatch_race

@altkdf
Copy link
Author

altkdf commented Feb 2, 2024

Can you see if this branch fixes the problem for you? https://github.com/tikue/tarpc/tree/request_dispatch_race

Thanks a lot for implementing it @tikue! I tried it out and, unfortunately, this still results in the same problem. But your code LGTM in general. So I suspect that this may be due to the use of ready! here, which may be returned due to spurious failures and I think returns something else compared to the desired return Poll::Ready(Err(e));. Also note that the PR I created consistently doesn't have this problem.

@tikue
Copy link
Collaborator

tikue commented Feb 2, 2024

Hm, but the ready! macro only returns Poll::Pending if the underlying future being polled also returns Pending. The dispatch future should continue to be polled after that, right? If it's a spurious failure, than the mpsc receiver should have already arranged for a wakeup.

The problem with polling in a loop even when pending is returned is that it blocks in a nonblocking function. For example, a sender may have reserved a permit and then subsequently went to sleep for an hour. We don't want to block a Tokio thread for an hour waiting for the client to wake up, as there could be other async tasks that need to run in the meantime.

@altkdf
Copy link
Author

altkdf commented Feb 2, 2024

Oh, true, you're right - the code should work just fine like that!

Also, I just noticed something strange in my tests. Although you changed the propagated error a little, my tests worked. But now when I try to reproduce my tests, I need to adjust the expected error message. Maybe I or cargo did something wrong, let me test it again.

@altkdf
Copy link
Author

altkdf commented Feb 2, 2024

So the test has been running repeatedly for more than 20min now and no errors so far. Sorry for causing confusion earlier and thanks again for implementing the fix @tikue !

@tikue
Copy link
Collaborator

tikue commented Feb 2, 2024

That's great news, thanks for confirming! Yeah, I changed the client errors a little since more types of channel errors are propagated now. I might still revisit them a bit. (benefits of being perpetually pre-1.0...)

@tikue tikue closed this as completed in #423 Feb 3, 2024
gitlab-dfinity pushed a commit to dfinity/ic that referenced this issue Feb 5, 2024
chore(crypto): CRP-2380 bump `tarpc` version to `0.34`

To test a fix for one of our tests (see MR !17088 and the [github issue](google/tarpc#415) for `tarpc`), the CSP code had to be adapted to the newer `tarpc` version. Since this work was already done, we can also just bump the version already now and it's likely that `0.34` will get a minor update with the fix that will work without adjusting our code. 

See merge request dfinity-lab/public/ic!17478
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants