Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client goes out of sync when pgbouncer returns ReadyForQuery for CancelRequest #141

Closed
cpatru-zz opened this issue Aug 11, 2016 · 2 comments

Comments

@cpatru-zz
Copy link

cpatru-zz commented Aug 11, 2016

What I observed is that sometimes clients get out of sync and start reading the result of the previous query. This doesn't happen deterministically but it's roughly this sequence:

    1. A query will timeout.
    2. A cancel request is sent due to timeout.
    3. The query eventually completes.
    4. A new query is send and the response is a plain `ReadyForQuery` (result is not there).
    5. The next query receives the result of the previous query.

I wrote this java console app that can reproduce the issue. However, I couldn't get it to fail with local pgbouncer and postgres. It's only reproducing when pgbouncer and postgres are on separate boxes with arround 0.3ms latency.

This is what the client, using pgjdbc, is reading off the buffer (left). You can see the ReadyForQuery being received in the middle of a query being executed, as a result of the CancelRequest (sent on a separate connection). On the right, is what pgbouncer is observing.

pgbouncer_readyforquery

This only happening when hitting pgbouncer. It didn't repro when connecting to postgres directly.

CancelRequests are sent through a separate connection on the client. Pgbouncer handles them by trying to find the original connection that the CancelRequest is referring to and then it might decide to send a ReadyForQuery message on that connection. It only does this if it considers that client connection not linked.

I find the behaviour pgbouncer exhibits for CancelRequest not inline with the postgres protocol documentation and may be breaking the correctness of the protocol.

I've commented this line in pgbouncer and the problem went away.

So it is a race condition that's making pgbouncer think that the client connection is not linked as shown in this diagram.

pgbouncer_cancel_request_flow

It became more clear that this has been a problem in the past when I saw this fix. But now I realise that it only fixed the problem for a subset of cases where the clients were lucky enough to have their message received by pgbouncer. All other cases where the request has been sent by the client but not received the pgbouncer are still broken. And I think this problem cannot be fixed.

The question is, why was this behaviour introduced and what is it trying to optimise?

@cpatru-zz
Copy link
Author

Here's the PR with the fix. @markokr

#147

@PJMODOS
Copy link
Contributor

PJMODOS commented Nov 7, 2016

Merged the patch, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants