Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle race condition between cancel requests and server reuse #717

Merged
merged 1 commit into from
Nov 17, 2022

Conversation

JelteF
Copy link
Member

@JelteF JelteF commented May 12, 2022

Cancel requests happen out of band with regular queries. Because
PgBouncer reuses server connections across clients its possible to get
race conditions here if you're not careful.

Without this patch there were two possibilities for a race condition
where a cancel request for client A could cancel a query that was sent
by client B. These race conditions happened when:

  1. A cancel request on client A was forwarded to the server only after
    the server was reused by another client.
  2. A cancel request for client A was already forwarded to the server,
    but not yet handled by the server and then the server was reused for
    client B.

The first race conditions was especially likely if connection
establishment takes a while, for example because of TLS roundtrips.
Because a new connection needs to be established for every forwarded
cancellation request. While establishing this connection the query
running on the server might finish by itself already, and then the
server could be reused by another already waiting client.

To resolve these race conditions the following two fixes are implemented:

  1. We now only forward the cancel request if the server connection over
    which the query was sent hasn't become idle or disconnected in the
    meantime. This is done by doubly-linking cancel requests and the
    original server connection that was used for the query.
  2. Once a cancel request is forwarded to the server, we do not allow reuse of
    the server until the cancel request completes. This required tracking
    the response of the server for cancel requests too, instead of
    closing the connection right after forwarding the cancel.

Fixes #544

In passing this also adds more comprehensive comments to the different
StatLists in the PgPool struct. Because understanding how the state
machine worked was the part of this PR that took me the longest.

@JelteF JelteF force-pushed the fix-cancel-race-condition branch 3 times, most recently from 1fa16c4 to 3891b64 Compare May 12, 2022 14:10
@JelteF JelteF force-pushed the fix-cancel-race-condition branch 4 times, most recently from 293ff31 to bec7e42 Compare June 3, 2022 13:36
@JelteF JelteF force-pushed the fix-cancel-race-condition branch from bec7e42 to d6bc244 Compare June 15, 2022 08:55
@x4m
Copy link

x4m commented Jul 18, 2022

I'm not really familiar with the PgBouncer's code, but in Odyssey we generate cancelation key for each client. So client cannot cancel someone else's server connection. Does this resolve the issue?

Anyway the behavior of cancels is somewhat problematic. Sometimes server's CPU is starving due to large amount of fork()s. Clients start to cancel their queries. But cancels produce more fork()s! And ISTM this PR addressess the issue, doesn't it?

@JelteF
Copy link
Member Author

JelteF commented Jul 18, 2022

@x4m PgBouncer also generates unique cancellation tokens per client. So, that alone won't protect you against the same issue in Odyssey.

To clarify the issue: These unique cancellation tokens, when received by the connection pooler, are handled by by sending a cancellation token to the actual Postgres server. This cancellation token should be the one that matches the postgres connection on which the query of the client is currently being sent. The problem is that while sending this cancellation token to Postgres, the query on that connection might finish, and the postgres connection might be reused for a query from another client. If that's possible in Odyssey then you almost certainly have the same bug.

One easy thing to try is run the test I wrote for this PR against Odyssey and see what happens.

Anyway the behavior of cancels is somewhat problematic. Sometimes server's CPU is starving due to large amount of fork()s. Clients start to cancel their queries. But cancels produce more fork()s! And ISTM this PR addressess the issue, doesn't it?

I'm not sure why you think this PR addresses the issue. It's definitely not the purpose of the PR, but it might reduce cancellation load as a side-effect.

@JelteF JelteF force-pushed the fix-cancel-race-condition branch from d6bc244 to 7d4ba39 Compare July 25, 2022 15:59
include/bouncer.h Outdated Show resolved Hide resolved
include/bouncer.h Outdated Show resolved Hide resolved
include/bouncer.h Show resolved Hide resolved
src/objects.c Outdated Show resolved Hide resolved
src/objects.c Outdated Show resolved Hide resolved
@sumedhpathak
Copy link

The other comment we discussed was to add this behind a feature flag, so we could toggle this off in case there is an issue.

@JelteF JelteF force-pushed the fix-cancel-race-condition branch 6 times, most recently from 1b6bf02 to a24d8bd Compare August 15, 2022 11:23
include/bouncer.h Outdated Show resolved Hide resolved
Copy link

@sumedhpathak sumedhpathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Cancel requests happen out of band with regular queries. Because
PgBouncer reuses server connections across clients its possible to get
race conditions here if you're not careful.

Without this patch there were two possibilities for a race condition
where a cancel request for client A could cancel a query that was sent
by client B. These race conditions happened when:
1. A cancel request on client A was forwarded to the server only after
   the server was reused by another client.
2. A cancel request for client A was already forwarded to the server,
   but not yet handled by the server and then the server was reused for
   client B.

The first race conditions was especially likely if connection
establishment takes a while, for example because of TLS roundtrips.
Because a new connection needs to be established for every forwarded
cancellation request. While establishing this connection the query
running on the server might finish by itself already, and then the
server could be reused by another already waiting client.

To resolve these race conditions the following two fixes are implemented:
1. We now only forward the cancel request if the server connection over
   which the query was sent hasn't become idle or disconnected in the
   meantime. This is done by doubly-linking cancel requests and the
   original server connection that was used for the query.
2. Once a cancel request is forwarded to the server, we do not allow reuse of
   the server until the cancel request completes. This required tracking
   the response of the server for cancel requests too, instead of
   closing the connection right after forwarding the cancel .

Fixes pgbouncer#544
@JelteF JelteF force-pushed the fix-cancel-race-condition branch 2 times, most recently from b6a23af to 414028d Compare August 31, 2022 15:49
@JelteF JelteF merged commit ea4fb5f into pgbouncer:master Nov 17, 2022
sxd pushed a commit to PascalBourdier/cloudnative-pg that referenced this pull request Dec 20, 2022
according pgbouncer/pgbouncer#717

Signed-off-by: Pascal Bourdier <pascal.bourdier@gmail.com>
sxd pushed a commit to PascalBourdier/cloudnative-pg that referenced this pull request Dec 28, 2022
according pgbouncer/pgbouncer#717

Signed-off-by: Pascal Bourdier <pascal.bourdier@gmail.com>
sxd pushed a commit to PascalBourdier/cloudnative-pg that referenced this pull request Dec 30, 2022
according pgbouncer/pgbouncer#717

Signed-off-by: Pascal Bourdier <pascal.bourdier@gmail.com>
sxd pushed a commit to PascalBourdier/cloudnative-pg that referenced this pull request Jan 3, 2023
according pgbouncer/pgbouncer#717

Signed-off-by: Pascal Bourdier <pascal.bourdier@gmail.com>
mnencia pushed a commit to PascalBourdier/cloudnative-pg that referenced this pull request Jan 18, 2023
according pgbouncer/pgbouncer#717

Signed-off-by: Pascal Bourdier <pascal.bourdier@gmail.com>
phisco pushed a commit to PascalBourdier/cloudnative-pg that referenced this pull request Feb 1, 2023
according pgbouncer/pgbouncer#717

Signed-off-by: Pascal Bourdier <pascal.bourdier@gmail.com>
sxd pushed a commit to PascalBourdier/cloudnative-pg that referenced this pull request Feb 2, 2023
according pgbouncer/pgbouncer#717

Signed-off-by: Pascal Bourdier <pascal.bourdier@gmail.com>
sxd pushed a commit to PascalBourdier/cloudnative-pg that referenced this pull request Feb 2, 2023
according pgbouncer/pgbouncer#717

Signed-off-by: Pascal Bourdier <pascal.bourdier@gmail.com>
JelteF added a commit to JelteF/pgbouncer that referenced this pull request Mar 13, 2023
JelteF added a commit that referenced this pull request Mar 24, 2023
Fixes a bug introduced by #717

Related to #801 (and hopefully fixes)
JelteF added a commit to JelteF/pgbouncer that referenced this pull request Apr 21, 2023
Cancellations and regular queries are quite different. So it makes sense
to have different (default) timeouts for cancellation requests than for
queries. While I already think the default query_wait_timeout of 120
seconds we have is excessively long, it doesn't make sense at all to
have such a long timeout by default for cancellation requests that are
not sent. So this introduced a new `cancel_wait_timeout` which is set to
10 seconds by default. If the cancellation request cannot be forwarded
within that time it's probably better to drop it.

One reason why it's important to relatively quickly close cancellation
requests when they are not being handled is because libpq only allows
the client to send a cancel in a blocking way. So there's no possibility
to bail out after a certain amount of seconds on the client side, if
PgBouncer keeps the connection to the client open. Both me and Marco
noticed this when testing my PgBouncer peering PR (pgbouncer#666), when an entry
in the `[peers]` list pointed to an unreachable host. That meant that
pgbouncer would keep the cancel connection open, and even though the
query had finished the `psql` would not display the result because it
was stuck waiting for a result to the cancel request.

To avoid race conditions of cancellations that already forwarded, but
not yet answered cancellation requests are not timed out. See pgbouncer#717 for
details on such race conditions.
JelteF added a commit to JelteF/pgbouncer that referenced this pull request Apr 24, 2023
Cancellations and regular queries are quite different. So it makes sense
to have different (default) timeouts for cancellation requests than for
queries. While I already think the default query_wait_timeout of 120
seconds we have is excessively long, it doesn't make sense at all to
have such a long timeout by default for cancellation requests that are
not sent. So this introduced a new `cancel_wait_timeout` which is set to
10 seconds by default. If the cancellation request cannot be forwarded
within that time it's probably better to drop it.

One reason why it's important to relatively quickly close cancellation
requests when they are not being handled is because libpq only allows
the client to send a cancel in a blocking way. So there's no possibility
to bail out after a certain amount of seconds on the client side, if
PgBouncer keeps the connection to the client open. Both me and Marco
noticed this when testing my PgBouncer peering PR (pgbouncer#666), when an entry
in the `[peers]` list pointed to an unreachable host. That meant that
pgbouncer would keep the cancel connection open, and even though the
query had finished the `psql` would not display the result because it
was stuck waiting for a result to the cancel request.

To avoid race conditions of cancellations that already forwarded, but
not yet answered cancellation requests are not timed out. See pgbouncer#717 for
details on such race conditions.
JelteF added a commit that referenced this pull request Apr 25, 2023
Cancellations and regular queries are quite different. So it makes sense
to have different (default) timeouts for cancellation requests than for
queries. While I already think the default query_wait_timeout of 120
seconds we have is excessively long, it doesn't make sense at all to
have such a long timeout by default for cancellation requests that are
not sent. So this introduced a new `cancel_wait_timeout` which is set to
10 seconds by default. If the cancellation request cannot be forwarded
within that time it's probably better to drop it.

One reason why it's important to relatively quickly close cancellation
requests when they are not being handled is because libpq only allows
the client to send a cancel in a blocking way. So there's no possibility
to bail out after a certain amount of seconds on the client side, if
PgBouncer keeps the connection to the client open. Both me and Marco
noticed this when testing my PgBouncer peering PR (#666), when an entry
in the `[peers]` list pointed to an unreachable host. That meant that
pgbouncer would keep the cancel connection open, and even though the
query had finished the `psql` would not display the result because it
was stuck waiting for a result to the cancel request.

To avoid race conditions of cancellations that already forwarded, but
not yet answered cancellation requests are not timed out. See #717 for
details on such race conditions.
JelteF added a commit to JelteF/pgbouncer that referenced this pull request May 25, 2023
In most cases the postgres server will not respond anything back to a
cancel message. It simply closes the connection. It won't even report an
error if the cancellation key is incorrect, or the PID is unknown.
However, it is possible that it sends another type of error message,
such as an OOM. Libpq actually ignores those errors and won't show them
to the user, but they might still be sent by the server. Because of the
changes in pgbouncer#717 this could cause a FATAL error. This fixes that by now
handling `SV_ACTIVE_CANCEL` the same as `SV_ACTIVE` in `server_proto`.

Fixes pgbouncer#847

Related to pgbouncer#717, pgbouncer#801, and pgbouncer#815
JelteF added a commit that referenced this pull request May 30, 2023
In most cases the postgres server will not respond anything back to a
cancel message. It simply closes the connection. It won't even report an
error if the cancellation key is incorrect, or the PID is unknown.
However, it is possible that it sends another type of error message,
such as an OOM. Libpq actually ignores those errors and won't show them
to the user, but they might still be sent by the server. Because of the
changes in #717 this could cause a FATAL error. This fixes that by now
handling `SV_ACTIVE_CANCEL` the same as `SV_ACTIVE` in `server_proto`.

Fixes #847

Related to #717, #801, and #815
JelteF added a commit to JelteF/pgbouncer that referenced this pull request May 30, 2023
…cer#849)

In most cases the postgres server will not respond anything back to a
cancel message. It simply closes the connection. It won't even report an
error if the cancellation key is incorrect, or the PID is unknown.
However, it is possible that it sends another type of error message,
such as an OOM. Libpq actually ignores those errors and won't show them
to the user, but they might still be sent by the server. Because of the
changes in pgbouncer#717 this could cause a FATAL error. This fixes that by now
handling `SV_ACTIVE_CANCEL` the same as `SV_ACTIVE` in `server_proto`.

Fixes pgbouncer#847

Related to pgbouncer#717, pgbouncer#801, and pgbouncer#815

(cherry picked from commit c9aa1c6)
JelteF added a commit that referenced this pull request May 31, 2023
In most cases the postgres server will not respond anything back to a
cancel message. It simply closes the connection. It won't even report an
error if the cancellation key is incorrect, or the PID is unknown.
However, it is possible that it sends another type of error message,
such as an OOM. Libpq actually ignores those errors and won't show them
to the user, but they might still be sent by the server. Because of the
changes in #717 this could cause a FATAL error. This fixes that by now
handling `SV_ACTIVE_CANCEL` the same as `SV_ACTIVE` in `server_proto`.

Fixes #847

Related to #717, #801, and #815

(cherry picked from commit c9aa1c6)
JelteF added a commit to JelteF/pgbouncer that referenced this pull request Aug 17, 2023
The cancel request protocol is very simple:
1. Clients send a cancel key
2. Server cancels the query running on the matching backend
3. The server closes the socket to indicate the cancel request was
   handled

At no point after sending the cancel key the client is required to send
any data, and if it does it won't get a response anyway because the
server would close the connection. So most clients don't, specifically
all libpq based client don't.

The Go client however, sends a Terminate ('X') packet before closing the
cancel socket from its own side (usually due to a 10 second timeout).
This would cause a crash in PgBouncer, because PgBouncer would still
listen on the client socket, but on receiving any data we wouldreport a
fatal error. This PR fixes that fatal error by stopping to listen on the
client socket once we've received the cancel key (i.e. when we change
the client state to CL_WAITING_CANCEL).

I was able to reproduce both of the errors reported by @raymasson and
confirmed that after this change the errors would not occur again.

Reproducing this with libpq based clients is not possible, except when
manually attaching to the client using `gdb` and manually running `send`
on the cancel request its socket. So no test is added, since our test
suite uses psycopg which uses libpq under the hood.

Fixes pgbouncer#904

Related to pgbouncer#717
JelteF added a commit that referenced this pull request Aug 18, 2023
The cancel request protocol is very simple:
1. Clients send a cancel key
2. Server cancels the query running on the matching backend
3. The server closes the socket to indicate the cancel request was
   handled

At no point after sending the cancel key the client is required to send
any data, and if it does it won't get a response anyway because the
server would close the connection. So most clients don't, specifically
all libpq based client don't.

The Go client however, sends a Terminate ('X') packet before closing the
cancel socket from its own side (usually due to a 10 second timeout).
This would cause a crash in PgBouncer, because PgBouncer would still
listen on the client socket, but on receiving any data we wouldreport a
fatal error. This PR fixes that fatal error by stopping to listen on the
client socket once we've received the cancel key (i.e. when we change
the client state to CL_WAITING_CANCEL).

I was able to reproduce both of the errors reported by @raymasson and
confirmed that after this change the errors would not occur again.

Reproducing this with libpq based clients is not possible, except when
manually attaching to the client using `gdb` and manually running `send`
on the cancel request its socket. So no test is added, since our test
suite uses psycopg which uses libpq under the hood.

Fixes #904

Related to #717
JelteF added a commit that referenced this pull request Aug 18, 2023
Fixes #822
Related to #904 and #717

In #904 @tnt-dev reported to still get the issue in #822 on 1.19.1 (on
which it was expected to be fixed).

```
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Failed with result 'exit-code'.
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Consumed 1h 51min 3.308s CPU time.
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 pgbouncer[4920]: started sending cancel request
```

Looking again, I now understand that the switch in question was missing
the SV_BEING_CANCELED case. I now added it to the same case list
that SV_IDLE was part of. Since SV_BEING_CANCELED is effectively the
same state as SV_IDLE (except we don't allow reuse of the server yet).

I did the same in a second switch case, for that one only warnings would
have been shown instead of causing a fatal error. But still it seems
good to avoid unnecessary warnings.

One example where this fatal error might occur is when PgBouncer is
still waiting for a response from the cancel request it sent to the server.
But the query that's being canceled completes before that happens.
This puts the server in SV_BEING_CANCELED. Then for whatever
reason the server sends a NOTICE or ERROR message (still before
the cancel request receives a response from Postgres).

I was able to reproduce (and confirm that the patch resolves it), by adding
some fake latency between PgBouncer and Postgres. And stopping Postgres
while the server was in the SV_BEING_CANCELED state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants