Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATAL @src/client.c:1092 in function client_proto(): bad client state: 6 #904

Closed
raymasson opened this issue Jul 25, 2023 · 6 comments · Fixed by #928
Closed

FATAL @src/client.c:1092 in function client_proto(): bad client state: 6 #904

raymasson opened this issue Jul 25, 2023 · 6 comments · Fixed by #928
Labels
bug something is broken

Comments

@raymasson
Copy link

Hello,

I have pgbouncer v1.19.1 in front of 10 DBs.
When I add a client that connects to one of the DBs, I start having one of the following errors and pgbouncer.service exits:
FATAL @src/client.c:1092 in function client_proto(): bad client state: 6
FATAL @src/client.c:1092 in function client_proto(): bad client state: 7

The new client (golang service) just connects to one of the DBs through pgbouncer but does not do any queries.
The client states 6 and 7 are CL_WAITING_CANCEL and CL_ACTIVE_CANCEL.
How can a client be in one of these states? Why could it make pgbouncer systemd service crash?
Is it just a matter of pgbouncer config (increase max_client_conn and file descriptor ulimit)?

Do you have an idea of what would cause this issue ?

Thank you very much

@JelteF JelteF added the bug something is broken label Jul 25, 2023
@JelteF
Copy link
Member

JelteF commented Jul 25, 2023

Hmm, that's not good. That seems like the 4th of type of such a crash that is caused by the introduction of these states in #717.

These states are for query cancelation requests. And in theory you don't need to have sent a query to be able to send a cancelation request. In Go those usually get sent when a cancel function of a context.Context is called (and that ctx is passed to the postgres driver).

I took a quick look at the file+linenumber mentioned in the error, but I'm honestly not sure why one of these states would get there:

  1. Could you share a bit more info on how this occurs?
  2. Do you have an easy way of reproducing them?
  3. Could you share a bit of logs of the stuff that happened before these errors?
  4. Is it possible to reproduce this in a staging environment and set verbose=3 in the pgbouncer config, to get much more detailed logs?

@raymasson
Copy link
Author

raymasson commented Jul 26, 2023

  1. It happens only on PROD unfortunately. I have the same pgbouncer config on STG and INT but it does not happen on these environments:
[pgbouncer]
logfile = /var/log/postgresql/pgbouncer.log
pidfile = /var/run/postgresql/pgbouncer.pid
​
listen_addr = *
listen_port = 5432
​
unix_socket_dir = /var/run/postgresql
​
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
auth_query = SELECT * FROM pgbouncer.get_auth($1)
​
admin_users = postgres
stats_users = pgbouncer
​
pool_mode = transaction
application_name_add_host = 1
​
server_reset_query = DISCARD ALL
​
max_client_conn = 500
default_pool_size = 20
min_pool_size = 0
reserve_pool_size = 5
reserve_pool_timeout = 5.0
server_idle_timeout = 30
dns_max_ttl = 15.0
dns_zone_check_period = 0
dns_nxdomain_ttl = 15.0
​
client_tls_key_file = /etc/vpcloud/certs/sa-patroni-main-eufh-fr1.prd.ops.vptech.eu_key.pem
client_tls_cert_file = /etc/vpcloud/certs/sa-patroni-main-eufh-fr1.prd.ops.vptech.eu_crt.pem
client_tls_sslmode = require
client_tls_protocols = secure
client_tls_ciphers = fast
server_tls_sslmode = verify-full
server_tls_ca_file = /etc/ssl/certs/ca-certificates.crt
server_tls_protocols = secure
server_tls_ciphers = fast
​
ignore_startup_parameters = extra_float_digits,statement_timeout
​
%include /etc/pgbouncer/pgbouncer.database.ini

Recently a new DB has been added on all environments and an app Golang that connects to this new DB was also deployed on all envs.
Since then, pgbouncer service systemd started to crash on PROD only and only during business hours. No crash during the week-end for example. It make me think that it is related to the traffic we have on PROD because it does not crash on test envs and nor did it on PROD during the week-end.
I have uninstalled on PROD the new app that connects to the new database and pgbouncer does not crash anymore.

  1. It is kind of hard to reproduce as it happens only on PROD
  2. Here are more logs:
Jul 21 13:10:54 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[15255]: S-0x55a94c06df40: supplier_return/admin_sa@xx.xx.xx.xx:xxxx new connection to server (from xx.xx.xx.xx:xxxx)
Jul 21 13:10:54 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[15255]: started sending cancel request
Jul 21 13:10:54 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[15255]: C-0x55a94d174310: supplier_return/(nouser)@xx.xx.xx.xx:xxxx closing because: client unexpected eof (age=0s)
Jul 21 13:10:54 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[15255]: @src/client.c:1092 in function client_proto(): bad client state: 7
Jul 21 13:10:54 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE
Jul 21 13:10:54 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu systemd[1]: pgbouncer.service: Failed with result 'exit-code'.
Jul 21 13:27:06 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu systemd[1]: Starting connection pooler for PostgreSQL...
Jul 21 13:27:06 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[2293]: kernel file descriptor limit: 1024 (hard: 524288); max_client_conn: 500, max expected fd use: 712
Jul 21 13:27:06 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[2293]: listening on 0.0.0.0:5432
Jul 21 13:27:06 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[2293]: listening on [::]:5432
Jul 21 06:17:27 sa-patroni-main-eufh-fr1 pgbouncer[25114]: S-0x556f382cee80: importer/admin_sa@xx.xx.xx.xx:xxxx SSL established: TLSv1.3/TLS_AES_256_GCM_SHA384/ECDH=prime256v1
Jul 21 06:17:29 sa-patroni-main-eufh-fr1 pgbouncer[25114]: C-0x556f37231cc0: pgbouncer/pgbouncer@xx.xx.xx.xx:xxxx login attempt: db=pgbouncer user=pgbouncer tls=TLSv1.3/TLS_AES_256_GCM_SHA384
Jul 21 06:17:29 sa-patroni-main-eufh-fr1 pgbouncer[25114]: C-0x556f37231cc0: pgbouncer/pgbouncer@xx.xx.xx.xx:xxxx closing because: client close request (age=0s)
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 pgbouncer[25114]: @src/client.c:1092 in function client_proto(): bad client state: 6
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 patroni[7992]: 2023-07-21 06:17:31.487 UTC [3973] LOG:  could not receive data from client: Connection reset by peer
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 patroni[7992]: 2023-07-21 06:17:31.487 UTC [3972] LOG:  could not receive data from client: Connection reset by peer
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 systemd[1]: pgbouncer.service: Failed with result 'exit-code'.
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 patroni[7992]: 2023-07-21 06:17:31.487 UTC [3971] LOG:  could not receive data from client: Connection reset by peer
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 patroni[7992]: 2023-07-21 06:17:31.487 UTC [3970] LOG:  could not receive data from client: Connection reset by peer
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 patroni[7992]: 2023-07-21 06:17:31.487 UTC [3876] LOG:  could not receive data from client: Connection reset by peer
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 patroni[7992]: 2023-07-21 06:17:31.487 UTC [3877] LOG:  could not receive data from client: Connection reset by peer
  1. I have simulated a lot of traffic on STAGING and could not reproduce the issue.

Thank you

@tnt-dev
Copy link

tnt-dev commented Aug 16, 2023

We have run into a similar problem on PgBouncer 1.19.1 a couple of days ago:

2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Failed with result 'exit-code'.
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Consumed 1h 51min 3.308s CPU time.
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 pgbouncer[4920]: started sending cancel request

It looks like #822 that has been marked as a clone of another (already fixed) issue.

JelteF added a commit to JelteF/pgbouncer that referenced this issue Aug 16, 2023
Fixes pgbouncer#822
Related to pgbouncer#904

In pgbouncer#904 @tnt-dev reported to still get the issue in pgbouncer#822 on 1.19.1 (on
which it was expected to be fixed).

```
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Failed with result 'exit-code'.
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Consumed 1h 51min 3.308s CPU time.
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 pgbouncer[4920]: started sending cancel request
```

Looking again, I now understand that the switch in question was missing
the SV_BEING_CANCELED case. I added it to this case to the same list
that SV_IDLE was part of. Since SV_BEING_CANCELED is effectively the
same state as SV_IDLE (except we don't allow reuse of the server yet).

I did the same in a second switch case, for that one only warnings would
have been shown instead of causing a fatal error. But still it seems
good to avoid unnecessary warnings.
@JelteF
Copy link
Member

JelteF commented Aug 16, 2023

thanks for the report @tnt-dev. This is another issue than the one reported by @raymasson. I created a PR to fix it in #927. For the issue created by @raymasson I still don't know the exact cause or right fix.

@JelteF
Copy link
Member

JelteF commented Aug 16, 2023

For the issue created by @raymasson I still don't know the exact cause or right fix.

Actually I think I figured out what's happening and also how to correctly fix it. I'll try to reproduce tomorrow with a small Go program and confirm that the fix I have in mind works.

JelteF added a commit to JelteF/pgbouncer that referenced this issue Aug 17, 2023
The cancel request protocol is very simple:
1. Clients send a cancel key
2. Server cancels the query running on the matching backend
3. The server closes the socket to indicate the cancel request was
   handled

At no point after sending the cancel key the client is required to send
any data, and if it does it won't get a response anyway because the
server would close the connection. So most clients don't, specifically
all libpq based client don't.

The Go client however, sends a Terminate ('X') packet before closing the
cancel socket from its own side (usually due to a 10 second timeout).
This would cause a crash in PgBouncer, because PgBouncer would still
listen on the client socket, but on receiving any data we wouldreport a
fatal error. This PR fixes that fatal error by stopping to listen on the
client socket once we've received the cancel key (i.e. when we change
the client state to CL_WAITING_CANCEL).

I was able to reproduce both of the errors reported by @raymasson and
confirmed that after this change the errors would not occur again.

Reproducing this with libpq based clients is not possible, except when
manually attaching to the client using `gdb` and manually running `send`
on the cancel request its socket. So no test is added, since our test
suite uses psycopg which uses libpq under the hood.

Fixes pgbouncer#904

Related to pgbouncer#717
JelteF added a commit that referenced this issue Aug 18, 2023
The cancel request protocol is very simple:
1. Clients send a cancel key
2. Server cancels the query running on the matching backend
3. The server closes the socket to indicate the cancel request was
   handled

At no point after sending the cancel key the client is required to send
any data, and if it does it won't get a response anyway because the
server would close the connection. So most clients don't, specifically
all libpq based client don't.

The Go client however, sends a Terminate ('X') packet before closing the
cancel socket from its own side (usually due to a 10 second timeout).
This would cause a crash in PgBouncer, because PgBouncer would still
listen on the client socket, but on receiving any data we wouldreport a
fatal error. This PR fixes that fatal error by stopping to listen on the
client socket once we've received the cancel key (i.e. when we change
the client state to CL_WAITING_CANCEL).

I was able to reproduce both of the errors reported by @raymasson and
confirmed that after this change the errors would not occur again.

Reproducing this with libpq based clients is not possible, except when
manually attaching to the client using `gdb` and manually running `send`
on the cancel request its socket. So no test is added, since our test
suite uses psycopg which uses libpq under the hood.

Fixes #904

Related to #717
JelteF added a commit that referenced this issue Aug 18, 2023
Fixes #822
Related to #904 and #717

In #904 @tnt-dev reported to still get the issue in #822 on 1.19.1 (on
which it was expected to be fixed).

```
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Failed with result 'exit-code'.
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Consumed 1h 51min 3.308s CPU time.
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 pgbouncer[4920]: started sending cancel request
```

Looking again, I now understand that the switch in question was missing
the SV_BEING_CANCELED case. I now added it to the same case list
that SV_IDLE was part of. Since SV_BEING_CANCELED is effectively the
same state as SV_IDLE (except we don't allow reuse of the server yet).

I did the same in a second switch case, for that one only warnings would
have been shown instead of causing a fatal error. But still it seems
good to avoid unnecessary warnings.

One example where this fatal error might occur is when PgBouncer is
still waiting for a response from the cancel request it sent to the server.
But the query that's being canceled completes before that happens.
This puts the server in SV_BEING_CANCELED. Then for whatever
reason the server sends a NOTICE or ERROR message (still before
the cancel request receives a response from Postgres).

I was able to reproduce (and confirm that the patch resolves it), by adding
some fake latency between PgBouncer and Postgres. And stopping Postgres
while the server was in the SV_BEING_CANCELED state.
@JelteF
Copy link
Member

JelteF commented Aug 18, 2023

@tnt-dev @raymasson I merged both fixes. If you're able to, please try out the current master branch to confirm that it resolves the issue for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants