FATAL @src/client.c:1092 in function client_proto(): bad client state: 6 #904

raymasson · 2023-07-25T15:26:39Z

Hello,

I have pgbouncer v1.19.1 in front of 10 DBs.
When I add a client that connects to one of the DBs, I start having one of the following errors and pgbouncer.service exits:
FATAL @src/client.c:1092 in function client_proto(): bad client state: 6
FATAL @src/client.c:1092 in function client_proto(): bad client state: 7

The new client (golang service) just connects to one of the DBs through pgbouncer but does not do any queries.
The client states 6 and 7 are CL_WAITING_CANCEL and CL_ACTIVE_CANCEL.
How can a client be in one of these states? Why could it make pgbouncer systemd service crash?
Is it just a matter of pgbouncer config (increase max_client_conn and file descriptor ulimit)?

Do you have an idea of what would cause this issue ?

Thank you very much

JelteF · 2023-07-25T15:48:50Z

Hmm, that's not good. That seems like the 4th of type of such a crash that is caused by the introduction of these states in #717.

These states are for query cancelation requests. And in theory you don't need to have sent a query to be able to send a cancelation request. In Go those usually get sent when a cancel function of a context.Context is called (and that ctx is passed to the postgres driver).

I took a quick look at the file+linenumber mentioned in the error, but I'm honestly not sure why one of these states would get there:

Could you share a bit more info on how this occurs?
Do you have an easy way of reproducing them?
Could you share a bit of logs of the stuff that happened before these errors?
Is it possible to reproduce this in a staging environment and set verbose=3 in the pgbouncer config, to get much more detailed logs?

raymasson · 2023-07-26T08:38:17Z

It happens only on PROD unfortunately. I have the same pgbouncer config on STG and INT but it does not happen on these environments:

[pgbouncer]
logfile = /var/log/postgresql/pgbouncer.log
pidfile = /var/run/postgresql/pgbouncer.pid

listen_addr = *
listen_port = 5432

unix_socket_dir = /var/run/postgresql

auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
auth_query = SELECT * FROM pgbouncer.get_auth($1)

admin_users = postgres
stats_users = pgbouncer

pool_mode = transaction
application_name_add_host = 1

server_reset_query = DISCARD ALL

max_client_conn = 500
default_pool_size = 20
min_pool_size = 0
reserve_pool_size = 5
reserve_pool_timeout = 5.0
server_idle_timeout = 30
dns_max_ttl = 15.0
dns_zone_check_period = 0
dns_nxdomain_ttl = 15.0

client_tls_key_file = /etc/vpcloud/certs/sa-patroni-main-eufh-fr1.prd.ops.vptech.eu_key.pem
client_tls_cert_file = /etc/vpcloud/certs/sa-patroni-main-eufh-fr1.prd.ops.vptech.eu_crt.pem
client_tls_sslmode = require
client_tls_protocols = secure
client_tls_ciphers = fast
server_tls_sslmode = verify-full
server_tls_ca_file = /etc/ssl/certs/ca-certificates.crt
server_tls_protocols = secure
server_tls_ciphers = fast

ignore_startup_parameters = extra_float_digits,statement_timeout

%include /etc/pgbouncer/pgbouncer.database.ini

Recently a new DB has been added on all environments and an app Golang that connects to this new DB was also deployed on all envs.
Since then, pgbouncer service systemd started to crash on PROD only and only during business hours. No crash during the week-end for example. It make me think that it is related to the traffic we have on PROD because it does not crash on test envs and nor did it on PROD during the week-end.
I have uninstalled on PROD the new app that connects to the new database and pgbouncer does not crash anymore.

It is kind of hard to reproduce as it happens only on PROD
Here are more logs:

Jul 21 13:10:54 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[15255]: S-0x55a94c06df40: supplier_return/admin_sa@xx.xx.xx.xx:xxxx new connection to server (from xx.xx.xx.xx:xxxx)
Jul 21 13:10:54 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[15255]: started sending cancel request
Jul 21 13:10:54 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[15255]: C-0x55a94d174310: supplier_return/(nouser)@xx.xx.xx.xx:xxxx closing because: client unexpected eof (age=0s)
Jul 21 13:10:54 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[15255]: @src/client.c:1092 in function client_proto(): bad client state: 7
Jul 21 13:10:54 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE
Jul 21 13:10:54 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu systemd[1]: pgbouncer.service: Failed with result 'exit-code'.
Jul 21 13:27:06 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu systemd[1]: Starting connection pooler for PostgreSQL...
Jul 21 13:27:06 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[2293]: kernel file descriptor limit: 1024 (hard: 524288); max_client_conn: 500, max expected fd use: 712
Jul 21 13:27:06 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[2293]: listening on 0.0.0.0:5432
Jul 21 13:27:06 sa-patroni-main-eufh-fr1.prd.ops.vptech.eu pgbouncer[2293]: listening on [::]:5432

Jul 21 06:17:27 sa-patroni-main-eufh-fr1 pgbouncer[25114]: S-0x556f382cee80: importer/admin_sa@xx.xx.xx.xx:xxxx SSL established: TLSv1.3/TLS_AES_256_GCM_SHA384/ECDH=prime256v1
Jul 21 06:17:29 sa-patroni-main-eufh-fr1 pgbouncer[25114]: C-0x556f37231cc0: pgbouncer/pgbouncer@xx.xx.xx.xx:xxxx login attempt: db=pgbouncer user=pgbouncer tls=TLSv1.3/TLS_AES_256_GCM_SHA384
Jul 21 06:17:29 sa-patroni-main-eufh-fr1 pgbouncer[25114]: C-0x556f37231cc0: pgbouncer/pgbouncer@xx.xx.xx.xx:xxxx closing because: client close request (age=0s)
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 pgbouncer[25114]: @src/client.c:1092 in function client_proto(): bad client state: 6
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 patroni[7992]: 2023-07-21 06:17:31.487 UTC [3973] LOG:  could not receive data from client: Connection reset by peer
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 patroni[7992]: 2023-07-21 06:17:31.487 UTC [3972] LOG:  could not receive data from client: Connection reset by peer
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 systemd[1]: pgbouncer.service: Failed with result 'exit-code'.
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 patroni[7992]: 2023-07-21 06:17:31.487 UTC [3971] LOG:  could not receive data from client: Connection reset by peer
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 patroni[7992]: 2023-07-21 06:17:31.487 UTC [3970] LOG:  could not receive data from client: Connection reset by peer
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 patroni[7992]: 2023-07-21 06:17:31.487 UTC [3876] LOG:  could not receive data from client: Connection reset by peer
Jul 21 06:17:31 sa-patroni-main-eufh-fr1 patroni[7992]: 2023-07-21 06:17:31.487 UTC [3877] LOG:  could not receive data from client: Connection reset by peer

I have simulated a lot of traffic on STAGING and could not reproduce the issue.

Thank you

tnt-dev · 2023-08-16T16:54:43Z

We have run into a similar problem on PgBouncer 1.19.1 a couple of days ago:

2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Failed with result 'exit-code'.
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Consumed 1h 51min 3.308s CPU time.
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 pgbouncer[4920]: started sending cancel request

It looks like #822 that has been marked as a clone of another (already fixed) issue.

@tnt-dev

Fixes pgbouncer#822 Related to pgbouncer#904 In pgbouncer#904 @tnt-dev reported to still get the issue in pgbouncer#822 on 1.19.1 (on which it was expected to be fixed). ``` 2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11 2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE 2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Failed with result 'exit-code'. 2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Consumed 1h 51min 3.308s CPU time. 2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11 2023-08-12T16:34:07+00:00 pgbouncer[4920]: started sending cancel request ``` Looking again, I now understand that the switch in question was missing the SV_BEING_CANCELED case. I added it to this case to the same list that SV_IDLE was part of. Since SV_BEING_CANCELED is effectively the same state as SV_IDLE (except we don't allow reuse of the server yet). I did the same in a second switch case, for that one only warnings would have been shown instead of causing a fatal error. But still it seems good to avoid unnecessary warnings.

JelteF · 2023-08-16T17:33:55Z

thanks for the report @tnt-dev. This is another issue than the one reported by @raymasson. I created a PR to fix it in #927. For the issue created by @raymasson I still don't know the exact cause or right fix.

JelteF · 2023-08-16T18:38:49Z

For the issue created by @raymasson I still don't know the exact cause or right fix.

Actually I think I figured out what's happening and also how to correctly fix it. I'll try to reproduce tomorrow with a small Go program and confirm that the fix I have in mind works.

@raymasson

The cancel request protocol is very simple: 1. Clients send a cancel key 2. Server cancels the query running on the matching backend 3. The server closes the socket to indicate the cancel request was handled At no point after sending the cancel key the client is required to send any data, and if it does it won't get a response anyway because the server would close the connection. So most clients don't, specifically all libpq based client don't. The Go client however, sends a Terminate ('X') packet before closing the cancel socket from its own side (usually due to a 10 second timeout). This would cause a crash in PgBouncer, because PgBouncer would still listen on the client socket, but on receiving any data we wouldreport a fatal error. This PR fixes that fatal error by stopping to listen on the client socket once we've received the cancel key (i.e. when we change the client state to CL_WAITING_CANCEL). I was able to reproduce both of the errors reported by @raymasson and confirmed that after this change the errors would not occur again. Reproducing this with libpq based clients is not possible, except when manually attaching to the client using `gdb` and manually running `send` on the cancel request its socket. So no test is added, since our test suite uses psycopg which uses libpq under the hood. Fixes pgbouncer#904 Related to pgbouncer#717

@raymasson

The cancel request protocol is very simple: 1. Clients send a cancel key 2. Server cancels the query running on the matching backend 3. The server closes the socket to indicate the cancel request was handled At no point after sending the cancel key the client is required to send any data, and if it does it won't get a response anyway because the server would close the connection. So most clients don't, specifically all libpq based client don't. The Go client however, sends a Terminate ('X') packet before closing the cancel socket from its own side (usually due to a 10 second timeout). This would cause a crash in PgBouncer, because PgBouncer would still listen on the client socket, but on receiving any data we wouldreport a fatal error. This PR fixes that fatal error by stopping to listen on the client socket once we've received the cancel key (i.e. when we change the client state to CL_WAITING_CANCEL). I was able to reproduce both of the errors reported by @raymasson and confirmed that after this change the errors would not occur again. Reproducing this with libpq based clients is not possible, except when manually attaching to the client using `gdb` and manually running `send` on the cancel request its socket. So no test is added, since our test suite uses psycopg which uses libpq under the hood. Fixes #904 Related to #717

@tnt-dev

Fixes #822 Related to #904 and #717 In #904 @tnt-dev reported to still get the issue in #822 on 1.19.1 (on which it was expected to be fixed). ``` 2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11 2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE 2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Failed with result 'exit-code'. 2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Consumed 1h 51min 3.308s CPU time. 2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11 2023-08-12T16:34:07+00:00 pgbouncer[4920]: started sending cancel request ``` Looking again, I now understand that the switch in question was missing the SV_BEING_CANCELED case. I now added it to the same case list that SV_IDLE was part of. Since SV_BEING_CANCELED is effectively the same state as SV_IDLE (except we don't allow reuse of the server yet). I did the same in a second switch case, for that one only warnings would have been shown instead of causing a fatal error. But still it seems good to avoid unnecessary warnings. One example where this fatal error might occur is when PgBouncer is still waiting for a response from the cancel request it sent to the server. But the query that's being canceled completes before that happens. This puts the server in SV_BEING_CANCELED. Then for whatever reason the server sends a NOTICE or ERROR message (still before the cancel request receives a response from Postgres). I was able to reproduce (and confirm that the patch resolves it), by adding some fake latency between PgBouncer and Postgres. And stopping Postgres while the server was in the SV_BEING_CANCELED state.

JelteF · 2023-08-18T07:20:58Z

@tnt-dev @raymasson I merged both fixes. If you're able to, please try out the current master branch to confirm that it resolves the issue for you.

JelteF added the bug something is broken label Jul 25, 2023

JelteF mentioned this issue Aug 16, 2023

FATAL @src/server.c:583 in function server_proto(): server_proto: server in bad state: 11 #822

Closed

JelteF mentioned this issue Aug 16, 2023

Fix FATAL in function server_proto(): server in bad state: 11 #927

Merged

JelteF mentioned this issue Aug 17, 2023

Fix FATAL in function client_proto(): bad client state: 6/7 #928

Merged

JelteF closed this as completed in #928 Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FATAL @src/client.c:1092 in function client_proto(): bad client state: 6 #904

FATAL @src/client.c:1092 in function client_proto(): bad client state: 6 #904

raymasson commented Jul 25, 2023

JelteF commented Jul 25, 2023

raymasson commented Jul 26, 2023 •

edited

tnt-dev commented Aug 16, 2023 •

edited

JelteF commented Aug 16, 2023

JelteF commented Aug 16, 2023

JelteF commented Aug 18, 2023 •

edited

FATAL @src/client.c:1092 in function client_proto(): bad client state: 6 #904

FATAL @src/client.c:1092 in function client_proto(): bad client state: 6 #904

Comments

raymasson commented Jul 25, 2023

JelteF commented Jul 25, 2023

raymasson commented Jul 26, 2023 • edited

tnt-dev commented Aug 16, 2023 • edited

JelteF commented Aug 16, 2023

JelteF commented Aug 16, 2023

JelteF commented Aug 18, 2023 • edited

raymasson commented Jul 26, 2023 •

edited

tnt-dev commented Aug 16, 2023 •

edited

JelteF commented Aug 18, 2023 •

edited