Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATAL @src/server.c:583 in function server_proto(): server_proto: server in bad state: 11 #822

Closed
guybrowne opened this issue Apr 3, 2023 · 3 comments · Fixed by #927
Closed

Comments

@guybrowne
Copy link

guybrowne commented Apr 3, 2023

PGBouncer upgraded from 1.17 to 1.18

When server under high load and application starts cancelling queries pgbouncer process crashes.

#pgbouncer6432.log
2023-04-03 15:14:20.360 AEST [9558] LOG C-0x55a94736e390: (nodb)/(nouser)@10.240.250.24:45874 closing because: failed cancel request (age=0s)
2023-04-03 15:14:20.365 AEST [9558] LOG C-0x55a94736e390: (nodb)/(nouser)@10.240.251.248:47398 closing because: failed cancel request (age=0s)
2023-04-03 15:14:20.371 AEST [9558] LOG C-0x55a94736e390: (nodb)/(nouser)@10.240.251.200:59238 closing because: failed cancel request (age=0s)
2023-04-03 15:14:20.381 AEST [9558] LOG C-0x55a94735a430: dbname/username1@10.240.250.241:34984 closing because: client unexpected eof (age=59s)
2023-04-03 15:14:20.381 AEST [9558] LOG S-0x55a94735ddf0: dbname/username1@127.0.0.1:5433 closing because: client disconnect while server was not ready (age=1625s)
2023-04-03 15:14:20.385 AEST [9558] LOG C-0x55a94735adb0: dbname/username1@10.240.250.241:48764 closing because: client unexpected eof (age=68s)
2023-04-03 15:14:20.385 AEST [9558] LOG S-0x55a94735d6d0: dbname/username1@127.0.0.1:5433 closing because: client disconnect while server was not ready (age=1678s)
2023-04-03 15:14:20.397 AEST [9558] LOG C-0x55a94735adb0: dbname/username1@10.240.250.241:35114 login attempt: db=dbname user=username1 tls=no
2023-04-03 15:14:20.459 AEST [9558] LOG C-0x55a94735a430: dbname/username1@10.240.250.241:35118 login attempt: db=dbname user=username1 tls=no
2023-04-03 15:14:20.800 AEST [9558] LOG C-0x55a94736e390: dbname/username2@10.240.251.218:52302 login attempt: db=dbname user=username2 tls=no
2023-04-03 15:14:20.894 AEST [9558] LOG S-0x55a94735d6d0: dbname/username1@127.0.0.1:5433 new connection to server (from 127.0.0.1:43194)
2023-04-03 15:14:20.894 AEST [9558] LOG started sending cancel request
2023-04-03 15:14:20.896 AEST [9558] LOG S-0x55a94735d6d0: dbname/username1@127.0.0.1:5433 closing because: successfully sent cancel request (age=0s)
2023-04-03 15:14:20.896 AEST [9558] LOG C-0x55a947359ab0: dbname/(nouser)@10.240.250.241:35122 closing because: successfully sent cancel request (age=0s)
2023-04-03 15:14:20.903 AEST [9558] LOG C-0x55a947367f00: dbname/username1@10.240.251.248:43174 closing because: client unexpected eof (age=0s)
2023-04-03 15:14:20.904 AEST [9558] LOG C-0x55a94735adb0: dbname/username1@10.240.250.241:35114 closing because: client unexpected eof (age=0s)
2023-04-03 15:14:20.904 AEST [9558] WARNING C-0x55a94735a430: dbname/username1@10.240.250.241:35118 FIXME: query end, but query_start == 0
2023-04-03 15:14:20.904 AEST [9558] WARNING C-0x55a94735a430: dbname/username1@10.240.250.241:35118 FIXME: transaction end, but xact_start == 0
2023-04-03 15:14:20.905 AEST [9558] LOG C-0x55a94735adb0: dbname/username1@10.240.250.241:35126 login attempt: db=dbname user=username1 tls=no
2023-04-03 15:14:20.908 AEST [9558] LOG C-0x55a94735a430: dbname/username1@10.240.250.241:35118 closing because: client unexpected eof (age=0s)
2023-04-03 15:14:21.068 AEST [9558] LOG C-0x55a94735a430: (nodb)/(nouser)@10.240.251.200:36974 closing because: failed cancel request (age=0s)
2023-04-03 15:14:21.295 AEST [9558] LOG C-0x55a94735a430: (nodb)/(nouser)@10.240.250.241:48918 closing because: failed cancel request (age=0s)
2023-04-03 15:14:22.047 AEST [9558] LOG C-0x55a94736eab0: dbname/username1@10.240.250.241:48798 closing because: client unexpected eof (age=47s)
2023-04-03 15:14:22.047 AEST [9558] LOG S-0x55a94735ec30: dbname/username1@127.0.0.1:5433 closing because: client disconnect while server was not ready (age=1354s)
2023-04-03 15:14:22.088 AEST [9558] LOG C-0x55a94736eab0: dbname/username1@10.240.250.24:45884 login attempt: db=dbname user=username1 tls=no
2023-04-03 15:14:22.363 AEST [9558] LOG C-0x55a94735a430: dbname/username1@10.240.250.241:48926 login attempt: db=dbname user=username1 tls=no
2023-04-03 15:14:22.372 AEST [9558] LOG C-0x55a947367f00: dbname/username1@10.240.251.248:43188 login attempt: db=dbname user=username1 tls=no
2023-04-03 15:14:22.561 AEST [9558] LOG C-0x55a947364600: dbname/username1@10.240.251.200:36844 closing because: client unexpected eof (age=47s)
2023-04-03 15:14:22.561 AEST [9558] LOG S-0x55a947360d70: dbname/username1@127.0.0.1:5433 closing because: client disconnect while server was not ready (age=1659s)
2023-04-03 15:14:23.791 AEST [9558] LOG C-0x55a94735adb0: dbname/username1@10.240.250.241:35126 closing because: client unexpected eof (age=2s)
2023-04-03 15:14:23.794 AEST [9558] LOG C-0x55a94735adb0: dbname/username1@10.240.250.241:48932 login attempt: db=dbname user=username1 tls=no
2023-04-03 15:14:24.438 AEST [9558] LOG C-0x55a9473587b0: dbname/username2@10.240.250.239:52832 closing because: client close request (age=22s)
2023-04-03 15:14:24.823 AEST [9558] LOG C-0x55a9473587b0: dbname/username1@10.240.250.24:33552 login attempt: db=dbname user=username1 tls=no
2023-04-03 15:14:24.834 AEST [9558] LOG C-0x55a947364600: dbname/username1@10.240.251.200:36992 login attempt: db=dbname user=username1 tls=no
2023-04-03 15:14:24.866 AEST [9558] LOG C-0x55a947359ab0: dbname/username2@10.240.250.239:46544 login attempt: db=dbname user=username2 tls=no
2023-04-03 15:14:25.676 AEST [9558] LOG C-0x55a947368fa0: dbname/username1@10.240.251.200:36922 closing because: client unexpected eof (age=19s)
2023-04-03 15:14:25.676 AEST [9558] LOG C-0x55a94736f8f0: dbname/username1@10.240.251.200:59266 login attempt: db=dbname user=username1 tls=no
2023-04-03 15:14:25.678 AEST [9558] LOG C-0x55a94735a430: dbname/username1@10.240.250.241:48926 closing because: client unexpected eof (age=3s)
2023-04-03 15:14:25.680 AEST [9558] LOG C-0x55a94735a430: dbname/username1@10.240.250.241:35148 login attempt: db=dbname user=username1 tls=no
2023-04-03 15:14:25.682 AEST [9558] LOG C-0x55a947370990: dbname/username1@10.240.251.248:47076 closing because: client unexpected eof (age=72s)
2023-04-03 15:14:26.416 AEST [9558] LOG C-0x55a947370990: dbname/username2@10.240.249.134:52806 login attempt: db=dbname user=username2 tls=no
2023-04-03 15:14:27.091 AEST [9558] LOG C-0x55a94736fdb0: dbname/username2@10.240.249.189:38010 closing because: client close request (age=10s)
2023-04-03 15:14:27.091 AEST [9558] LOG C-0x55a9473670c0: dbname/username2@10.240.249.189:37838 closing because: client close request (age=20s)
2023-04-03 15:14:27.469 AEST [9558] LOG C-0x55a9473670c0: dbname/username2@10.240.251.218:52366 login attempt: db=dbname user=username2 tls=no
2023-04-03 15:14:27.469 AEST [9558] LOG C-0x55a94736fdb0: dbname/username2@10.240.251.218:58098 login attempt: db=dbname user=username2 tls=no
2023-04-03 15:14:29.495 AEST [9558] LOG C-0x55a947368fa0: dbname/username1@10.240.250.24:33558 login attempt: db=dbname user=username1 tls=no
2023-04-03 15:14:29.737 AEST [9558] LOG C-0x55a94736a040: dbname/username2@10.240.249.189:38172 login attempt: db=dbname user=username2 tls=no
2023-04-03 15:14:31.228 AEST [9558] WARNING S-0x55a9473603f0: dbname/username1@127.0.0.1:5433 got packet '1' from server when not linked
2023-04-03 15:14:31.228 AEST [9558] WARNING S-0x55a9473603f0: dbname/username1@127.0.0.1:5433 got packet '2' from server when not linked
2023-04-03 15:14:31.228 AEST [9558] WARNING S-0x55a9473603f0: dbname/username1@127.0.0.1:5433 got packet 'T' from server when not linked
2023-04-03 15:14:31.228 AEST [9558] WARNING S-0x55a9473603f0: dbname/username1@127.0.0.1:5433 got packet 'D' from server when not linked
2023-04-03 15:14:31.228 AEST [9558] WARNING S-0x55a9473603f0: dbname/username1@127.0.0.1:5433 got packet 'D' from server when not linked
<...snip...>
2023-04-03 15:15:38.992 AEST [9558] LOG C-0x55a947368d40: dbname/username1@10.240.251.248:47514 closing because: client unexpected eof (age=1s)
2023-04-03 15:15:38.992 AEST [9558] LOG C-0x55a947369200: dbname/username1@10.240.250.24:33592 closing because: client unexpected eof (age=32s)
2023-04-03 15:15:38.992 AEST [9558] LOG C-0x55a947366c00: dbname/username1@10.240.251.200:37102 closing because: client unexpected eof (age=1s)
2023-04-03 15:15:38.994 AEST [9558] LOG C-0x55a947366c00: dbname/username1@10.240.251.200:59390 login attempt: db=dbname user=username1 tls=no
2023-04-03 15:15:39.012 AEST [9558] LOG C-0x55a947366c00: dbname/username1@10.240.251.200:59390 closing because: client unexpected eof (age=0s)
2023-04-03 15:15:39.012 AEST [9558] WARNING S-0x55a94735d210: dbname/username1@127.0.0.1:5433 got packet '1' from server when not linked
2023-04-03 15:15:39.012 AEST [9558] WARNING S-0x55a94735d210: dbname/username1@127.0.0.1:5433 got packet '2' from server when not linked
2023-04-03 15:15:39.012 AEST [9558] WARNING S-0x55a94735d210: dbname/username1@127.0.0.1:5433 got packet 'T' from server when not linked
2023-04-03 15:15:39.012 AEST [9558] WARNING S-0x55a94735d210: dbname/username1@127.0.0.1:5433 got packet 'D' from server when not linked
2023-04-03 15:15:39.012 AEST [9558] WARNING S-0x55a94735d210: dbname/username1@127.0.0.1:5433 got packet 'C' from server when not linked
2023-04-03 15:15:39.012 AEST [9558] WARNING S-0x55a94735d210: dbname/username1@127.0.0.1:5433 got packet 'Z' from server when not linked
2023-04-03 15:15:41.501 AEST [9558] LOG C-0x55a9473549f0: dbname/username2@10.240.251.218:58656 closing because: client close request (age=3s)
2023-04-03 15:15:43.313 AEST [9558] LOG C-0x55a9473549f0: dbname/username2@10.240.251.218:58706 login attempt: db=dbname user=username2 tls=no
2023-04-03 15:15:47.326 AEST [9558] LOG S-0x55a94735d470: dbname/username1@127.0.0.1:5433 new connection to server (from 127.0.0.1:43378)
2023-04-03 15:15:47.326 AEST [9558] LOG started sending cancel request
2023-04-03 15:15:47.329 AEST [9558] FATAL @src/server.c:583 in function server_proto(): server_proto: server in bad state: 11
2023-04-03 16:43:35.893 AEST [3406134] LOG kernel file descriptor limit: 8192 (hard: 8192); max_client_conn: 500, max expected fd use: 572

#journalctl
Apr 03 15:15:39 pgdb.server pgbouncer[9558]: S-0x55a94735d210: dbname/username1@127.0.0.1:5433 got packet '1' from server when not linked
Apr 03 15:15:39 pgdb.server pgbouncer[9558]: S-0x55a94735d210: dbname/username1@127.0.0.1:5433 got packet '2' from server when not linked
Apr 03 15:15:39 pgdb.server pgbouncer[9558]: S-0x55a94735d210: dbname/username1@127.0.0.1:5433 got packet 'T' from server when not linked
Apr 03 15:15:39 pgdb.server pgbouncer[9558]: S-0x55a94735d210: dbname/username1@127.0.0.1:5433 got packet 'D' from server when not linked
Apr 03 15:15:39 pgdb.server pgbouncer[9558]: S-0x55a94735d210: dbname/username1@127.0.0.1:5433 got packet 'C' from server when not linked
Apr 03 15:15:39 pgdb.server pgbouncer[9558]: S-0x55a94735d210: dbname/username1@127.0.0.1:5433 got packet 'Z' from server when not linked
Apr 03 15:15:41 pgdb.server pgbouncer[9558]: C-0x55a9473549f0: dbname/username2@10.240.251.218:58656 closing because: client close request (age=3s)
Apr 03 15:15:43 pgdb.server pgbouncer[9558]: C-0x55a9473549f0: dbname/username2@10.240.251.218:58706 login attempt: db=dbname user=username2 tls=no
Apr 03 15:15:47 pgdb.server pgbouncer[9558]: S-0x55a94735d470: dbname/username1@127.0.0.1:5433 new connection to server (from 127.0.0.1:43378)
Apr 03 15:15:47 pgdb.server pgbouncer[9558]: started sending cancel request
Apr 03 15:15:47 pgdb.server pgbouncer[9558]: @src/server.c:583 in function server_proto(): server_proto: server in bad state: 11
Apr 03 15:15:47 pgdb.server systemd[1]: pgbouncer6432.service: Main process exited, code=exited, status=1/FAILURE
Apr 03 15:15:47 pgdb.server systemd[1]: pgbouncer6432.service: Failed with result 'exit-code'.
Apr 03 15:16:00 pgdb.server systemd[1]: Starting system activity accounting tool...
Apr 03 15:16:00 pgdb.server systemd[1]: sysstat-collect.service: Succeeded.
Apr 03 15:16:00 pgdb.server systemd[1]: Started system activity accounting tool.

pgbouncer6432.ini

[databases]
dbname = host=127.0.0.1 port=5433 auth_user=pgbouncer pool_size=30
"*" = host=127.0.0.1 port=5433 auth_user=pgbouncer
[users]
[pgbouncer]
logfile = /var/log/pgbouncer/pgbouncer6432.log
pidfile = /var/run/pgbouncer/pgbouncer6432.pid
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
auth_user = pgbouncer
auth_query = SELECT * FROM pgbouncer.get_auth($1)
admin_users = postgres
stats_users = stats, postgres, pgbouncer
pool_mode = transaction
server_reset_query = DISCARD ALL
ignore_startup_parameters = extra_float_digits,search_path
server_check_query = select 1
server_fast_close = 1
max_client_conn = 500
default_pool_size = 30
max_db_connections = 250
query_wait_timeout = 60
tcp_keepalive = 1
tcp_keepcnt = 6
tcp_keepidle = 1
tcp_keepintvl = 10

@JelteF
Copy link
Member

JelteF commented Apr 3, 2023

Thanks for the report. But afaict this is a duplicate of #801, which was fixed recently on the master branch.

@JelteF
Copy link
Member

JelteF commented May 4, 2023

Closing this as a duplicate of #801 (for which a fix has now been released).

@JelteF JelteF closed this as not planned Won't fix, can't repro, duplicate, stale May 4, 2023
@JelteF
Copy link
Member

JelteF commented Aug 16, 2023

Reopening after report from tnt-dev on #904

@JelteF JelteF reopened this Aug 16, 2023
JelteF added a commit to JelteF/pgbouncer that referenced this issue Aug 16, 2023
Fixes pgbouncer#822
Related to pgbouncer#904

In pgbouncer#904 @tnt-dev reported to still get the issue in pgbouncer#822 on 1.19.1 (on
which it was expected to be fixed).

```
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Failed with result 'exit-code'.
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Consumed 1h 51min 3.308s CPU time.
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 pgbouncer[4920]: started sending cancel request
```

Looking again, I now understand that the switch in question was missing
the SV_BEING_CANCELED case. I added it to this case to the same list
that SV_IDLE was part of. Since SV_BEING_CANCELED is effectively the
same state as SV_IDLE (except we don't allow reuse of the server yet).

I did the same in a second switch case, for that one only warnings would
have been shown instead of causing a fatal error. But still it seems
good to avoid unnecessary warnings.
JelteF added a commit that referenced this issue Aug 18, 2023
Fixes #822
Related to #904 and #717

In #904 @tnt-dev reported to still get the issue in #822 on 1.19.1 (on
which it was expected to be fixed).

```
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Failed with result 'exit-code'.
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Consumed 1h 51min 3.308s CPU time.
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 pgbouncer[4920]: started sending cancel request
```

Looking again, I now understand that the switch in question was missing
the SV_BEING_CANCELED case. I now added it to the same case list
that SV_IDLE was part of. Since SV_BEING_CANCELED is effectively the
same state as SV_IDLE (except we don't allow reuse of the server yet).

I did the same in a second switch case, for that one only warnings would
have been shown instead of causing a fatal error. But still it seems
good to avoid unnecessary warnings.

One example where this fatal error might occur is when PgBouncer is
still waiting for a response from the cancel request it sent to the server.
But the query that's being canceled completes before that happens.
This puts the server in SV_BEING_CANCELED. Then for whatever
reason the server sends a NOTICE or ERROR message (still before
the cancel request receives a response from Postgres).

I was able to reproduce (and confirm that the patch resolves it), by adding
some fake latency between PgBouncer and Postgres. And stopping Postgres
while the server was in the SV_BEING_CANCELED state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants