Skip to content

Fix FATAL in function server_proto(): server in bad state: 11 #927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 18, 2023

Conversation

JelteF
Copy link
Member

@JelteF JelteF commented Aug 16, 2023

Fixes #822
Related to #904 and #717

In #904 @tnt-dev reported to still get the issue in #822 on 1.19.1 (on
which it was expected to be fixed).

2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Failed with result 'exit-code'.
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Consumed 1h 51min 3.308s CPU time.
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 pgbouncer[4920]: started sending cancel request

Looking again, I now understand that the switch in question was missing
the SV_BEING_CANCELED case. I now added it to the same case list
that SV_IDLE was part of. Since SV_BEING_CANCELED is effectively the
same state as SV_IDLE (except we don't allow reuse of the server yet).

I did the same in a second switch case, for that one only warnings would
have been shown instead of causing a fatal error. But still it seems
good to avoid unnecessary warnings.

One example where this fatal error might occur is when PgBouncer is
still waiting for a response from the cancel request it sent to the server.
But the query that's being canceled completes before that happens.
This puts the server in SV_BEING_CANCELED. Then for whatever
reason the server sends a NOTICE or ERROR message (still before
the cancel request receives a response from postgres).

I was able to reproduce (and confirm that the patch resolves it), by adding
some fake latency between pgbouncer and postgres. And stopping postgres
while the server was in the SV_BEING_CANCELED state.

Fixes pgbouncer#822
Related to pgbouncer#904

In pgbouncer#904 @tnt-dev reported to still get the issue in pgbouncer#822 on 1.19.1 (on
which it was expected to be fixed).

```
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Main process exited, code=exited, status=1/FAILURE
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Failed with result 'exit-code'.
2023-08-12T16:34:07+00:00 systemd[1]: pgbouncer.service: Consumed 1h 51min 3.308s CPU time.
2023-08-12T16:34:07+00:00 pgbouncer[4920]: @src/server.c:591 in function server_proto(): server_proto: server in bad state: 11
2023-08-12T16:34:07+00:00 pgbouncer[4920]: started sending cancel request
```

Looking again, I now understand that the switch in question was missing
the SV_BEING_CANCELED case. I added it to this case to the same list
that SV_IDLE was part of. Since SV_BEING_CANCELED is effectively the
same state as SV_IDLE (except we don't allow reuse of the server yet).

I did the same in a second switch case, for that one only warnings would
have been shown instead of causing a fatal error. But still it seems
good to avoid unnecessary warnings.
@eulerto
Copy link
Member

eulerto commented Aug 17, 2023

One example where this fatal error might occur is when PgBouncer is
still waiting for a response from the cancel request it sent to the server.
But the query that's being canceled completes before that happens.
This puts the server in SV_BEING_CANCELED. Then for whatever
reason the server sends a NOTICE or ERROR message (still before
the cancel request receives a response from postgres).

Did you manage to reproduce this case? I couldn't. I'm wondering if this logic can handle any type of packet it receives while executing the function handle_server_work().

Does it executes the following code path?

pgbouncer/src/server.c

Lines 421 to 428 in 07239d3

} else {
if (server->state != SV_TESTED) {
slog_warning(server,
"got packet '%c' from server when not linked",
pkt_desc(pkt));
}
sbuf_prepare_skip(sbuf, pkt->len);
}

If so, this message does not provide a proper reason for the warning and we should create a debug message for this case.

@JelteF
Copy link
Member Author

JelteF commented Aug 17, 2023

Did you manage to reproduce this case? I couldn't. I'm wondering if this logic can handle any type of packet it receives while executing the function handle_server_work().

Yes, I was able to reproduce (and confirm that the patch resolves it), by adding some fake latency between pgbouncer and postgres. And stopping postgres while the server was in the SV_BEING_CANCELED state.

That indeed caused the code path that you linked to, to be executed.

2023-08-17 23:15:32.404 CEST [539954] WARNING S-0x56061eb25530: postgres/jelte@127.0.0.1:9700 got packet 'E' from server when not linked
2023-08-17 23:15:32.404 CEST [539954] LOG S-0x56061eb25530: postgres/jelte@127.0.0.1:9700 closing because: server conn crashed? (age=50s)

The same message occurs when killing postgres while a server connection is in the SV_IDLE state. I agree that it's not a very useful warning. But I think it's a separate issue from the one that this PR is trying to fix I opened #931 to track it. Going from a full process crash to a confusing warning is definitely a step in the right direction. And this confusing warning has existed for quite some time for SV_IDLE connections, which is a much more common state than SV_BEING_CANCELED.

@eulerto
Copy link
Member

eulerto commented Aug 17, 2023

Yes, I was able to reproduce (and confirm that the patch resolves it), by adding some fake latency between pgbouncer and postgres. And stopping postgres while the server was in the SV_BEING_CANCELED state.

Cool.

The same message occurs when killing postgres while a server connection is in the SV_IDLE state. I agree that it's not a very useful warning. But I think it's a separate issue from the one that this PR is trying to fix I opened #931 to track it. Going from a full process crash to a confusing warning is definitely a step in the right direction. And this confusing warning has existed for quite some time for SV_IDLE connections, which is a much more common state than SV_BEING_CANCELED.

Yeah. My comment is more in this direction: provide accurate information to the user. I'm fine leaving this as-is for this PR.

@JelteF JelteF merged commit 8d30ebc into pgbouncer:master Aug 18, 2023
@JelteF JelteF deleted the fix-crash-SV_BEING_CANCELED branch August 18, 2023 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FATAL @src/server.c:583 in function server_proto(): server_proto: server in bad state: 11
2 participants