Only use positive numbers in the PID part of the cancel key #945

JelteF · 2023-09-06T07:17:04Z

We were using the full 64 bits of the BackendKeyData message as random
bytes. This turned out to be arguably incorrect, because the first 32
bits are used by PostgreSQL as a Process ID (and this is part of the
protocol documentation too). Actual PIDs are always positive, but
we also put a random bit in the sign bit so were setting it to negative
numbers half of the time. For most cases this does not matter, but it
turned out that pg_basebackup relied on the PID part to actually be
positive.

While pg_basebackup is now fixed to support negative Process IDs, it
still seems good to adhere to this implicit requirement on positive
numbers in case other clients also depend on it.

Since this change requires changing the bytes of the cancel key in which
we encode the peer_id, this change breaks cancelations in peered
clusters of different PgBouncer versions. This seems like a minor
enough problem that we should not care about this. In practice this
should only happen during a rolling upgrade, which we currently don't
support well anyway (see #902 for improvements on that). And even if we
did, breaking cancellations for a few minutes in this transitional stage
doesn't seem like a huge deal.

petere · 2023-10-04T08:19:27Z

While pg_basebackup is now fixed to support negative Process IDs, it still seems good to adhere to this implicit requirement on positive numbers in case other clients also depend on it.

That makes sense.

Since this change requires changing the bytes of the cancel key in which we encode the peer_id, this change breaks cancelations in peered clusters of different PgBouncer versions. This seems like a minor enough problem that we should not care about this. In practice this should only happen during a rolling upgrade, which we currently don't support well anyway (see #902 for improvements on that). And even if we did, breaking cancellations for a few minutes in this transitional stage doesn't seem like a huge deal.

I don't think I follow this. While temporary breakage during rolling upgrades might be a minor problem, what about different pgbouncer versions on different hosts that are peered together? We don't want to force that all of those are upgraded together.

Maybe a compatibility break is necessary, but then we should just say so.

What would happen if we left the peer ID as is (in the first bytes)?

JelteF · 2023-10-04T14:52:53Z

Maybe a compatibility break is necessary, but then we should just say so.

Fair enough. I'll update the docs in the PR and note that cross version peering is not supported across this version change. (and also include it in the changelog)

What would happen if we left the peer ID as is (in the first bytes)?

Then we'd cut the usable peer id space in half, because the sign bit always needs to be 0. The maximum value of peer_id (16383) is also documented, so changing that would also be a breaking change. Since we're breaking compatibility either way, I'd rather do it in the way that the PR is doing now so we can keep the same amount of peer_ids.

We were using the full 64 bits of the BackendKeyData message as random bytes. This turned out to be arguably incorrect, because the first 32 bits are used by PostgreSQL as a Process ID (and this is part of the [protocol documentation too][1]). Actual PIDs are always positive, but we also put a random bit in the sign bit so were setting it to negative numbers half of the time. For most cases this does not matter, but it turned out that [`pg_basebackup` relies on the PID part to actually be positive][2]. While it seems semi-likely that `pg_basebackup` will be fixed to support negative Process IDs, it still seems good to adhere to this implicit requirement on positive numbers in case other clients also depend on it. Since this change requires changing the bytes of the cancel key in which we encode the `peer_id`, this change breaks cancelations in peered clusters of different PgBouncer versions. This seems like a minor enough problem that we should not care about this. In practice this should only happen during a rolling upgrade, which we currently don't support well anyway (see pgbouncer#902 for improvements on that). And even if we did, breaking cancellations for a few minutes in this transitional stage doesn't seem like a huge deal. [1]: https://www.postgresql.org/docs/current/protocol-message-formats.html [2]: https://www.postgresql.org/message-id/flat/CAGECzQQOGvYfp8ziF4fWQ_o8s2K7ppaoWBQnTmdakn3s-4Z%3D5g%40mail.gmail.com

JelteF · 2023-10-05T07:51:30Z

I updated the docs to include the cross-version breakage.

JelteF requested a review from eulerto September 6, 2023 07:17

JelteF mentioned this pull request Sep 6, 2023

Support replication connections through PgBouncer #876

Open

JelteF force-pushed the positive-pid-in-cancel-key branch from 8d11b1c to 8330df4 Compare September 6, 2023 07:23

JelteF force-pushed the positive-pid-in-cancel-key branch from 8330df4 to 9182a18 Compare October 5, 2023 07:48

JelteF force-pushed the positive-pid-in-cancel-key branch from 9182a18 to 7e68480 Compare October 5, 2023 07:50

JelteF merged commit 5065deb into pgbouncer:master Oct 5, 2023
7 checks passed

JelteF deleted the positive-pid-in-cancel-key branch October 5, 2023 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only use positive numbers in the PID part of the cancel key #945

Only use positive numbers in the PID part of the cancel key #945

JelteF commented Sep 6, 2023 •

edited

petere commented Oct 4, 2023

JelteF commented Oct 4, 2023 •

edited

JelteF commented Oct 5, 2023 •

edited

Only use positive numbers in the PID part of the cancel key #945

Only use positive numbers in the PID part of the cancel key #945

Conversation

JelteF commented Sep 6, 2023 • edited

petere commented Oct 4, 2023

JelteF commented Oct 4, 2023 • edited

JelteF commented Oct 5, 2023 • edited

JelteF commented Sep 6, 2023 •

edited

JelteF commented Oct 4, 2023 •

edited

JelteF commented Oct 5, 2023 •

edited