Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only use positive numbers in the PID part of the cancel key #945

Merged
merged 1 commit into from Oct 5, 2023

Conversation

JelteF
Copy link
Member

@JelteF JelteF commented Sep 6, 2023

We were using the full 64 bits of the BackendKeyData message as random
bytes. This turned out to be arguably incorrect, because the first 32
bits are used by PostgreSQL as a Process ID (and this is part of the
protocol documentation too). Actual PIDs are always positive, but
we also put a random bit in the sign bit so were setting it to negative
numbers half of the time. For most cases this does not matter, but it
turned out that pg_basebackup relied on the PID part to actually be
positive
.

While pg_basebackup is now fixed to support negative Process IDs, it
still seems good to adhere to this implicit requirement on positive
numbers in case other clients also depend on it.

Since this change requires changing the bytes of the cancel key in which
we encode the peer_id, this change breaks cancelations in peered
clusters of different PgBouncer versions. This seems like a minor
enough problem that we should not care about this. In practice this
should only happen during a rolling upgrade, which we currently don't
support well anyway (see #902 for improvements on that). And even if we
did, breaking cancellations for a few minutes in this transitional stage
doesn't seem like a huge deal.

@petere
Copy link
Member

petere commented Oct 4, 2023

While pg_basebackup is now fixed to support negative Process IDs, it still seems good to adhere to this implicit requirement on positive numbers in case other clients also depend on it.

That makes sense.

Since this change requires changing the bytes of the cancel key in which we encode the peer_id, this change breaks cancelations in peered clusters of different PgBouncer versions. This seems like a minor enough problem that we should not care about this. In practice this should only happen during a rolling upgrade, which we currently don't support well anyway (see #902 for improvements on that). And even if we did, breaking cancellations for a few minutes in this transitional stage doesn't seem like a huge deal.

I don't think I follow this. While temporary breakage during rolling upgrades might be a minor problem, what about different pgbouncer versions on different hosts that are peered together? We don't want to force that all of those are upgraded together.

Maybe a compatibility break is necessary, but then we should just say so.

What would happen if we left the peer ID as is (in the first bytes)?

@JelteF
Copy link
Member Author

JelteF commented Oct 4, 2023

Maybe a compatibility break is necessary, but then we should just say so.

Fair enough. I'll update the docs in the PR and note that cross version peering is not supported across this version change. (and also include it in the changelog)

What would happen if we left the peer ID as is (in the first bytes)?

Then we'd cut the usable peer id space in half, because the sign bit always needs to be 0. The maximum value of peer_id (16383) is also documented, so changing that would also be a breaking change. Since we're breaking compatibility either way, I'd rather do it in the way that the PR is doing now so we can keep the same amount of peer_ids.

We were using the full 64 bits of the BackendKeyData message as random
bytes. This turned out to be arguably incorrect, because the first 32
bits are used by PostgreSQL as a Process ID (and this is part of the
[protocol documentation too][1]). Actual PIDs are always positive, but
we also put a random bit in the sign bit so were setting it to negative
numbers half of the time. For most cases this does not matter, but it
turned out that [`pg_basebackup` relies on the PID part to actually be
positive][2].

While it seems semi-likely that `pg_basebackup` will be fixed to support
negative Process IDs, it still seems good to adhere to this implicit
requirement on positive numbers in case other clients also depend on it.

Since this change requires changing the bytes of the cancel key in which
we encode the `peer_id`, this change breaks cancelations in peered
clusters of different PgBouncer versions. This seems like a minor
enough problem that we should not care about this. In practice this
should only happen during a rolling upgrade, which we currently don't
support well anyway (see pgbouncer#902 for improvements on that). And even if we
did, breaking cancellations for a few minutes in this transitional stage
doesn't seem like a huge deal.

[1]: https://www.postgresql.org/docs/current/protocol-message-formats.html
[2]: https://www.postgresql.org/message-id/flat/CAGECzQQOGvYfp8ziF4fWQ_o8s2K7ppaoWBQnTmdakn3s-4Z%3D5g%40mail.gmail.com
@JelteF
Copy link
Member Author

JelteF commented Oct 5, 2023

I updated the docs to include the cross-version breakage.

@JelteF JelteF merged commit 5065deb into pgbouncer:master Oct 5, 2023
7 checks passed
@JelteF JelteF deleted the positive-pid-in-cancel-key branch October 5, 2023 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants