-
Notifications
You must be signed in to change notification settings - Fork 493
Add PgBouncer peering support #666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c947da9
to
24075ce
Compare
Worth noting that the peer ID is encoded into the first 16 bits of the cancellation key sent to the client (which is normally a random 64-bit number), and the last bit is used to indicate whether the cancellation was forwarded. This reduces entropy somewhat, but it is still a lot higher than for a regular postgres server which uses the first 32 bits to encode the (quite predictable, or known) PID. |
I updated various comments to make the code and the reasoning clearer, including one about the reduced entropy. |
4362edc
to
c15fa89
Compare
b9ab2bb
to
303e2ae
Compare
This changes our testing infrastructure to use python instead of bash. Both `test/test.sh` and `test/ssl/test.sh` and have been completely coverted to python. The main reason I want to introduce this change is because Bash is one of the worst languages to write tests in. One of the main goals of tests is to find unexpected errors. And bash makes it very easy not to notice such errors. This made it quite hard for me to write correct tests and review newly introduced tests in PRs for correctness. Another big reason I want this change is because writing tests for #666 requires running multiple pgbouncer processes at the same time. The bash infrastructure did not allow for this. The new python testing infrastructure. Finally there's some other advantages that this new test infrastructure brings: 1. Tests can be run in parallel, which makes them go **much** faster. The duration of the test step in CI is now ~1 minute for non valgrind runs and ~2 minutes for valgrind runs. With the bash testing infrastructure these timings were 5.5 minutes and ~8 minutes respectively. 2. Running tests with `valgrind` locally is now as simple as `ENABLE_VALGRIND=1 pytest -n auto` 3. Some tests would not run on some OSes (e.g. ssl tests on windows). I made sure almost all are now fully cross platform (for the platforms we support in CI). 4. `USE_SUDO=1` tests are now run in CI on MacOS and FreeBSD. (I confirmed `USE_SUDO=1` works with Linux on my local machine, but was unable to get it working in CI due to `iptables` not being allowed in the CI containers). How to run tests can be found in the updated `test/README.md` file. PS. I haven't tested with OpenBSD, since I don't have access to such a machine. I'm fairly confident that the tests will run fine though, since they pass on all CI plaftorms (including MacOS and FreeBSD). But if anyone else has access to an OpenBSD box, it would be great if you could test that tests pass there as well.
80e0e96
to
bc1bb59
Compare
Is there any downside when users include the peer itself? (to keep config consistent) |
free(peer->host); | ||
peer->host = host; | ||
peer->port = port; | ||
peer->pool_size = pool_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there cases in which a pool_size > 1 is desirable? (should that be disallowed?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I added user facing docs (which were partially missing) to explain why. If there are bursts of cancel requests you likely don't want to send those sequentially to the other process, so pool_size should almost always be larger than 1.
060e07b
to
5d38448
Compare
bafa892
to
844ba88
Compare
I took another look at this, and it seems pretty solid. The only issue I ran into is that if the peer is unreachable then cancellation can take up to query_wait_timeout (2 minutes by default) to return, which seems a little unexpected, though also not specific to the peering problem. |
so_reuseport=1 | ||
unix_socket_dir=/tmp/pgbouncer1 | ||
peer_id=1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also had to add a pidfile to make the example work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked and it works fine without pidfile for me when running the two configs in two different shells like this:
./pgbouncer pgbouncer-peering1.ini
./pgbouncer pgbouncer-peering2.ini
I think it might have had to do with the fact that you didn't have /tmp/pgbouncer1
or /tmp/pgbouncer2
directories. I think that should be a hard error instead of a warning when using so_reuseport=1
. But that's a separate issue that I'll open a new PR for.
For both performance and availability reasons it can be useful to load balance across multiple PgBouncer processes, possibly on different machines. It's very easy to do, by simply putting a generic TCP load balancer in front of these pgbouncer processes. The `so_reuseport` option is an example of such a load balancer, which uses the OS kernel to do the load balancing. However, there is a problem with this simple approach. Query cancellations will not work most of the time. The reason is that a cancellation will open another TCP connection, which might be handled by another pgbouncer process than the one that is processing the query that should be cancelled. This change adds support to for pgbouncer peering. To enable peering two things need to be added to the configuration of each pgbouncer: 1. A peer_id that's unique for each pgbouncer that is peered with 2. A new "peers" section that should get an entry for each of the other peers When pgbouncer is configured like this, then whenever it receives a cancellation request for an unknown session, it will forward it to the peer that owns this session. To configure three pgbouncers in peering mode using `so_reuseport`, you can use these example configs. pgbouncer1.ini ```ini [databases] postgres = host=localhost dbname=postgres [pgbouncer] peer_id=1 pool_mode=transaction listen_addr=127.0.0.1 auth_type=trust admin_users=jelte auth_file=auth_file.conf so_reuseport=1 unix_socket_dir=/tmp/pgbouncer1 [peers] 2 = host=/tmp/pgbouncer2 3 = host=/tmp/pgbouncer3 ``` pgbouncer2.ini ```ini [databases] postgres = host=localhost dbname=postgres [pgbouncer] peer_id=2 pool_mode=transaction listen_addr=127.0.0.1 auth_type=trust admin_users=jelte auth_file=auth_file.conf so_reuseport=1 unix_socket_dir=/tmp/pgbouncer2 [peers] 1 = host=/tmp/pgbouncer1 3 = host=/tmp/pgbouncer3 ``` pgbouncer3.ini ```ini [databases] postgres = host=localhost dbname=postgres [pgbouncer] peer_id=3 pool_mode=transaction listen_addr=127.0.0.1 auth_type=trust admin_users=jelte auth_file=auth_file.conf so_reuseport=1 unix_socket_dir=/tmp/pgbouncer3 [peers] 1 = host=/tmp/pgbouncer1 2 = host=/tmp/pgbouncer2 ``` To then try out the cancellation forwarding. Simply run the following in `psql`: ```sql select pg_sleep(20); ``` And then press ctrl+c Related to pgbouncer#345 Co-authored-by: Marco Slot <marco.slot@gmail.com>
This was something that was forgotten to add to the docs in pgbouncer#666
Cancellations and regular queries are quite different. So it makes sense to have different (default) timeouts for cancellation requests than for queries. While I already think the default query_wait_timeout of 120 seconds we have is excessively long, it doesn't make sense at all to have such a long timeout by default for cancellation requests that are not sent. So this introduced a new `cancel_wait_timeout` which is set to 10 seconds by default. If the cancellation request cannot be forwarded within that time it's probably better to drop it. One reason why it's important to relatively quickly close cancellation requests when they are not being handled is because libpq only allows the client to send a cancel in a blocking way. So there's no possibility to bail out after a certain amount of seconds on the client side, if PgBouncer keeps the connection to the client open. Both me and Marco noticed this when testing my PgBouncer peering PR (pgbouncer#666), when an entry in the `[peers]` list pointed to an unreachable host. That meant that pgbouncer would keep the cancel connection open, and even though the query had finished the `psql` would not display the result because it was stuck waiting for a result to the cancel request. To avoid race conditions of cancellations that already forwarded, but not yet answered cancellation requests are not timed out. See pgbouncer#717 for details on such race conditions.
Cancellations and regular queries are quite different. So it makes sense to have different (default) timeouts for cancellation requests than for queries. While I already think the default query_wait_timeout of 120 seconds we have is excessively long, it doesn't make sense at all to have such a long timeout by default for cancellation requests that are not sent. So this introduced a new `cancel_wait_timeout` which is set to 10 seconds by default. If the cancellation request cannot be forwarded within that time it's probably better to drop it. One reason why it's important to relatively quickly close cancellation requests when they are not being handled is because libpq only allows the client to send a cancel in a blocking way. So there's no possibility to bail out after a certain amount of seconds on the client side, if PgBouncer keeps the connection to the client open. Both me and Marco noticed this when testing my PgBouncer peering PR (pgbouncer#666), when an entry in the `[peers]` list pointed to an unreachable host. That meant that pgbouncer would keep the cancel connection open, and even though the query had finished the `psql` would not display the result because it was stuck waiting for a result to the cancel request. To avoid race conditions of cancellations that already forwarded, but not yet answered cancellation requests are not timed out. See pgbouncer#717 for details on such race conditions.
This was something that was forgotten to add to the docs in #666
Cancellations and regular queries are quite different. So it makes sense to have different (default) timeouts for cancellation requests than for queries. While I already think the default query_wait_timeout of 120 seconds we have is excessively long, it doesn't make sense at all to have such a long timeout by default for cancellation requests that are not sent. So this introduced a new `cancel_wait_timeout` which is set to 10 seconds by default. If the cancellation request cannot be forwarded within that time it's probably better to drop it. One reason why it's important to relatively quickly close cancellation requests when they are not being handled is because libpq only allows the client to send a cancel in a blocking way. So there's no possibility to bail out after a certain amount of seconds on the client side, if PgBouncer keeps the connection to the client open. Both me and Marco noticed this when testing my PgBouncer peering PR (#666), when an entry in the `[peers]` list pointed to an unreachable host. That meant that pgbouncer would keep the cancel connection open, and even though the query had finished the `psql` would not display the result because it was stuck waiting for a result to the cancel request. To avoid race conditions of cancellations that already forwarded, but not yet answered cancellation requests are not timed out. See #717 for details on such race conditions.
This log line was introduced in pgbouncer#666, because that significantly changed our cancellation logic. But when a client sends many cancellations this will flood the logs with many identical log lines. To avoid that this lowers the log level to debug. Also start using `slog_xxx` functions instead of `log_xxx` functions for easier debugging in `forward_cancel_request`. Fixes pgbouncer#901
This log line was introduced in #666, because that significantly changed our cancellation logic. But when a client sends many cancellations this will flood the logs with many identical log lines. To avoid that this lowers the log level to debug. Also start using `slog_xxx` functions instead of `log_xxx` functions for easier debugging in `forward_cancel_request`. Fixes #901
@JelteF I am not sure I understand why this is needed. In your example pgbouncer config, all the three pgbouncer instances talk to the same postgres host. In that case, the same postgres host will always receive the cancel request. I saw that your slides show two different postgres instances each with a pgbouncer in front, that setup makes sense to me. But as far as I can tell, this is not useful when a bunch of different pgbouncer instances all talk to the same postgres instance. Did I misunderstand this? |
@achanda Yes, you misunderstood. This peering feature is useful if there's a load balancer in front of these different pgbouncer instances (or if you're using The client and server cancellation tokens cannot be the same, because when the client connects to PgBouncer, no server connection is assigned to it yet. And even if a connection would be assigned instantly, in transaction pooling mode the client will likely use another server connection for the next transaction, and the protocol does not allow updating the cancellation token. The cancellation token is static for the duration of the connection and is only sent during the start-up phase of the connection. |
Thanks for explaining, makes sense 👍 |
For both performance and availability reasons it can be useful to load
balance across multiple PgBouncer processes, possibly on different
machines. It's very easy to do, by simply putting a generic TCP load
balancer in front of these pgbouncer processes. The
so_reuseport
option is an example of such a load balancer, which uses the OS kernel
to do the load balancing.
However, there is a problem with this simple approach. Query
cancellations will not work most of the time. The reason is that a
cancellation will open another TCP connection, which might be handled by
another pgbouncer process than the one that is processing the query that
should be cancelled. I talked about this problem in more detail at
the Uptime conference: https://www.youtube.com/watch?v=M585FfbboNA
This change adds support for pgbouncer peering. To enable peering two
things need to be added to the configuration of each pgbouncer:
peers
When pgbouncer is configured like this, then whenever it receives a
cancellation request for an unknown session, it will forward it to the
peer that owns this session.
To configure three pgbouncers in peering mode using
so_reuseport
, youcan use these example configs.
pgbouncer1.ini
pgbouncer2.ini
pgbouncer3.ini
To then try out the cancellation forwarding. Simply run the following in
psql
:And then press ctrl+c
Related to #345