Skip to content

Add load_balance_hosts parameter to db config#736

Merged
JelteF merged 19 commits into
pgbouncer:masterfrom
dpirotte:host-strategy
Nov 2, 2024
Merged

Add load_balance_hosts parameter to db config#736
JelteF merged 19 commits into
pgbouncer:masterfrom
dpirotte:host-strategy

Conversation

@dpirotte

@dpirotte dpirotte commented Jul 5, 2022

Copy link
Copy Markdown
Contributor

PGBouncer's default behavior is to round-robin between comma-separated
hosts on each new connection attempt. This is desirable when all hosts
are similar, such as load balancing over multiple replicas, but
undesirable when only one host out of the list is expected to
successfully connect and login. In this scenario, every other connection
attempt would hit an invalid host and fail.

The host_strategy parameter controls this host selection behavior. The
default behavior is labeled round_robin and a new strategy called
last_successful is introduced. This new strategy instructs PGBouncer
to prefer the host with most recent successful connections. When that
host fails, new connections will round-robin through the list until a
successful connection occurs.

Comment thread src/loader.c Outdated
int max_db_connections = -1;
int dbname_ofs;
int pool_mode = POOL_INHERIT;
int host_strategy = ROUND_ROBIN;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int host_strategy = ROUND_ROBIN;
enum HostStrategy host_strategy = ROUND_ROBIN;

@JelteF JelteF left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution, it looks useful, postgres connection strings work in a similar way so it makes sense to support that as well.

I left some comments. And can you please rebase/merge this PR because there's a merge conflict.

Comment thread src/loader.c Outdated
res_pool_size = atoi(val);
} else if (strcmp("max_db_connections", key) == 0) {
max_db_connections = atoi(val);
} else if (strcmp("host_strategy", key) == 0) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs an entry in config.md to describe its usage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation is in. Feedback welcome.

Comment thread src/objects.c Outdated
Comment on lines +1072 to +1077
if (!server->pool->last_connect_failed && server->pool->db->host_strategy == LAST_SUCCESSFUL)
server->pool->rrcounter = server->pool->last_successful_rrcounter;

@JelteF JelteF Aug 11, 2022

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current PR seems to have the desired effect based on the tests you wrote, but I have the feeling the same could probably be achieved without the two new fields you added: server->rrcounter and pool->last_successful_rrcounter.

How about we simply increase the pool->rr_counter whenever a connection attempt fails.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I simplified the logic as you suggest and it looks better to me.

@dpirotte

Copy link
Copy Markdown
Contributor Author

@JelteF Touching base. Happy to incorporate any additional feedback on the new implementation.

Comment thread src/loader.c Outdated
db->res_pool_size = res_pool_size;
db->pool_mode = pool_mode;
db->max_db_connections = max_db_connections;
db->host_strategy = host_strategy;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the host_strategy of a database can change, should we tag it as changed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call. Yes, I think we should tag it so behavior changes after a reload. I will make that change. Thanks for the feedback!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host_strategy is now reloadable.

show databases/show pools also now include host_strategy in their output, which makes this testable and is useful on its own.

@dpirotte

dpirotte commented Nov 7, 2022

Copy link
Copy Markdown
Contributor Author

@petere @JelteF I rebased this PR and made @goksgie's suggested change above.

Please let me know if anything else needs to happen to get this into a mergeable state.

@JelteF

JelteF commented Nov 17, 2022

Copy link
Copy Markdown
Member

I think the code in the PR is fine now (apart from the merge conflict that occured due to another merged PR). But after playing with these changes a bit I found one other issue, and I'm wondering how useful this feature is without also addressing this issue:
server_login_retry creates a significant pause anytime a single node cannot be connected to.

@dpirotte How are you using this change, that it is useful in its current form?

@JelteF

JelteF commented Jun 5, 2023

Copy link
Copy Markdown
Member

Hi @dpirotte, github notified me that you rebased this on a branch of your fork. I think a rebased version of this can be merged pretty easily. Could you still explain how you use this in practice? Because when I played with it, another bug in pgbouncer was causing serious issues for me.

@dpirotte

dpirotte commented Jun 5, 2023

Copy link
Copy Markdown
Contributor Author

Hi @JelteF. Yeah, I’m working on rebasing this and another patch on top of 1.19. The code works but I haven’t updated to your new (much improved!) test framework.

The purpose here is to support faster failovers in environments where 1/N hosts in the list will be the primary and (N-1)/N hosts will be replicas and where DNS updates are potentially too slow. RDS/Aurora is one such environment, and I believe some Patroni configurations are as well.

Given the following configurations:

; pg-primary is a CNAME to whichever of pg-0 or pg-1 is the primary
current = host=pg-primary.cluster.local connect_query=‘select disconnect_self() if pg_is_in_recovery()’…
proposed = host=pg-0.cluster.local,pg-1.cluster.local host_strategy=last_successful connect_query=‘select disconnect_self() if pg_is_in_recovery()’…

After a failover, current may take some time to point to the correct new primary, i.e. due to DNS caching. To speed up failovers, proposed takes the full list of potential primaries so pgbouncer can iterate through until it finds the primary. (And, to your earlier comment, this does encourage a low value for server_login_retry.)

A key part here is the connect_query, which asserts that pgbouncer is only ever talking to a writable database. Without host_strategy, using the full list of potential primaries means that (N-1)/N new connections will fail and wait for server_login_retry before trying another host. With host_strategy=last_successful, we only fail connections until a new primary is found, and direct all new connections there until the next failure. This minimizes failed connection noise and time waiting for server_login_retry.

I have another patch that adds lightweight target_session_attrs support albeit only on PG 14+ hosts where the server informs the client of its status as a primary or standby. This replaces the connect_query above, but is not a replacement for host_strategy. One could imagine having three hosts (primary and two replicas) where you want host_strategy=last_successful for the primary db connection string but still want to load balance on the replica connection string with host_strategy=round_robin.

Another option would be to keep the round-robin behavior and make target_session_attrs loop through hosts and gracefully disconnect until it finds one matching the requested state. I prefer having both configuration options so that target_session_attrs=primary host_strategy=last_successful is opening/closing as few connections on replicas as possible.

I’ll bump once the tests are ported to pytest and we’ll go from there.

@dpirotte

dpirotte commented Jun 7, 2023

Copy link
Copy Markdown
Contributor Author

@JelteF Rebased host_strategy with pytests if you want to take a look while I sort out the macOS build.

Comment thread doc/config.md Outdated
Set the pool mode specific to this database. If not set,
the default `pool_mode` is used.

### host_strategy

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to change the name of the option to load_balance_hosts. This option got added to libpq for PG16 (by me). It seems nice to use consistently named options across projects. The last_successful option name should then be changed to disable. And lets change round_robin to round-robin, to be consistent with other arguments like verify-full and read-write.

PS. I especially think consistency is nice here, because you mentioned you also want to add target_session_attrs to PgBouncer. Since target_session_attrs and load_balance_hosts can be used nicely together, I think it's especially useful to have the same names for both Postgres and PgBouncer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, adding a random as an option would be a nice future improvement to make sure PgBouncer supports the same as libpq. But that should definitely be another follow up PR. We can merge this PR easily without the random functionality.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to change the name of the option to load_balance_hosts. This option got added to libpq for PG16 (by me).

Ha, good call. I noticed that you contributed this feature in the release notes. +100 to consistent naming with libpq. I'll take care of that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the initial naming. It seems clear. load_balance_hosts seems to be a misnomer, since no "load" and no "balancing" is involved?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the initial naming. It seems clear.

I think the initial name was a fine name too. But I think being in line with naming of libpq, jdbc and npgsql is a big advantage of the load_balance_hosts naming.

load_balance_hosts seems to be a misnomer, since no "load" and no "balancing" is involved?

The "load" is the number of connection that are made by pgbouncer to the database, and those connections are "balanced" across the different servers in a round-robin fashion (or random fashion when implemented in the future). And you can enable this balancing by using disable which will select a working host and send all traffic there.

IMHO, the old naming describes how pgbouncer chooses hosts. The new naming describes what the user will see happening to their setup by changing the setting. I think I personally like names that describe the effects instead of the method.

@dpirotte dpirotte changed the title Add host_strategy parameter to db config Add load_balance_hosts parameter to db config Jun 8, 2023
@dpirotte

dpirotte commented Jun 8, 2023

Copy link
Copy Markdown
Contributor Author

@JelteF Updated naming per discussion above.

The macOS build passes now. The mingw{32,64} builds failed a couple times but seem to pass now, so those might have been random failures. The most recent build failures look like Cirrus issues: Failed to start an instance: INTERNAL: API response null(0): An error has occurred. Failed to connect to /34.122.100.68:443.

Comment thread test/test.ini
hostlist2 = port=6666 host=127.0.0.1,127.0.0.1 dbname=p0 user=bouncer
hostlist_good_first = port=6666 host=127.0.0.1,127.0.0.3 dbname=p0 user=bouncer load_balance_hosts=disable
hostlist_bad_first = port=6666 host=unresolvable-hostname,127.0.0.1 dbname=p0 user=bouncer load_balance_hosts=disable
load_balance_hosts_update = port=6666 host=127.0.0.1,127.0.0.3 dbname=p0 user=bouncer load_balance_hosts=disable

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load_balance_hosts_update database isn't used (afaict). Did you have a test in mind for it?

Comment thread doc/config.md
Comment on lines +1065 to +1243
When a comma-separated list is specified in `host`, `load_balance_hosts` controls
which entry is chosen for a new connection.

@JelteF JelteF Jun 14, 2023

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if this same logic would be used when receiving multiple IP adresses from DNS. The easiest way to test this is to create lines in /etc/host. Libpq its load_balance_hosts version does this too. So if we don't implement this, we should at least make it explicit in the docs that this setting currently doesn't impact DNS based load balancing (but that this might change in the future).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'm wondering if this feature also includes #448 or at least to have it in mind.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I didn't see that PR before (I guess I didn't go more than 2 years into PR history when joining the project). Right now #448 is completely distinct from this PR:

  1. This PR does not implement the disable mode for DNS load balancing. It still uses round robin even if disable is used.
  2. This PR does not implement the random mode. PR add server_shuffle_hosts #448 implements random mode for DNS load balancing only, not for host lists (certainly because host lists were not available in PgBouncer in the time that PR was written).

Th design of the config option would allow for implementing both 1 and 2 though. I think 2 isn't required to implement for this PR. A completely missing option isn't very confusing. But I think 1 is quite confusing because the existing option does not apply to all the places that you would expect, so preferably this PR would implement 1 too. If that's hard for some reason, then I think a note in the docs is also sufficient.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @JelteF,

I work with the author and he's been a bit too busy to finish responding to your feedback so I'd like to take a stab at it.

I don't quite understand 1. Maybe I don't understand pgbouncer internals well enough yet but can you give me some pointers on implementing the disable mode for DNS load balancing? Sorry for such a vague ask for clarification but I do want to complete this feature as soon as I can.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to clarify: PgBouncer does round-robin load balancing in two ways:

  1. If you have multiple hostnames in the host key, then it will round robin across these.
  2. If a single DNS entry returns multiple IPs, then it load balanaces using round-robin across these IPs: https://www.pgbouncer.org/faq.html#how-to-load-balance-queries-between-several-servers

This same thing is true for libpq and load_balance_hosts can be used to control the behaviour for both of these there. The implementation of load_balance_hosts in this PR only controls the behaviour of 1, not 2. And I think it should control both.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mistaken. Looks like getent hosts XX returns all IPs listed in /etc/hosts for XX.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK then the only thing that makes this hard @JelteF is my own understanding. Is this where the DNS rr needs to be disabled? https://github.com/cosgroveb/pgbouncer/blob/host-strategy-cos-rebased/src/objects.c#L1647-L1649

@cosgroveb cosgroveb Sep 17, 2024

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay while getent hosts XX returns all the IPs listed in /etc/hosts should I expect to see all the IPs in /etc/hosts for my database host? I put together this demonstration:

def test_load_balance_hosts_disable_with_dns(bouncer, pg):
    bouncer.default_db = "dns_load_balance_hosts_disable"
    hosts = f"""
127.0.0.53       dnsdbhost
127.0.0.54       dnsdbhost
127.0.0.55       dnsdbhost
    """

    # I have modified utils.py to set listen_addresses='*' for this
    #  demonstrationsbut would use pg.configure/pg.reload for a real test
    with bouncer.run_with_appended_etc_hosts(hosts):
        subprocess_result = capture(
            ["getent", "hosts", "dnsdbhost"],
        )
        getent_result = subprocess_result.split("\n")
        assert "127.0.0.53      dnsdbhost" == getent_result[0]
        assert "127.0.0.54      dnsdbhost" == getent_result[1]
        assert "127.0.0.55      dnsdbhost" == getent_result[2]

        bouncer.sql(query="SELECT 1", user="bouncer", password="zzzz", dbname="dns_load_balance_hosts_disable")
        with bouncer.admin_runner.cur() as cur:
            results = cur.execute("SHOW DNS_HOSTS").fetchall()
            result = [r for r in results if r[0] == 'dnsdbhost'][0]

            assert result[2] != "127.0.0.53:0"
______________________________________ test_load_balance_hosts_disable_with_dns _______________________________________
/home/admin/code/pgbouncer/test/test_load_balance_hosts.py:70: in test_load_balance_hosts_disable_with_dns
    assert result[2] != "127.0.0.53:0"
E   AssertionError: assert '127.0.0.53:0' != '127.0.0.53:0'
        bouncer    = <test.utils.Bouncer object at 0x7f0499fbb2e0>
        cur        = <psycopg.Cursor [closed] [BAD] at 0x7f049b445460>
        getent_result = ['127.0.0.53      dnsdbhost', '127.0.0.54      dnsdbhost', '127.0.0.55      dnsdbhost', '127.0.0.53      dnsdbhost', '127.0.0.54      dnsdbhost', '127.0.0.55      dnsdbhost', ...]
        hosts      = '\n127.0.0.53       dnsdbhost\n127.0.0.54       dnsdbhost\n127.0.0.55       dnsdbhost\n    '
        pg         = <test.utils.Postgres object at 0x7f0499fbbac0>
        result     = ('dnsdbhost', 14, '127.0.0.53:0')
        results    = [('dnsdbhost', 14, '127.0.0.53:0')]
        subprocess_result = '127.0.0.53      dnsdbhost\n127.0.0.54      dnsdbhost\n127.0.0.55      dnsdbhost\n127.0.0.53      dnsdbhost\n127.0.0.54      dnsdbhost\n127.0.0.55      dnsdbhost\n'

Because the work on the related feature, target_session_attrs is substantially complete and in need of review I do favor merging this branch as soon as possible. As you can see, the testing situation is a bit icky at the moment for the DNS sub-feature of load_balance_hosts but with some time I can complete it. It would be super beneficial to me if we can start working on getting target_session_attrs into pgbouncer in parallel and I think it would be to other users as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll get push access to @dpirotte's fork, push up a more recent rebase, and address your docs feedback below it is recommended to set server_login_retry lower than the default

Comment thread doc/config.md
@cosgroveb

Copy link
Copy Markdown
Contributor

@JelteF I rebased this and addressed the docs feedback above. Let me know if you run into any issues merging this!

dpirotte and others added 10 commits September 18, 2024 12:21
PGBouncer's default behavior is to round-robin between comma-separated
hosts on each new connection attempt. This is desirable when all hosts
are similar, such as load balancing over multiple replicas, but
undesirable when only one host out of the list is expected to
successfully connect and login. In this scenario, every other connection
attempt would hit an invalid host and fail.

The `host_strategy` parameter controls this host selection behavior. The
default behavior is labeled `round_robin` and a new strategy called
`last_successful` is introduced. This new strategy instructs PGBouncer
to prefer the host with most recent successful connections. When that
host fails, new connections will round-robin through the list until a
successful connection occurs.
In order to maintain behavior that connections always start from the
left-most host in the host list, increment rrcounter either before
choosing a host (in the case of last_successful) or after choosing a
host (in the case of round_robin).

(This incorporates PR feedback from @JelteF.)
Previously, host_strategy could only be set at pgbouncer startup, which
is unintuitive and inconsistent with other per-database configuration.

Also, add host_strategy to both `show databases` and `show pools` output
for visibility into the current setting, and to facilitate testing that
the configuration properly reloads.
Previously, the host_strategy tests used 127.0.0.2,127.0.0.1 had a "bad
IP" (127.0.0.2) as the first entry to force the first backend server
connection to fail and verify that `host_strategy=last_successful`
directs new connections to the second "good" entry.

MacOS doesn't treat 127.0.0.2 in the same manner as Linux by default, so
we need a different way to fail the first host in the list. Switching to
an unresolvable DNS hostname works fine on both Linux and MacOS.
I incorrectly ported these tests from bash to pytest and so they were
not actually verifying that subsequent connections were reusing the last
successful host in the list. The tests now verify behavior correctly.
This feature does almost the same thing as the upcoming libpq param
called `load_balance_hosts`, so use that name for consistency and rename
the configuration options accordingly (`last_successful` => `disable`)
Lost in translation from bash tests to pytest

@JelteF JelteF left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to give this another round of testing, but from what I remember the functionality worked as advertised in the past.

I left a few small comments on the documentation. Also formatting check is still failing.

Comment thread doc/config.md
Comment thread doc/usage.md Outdated
@cosgroveb

cosgroveb commented Sep 18, 2024

Copy link
Copy Markdown
Contributor

Thanks @JelteF. will keep an eye on CI.

@cosgroveb

Copy link
Copy Markdown
Contributor

Hey @JelteF did you run into any trouble testing this? Just checking as I'm rebasing and prepping the target_session_attrs PR.

@JelteF

JelteF commented Oct 28, 2024

Copy link
Copy Markdown
Member

Sorry for the delay here. I'll try to play around with this soonish.

Aside from that, since today there's now a discord channel to discuss PgBouncer development. It would be nice if you could join that: https://discordapp.com/channels/1258108670710124574/1300532992304742481

@cosgroveb

Copy link
Copy Markdown
Contributor

@JelteF can you link the Discord server? I wonder if I need to join the server first. Your channel link takes me here:

image

@cosgroveb

Copy link
Copy Markdown
Contributor

Aha it must be PostgreSQL Hackers https://lnkd.in/g8n3dZfx

@JelteF JelteF merged commit 6ec2c65 into pgbouncer:master Nov 2, 2024
@JelteF JelteF mentioned this pull request Dec 3, 2024
rajaryanece pushed a commit to rajaryanece/pgbouncer that referenced this pull request Jan 16, 2025
PGBouncer's default behavior is to round-robin between comma-separated
hosts on each new connection attempt. This is desirable when all hosts
are similar, such as load balancing over multiple replicas, but
undesirable when only one host out of the list is expected to
successfully connect and login. In this scenario, every other connection
attempt would hit an invalid host and fail.

The `host_strategy` parameter controls this host selection behavior. The
default behavior is labeled `round_robin` and a new strategy called
`last_successful` is introduced. This new strategy instructs PGBouncer
to prefer the host with most recent successful connections. When that
host fails, new connections will round-robin through the list until a
successful connection occurs.

---------

Co-authored-by: Brian Cosgrove <cosgroveb@gmail.com>
cosgroveb added a commit to dpirotte/pgbouncer that referenced this pull request Mar 4, 2025
cosgroveb added a commit to cosgroveb/pgbouncer that referenced this pull request Mar 4, 2025
JelteF pushed a commit that referenced this pull request Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants