New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve multihost connection for preferSlave case (verify expired hosts before connecting to cached master) #844

Merged
merged 10 commits into from Jan 3, 2018

Conversation

Projects
None yet
6 participants
@ChenHuajun
Contributor

ChenHuajun commented Jun 13, 2017

Hi,all

The current MultiHostChooser implementation in the following two scenarios is not very reasonable.

  1. jdbc:postgresql://slave1,master1/accounting?targetServerType=master

Each time the cached host state expires(control by hostRecheckSeconds), need to try 2 times to connect to the correct server.

improvement:
Adjust the connecting order, first try to connect master1

currently: slave1 -> master1
after : master1 -> slave1

  1. jdbc:postgresql://master1,slave1/accounting?targetServerType=preferSlave

The first connection or the first connection after cached host state expires, will be connected to the master1.

But that's not what we expected. Reasonable result is only in the absence of any available slave, will be connected to the Master.

improvement:
When targetServerType is preferSlave, internal attempts to make connection twice.
The first time internally set targetServerType to 'slave' and connect. If the connecting fails, then set the targetServerType to 'any' and do a second attempt.

This PR include the following major changes, please take a look

  1. Modify the above two problems
  2. Simplify the implementation of loadBalance
  3. modify test cases
    3.1 Adjust test case for above modification
    3.2 Add test cases for 1 master + 2 slave
    3.3 Add test cases for targetServerType=preferSlave&loadBalanceHosts=true

@ChenHuajun ChenHuajun changed the title from The current MultiHostChooser implementation in the following two scen… to improvement of MultiHostChooser implementation Jun 13, 2017

@codecov-io

This comment has been minimized.

codecov-io commented Jun 13, 2017

Codecov Report

Merging #844 into master will decrease coverage by 3.93%.
The diff coverage is 46.83%.

@@             Coverage Diff              @@
##             master     #844      +/-   ##
============================================
- Coverage     66.66%   62.73%   -3.94%     
+ Complexity     3659     3325     -334     
============================================
  Files           170      169       -1     
  Lines         15788    14886     -902     
  Branches       2582     2406     -176     
============================================
- Hits          10525     9338    -1187     
- Misses         4061     4442     +381     
+ Partials       1202     1106      -96
@davecramer

This comment has been minimized.

Member

davecramer commented Jun 16, 2017

@ChenHuajun

This comment has been minimized.

Contributor

ChenHuajun commented Jun 19, 2017

I have solved these problems checkstyle found, please check again.

And i added two 1 master + 2 slaves test cases in this patch, so the test environment also needs to add an additional slave with port number 5434.

@davecramer

This comment has been minimized.

Member

davecramer commented Jun 19, 2017

Thanks! I am wondering if we can get someone else besides @vlsi to review this ?

@vlsi vlsi self-requested a review Aug 8, 2017

@vlsi vlsi added the needs-review label Sep 25, 2017

try {
targetServerType =
requestedTargetServerType =
HostRequirement.valueOf(info.getProperty("targetServerType", HostRequirement.any.name()));

This comment has been minimized.

@kato-sho

kato-sho Sep 29, 2017

You should use targetServerTypeStr like the master branch.

This comment has been minimized.

@ChenHuajun

ChenHuajun Oct 10, 2017

Contributor

I had rebased this pacth and used targetServerTypeStr.

@ChenHuajun

This comment has been minimized.

Contributor

ChenHuajun commented Nov 12, 2017

@kato-sho @vlsi @davecramer Can anyone review this patch?

@davecramer

This comment has been minimized.

Member

davecramer commented Nov 13, 2017

@ChenHuajun does this change the current behaviour or just fix it ?

Would it be possible to elaborate some more on the documentation https://jdbc.postgresql.org/documentation/head/connect.html#connection-parameters search for Connection failover

@vlsi

This comment has been minimized.

Member

vlsi commented Nov 13, 2017

@ChenHuajun , can you please clarify how "3.2 Add test cases for 1 master + 2 slave" works in Travis environment?
It looks like it requires to have a secondary slave URL, and I do not see where that slave is created.

@vlsi

This comment has been minimized.

Member

vlsi commented Nov 13, 2017

As far as I can see, MultiHostsConnectionTest is included via MultiHostTestSuite, and the test just gets ignored at Travis. Is it something that can be tested?

@ChenHuajun

This comment has been minimized.

Contributor

ChenHuajun commented Nov 15, 2017

@davecramer This patch just a fix , does not change the behaviour described on the documentation : https://jdbc.postgresql.org/documentation/head/connect.html#connection-parameters

@ChenHuajun

This comment has been minimized.

Contributor

ChenHuajun commented Nov 15, 2017

@vlsi
MultiHostsConnectionTest will be executed only when slaveServer and slaveServer2 had been setted in build.local.properties or build.properties.

for example:

port=5432
database=postgres
username=test
password=test
slaveServer=192.168.107.211
slavePort=5433
slaveServer2=192.168.107.211
slavePort2=5434

Of course it's needed to prepared two slave servers first. Could you configure build.local.properties on Travis

@davecramer

This comment has been minimized.

Member

davecramer commented Nov 18, 2017

@ChenHuajun the problem is having two servers on travis. not configuring it.

@ChenHuajun

This comment has been minimized.

Contributor

ChenHuajun commented Nov 20, 2017

@davecramer @vlsi
I am not familiar with Travis before . Do you mean should modify Travis's script to create two hot standby servers ?
If so, i think the two extra servers could be created and MultiHostTestSuite would be run when the environment variable REPLICATION=Y . If you agree with that, i will modified this patch.

@davecramer

This comment has been minimized.

Member

davecramer commented Nov 21, 2017

@ChenHuajun yes, I agree, please modify so that the test can run. Thanks!

@ChenHuajun

This comment has been minimized.

Contributor

ChenHuajun commented Nov 28, 2017

@vlsi the MultiHostsConnectionTest has been added to Travis, Please check again.

sudo su postgres -c "echo primary_conninfo = \'host=localhost port=5432 user=test password=test\' >>${PG_SLAVE1_DATADIR}/recovery.conf"
sudo su postgres -c "echo recovery_target_timeline = \'latest\' >>${PG_SLAVE1_DATADIR}/recovery.conf"
if [ "x${PG_VERSION}" <> "xHEAD" ]

This comment has been minimized.

@vlsi

vlsi Nov 28, 2017

Member

Does this work?
The code gets executed for HEAD as well: https://travis-ci.org/pgjdbc/pgjdbc/jobs/307997910#L5427

Would you please try use [[ .. ]]?

This comment has been minimized.

@ChenHuajun

ChenHuajun Dec 2, 2017

Contributor

@vlsi I had fix this problem, Please check again.

@vlsi

This comment has been minimized.

Member

vlsi commented Nov 28, 2017

Just a note: slaves are configured for REPLICATION=Y jobs, and multihost tests get executed

Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.185 sec - in org.postgresql.test.hostchooser.MultiHostTestSuite
@@ -116,21 +116,35 @@ public QueryExecutor openConnectionImpl(HostSpec[] hostSpecs, String user, Strin
int connectTimeout = PGProperty.CONNECT_TIMEOUT.getInt(info) * 1000;
// - the targetServerType setting
HostRequirement targetServerType;
HostRequirement requestedTargetServerType;

This comment has been minimized.

@AlexElin

AlexElin Dec 5, 2017

Contributor

I think, one can extract the getting of 'requestedTargetServerType' (lines 119-127) into private method

This comment has been minimized.

@ChenHuajun

ChenHuajun Dec 12, 2017

Contributor

@AlexElin Had changed

@vlsi

I was reviewing this, and I don't quite like the implementation yet.

As far as I can see, this change does not get status expiration right.

}
SocketFactory socketFactory = SocketFactoryFactory.getSocketFactory(info);
HostChooser hostChooser =
HostChooserFactory.createHostChooser(hostSpecs, targetServerType, info);
Iterator<HostSpec> hostIter = hostChooser.iterator();
if (!hostIter.hasNext() && nextTargetServerType != null) {

This comment has been minimized.

@vlsi

vlsi Dec 19, 2017

Member

This code gets repeated multiple times and that is a bad sign as that somewhat duplicate the concept of iterator() that has been there.

This comment has been minimized.

@ChenHuajun

ChenHuajun Dec 24, 2017

Contributor

i had refactor code to avoid repeated those code, please review it again.

hostInfo = new HostSpecStatus(hostSpec, null, Long.MAX_VALUE);
}
// candidates are nodes we do not know about and the nodes with correct type
if (hostInfo.status == null || targetServerType.allowConnectingTo(hostInfo.status)) {
if (hostInfo.status == null
|| hostInfo.lastUpdated < latestAllowedUpdate

This comment has been minimized.

@vlsi

vlsi Dec 19, 2017

Member

The thing is due to "host recheck", it should forget known master/slave status of the host. So it ends up with (host1==master, host2==slave, host3==unknown, host4==unknown) kind of states.
In that case, for loadBalance=true; connect=preferSlave it should try connecting to unknown hosts and it should try its best finding a slave there (between unknowns and known-slaves).

As far as I can see, the change was "never reset status", and I do not see why it is a good change.

This comment has been minimized.

@ChenHuajun

ChenHuajun Dec 21, 2017

Contributor

This change just move "hostInfo.lastUpdated < latestAllowedUpdate" from line 66 to line 70, when "loadBalance=true&targetServerType=preferSlave", the host whose status is expired or unknown do have the chance to be chosen.
In other words, the behavior is same as before when "loadBalance=true".

But when "loadBalance=false", this change made the hosts with the right matched status(whether expired or not) have higher priority than others to be chosen.
That can avoid some useless connecting retry.

This comment has been minimized.

@vlsi

vlsi Dec 21, 2017

Member

the host whose status is expired or unknown do have the chance to be chosen.

They do not unless the requested status matches server state.
For instance:

  1. requested is "master"
  2. once the server was identified as "slave"
  3. then pgjdbc would never try to connect to that server since it remembers the status as slave.

Am I missing something?

This comment has been minimized.

@ChenHuajun

ChenHuajun Dec 21, 2017

Contributor
  1. requested is "master"
  2. once the server was identified as "slave"
  3. then pgjdbc would never try to connect to that server since it remembers the status as slave.

That's true only when the server's status is not expired(default is 10s)

when status was expired, the server will be add to the candidates list.

GlobalHostStatusTracker.java

if (hostInfo.status == null
   || hostInfo.lastUpdated < latestAllowedUpdate
   || targetServerType.allowConnectingTo(hostInfo.status)) {
       candidates.add(hostInfo);
    }

and then have chance to be chosen.

MultiHostChooser.java

    if (!loadBalance) {
      sortCandidates(candidates);
    } else {
      shuffleCandidates(candidates);
    }
The current MultiHostChooser implementation in the following two scen…
…arios is not very reasonable.

1. jdbc:postgresql://slave1,master1/accounting?targetServerType=master
Each time the cached host state expires(control by hostRecheckSeconds), need to try 2 times to connect to the correct server.

improvement:
Adjust the connecting order, first try to connect master1
currently: slave1 -> master1
after    : master1 -> slave1

2. jdbc:postgresql://master1,slave1/accounting?targetServerType=preferSlave
The first connection or the first connection after cached host state expires, will be connected to the master1.
But that's not what we expected. Reasonable result is only in the absence of any available slave, will be connected to the Master.

improvement:
When targetServerType is preferSlave, internal attempts to make connection twice.
The first time internally set targetServerType to 'slave' and connect. If the connecting fails, then set the targetServerType to 'any' and do a second attempt.
@ChenHuajun

This comment has been minimized.

Contributor

ChenHuajun commented Dec 24, 2017

@vlsi I had rebased this patch to resolve confict with #1028 and refactor code to avoid too many repeated code in ConnectionFactoryImpl.java as you pointed, Please check again.

vlsi added some commits Dec 31, 2017

refactor: replace sort with append(slaves, any) in MultiHostChooser
This makes logic for "preferSlave" much cleaner
fix: re-introduce "try all hosts in case no good candidates known" be…
…havior

Technically speaking, hostRecheckSeconds is 10 seconds by default, so it should cache "connection state" by default
fix: avoid use of GlobalHostStatusTracker.getKnownStatus in Connectio…
…nFactoryImpl

This is required for "no good known hosts available" case. In that case the system
tries all the hosts in order, so it should not consider status in the GlobalHostStatusTracker
@vlsi

vlsi approved these changes Jan 3, 2018

@vlsi vlsi removed the needs-review label Jan 3, 2018

@vlsi vlsi added this to the 42.2.0 milestone Jan 3, 2018

@vlsi vlsi changed the title from improvement of MultiHostChooser implementation to fix: improve multihost connection for preferSlave case (verify expired hosts before connecting to cached master) Jan 3, 2018

@vlsi vlsi merged commit c6fec34 into pgjdbc:master Jan 3, 2018

1 of 2 checks passed

codecov/project 62.73% (-3.94%) compared to 3e0491a
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@vlsi

This comment has been minimized.

Member

vlsi commented Jan 3, 2018

@ChenHuajun, many thanks for the fix and tests!

rhavermans added a commit to bolcom/pgjdbc that referenced this pull request Jul 13, 2018

fix: improve multihost connection for preferSlave case (verify expire…
…d hosts before connecting to cached master) (pgjdbc#844)

The notable behavior change is related with targetServerType=preferSlave.
In that case, it requires to check all the possible slaves first to find the right server.
The state of some servers might be expired due to hostRecheckSeconds,
so pgjdbc should try connecting to all the "known slaves"
and "expired" servers in attempt to find a slave among them, and only then
it should try connecting to master.

That logic was not there, and preferSlave could easily connect to master even
at times when a slave was available.

Note: hostRecheckSeconds is still in place (default 10 seconds), and pgdjbc
trusts that status.

However, there's an exception: in case no servers match the criteria (e.g.
all the servers are marked with "can't connect" in cache), then pgjdbc
would still try connecting to all the hosts in the connection URL in order.

rhavermans added a commit to bolcom/pgjdbc that referenced this pull request Jul 13, 2018

fix: improve multihost connection for preferSlave case (verify expire…
…d hosts before connecting to cached master) (pgjdbc#844)

The notable behavior change is related with targetServerType=preferSlave.
In that case, it requires to check all the possible slaves first to find the right server.
The state of some servers might be expired due to hostRecheckSeconds,
so pgjdbc should try connecting to all the "known slaves"
and "expired" servers in attempt to find a slave among them, and only then
it should try connecting to master.

That logic was not there, and preferSlave could easily connect to master even
at times when a slave was available.

Note: hostRecheckSeconds is still in place (default 10 seconds), and pgdjbc
trusts that status.

However, there's an exception: in case no servers match the criteria (e.g.
all the servers are marked with "can't connect" in cache), then pgjdbc
would still try connecting to all the hosts in the connection URL in order.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment