Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: advance lastReceiveLSN on keepalive messages #1038

Merged
merged 1 commit into from Jan 13, 2018

Conversation

@jklukas
Copy link
Contributor

@jklukas jklukas commented Dec 19, 2017

Fix a bug with V3PGReplicationStream where the server may never
be able to send a CommandComplete message to the client because
the client's status update messages do not advance the received LSN.
This behavior can be reliably reproduced on a server where no
data changes are happening, so the only messages being sent from
the client are status updates and the only messages coming from
the server are keepalives. The LSN reported from the server keepalives
continues to advance, but the client reports a static
lastReceiveLSN, so the server never considers the client caught
up and never sends CommandComplete to trigger the client to shut down.
By updating lastReceiveLSN on keepalive messages, the expected
server CommandComplete and client Terminate messages are exchanged,
allowing the server to gracefully shut down.

@@ -222,6 +222,9 @@ private void updateStatusInternal(

private boolean processKeepAliveMessage(ByteBuffer buffer) {
lastServerLSN = LogSequenceNumber.valueOf(buffer.getLong());
if (lastServerLSN.asLong() > lastReceiveLSN.asLong()) {
lastReceiveLSN = lastServerLSN;
}

This comment has been minimized.

@jklukas

jklukas Dec 19, 2017
Author Contributor

This mimics the keepalive logic in pg_recvlogical where output_written_lsn is set to Max(walEnd, output_written_lsn) and walEnd is the first long of the server keepalive message.

It's unclear to me whether this logic also makes sense for physical replication, or if different logic is needed there.

This comment has been minimized.

@davecramer

davecramer Dec 19, 2017
Member

I seriously doubt anyone uses the physical replication facility in JDBC

Fix a bug with V3PGReplicationStream where the server may never
be able to send a CommandComplete message to the client because
the client's status update messages do not advance the received LSN.
This behavior can be reliably reproduced on a server where no
data changes are happening, so the only messages being sent from
the client are status updates and the only messages coming from
the server are keepalives. The LSN reported from the server keepalives
continues to advance, but the client reports a static
lastReceiveLSN, so the server never considers the client caught
up and never sends CommandComplete to trigger the client to shut down.
By updating lastReceiveLSN on keepalive messages, the expected
server CommandComplete and client Terminate messages are exchanged,
allowing the server to gracefully shut down.
@jklukas jklukas force-pushed the jklukas:advance-on-keepalive branch from 414edc4 to 5c802a3 Dec 19, 2017
@jklukas
Copy link
Contributor Author

@jklukas jklukas commented Dec 19, 2017

cc @Gordiychuk if you have some time to comment on this approach.

@codecov-io
Copy link

@codecov-io codecov-io commented Dec 19, 2017

Codecov Report

No coverage uploaded for pull request base (master@e6a1ecc). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master    #1038   +/-   ##
=========================================
  Coverage          ?   67.18%           
  Complexity        ?     3700           
=========================================
  Files             ?      170           
  Lines             ?    15699           
  Branches          ?     2537           
=========================================
  Hits              ?    10547           
  Misses            ?     3958           
  Partials          ?     1194
@davecramer
Copy link
Member

@davecramer davecramer commented Jan 12, 2018

@jklukas can you reproduce the bug in a test case ? I'd like to get this in ASAP.

@jklukas
Copy link
Contributor Author

@jklukas jklukas commented Jan 12, 2018

Good to know getting it in sounds doable. I will take a look at tests and see if I can figure out a way to represent this.

@jklukas
Copy link
Contributor Author

@jklukas jklukas commented Jan 12, 2018

I've added a test now that is able to demonstrate this new behavior for me locally.

The problem is that it's not clear to me how the PostgreSQL server decides when it's going to send keepalive messages. When the client is only sending status update messages, I'm seeing keepalives back from the server about every 20 seconds. This test currently spins for 30 seconds in order to have a high probability of getting a keepalive back.

I think we need a better guarantee for this test than "the server will probably send a keepalive if we wait 30 seconds", and adding 30 seconds to the test suite seems silly too.

Any thoughts on whether we there's something we can do in the test or in the server configuration to increase the rate of keepalives?

@davecramer
Copy link
Member

@davecramer davecramer commented Jan 12, 2018

Not off the top of my head and I would really like to see this get in before we release 42.0.0. I'm fairly confident this will work as it appears to have just replicated the code from pg_recvlogical

@jklukas
Copy link
Contributor Author

@jklukas jklukas commented Jan 12, 2018

I'm fairly confident this will work as it appears to have just replicated the code from pg_recvlogical

I agree. I've been successfully testing an application with this change (although that's just one data point).

I can't guarantee I'll have more time in the near future to put into trying to get a nice test put together. Do you think we should remove the test here for the time being?

@davecramer
Copy link
Member

@davecramer davecramer commented Jan 12, 2018

@jklukas jklukas force-pushed the jklukas:advance-on-keepalive branch from 39ee5ea to 5c802a3 Jan 12, 2018
@jklukas
Copy link
Contributor Author

@jklukas jklukas commented Jan 12, 2018

Reverted this branch back to just the code change, without the test.

@davecramer davecramer merged commit 1be8a9e into pgjdbc:master Jan 13, 2018
2 checks passed
2 checks passed
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +33.6% compared to ed27c5b
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@vlsi vlsi added this to the 42.2.0 milestone Jan 13, 2018
rhavermans added a commit to bolcom/pgjdbc that referenced this pull request Jul 13, 2018
Fix a bug with V3PGReplicationStream where the server may never
be able to send a CommandComplete message to the client because
the client's status update messages do not advance the received LSN.
This behavior can be reliably reproduced on a server where no
data changes are happening, so the only messages being sent from
the client are status updates and the only messages coming from
the server are keepalives. The LSN reported from the server keepalives
continues to advance, but the client reports a static
lastReceiveLSN, so the server never considers the client caught
up and never sends CommandComplete to trigger the client to shut down.
By updating lastReceiveLSN on keepalive messages, the expected
server CommandComplete and client Terminate messages are exchanged,
allowing the server to gracefully shut down.
rhavermans added a commit to bolcom/pgjdbc that referenced this pull request Jul 13, 2018
Fix a bug with V3PGReplicationStream where the server may never
be able to send a CommandComplete message to the client because
the client's status update messages do not advance the received LSN.
This behavior can be reliably reproduced on a server where no
data changes are happening, so the only messages being sent from
the client are status updates and the only messages coming from
the server are keepalives. The LSN reported from the server keepalives
continues to advance, but the client reports a static
lastReceiveLSN, so the server never considers the client caught
up and never sends CommandComplete to trigger the client to shut down.
By updating lastReceiveLSN on keepalive messages, the expected
server CommandComplete and client Terminate messages are exchanged,
allowing the server to gracefully shut down.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.