New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue #834 setting statusIntervalUpdate causes high CPU load \ #835

Merged
merged 2 commits into from Jun 8, 2017

Conversation

Projects
None yet
5 participants
@davecramer
Member

davecramer commented Jun 1, 2017

check for statusInterval of 0 and return false when checking if it is time to update

fix issue #834 setting statusIntervalUpdate causes high CPU load \
check for statusInterval of 0 and return false when checking if it is time to update
@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Jun 1, 2017

Codecov Report

Merging #835 into master will decrease coverage by <.01%.
The diff coverage is 0%.

@@             Coverage Diff              @@
##             master     #835      +/-   ##
============================================
- Coverage     65.56%   65.55%   -0.01%     
  Complexity     3545     3545              
============================================
  Files           166      166              
  Lines         15250    15252       +2     
  Branches       2475     2475              
============================================
  Hits           9998     9998              
- Misses         4074     4075       +1     
- Partials       1178     1179       +1

codecov-io commented Jun 1, 2017

Codecov Report

Merging #835 into master will decrease coverage by <.01%.
The diff coverage is 0%.

@@             Coverage Diff              @@
##             master     #835      +/-   ##
============================================
- Coverage     65.56%   65.55%   -0.01%     
  Complexity     3545     3545              
============================================
  Files           166      166              
  Lines         15250    15252       +2     
  Branches       2475     2475              
============================================
  Hits           9998     9998              
- Misses         4074     4075       +1     
- Partials       1178     1179       +1
@DuncanSands

This comment has been minimized.

Show comment
Hide comment
@DuncanSands

DuncanSands Jun 1, 2017

While this change seems OK and is needed, it isn't enough. I tested it and I see this:
A keepalive packet is received, so we hit case:
https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/core/v3/replication/V3PGReplicationStream.java#L134

    case 'k': //KeepAlive message
      updateStatusRequired = processKeepAliveMessage(buffer);
      updateStatusRequired |= updateInterval == 0;
      break;

Because of updateInterval being zero, updateStatusRequired is set to True. This triggers a call to timeUpdateStatus. But the call to timeUpdateStatus causes the other end to send another keep-alive packet. So we end up in a tight loop, endlessly getting keep alive packets and sending status updates which trigger additional keep alive packets.

If I disable setting updateStatusRequired then only one keep alive packet is received (so far at least) and the tight loop is avoided.

DuncanSands commented Jun 1, 2017

While this change seems OK and is needed, it isn't enough. I tested it and I see this:
A keepalive packet is received, so we hit case:
https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/core/v3/replication/V3PGReplicationStream.java#L134

    case 'k': //KeepAlive message
      updateStatusRequired = processKeepAliveMessage(buffer);
      updateStatusRequired |= updateInterval == 0;
      break;

Because of updateInterval being zero, updateStatusRequired is set to True. This triggers a call to timeUpdateStatus. But the call to timeUpdateStatus causes the other end to send another keep-alive packet. So we end up in a tight loop, endlessly getting keep alive packets and sending status updates which trigger additional keep alive packets.

If I disable setting updateStatusRequired then only one keep alive packet is received (so far at least) and the tight loop is avoided.

@davecramer

This comment has been minimized.

Show comment
Hide comment
@davecramer

davecramer Jun 1, 2017

Member

@Gordiychuk

What was your intention with the

updateStatusRequired |= updateInterval == 0;

line ?

Member

davecramer commented Jun 1, 2017

@Gordiychuk

What was your intention with the

updateStatusRequired |= updateInterval == 0;

line ?

@Gordiychuk

This comment has been minimized.

Show comment
Hide comment
@Gordiychuk

Gordiychuk Jun 2, 2017

Contributor

The initial idea of this line was to avoid disconnect from server as a result of network lags, because when server send us keepAlive with require replay flag we can send replay to server too late. I faced with this disconnect when server located in another datacenter. In that case i plane use zero as marker flag that need replay on each keep alive independent on contains required flags in message or not(but without required replay flag, current version not correct).

Maybe initial idea was not correct, because problem with networks lags should be solved via server configure. But on the other hand zero as statusInterval allow start use logical replication without care about network lags(need only fix require replay flag) and even doesn't know which interval configure on server side.

Contributor

Gordiychuk commented Jun 2, 2017

The initial idea of this line was to avoid disconnect from server as a result of network lags, because when server send us keepAlive with require replay flag we can send replay to server too late. I faced with this disconnect when server located in another datacenter. In that case i plane use zero as marker flag that need replay on each keep alive independent on contains required flags in message or not(but without required replay flag, current version not correct).

Maybe initial idea was not correct, because problem with networks lags should be solved via server configure. But on the other hand zero as statusInterval allow start use logical replication without care about network lags(need only fix require replay flag) and even doesn't know which interval configure on server side.

@DuncanSands

This comment has been minimized.

Show comment
Hide comment
@DuncanSands

DuncanSands Jun 5, 2017

DuncanSands commented Jun 5, 2017

@Gordiychuk

This comment has been minimized.

Show comment
Hide comment
@Gordiychuk

Gordiychuk Jun 6, 2017

Contributor

how does this happen exactly in practice, only when the server is very busy?
I've also seen timeouts with postgres 9.6, on a fast local network, when the
server is busy, so I'm wondering if it's maybe got nothing to do with network
lag, and instead is due to the server failing to promptly read from the client
when the server is busy sending data.

I faced with this problem only during testing, when database was not load. In production usage I set statusInterval equal to server setting wal_sender_timeout / 2.

I'm guessing that this was the cause of the keepalive storm I was seeing. Is
this what you mean by "current version not correct"?

Oh, sorry it was my mistake. I we doesn't set replay required it's not a bug.

If the network lag is large enough then won't it still fail even if you send a
reply to every keepalive you receive? I.e. is this really a reliable approach,
or does it just happen to work because your network lag, while large, isn't too
large?

Yes, when network lag huge enough it will not work. Zero as statusInterval it's a good default parameter that allow start use logical replication. But for production I recommend set manually statusInterval as value that smaller than server setting(wal_sender_timeout)

Contributor

Gordiychuk commented Jun 6, 2017

how does this happen exactly in practice, only when the server is very busy?
I've also seen timeouts with postgres 9.6, on a fast local network, when the
server is busy, so I'm wondering if it's maybe got nothing to do with network
lag, and instead is due to the server failing to promptly read from the client
when the server is busy sending data.

I faced with this problem only during testing, when database was not load. In production usage I set statusInterval equal to server setting wal_sender_timeout / 2.

I'm guessing that this was the cause of the keepalive storm I was seeing. Is
this what you mean by "current version not correct"?

Oh, sorry it was my mistake. I we doesn't set replay required it's not a bug.

If the network lag is large enough then won't it still fail even if you send a
reply to every keepalive you receive? I.e. is this really a reliable approach,
or does it just happen to work because your network lag, while large, isn't too
large?

Yes, when network lag huge enough it will not work. Zero as statusInterval it's a good default parameter that allow start use logical replication. But for production I recommend set manually statusInterval as value that smaller than server setting(wal_sender_timeout)

@davecramer davecramer merged commit 59236b7 into pgjdbc:master Jun 8, 2017

1 of 2 checks passed

codecov/project 65.55% (-0.01%) compared to 2d0bfce
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@davecramer davecramer deleted the davecramer:fixupdatestatusinterval branch Jun 8, 2017

@vlsi vlsi added this to the 42.1.2 milestone Jun 8, 2017

ahachete added a commit to ahachete/pgjdbc that referenced this pull request Jun 11, 2017

fix issue pgjdbc#834 setting statusIntervalUpdate causes high CPU loa…
…d \ (pgjdbc#835)

* fix issue pgjdbc#834 setting statusIntervalUpdate causes high CPU load \
check for statusInterval of 0 and return false when checking if it is time to update

davecramer added a commit to davecramer/pgjdbc that referenced this pull request Sep 19, 2017

fix issue pgjdbc#834 setting statusIntervalUpdate causes high CPU loa…
…d \ (pgjdbc#835)

* fix issue pgjdbc#834 setting statusIntervalUpdate causes high CPU load \
check for statusInterval of 0 and return false when checking if it is time to update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment