Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Not valid calculate lastReceiveLSN for logical replication #801

Conversation

@Gordiychuk
Copy link
Contributor

@Gordiychuk Gordiychuk commented Apr 4, 2017

lastReceiveLSN calculates not valid for logical replication. Right now it calculates like
startLSN + payloadSize from XLogData it's correct statement for physical replication
where payload it's raw data from WAL, and we can easily estimate position of next message [1].
But it's does't work for logical replication, because raw data from WAL changes in output plugin
as a result size of output payload not equal to size of payload from WAL. For deal with this problem
logical replication for commit message include to startLSN transaction end lsn + 1 [2]. So we can use
startLSN as is for lastReceiveLSN.

Thanks Stefan Smith for report this problem [3].

[1] #550 (comment)
[2] https://github.com/postgres/postgres/blob/master/src/backend/replication/logical/logical.c#L643
[3] https://www.postgresql.org/message-id/CAHHbV7V4XvdHGw_jpR9Xyq3fz%3Df%2BO4oa%2B73sbizGTv_AvmDXhQ%40mail.gmail.com

@Gordiychuk Gordiychuk changed the title [WIP] Bug: Not valid reciveLSN that lead to lost parallel tx Bug: Not valid calculate lastReceiveLSN for logical replication Apr 5, 2017
@vlsi vlsi added this to the 42.1.0 milestone Apr 5, 2017
@Gordiychuk
Copy link
Contributor Author

@Gordiychuk Gordiychuk commented Apr 5, 2017

@davecramer @vlsi I think we should mark 42 version as unstable for use logical replication. This problem affect all users. And users can start lost his data after restart replication.

@codecov-io
Copy link

@codecov-io codecov-io commented Apr 5, 2017

Codecov Report

Merging #801 into master will decrease coverage by <.01%.
The diff coverage is 93.33%.

@@             Coverage Diff              @@
##             master     #801      +/-   ##
============================================
- Coverage     65.25%   65.25%   -0.01%     
  Complexity     3517     3517              
============================================
  Files           165      166       +1     
  Lines         15217    15226       +9     
  Branches       2465     2466       +1     
============================================
+ Hits           9930     9935       +5     
- Misses         4100     4101       +1     
- Partials       1187     1190       +3

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3883a84...44c808b. Read the comment docs.

@davecramer
Copy link
Member

@davecramer davecramer commented Apr 5, 2017

@Gordiychuk this is also failing due to lack of sudo in .travis.yml, can you add it and try again ?

@davecramer
Copy link
Member

@davecramer davecramer commented Apr 6, 2017

@Gordiychuk yes, agreed

@Gordiychuk
Copy link
Contributor Author

@Gordiychuk Gordiychuk commented Apr 6, 2017

Looks like false alarm from script that find core dumps.

@davecramer
Copy link
Member

@davecramer davecramer commented Apr 6, 2017

agreed, strangely only happens on certain JVM's

@Gordiychuk
Copy link
Contributor Author

@Gordiychuk Gordiychuk commented Apr 6, 2017

Fix on false alarms #806

Gordiychuk added 2 commits Apr 4, 2017
Algorithm of calculate lastReceiveLSN are different
for logical and physical replication. For physical
replication it startLsn from XLogData plus payload
size. For logical replication as lastReceiveLSN
should be use startLSN from XLogData. Add to result
payload size not available, because from WAL reads not
RAW data - as a result logical decoding message size
can change and we get lsn from the future random
future transaction.
@Gordiychuk Gordiychuk force-pushed the Gordiychuk:bug/logical_replication_restart_for_parallel_tx branch from 4b4a521 to 44c808b Apr 7, 2017
@davecramer davecramer merged commit 170d9c2 into pgjdbc:master Apr 7, 2017
2 checks passed
2 checks passed
codecov/project Absolute coverage decreased by -<.01% but relative coverage increased by +28.07% compared to 3883a84
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@davecramer
Copy link
Member

@davecramer davecramer commented Apr 7, 2017

@Gordiychuk I hope I didn't completely obsfucate your commit message ;)

@ringerc
Copy link
Member

@ringerc ringerc commented Apr 11, 2017

davecramer added a commit to davecramer/pgjdbc that referenced this pull request Sep 19, 2017
…bc#801)

* Bug: Not valid receiveLSN that lead to lost parallel transactions

Add test that reproduce issue from
https://www.postgresql.org/message-id/CAHHbV7V4XvdHGw_jpR9Xyq3fz%3Df%2BO4oa%2B73sbizGTv_AvmDXhQ%40mail.gmail.com

* bug: lastReceiveLSN not valid for logical replication

Logical and Physical replication use different algorithms
to calculate the lastReceiveLSN. 
For physical replication the calculation is:
startLsn from XLogData plus the payloadsize; this is correct as we have the raw data. 
For logical replication the lastReceiveLSN uses startLSN from XLogData without the payload size as payload size is not available as a result logical decoding message size
can change and we get LSN from the future random future transaction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.