Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected packet type during stream replication #1466

Closed
1 of 2 tasks
omgrr opened this issue Apr 10, 2019 · 67 comments
Closed
1 of 2 tasks

Unexpected packet type during stream replication #1466

omgrr opened this issue Apr 10, 2019 · 67 comments

Comments

@omgrr
Copy link

@omgrr omgrr commented Apr 10, 2019

I'm submitting a ...

  • bug report
  • feature request

Describe the issue
When using PGReplicationStream to get data off of a replication slot, we occasionally encounter an Unexpected packet type while processing copy data. The control character that we are actually receiving is a lowercase ‘w’ which, when looking through other parts of the code, appears to be the control character that we’d receive when PostgreSQL is telling pgjdbc to expect xlog data.

99% of the time, this is fine since the replication is resilient and just kinda picks up where it left off after the exception is raised. We have, however, encountered times when, no matter what we did, reading from that stream always failed at some point during a transaction and the replication lag got rather out of hand. At first, we thought this had to do with the size of the transactions that were failing but it does not seem to make a difference how big these transactions are.

Driver Version?
RE42.2.5

Java Version?
openjdk version "1.8.0_212"

OS Version?
debian 9.8

PostgreSQL Version?
We've seen this on PostgreSQL 9.6.6 and PostgresSQL 11 (Latest)

To Reproduce
We prepared a toy app and docker compose file that should, at some point, cause the error.

https://github.com/omgrr/streamProblems

We usually see the error within 15 minutes, but occasionally it has taken an hour. There is an example log file in the repo to see what this error typically looks like.

Expected behaviour
This seems to be happening when pgjdbc requests a status update. We would expect to see the correct response from postgres, but instead the next byte on the buffer indicates more xlog data.

Logs

The full compressed logs can be found on in the in example repo, in the example_packet_error.log.gz log file.

pgjdbc logs around the exception -

Apr 09, 2019 7:18:18 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults                                           
FINEST:  <=BE CopyData                                                                                                        
Apr 09, 2019 7:18:18 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData                              
FINEST:   <=BE XLogData(currWal: 0/1DE4D18, lastServerWal: 0/1DE4D18, clock: 0)                                               
Apr 09, 2019 7:18:18 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults                                           
FINEST:  <=BE CopyData                                                                                                        
Apr 09, 2019 7:18:18 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData                              
FINEST:   <=BE XLogData(currWal: 0/1DE4D18, lastServerWal: 0/1DE4D18, clock: 0)                                               
Apr 09, 2019 7:18:18 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults                                           
FINEST:  <=BE CopyData                                                                                                        
Apr 09, 2019 7:18:18 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData                              
FINEST:   <=BE XLogData(currWal: 0/1DE4D18, lastServerWal: 0/1DE4D18, clock: 0)                                               
Apr 09, 2019 7:18Exception in thread "main" org.postgresql.util.PSQLException: Database connection failed when writing to copy
 at org.postgresql.core.v3.QueryExecutorImpl.writeToCopy(QueryExecutorImpl.java:1000)                                         
 at org.postgresql.core.v3.CopyDualImpl.writeToCopy(CopyDualImpl.java:19)                                                     
 at org.postgresql.core.v3.replication.V3PGReplicationStream.updateStatusInternal(V3PGReplicationStream.java:189)             
 at org.postgresql.core.v3.replication.V3PGReplicationStream.timeUpdateStatus(V3PGReplicationStream.java:181)                 
 at org.postgresql.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:121)                     
 at org.postgresql.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:78)                       
 at streamProblem.main(streamProblem.java:43)                                                                                 
Caused by: java.io.IOException: Unexpected packet type during copy: 119                                                       
 at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1236)                                  
 at org.postgresql.core.v3.QueryExecutorImpl.writeToCopy(QueryExecutorImpl.java:998)                                          
 ... 6 more                                                                                                                   
:18 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults                                                            
FINEST:  <=BE CopyData                                                                                                        
Apr 09, 2019 7:18:18 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData                              
FINEST:   <=BE XLogData(currWal: 0/1DE4D18, lastServerWal: 0/1DE4D18, clock: 0)                                               
Apr 09, 2019 7:18:18 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults                                           
FINEST:  <=BE CopyData                                                                                                        
Apr 09, 2019 7:18:18 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData                              
FINEST:   <=BE XLogData(currWal: 0/1DE4D18, lastServerWal: 0/1DE4D18, clock: 0)                                               
Apr 09, 2019 7:18:18 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults                                           
FINEST:  <=BE CopyData                                                                                                        
Apr 09, 2019 7:18:18 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData                              
FINEST:   <=BE XLogData(currWal: 0/1DE6968, lastServerWal: 0/1DE6968, clock: 0)                                               
Apr 09, 2019 7:18:18 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults                                           
FINEST:  <=BE CopyData                                                                                                        
@davecramer
Copy link
Member

@davecramer davecramer commented Apr 10, 2019

I just tried to build this and got
Source option 5 is no longer supported. Use 6 or later.
[ERROR] Target option 1.5 is no longer supported. Use 1.6 or later.

@omgrr
Copy link
Author

@omgrr omgrr commented Apr 10, 2019

@davecramer Thanks for taking a look! If you would like to try again, I updated the example so that the docker container builds the jar.

You should just need to docker-compose build && docker-compose up now.

@davecramer
Copy link
Member

@davecramer davecramer commented Apr 11, 2019

@omgrr can you try #1467

I quickly tried it on your test and was unable to get it to fail.

@omgrr
Copy link
Author

@omgrr omgrr commented Apr 11, 2019

Unfortunately I got the error after about 13 minutes. I updated the streamProblems repo to include the new 42.2.6 jar. Looks like the same issue -

Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 3/61DF9FD8, lastServerWal: 3/61DF9FD8, clock: 0)
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 3/61DFBC28, lastServerWal: 3/61DFBC28, clock: 0)
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 3/61DFBC28, lastServerWal: 3/61DFBC28, clock: 0)
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.QueryExecutorImplException in thread "main" org.postgresql.util.PSQLException: Database connection failed when writing to copy
        at org.postgresql.core.v3.QueryExecutorImpl.writeToCopy(QueryExecutorImpl.java:1028)
        at org.postgresql.core.v3.CopyDualImpl.writeToCopy(CopyDualImpl.java:19)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.updateStatusInternal(V3PGReplicationStream.java:189)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.timeUpdateStatus(V3PGReplicationStream.java:181)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:121)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:78)
        at streamProblem.main(streamProblem.java:43)
Caused by: java.io.IOException: Unexpected packet type during copy: 119
        at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1264)
        at org.postgresql.core.v3.QueryExecutorImpl.writeToCopy(QueryExecutorImpl.java:1026)
        ... 6 more
 processCopyResults
FINEST:  <=BE CopyData
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 3/61DFBC28, lastServerWal: 3/61DFBC28, clock: 0)
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 3/61DFC878, lastServerWal: 3/61DFC878, clock: 0)
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 3/61DFC878, lastServerWal: 3/61DFC878, clock: 0)
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 3/61DFC878, lastServerWal: 3/61DFC878, clock: 0)
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 3/61DFC878, lastServerWal: 3/61DFC878, clock: 0)
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 3/61DFD8B8, lastServerWal: 3/61DFD8B8, clock: 0)
Apr 11, 2019 3:20:38 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Apr 11, 2019 3:20:39 PM org.postgresql.core.v3.replication.V3PGReplicationStream prepareUpdateStatus
FINEST:  FE=> StandbyStatusUpdate(received: 3/61DFD8B8, flushed: 3/61DFD8B8, applied: 3/61DFD8B8, clock: 4/11/19 3:20 PM)
Apr 11, 2019 3:20:39 PM org.postgresql.core.v3.QueryExecutorImpl writeToCopy
FINEST:  FE=> CopyData(34)

@davecramer
Copy link
Member

@davecramer davecramer commented Apr 11, 2019

@omgrr
Copy link
Author

@omgrr omgrr commented Apr 11, 2019

Right, and thats exactly what I did.

I'm a bit out of my wheelhouse when it comes to java and maven so I don't know if there was a better way to do this, but

Now that I had my jar

Appreciate the continued help!

@davecramer
Copy link
Member

@davecramer davecramer commented Apr 11, 2019

@davecramer
Copy link
Member

@davecramer davecramer commented Apr 12, 2019

OK, I can recreate it, interestingly the error I got was unexpected character 120 which is an x, which is not even a protocol message.

@omgrr
Copy link
Author

@omgrr omgrr commented Apr 12, 2019

Very interesting, I actually haven't seen that in testing, we've always gotten a w.

In other environments we've gotten other characters. My thinking was that those "other characters" are part of a message. Which now seems very likely, given that you've gotten an x. The INSERTS that the test app is making is just abunch of x's into a text column.

@davecramer
Copy link
Member

@davecramer davecramer commented Apr 16, 2019

OK, I've been looking at this for the last couple days. It is very repeatable.

Many times the stream pgstream appears to be reset just before the problem as the index is very close to 0

see #1470. There is at least one bug fixed but apparently that wasn't it.

What appears to happen is that we miss reading the 'd' and get an invalid character. So either the code is reading too far in the previous message or we are mis-handling the buffer..

I'm off for a few days so won't be back to this probably until after Easter

@omgrr
Copy link
Author

@omgrr omgrr commented May 2, 2019

Hey @davecramer, I don't suppose you've had a chance to revisit this?

@davecramer
Copy link
Member

@davecramer davecramer commented May 3, 2019

I've been moving. I will get back to this next week

@davecramer
Copy link
Member

@davecramer davecramer commented May 21, 2019

@omgrr Ok, I pushed another commit that has made it not possible for me to replicate the problem on my test bed. This doesn't fix the problem but it may lead to the solution.

see #1483 for a WIP can you see if you can still replicate the problem with this PR ?

@omgrr
Copy link
Author

@omgrr omgrr commented May 23, 2019

@davecramer Yup! Can still replicate it -

May 23, 2019 3:02:21 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/1D385B8, lastServerWal: 0/1D385B8, clock: 0)
May 23, 2019 3:02:21 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData length 1,156
May 23, 2019 3:02:21 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/1D385B8, lastServerWal: 0/1D385B8, clock: 0)
May 23, 2019 3:02:21 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData length 1,156
May 23, 2019 3:02:21 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/1D385BException in thread "main" org.postgresql.util.PSQLException: Database connection failed when writing to copy
        at org.postgresql.core.v3.QueryExecutorImpl.writeToCopy(QueryExecutorImpl.java:1028)
        at org.postgresql.core.v3.CopyDualImpl.writeToCopy(CopyDualImpl.java:19)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.updateStatusInternal(V3PGReplicationStream.java:189)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.timeUpdateStatus(V3PGReplicationStream.java:181)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:121)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:78)
        at streamProblem.main(streamProblem.java:43)
Caused by: java.io.IOException: Unexpected packet type during copy: 119
        at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1265)
        at org.postgresql.core.v3.QueryExecutorImpl.writeToCopy(QueryExecutorImpl.java:1026)
        ... 6 more
8, lastServerWal: 0/1D385B8, clock: 0)
May 23, 2019 3:02:21 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData length 1,156
May 23, 2019 3:02:21 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/1D385B8, lastServerWal: 0/1D385B8, clock: 0)
May 23, 2019 3:02:21 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData length 1,156
May 23, 2019 3:02:21 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/1D3A208, lastServerWal: 0/1D3A208, clock: 0)
May 23, 2019 3:02:21 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData length 1,156
May 23, 2019 3:02:21 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/1D3A208, lastServerWal: 0/1D3A208, clock: 0)
May 23, 2019 3:02:21 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData length 1,156
May 23, 2019 3:02:21 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData

@davecramer
Copy link
Member

@davecramer davecramer commented May 23, 2019

dang, this is painful. I ran my test 3 times (it blows up my server) and it passed.
OK, thanks for checking

@davecramer
Copy link
Member

@davecramer davecramer commented May 24, 2019

@omgrr can you try this one ?

@omgrr
Copy link
Author

@omgrr omgrr commented May 24, 2019

😞 Nope, looks like the same error

May 24, 2019 4:30:49 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData length 1,155
May 24, 2019 4:30:49 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 3/B2382950, lastServerWal: 3/B2382950, clock: 0)
May 24, 2019 4:30:49 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData length 1,155
May 24, 2019 4:30:49 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogDatException in thread "main" org.postgresql.util.PSQLException: Database connection failed when writing to copy
        at org.postgresql.core.v3.QueryExecutorImpl.writeToCopy(QueryExecutorImpl.java:1028)
        at org.postgresql.core.v3.CopyDualImpl.writeToCopy(CopyDualImpl.java:19)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.updateStatusInternal(V3PGReplicationStream.java:189)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.timeUpdateStatus(V3PGReplicationStream.java:181)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:121)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:78)
        at streamProblem.main(streamProblem.java:43)
Caused by: java.io.IOException: Unexpected packet type during copy: 119
        at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1265)
        at org.postgresql.core.v3.QueryExecutorImpl.writeToCopy(QueryExecutorImpl.java:1026)
        ... 6 more
a(currWal: 3/B2382950, lastServerWal: 3/B2382950, clock: 0)
May 24, 2019 4:30:49 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData length 1,155
May 24, 2019 4:30:49 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 3/B2382950, lastServerWal: 3/B2382950, clock: 0)
May 24, 2019 4:30:49 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData length 1,155

@omgrr
Copy link
Author

@omgrr omgrr commented May 24, 2019

I don't think I've mentioned this, but FWIW I'm running my little test repo on an m5.4xlarge AWS instance (16cpu 64gb of mem). Not sure if that helps 😕.

@davecramer
Copy link
Member

@davecramer davecramer commented May 24, 2019

@omgrr
Copy link
Author

@omgrr omgrr commented May 24, 2019

This was the last commit right? davecramer@a9e00c1

I pulled that down and added some additional logging (just to be sure I'm correctly capturing changes), and still getting the same error, and I see my new logging. So pretty confident I'm getting the latest stuff.

It sounds like it is working for you?

@davecramer
Copy link
Member

@davecramer davecramer commented May 24, 2019

@omgrr
Copy link
Author

@omgrr omgrr commented May 24, 2019

Darn, same thing. Got another error.

@davecramer
Copy link
Member

@davecramer davecramer commented May 24, 2019

@bokken
Copy link
Member

@bokken bokken commented May 27, 2019

@omgrr, I think we might have found the issue. Can you give @davecramer latest commit a try.

@evz
Copy link

@evz evz commented May 30, 2019

Hey there, @omgrr has been away so I've gone ahead and run our test app and ran into the same issue.

May 30, 2019 7:59:16 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/362A880, lastServerWal: 0/362A880, clock: 0)
May 30, 2019 7:59:16 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData length 1,156
May 30, 2019 7:59:16 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=Hi there!
May 30, 2019 7:59:16 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/362C4D0, lastServerWal: 0/362C4D0, clock: 0)
May 30, 2019 7:59:16 PM org.postgresql.core.v3.QueryExecutoException in thread "main" org.postgresql.util.PSQLException: Database connection failed when writing to copy
        at org.postgresql.core.v3.QueryExecutorImpl.writeToCopy(QueryExecutorImpl.java:1028)
        at org.postgresql.core.v3.CopyDualImpl.writeToCopy(CopyDualImpl.java:19)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.updateStatusInternal(V3PGReplicationStream.java:189)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.timeUpdateStatus(V3PGReplicationStream.java:181)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:121)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:78)
        at streamProblem.main(streamProblem.java:43)
Caused by: java.io.IOException: Unexpected packet type during copy: 119
        at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1266)
        at org.postgresql.core.v3.QueryExecutorImpl.writeToCopy(QueryExecutorImpl.java:1026)
        ... 6 more
rImpl processCopyResults
FINEST:  <=BE CopyData length 1,156
May 30, 2019 7:59:16 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults

@davecramer
Copy link
Member

@davecramer davecramer commented May 31, 2019

@davecramer
Copy link
Member

@davecramer davecramer commented May 31, 2019

@bokken any thoughts on how to debug this ? I'm inclined to try to replace the custom class we have now with a standard class to see if that solves the problem or if the problem is somewhere else.

@bokken
Copy link
Member

@bokken bokken commented May 31, 2019

@davecramer
Copy link
Member

@davecramer davecramer commented May 31, 2019

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 6, 2019

Question for everyone. Which output plugin are you using?

@teej
Copy link

@teej teej commented Aug 6, 2019

wal2json

@omgrr
Copy link
Author

@omgrr omgrr commented Aug 6, 2019

wal2json

Yup, same.

edit: Well, in production we use wal2json, but in our repro case we are not.
https://github.com/omgrr/streamProblems/blob/master/docker/stream-problem/docker-entrypoint.sh#L21

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 6, 2019

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 6, 2019

Ok guys, can everyone test b9ecbb0

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 7, 2019

I have another PR which is cleaner which I believe works. #1467

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 7, 2019

@omgrr @teej , @ivanobulo So yes, please test this one. I believe this solves the issue!

@ivanobulo
Copy link

@ivanobulo ivanobulo commented Aug 7, 2019

thanks @davecramer ! I'll give it a try if encounter this issue again.

@teej
Copy link

@teej teej commented Aug 14, 2019

I built this branch and deployed it to my pipeline using 9.6.x and it seems to work. I haven't had an opportunity to test on 11+.

@bokken
Copy link
Member

@bokken bokken commented Aug 14, 2019

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 15, 2019

Painfully simple after staring at it long enough. Thx

@evz
Copy link

@evz evz commented Aug 15, 2019

Unfortunately, I was able to reproduce this once again using the toy app that @omgrr and I put together. I tried both PostgreSQL 9.6 and PostgreSQL 11.

Aug 15, 2019 4:01:36 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Aug 15, 2019 4:01:36 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/1CB9638, lastServerWal: 0/1CB9638, clock: 0)
Aug 15, 2019 4:01:36 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Aug 15, 2Exception in thread "main" org.postgresql.util.PSQLException: Database connection failed when reading from copy
        at org.postgresql.core.v3.QueryExecutorImpl.readFromCopy(QueryExecutorImpl.java:1063)
        at org.postgresql.core.v3.CopyDualImpl.readFromCopy(CopyDualImpl.java:37)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.receiveNextData(V3PGReplicationStream.java:158)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:123)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:80)
        at streamProblem.main(streamProblem.java:43)
Caused by: java.io.IOException: Unexpected packet type during copy: 119
        at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1265)
        at org.postgresql.core.v3.QueryExecutorImpl.readFromCopy(QueryExecutorImpl.java:1061)
        ... 5 more
019 4:01:36 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/1CB9638, lastServerWal: 0/1CB9638, clock: 0)
Aug 15, 2019 4:01:36 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Aug 15, 2019 4:01:36 PM org.postgresql.core.v3.replication.V3PGReplicationStream prepareUpdateStatus
FINEST:  FE=> StandbyStatusUpdate(received: 0/1CB9638, flushed: 0/1CB9638, applied: 0/1CB9638, clock: 1/1/70 12:14 AM)
Aug 15, 2019 4:01:36 PM org.postgresql.core.v3.QueryExecutorImpl writeToCopy
FINEST:  FE=> CopyData(34)

@teej Do you have any code you can share? It'd be great to see if we could triangulate a bit on this based upon different uses.

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 15, 2019

And this is with my branch ?

@evz
Copy link

@evz evz commented Aug 15, 2019

Yup. With the 42.2.7-SNAPSHOT jar file that it builds.

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 15, 2019

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 16, 2019

OK, I think I'm going to put a lock in process results to make sure it is only called once.

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 26, 2019

I'm unable to replicate this now. I've added a lock in processCopyResults and this should print the date I built this in the exception to make sure we are using the correct jar. Please test atain
#1467

@evz
Copy link

@evz evz commented Aug 26, 2019

Weirdly, I ran this twice and was unable to get into that block (with the date) but did get into this block

Aug 26, 2019 7:26:50 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/2088110, lastServerWal: 0/2088110, clock: 0)
Aug 26, 2019 7:26:50 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Aug 26, 2019 7:26:50 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/2088110, lastServerWal: 0/2088110, clock: 0)
Aug 26, 2019 7:26:50 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Aug 26, 2019 7:26:50 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/2088110, lastServerWal: 0/2088110, clock: 0)
Aug 26, 2019 7:26:50 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Aug 26, 2019 7:26:50 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/2088110, lastServerWal: 0/2088110, clock: 0)
Aug 26, 2019 7:26:50 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Aug 26, 2019 7:26:50 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST: Exception in thread "main" org.postgresql.util.PSQLException: Debug should not be here
        at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1085)
        at org.postgresql.core.v3.QueryExecutorImpl.readFromCopy(QueryExecutorImpl.java:1062)
        at org.postgresql.core.v3.CopyDualImpl.readFromCopy(CopyDualImpl.java:37)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.receiveNextData(V3PGReplicationStream.java:158)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:123)
        at org.postgresql.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:80)
        at streamProblem.main(streamProblem.java:43)
  <=BE XLogData(currWal: 0/2088110, lastServerWal: 0/2088110, clock: 0)
Aug 26, 2019 7:26:50 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults
FINEST:  <=BE CopyData
Aug 26, 2019 7:26:50 PM org.postgresql.core.v3.replication.V3PGReplicationStream processXLogData
FINEST:   <=BE XLogData(currWal: 0/2088110, lastServerWal: 0/2088110, clock: 0)
Aug 26, 2019 7:26:50 PM org.postgresql.core.v3.QueryExecutorImpl processCopyResults

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 26, 2019

@evz thx man, Not so weird the lock should stop it from ever getting to the date block.

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 26, 2019

This is actually very interesting. The fact that the lock was violated by the calling code means that another thread (which I had no idea existed) was reading the copy results.

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 26, 2019

So I am unable to replicate this on a virtual machine. It seems strange that you are able to easily using docker...

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 27, 2019

@omgrr @evz can you try again, with the latest code. At this point I am just going to ignore concurrent requests. I can't replicate this locally.

@evz
Copy link

@evz evz commented Aug 27, 2019

Welp, I'm happy to report that I was able to run our test app for 6 hours today (filling about 1.9 TB of disk space with lowercase "x"s) and did not run into any exceptions. So, I think you've done it @davecramer!

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 27, 2019

@evz it's patched, not fixed. But at this point I'll take it. Thanks for testing it.

@davecramer
Copy link
Member

@davecramer davecramer commented Aug 28, 2019

Fixed in #1467 Thanks everyone for finding and testing this!

@davecramer davecramer closed this Sep 3, 2019
@davecramer
Copy link
Member

@davecramer davecramer commented Nov 8, 2019

@omgrr any chance you guys can test #1603 ? I was never happy with the hack I did to fix this. SocketTimeout looks like a more reasonable solution.

@omgrr
Copy link
Author

@omgrr omgrr commented Nov 8, 2019

Well I've had it running for 2 hours and no problems thus far @davecramer!

@davecramer
Copy link
Member

@davecramer davecramer commented Nov 11, 2019

Sadly I can't run it for 2 minutes. I'm curious how it works so well for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants