New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"FATAL: invalid frontend message type 137" #117

Closed
ocharles opened this Issue Jul 16, 2014 · 11 comments

Comments

Projects
None yet
3 participants
@ocharles
Contributor

ocharles commented Jul 16, 2014

We're seeing this almost daily in our server logs, but unfortunately I have no idea to how to really reproduce it. It seems that under higher load, it seems to happen more frequently. #postgresql suggest it would be a concurrency problem, and our application does indeed have quite a lot of concurrency - I don't believe we are sharing connections between threads, but I'm going to verify that claim now.

@lpsmith

This comment has been minimized.

Show comment
Hide comment
@lpsmith

lpsmith Jul 16, 2014

Owner

Well, it should be safe to share postgresql-simple sockets between threads without locking... but there could well be bugs. Anyway, keep me informed when you find out more info.

Owner

lpsmith commented Jul 16, 2014

Well, it should be safe to share postgresql-simple sockets between threads without locking... but there could well be bugs. Anyway, keep me informed when you find out more info.

@lpsmith

This comment has been minimized.

Show comment
Hide comment
@lpsmith

lpsmith Jul 16, 2014

Owner

Another idea, is you could run a packet capture on a postgres connection until this issue occurs, and then go back and examine the capture afterwards. Wireshark has a postgres protocol dissector, which would be quite useful for analyzing the capture.

Owner

lpsmith commented Jul 16, 2014

Another idea, is you could run a packet capture on a postgres connection until this issue occurs, and then go back and examine the capture afterwards. Wireshark has a postgres protocol dissector, which would be quite useful for analyzing the capture.

@lpsmith

This comment has been minimized.

Show comment
Hide comment
@lpsmith

lpsmith Jul 16, 2014

Owner

Hmm, what version of postgresql-simple are you running? This didn't happen to start with postgresql-simple-0.4.3.0, did it? Because looking at your #postgresql conversation, the resource vanished exception message from Internal.exec should only be possible with 0.4.3.0. So it's not necessarily listen/notify causing the problem.

Owner

lpsmith commented Jul 16, 2014

Hmm, what version of postgresql-simple are you running? This didn't happen to start with postgresql-simple-0.4.3.0, did it? Because looking at your #postgresql conversation, the resource vanished exception message from Internal.exec should only be possible with 0.4.3.0. So it's not necessarily listen/notify causing the problem.

@ocharles

This comment has been minimized.

Show comment
Hide comment
@ocharles

ocharles Jul 16, 2014

Contributor

I'm not sure when it's started, but I've certainly seen the invalid frontend message happen for months - so that's probably not a 4.3.0 thing. The error just happened again now, and I can confirm the server is compiled against 4.3.0. I've only reproduced it on the server, but I've seen it locally too - let me get the database imported locally and I'll see if I can Wireshark it.

Contributor

ocharles commented Jul 16, 2014

I'm not sure when it's started, but I've certainly seen the invalid frontend message happen for months - so that's probably not a 4.3.0 thing. The error just happened again now, and I can confirm the server is compiled against 4.3.0. I've only reproduced it on the server, but I've seen it locally too - let me get the database imported locally and I'll see if I can Wireshark it.

@ocharles

This comment has been minimized.

Show comment
Hide comment
@ocharles

ocharles Jul 17, 2014

Contributor

I have managed to get a minimal Wireshark capture. Here's the packet dump for just the connection that fails:

https://ocharles.org.uk/tmp/invalid-frontend-message.pcapng

Packet 3944 is the response "Invalid frontend message type 137"

You should be able to import this into Wireshark. I also have a much larger dump of the entire application, if you want to see what the other connections were doing.

I'll do some reading on this dump now and see if I can spot anything suspect, but I wanted to get this shared as soon as possible (and also to ensure I don't lose it!)

Contributor

ocharles commented Jul 17, 2014

I have managed to get a minimal Wireshark capture. Here's the packet dump for just the connection that fails:

https://ocharles.org.uk/tmp/invalid-frontend-message.pcapng

Packet 3944 is the response "Invalid frontend message type 137"

You should be able to import this into Wireshark. I also have a much larger dump of the entire application, if you want to see what the other connections were doing.

I'll do some reading on this dump now and see if I can spot anything suspect, but I wanted to get this shared as soon as possible (and also to ensure I don't lose it!)

@ocharles

This comment has been minimized.

Show comment
Hide comment
@ocharles

ocharles Jul 17, 2014

Contributor

It appears that "we" are somehow sending the following bytes:

89 04 70 69 6e 67                ..ping

I have no idea where this is coming from, but 89 is hexadecimal for 137 and that packet occurs immediately before PostgreSQL gets confused and closes the connection - so it would suggest this is the problem.

Contributor

ocharles commented Jul 17, 2014

It appears that "we" are somehow sending the following bytes:

89 04 70 69 6e 67                ..ping

I have no idea where this is coming from, but 89 is hexadecimal for 137 and that packet occurs immediately before PostgreSQL gets confused and closes the connection - so it would suggest this is the problem.

@lpsmith

This comment has been minimized.

Show comment
Hide comment
@lpsmith

lpsmith Jul 17, 2014

Owner

Well, I don't know of anything in postgresql-simple that would be violating the libpq abstraction and be sending that sort of data. So either it's a "bug" in libpq, or perhaps you have some C code somewhere that's sending some junk on the wrong file descriptor somehow? (Just spitballing here.)

I suppose it's possible to send junk on the file descriptor from your application without dropping down to C, via the Simple.Internal module. Are you using this module? And in particular, are you ever calling LibPQ.socket yourself?

Actually, connecting via SSL should be a decent way of diagnosing whether or not this string is being sent from libpq, or by another bit of code that should not be touching that file descriptor in that way.

If that string is originating from something other than libpq, then this should result in an SSL error, and is good evidence that the data is not being sent by pqsecure_write, which is an internal libpq function. This would strongly suggest something's accidentally using the Fd returned by PQ.socket or just treating some random integer as a file descriptor in some C code somewhere.

On the other hand, if you still get the invalid frontend message after turning on SSL, then this is strong evidence that that string is passing through pqsecure_write for unknown reasons.

(I do bring this up because, for a variety of reasons, I've thought about implementing the frontend/backend protocol in Haskell off and on since before the first release of postgresql-simple... but I really don't want to put the effort into supporting all the connection and authentication options. So I've wondered about using libpq to connect and authenticate and then basically take over the connection myself. Thanks in part to @rjmac's contributions to #25, I know how to do this now, and have written some (very preliminary and mostly unusable) code to that effect already.)

Owner

lpsmith commented Jul 17, 2014

Well, I don't know of anything in postgresql-simple that would be violating the libpq abstraction and be sending that sort of data. So either it's a "bug" in libpq, or perhaps you have some C code somewhere that's sending some junk on the wrong file descriptor somehow? (Just spitballing here.)

I suppose it's possible to send junk on the file descriptor from your application without dropping down to C, via the Simple.Internal module. Are you using this module? And in particular, are you ever calling LibPQ.socket yourself?

Actually, connecting via SSL should be a decent way of diagnosing whether or not this string is being sent from libpq, or by another bit of code that should not be touching that file descriptor in that way.

If that string is originating from something other than libpq, then this should result in an SSL error, and is good evidence that the data is not being sent by pqsecure_write, which is an internal libpq function. This would strongly suggest something's accidentally using the Fd returned by PQ.socket or just treating some random integer as a file descriptor in some C code somewhere.

On the other hand, if you still get the invalid frontend message after turning on SSL, then this is strong evidence that that string is passing through pqsecure_write for unknown reasons.

(I do bring this up because, for a variety of reasons, I've thought about implementing the frontend/backend protocol in Haskell off and on since before the first release of postgresql-simple... but I really don't want to put the effort into supporting all the connection and authentication options. So I've wondered about using libpq to connect and authenticate and then basically take over the connection myself. Thanks in part to @rjmac's contributions to #25, I know how to do this now, and have written some (very preliminary and mostly unusable) code to that effect already.)

@ocharles

This comment has been minimized.

Show comment
Hide comment
@ocharles

ocharles Jul 17, 2014

Contributor

It seems that it's probably websockets-snap that is causing this problem. 89 04 ping corresponds directly to a websockets ping, and we are using websockets. So I think I'm going to close this, as it doesn't seem to be a problem in your code.

My guess is the "ping" thread in websockets-snap (which fires every 15 seconds) is firing when the connection is closed, but a new socket has opened in the meantime on a different thread and got the same fd. I'm not entirely sure why that write would still succeed, but for now I've disabled that bit of code and will watch the logs for a year.

Thanks for teaching me about the pgsql protocol in wireshark though - I don't think I could have solved it without that!

Contributor

ocharles commented Jul 17, 2014

It seems that it's probably websockets-snap that is causing this problem. 89 04 ping corresponds directly to a websockets ping, and we are using websockets. So I think I'm going to close this, as it doesn't seem to be a problem in your code.

My guess is the "ping" thread in websockets-snap (which fires every 15 seconds) is firing when the connection is closed, but a new socket has opened in the meantime on a different thread and got the same fd. I'm not entirely sure why that write would still succeed, but for now I've disabled that bit of code and will watch the logs for a year.

Thanks for teaching me about the pgsql protocol in wireshark though - I don't think I could have solved it without that!

@dzhus

This comment has been minimized.

Show comment
Hide comment
@dzhus

dzhus Sep 22, 2014

Contributor

Do you think that the same websockets-snap pinger issue could cause no connection to the server errors as well?

Sep 22 10:34:47 carma1 carma[40706]: PostgresCRUD/insert {"tid":"ThreadId 8427462","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 10:47:03 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8477284","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 11:29:02 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8661478","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 12:02:32 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8775273","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 12:21:11 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8827198","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 12:54:05 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8949791","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 14:17:32 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 9244298","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 14:48:28 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 9350457","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}

I'm seeing lots of these in the logs of our application which uses postgresql-simple and snaplet-postgresql-simple for PG connection pools. websockets-snap are also used. It looks almost as if the resource-pool inside postgresql-simple provides connections in non-OK status (http://doxygen.postgresql.org/fe-exec_8c.html#af9a6a3980b0142764b437eabc6da1a3c)

Contributor

dzhus commented Sep 22, 2014

Do you think that the same websockets-snap pinger issue could cause no connection to the server errors as well?

Sep 22 10:34:47 carma1 carma[40706]: PostgresCRUD/insert {"tid":"ThreadId 8427462","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 10:47:03 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8477284","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 11:29:02 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8661478","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 12:02:32 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8775273","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 12:21:11 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8827198","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 12:54:05 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 8949791","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 14:17:32 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 9244298","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}
Sep 22 14:48:28 carma1 carma[40706]: PostgresCRUD/insertUpdate {"tid":"ThreadId 9350457","msg":"hushed exception","exn":"SqlError {sqlState = \"\", sqlExecStatus = FatalError, sqlErrorMsg = \"no connection to the server\\n\", sqlErrorDetail = \"\", sqlErrorHint = \"\"}"}

I'm seeing lots of these in the logs of our application which uses postgresql-simple and snaplet-postgresql-simple for PG connection pools. websockets-snap are also used. It looks almost as if the resource-pool inside postgresql-simple provides connections in non-OK status (http://doxygen.postgresql.org/fe-exec_8c.html#af9a6a3980b0142764b437eabc6da1a3c)

@lpsmith

This comment has been minimized.

Show comment
Hide comment
@lpsmith

lpsmith Sep 22, 2014

Owner

@dzhus, are you aware of this issue? mightybyte/snaplet-postgresql-simple#6

If you suspect something like what ocharles observed, I would try to take a packet capture of database connections until you catch this error in action.

Owner

lpsmith commented Sep 22, 2014

@dzhus, are you aware of this issue? mightybyte/snaplet-postgresql-simple#6

If you suspect something like what ocharles observed, I would try to take a packet capture of database connections until you catch this error in action.

@dzhus

This comment has been minimized.

Show comment
Hide comment
@dzhus

dzhus Sep 24, 2014

Contributor

@lpsmith Thanks for pointing this out, it seems to be at least tangential to the issue we're having.

Contributor

dzhus commented Sep 24, 2014

@lpsmith Thanks for pointing this out, it seems to be at least tangential to the issue we're having.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment