Fails to detect broken connections in some cases #263

NightTsarina · 2014-09-22T02:09:36Z

Hi,

I have been trying to chase down a bug I am experiencing: sometimes the connection to the database would go away, and neither psycopg2 nor sqlalchemy detect the condition, and thus, the connection is not removed from the connection pool.

It seems to be the same issue as #196 , which is marked as solved, but I can reproduce this behaviour with psycopg 2.5.3, and problems other than the server restarting. In my case it seems that connections are timing out (the error is SSL SYSCALL error: EOF detected), but closing manually the filedescriptor triggers the same issue):

$ python
Python 2.7.3 (default, Mar 13 2014, 11:03:55) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import psycopg2
>>> conn = psycopg2.connect('postgresql://XXX:YYY@localhost/testsrunner')
>>> cur = conn.cursor()
>>> cur.execute("SELECT 1")
>>> cur.fetchall()
[(1,)]
>>> conn.closed
0
>>> conn.poll()
0
>>> conn.closed
0
>>> 
[1]+  Stopped                 python
$ ps
  PID TTY          TIME CMD
14434 pts/3    00:00:00 bash
18299 pts/3    00:00:00 python
18336 pts/3    00:00:00 ps
$ lsof -p 18299|grep TCP
python  18299 tincho    3u  IPv4 13795332      0t0     TCP localhost:46295->localhost:postgresql (ESTABLISHED)
$ fg
python


>>> import os
>>> os.close(3)
>>> cur = conn.cursor()
>>> cur.execute("SELECT 1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
psycopg2.OperationalError: SSL SYSCALL error: Bad file descriptor

>>> conn.closed
0
>>> conn.poll()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
psycopg2.OperationalError: SSL SYSCALL error: Bad file descriptor

>>> conn.closed
0
>>> psycopg2.__version__
'2.5.3 (dt dec mx pq3 ext)'

It is worth noting that this does not seem to be an SSL-related issue, as I can reproduce the same problem with a UNIX domain socket:

>>> conn = psycopg2.connect('postgresql://')
[..]
$ lsof -p 18427|grep socket
python  18427 postgres    3u  unix 0xffff880000ea43c0      0t0 13797662 socket
$ fg
python


>>> import os
>>> os.close(3)
>>> cur = conn.cursor()
>>> cur.execute("SELECT 1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
psycopg2.OperationalError: could not receive data from server: Bad file descriptor

>>> conn.closed
0
>>> conn.poll()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
psycopg2.OperationalError: could not receive data from server: Bad file descriptor

>>> conn.closed
0

The text was updated successfully, but these errors were encountered:

dvarrazzo · 2014-09-22T05:32:54Z

I can reproduce the issue, e.g.:

In [2]: cnn = psycopg2.connect('')

In [3]: os.close(cnn.fileno())

In [4]: cnn.poll()
[21750] conn_poll: status = 1
[21750] conn_poll: async_status = ASYNC_DONE
[21750] pq_is_busy: consuming input
[21750] pq_is_busy: PQconsumeInput() failed
[21750] conn_poll: poll reading
---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
/home/piro/dev/psycopg2/<ipython-input-4-2d99764f7d0a> in <module>()
----> 1 cnn.poll()

OperationalError: could not receive data from server: Bad file descriptor


In [5]: cnn.closed, cnn.status
Out[5]: (0, 1)

However the libpq doesn't consider the connection broken either:

In [6]: cnn.get_transaction_status()
Out[6]: 0

When we find an error, e.g. in poll() or in execute() we ask an opinion to the libpq about the status of the connection. It seems like it says it's still fine, so we don't close it. I'll have to wrap up a minimal C program reproducing the issue and ask postgres developers what to do: if it's ok to be proactive and close the connection on error regardless of PQstatus.

dvarrazzo · 2014-09-22T06:14:22Z

A C test confirms that the libpq informative functions return the connection in good state after the fd is closed. Let's see what can we do about that...

NightTsarina · 2014-09-23T03:10:12Z

On 22/09/14 06:32, Daniele Varrazzo wrote:

However the libpq doesn't consider the connection broken either:

|In [6]: cnn.get_transaction_status()
Out[6]: 0

Hah, this bug keeps getting deeper :-)

When we find an error, e.g. in |poll()| or in |execute()| we ask an
opinion to the libpq
https://github.com/psycopg/psycopg2/blob/2_5_4/psycopg/pqpath.c#L806
about the status of the connection. It seems like it says it's still
fine, so we don't close it. I'll have to wrap up a minimal C program
reproducing the issue and ask postgres developers what to do: if it's ok
to be proactive and close the connection on error regardless of |PQstatus|.

Thanks for looking into this!

Martín Ferrari (Tincho)

dvarrazzo · 2014-09-23T10:19:05Z

After Tom Lane's message I had a quick run of testing with kill -9 of the backend. It seems that in that case we are mostly doing the right thing (only once triggered a case where the connection was dead but reported open, but I can't reproduce it anymore, and on poll sometimes we raise an exception with an empty message: I'll look better into these).

However, I don't think psycopg should do much different than what is doing: when we receive an error we ask the libpq if the connection is still ok: if it says no we close it (usually with closed = 2). We cannot be more aggressive than that or we risk closing connections that are still ok.

Can I please ask you to repeat your tests? Please check the closed, status and get_transaction_status() on the connections you find in error. If closed = 0 and get_transaction_status = 4 it's a psycopg bug because in some code path we fell out of sync with the libpq connection state, and I'm interested in knowing the steps to reproduce it. But if you find a different transaction status (values are in psycopg2.extensions) and the connection is still open then it is the libpq that could be not handling correctly some error message value on the socket, in which case you will have to raise a bug to the Postgres guys.

Also note that the connection is not reported broken if not after an attempt to read: in psycopg this happens both on cursor.execute() and connection.poll() which are two different code paths and may be affected by different bugs.

Thank you.

NightTsarina · 2014-09-23T12:09:34Z

Hi Daniele,

On 23/09/14 11:19, Daniele Varrazzo wrote:

After Tom Lane's message
http://www.postgresql.org/message-id/12708.1411394055@sss.pgh.pa.us I
had a quick run of testing with kill -9 of the backend. It seems that in
that case we are mostly doing the right thing (only once triggered a
case where the connection was dead but reported open, but I can't
reproduce it anymore, and on poll sometimes we raise an exception with
an empty message: I'll look better into these).

I expected that, I had also made other tests where psycopg did the right
thing. This condition is pretty tricky to reproduce..

However, I don't think psycopg should do much different than what is
doing: when we receive an error we ask the libpq if the connection is
still ok: if it says no we close it (usually with closed = 2). We cannot
be more aggressive than that or we risk closing connections that are
still ok.

Can I please ask you to repeat your tests? Please check the |closed|,
|status| and |get_transaction_status()| on the connections you find in
error. If closed = 0 and get_transaction_status = 4 it's a psycopg bug
because in some code path we fell out of sync with the libpq connection
state, and I'm interested in knowing the steps to reproduce it. But if
you find a different transaction status (values are in
|psycopg2.extensions|) and the connection is still open then it is the
libpq that could be not handling correctly some error message value on
the socket, in which case you will have to raise a bug to the Postgres guys.

I have added extra logging to my code to see what is the status of these
three values when the exact error happens again.

I have created a minimal test case, which breaks the connection in two
different ways: by closing the fd directly, and by closing the
connection through psycopg, but in a separate process (so the fd is
open, but the connection is unusable). None of these are the exact
problem I am getting, but are still valid error conditions.

I have read Tom Lane's message, and I understand his point about this
not being the normal way a connection dies, but it is still a write and
a read returning -1, so if the error is ignored I think it is still a
bug (albeit maybe a different one).

Also note that the connection is not reported broken if not after an
attempt to read: in psycopg this happens both on |cursor.execute()| and
|connection.poll()| which are two different code paths and may be
affected by different bugs.

In the test case I tried execute, poll right after failed execute, and
poll just after the connection is broken; with the two ways to break the
connection:

$ python testpg.py
Testing with closing the connection on another process.

exception in execute: 'SSL connection has been closed unexpectedly\n'
closed: 0, status: 2, get_transaction_status: 4
exception in poll after execute: 'connection not open\n'
closed: 2, status: 2, get_transaction_status: 4
exception in poll: 'SSL connection has been closed unexpectedly\n'
closed: 2, status: 1, get_transaction_status: 4

Testing with closing the underlying file descriptor.

exception in execute: 'SSL SYSCALL error: Bad file descriptor\n'
closed: 0, status: 2, get_transaction_status: 2
exception in poll after execute: 'SSL SYSCALL error: Bad file descriptor\n'
closed: 0, status: 2, get_transaction_status: 2
exception in poll: 'SSL SYSCALL error: Bad file descriptor\n'
closed: 0, status: 1, get_transaction_status: 0

As you can see, the behaviour is pretty different. If I close the
connection in a fork, psycopg seems to be doing the right thing,
although the closed attribute lags a bit (and therefore sqlalchemy will
fail a second time before realising the connection is dead, unless they
add a poll call after every exception).

In the second case, although I don't know the semantics of the values,
it seems to me that libpg is not marking the connection as dead at all.
I suspect my bug in production must be in the same code path (at least
it has the same 'SSL SYSCALL error' prefix), and as soon as I have data
from the changes I have just pushed, I will report back.

The code for the tests is this:

$ cat testpg.py
#!/usr/bin/env python
# vim:ts=2:sw=2:et:ai:sts=2


import os

import psycopg2

def try_report(msg, connection, call, *args):
  try:
    call(*args)
  except psycopg2.OperationalError as e:
    print 'exception in %s: %r' % (msg, str(e))
    print 'closed: %d, status: %d, get_transaction_status: %s' % (
        connection.closed, connection.status,
        connection.get_transaction_status())

def test(uri, fail_func, desc):
  print 'Testing with %s.\n' % desc
  conn = psycopg2.connect(uri)
  cur = conn.cursor()
  cur.execute('SELECT 1')

  fail_func(conn)
  cur = conn.cursor()
  try_report('execute', conn, cur.execute, 'SELECT 1')
  try_report('poll after execute', conn, conn.poll)

  conn = psycopg2.connect(uri)
  fail_func(conn)
  try_report('poll', conn, conn.poll)

  print

def fail_fork(conn):
  pid = os.fork()
  if pid:
    os.wait()
  else:
    conn.close()
    os._exit(0)

def fail_closefd(conn):
  os.close(conn.fileno())

if __name__ == '__main__':
  uri = os.environ['DATABASE_URI']
  test(uri, fail_fork, 'closing the connection on another process')
  test(uri, fail_closefd, 'closing the underlying file descriptor')

Thanks!!

Martín Ferrari (Tincho)

dvarrazzo · 2014-09-23T12:27:01Z

Thank you for the extensive testing: I'll try to reproduce and fix the case of

exception in execute: 'SSL connection has been closed unexpectedly\n'
closed: 0, status: 2, get_transaction_status: 4

for the other ones I think you'll have to talk with PG developers: they could be handling some SSL error value inappropriately.

zzzeek · 2014-09-24T13:32:52Z

However, I don't think psycopg should do much different than what is doing: when we receive an error we ask the libpq if the connection is still ok: if it says no we close it (usually with closed = 2). We cannot be more aggressive than that or we risk closing connections that are still ok.

Well, the fact that the error you got was "SSL SYSCALL error: Bad file descriptor" would tell you that the connection is not OK. But I assume psycopg2 doesn't want to get into the business of heuristics based on exceptions, which is fine! we already do it, and based on this thread it seems like we really can't rely on connection.closed very much for our purposes at all, so I'll keep adding more strings as users report them.

NightTsarina · 2014-09-24T15:10:02Z

On 24/09/14 14:32, mike bayer wrote:

However, I don't think psycopg should do much different than what is
doing: when we receive an error we ask the libpq if the connection
is still ok: if it says no we close it (usually with closed = 2). We
cannot be more aggressive than that or we risk closing connections
that are still ok.
Well, the fact that the error you got was "SSL SYSCALL error: Bad file
descriptor" would tell you that the connection is not OK. But I assume
psycopg2 doesn't want to get into the business of heuristics based on
exceptions, which is fine! we already do it
https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/dialects/postgresql/psycopg2.py#L512,
and based on this thread it seems like we really can't rely on
connection.closed very much for our purposes at all, so I'll keep adding
more strings as users report them.

It seems that you would need to do more than check closed, at least
calling poll() first, or checking get_transaction_status. From what I
saw, .closed is never a good indicator.

Martín Ferrari (Tincho)

dvarrazzo · 2014-09-24T15:20:12Z

get_transaction_status() goes to 4 always after a network operation that we have initiated: if closed is not a good indicator we might be able to fix it. We have to identify the code path where this happens. If you can give me a hand, e.g. running your test that produces the inconsistent state you have found with debug enabled we may get a better picture. Otherwise I'll eventually take a look at it but it may take me a few days.

About parsing the error messages: you are right, we never take any decision based on them. They are free to change and subject to L10N too. But the libpq has the exact errno of the problem so they may be failing to handle properly some of them and improving that would be beneficial.

NightTsarina · 2014-09-26T11:21:02Z

I would be happy to help. But I am not sure I understand what do you need me to do... The error we get in the production app is not reproducible yet (I still don't know the cause), so I can only add some debugging when I detect it.. Is there anything else I can do there?

rkonda · 2015-12-13T09:48:01Z

what is the resolution for this issue?

dvarrazzo · 2015-12-14T12:28:03Z

That psycopg won't do anything actively such a poll to check the detection is broken. If your application finds a connection in broken/inconsistent state it is your responsibility to dispose of it and open a new one. If you use the connection pool it should do that for you (but on putconn, not on getconn) and if it doesn't it is our bug to fix.

I have recently improved some internal handling quirknesses so some of the issues here could have been solved, but more along the line of returning a better error message, not automatic reconnection or anything too fancy.

dvarrazzo · 2016-07-05T06:49:05Z

Fixed with #443

We decided to add pre_ping=True . Without it we were running into bug: psycopg/psycopg2#263.

amn mentioned this issue Apr 27, 2015

Connection object closed still zero after the [TCP/IP] connection is terminated #311

Closed

keturn mentioned this issue Aug 19, 2015

database unavailable after worker timed out, but worker still accepts tasks celery/celery#2772

Closed

stas mentioned this issue Aug 21, 2015

Read-only slave connections are not closed coleifer/peewee#689

Closed

svnv mentioned this issue Oct 22, 2015

Unable to recover connection to remote host after SSL SYSCALL error FSX/momoko#126

Closed

deem0n mentioned this issue Nov 11, 2015

Database.PostgreSQL.Simple.Internal.exec: resource vanished (failed to fetch file descriptor) lpsmith/postgresql-simple#168

Open

dvarrazzo mentioned this issue Jun 3, 2016

Possible wrong status after problems in connection. #435

Closed

gward mentioned this issue Jun 20, 2016

Inconsistent value of connection closed when using Unix-domain socket #443

Closed

leplatrem mentioned this issue Oct 17, 2016

[investigation] Connection reset error Kinto/kinto#867

Closed

m-novikov mentioned this issue Dec 20, 2016

Momoko Pool trying to use dead connections FSX/momoko#149

Closed

dvarrazzo closed this as completed Dec 27, 2016

kramarz added a commit to kramarz/sqlalchemy-postgres-autocommit that referenced this issue Aug 23, 2017

Add kwargs to create engine.

fa75686

We decided to add pre_ping=True . Without it we were running into bug: psycopg/psycopg2#263.

kramarz mentioned this issue Aug 23, 2017

Add kwargs to create engine. socialwifi/sqlalchemy-postgres-autocommit#2

Merged

ajmaddaford mentioned this issue Oct 6, 2017

500 errors on EQ pre-prod ONSdigital/eq-survey-runner#1288

Closed

shaib mentioned this issue Nov 8, 2017

Redundant connections opened and held with 2.7.3 #612

Closed

sqlalchemy-bot mentioned this issue Nov 27, 2018

SSL EOF not detected as disconnect in psycopg2 sqlalchemy/sqlalchemy#3021

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fails to detect broken connections in some cases #263

Fails to detect broken connections in some cases #263

NightTsarina commented Sep 22, 2014

dvarrazzo commented Sep 22, 2014

dvarrazzo commented Sep 22, 2014

NightTsarina commented Sep 23, 2014

dvarrazzo commented Sep 23, 2014

NightTsarina commented Sep 23, 2014

dvarrazzo commented Sep 23, 2014

zzzeek commented Sep 24, 2014

NightTsarina commented Sep 24, 2014

dvarrazzo commented Sep 24, 2014

NightTsarina commented Sep 26, 2014

rkonda commented Dec 13, 2015

dvarrazzo commented Dec 14, 2015

dvarrazzo commented Jul 5, 2016

Fails to detect broken connections in some cases #263

Fails to detect broken connections in some cases #263

Comments

NightTsarina commented Sep 22, 2014

dvarrazzo commented Sep 22, 2014

dvarrazzo commented Sep 22, 2014

NightTsarina commented Sep 23, 2014

dvarrazzo commented Sep 23, 2014

NightTsarina commented Sep 23, 2014

dvarrazzo commented Sep 23, 2014

zzzeek commented Sep 24, 2014

NightTsarina commented Sep 24, 2014

dvarrazzo commented Sep 24, 2014

NightTsarina commented Sep 26, 2014

rkonda commented Dec 13, 2015

dvarrazzo commented Dec 14, 2015

dvarrazzo commented Jul 5, 2016