Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to detect broken connections in some cases #263

Closed
NightTsarina opened this issue Sep 22, 2014 · 13 comments
Closed

Fails to detect broken connections in some cases #263

NightTsarina opened this issue Sep 22, 2014 · 13 comments

Comments

@NightTsarina
Copy link

Hi,

I have been trying to chase down a bug I am experiencing: sometimes the connection to the database would go away, and neither psycopg2 nor sqlalchemy detect the condition, and thus, the connection is not removed from the connection pool.

It seems to be the same issue as #196 , which is marked as solved, but I can reproduce this behaviour with psycopg 2.5.3, and problems other than the server restarting. In my case it seems that connections are timing out (the error is SSL SYSCALL error: EOF detected), but closing manually the filedescriptor triggers the same issue):

$ python
Python 2.7.3 (default, Mar 13 2014, 11:03:55) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import psycopg2
>>> conn = psycopg2.connect('postgresql://XXX:YYY@localhost/testsrunner')
>>> cur = conn.cursor()
>>> cur.execute("SELECT 1")
>>> cur.fetchall()
[(1,)]
>>> conn.closed
0
>>> conn.poll()
0
>>> conn.closed
0
>>> 
[1]+  Stopped                 python
$ ps
  PID TTY          TIME CMD
14434 pts/3    00:00:00 bash
18299 pts/3    00:00:00 python
18336 pts/3    00:00:00 ps
$ lsof -p 18299|grep TCP
python  18299 tincho    3u  IPv4 13795332      0t0     TCP localhost:46295->localhost:postgresql (ESTABLISHED)
$ fg
python


>>> import os
>>> os.close(3)
>>> cur = conn.cursor()
>>> cur.execute("SELECT 1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
psycopg2.OperationalError: SSL SYSCALL error: Bad file descriptor

>>> conn.closed
0
>>> conn.poll()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
psycopg2.OperationalError: SSL SYSCALL error: Bad file descriptor

>>> conn.closed
0
>>> psycopg2.__version__
'2.5.3 (dt dec mx pq3 ext)'

It is worth noting that this does not seem to be an SSL-related issue, as I can reproduce the same problem with a UNIX domain socket:

>>> conn = psycopg2.connect('postgresql://')
[..]
$ lsof -p 18427|grep socket
python  18427 postgres    3u  unix 0xffff880000ea43c0      0t0 13797662 socket
$ fg
python


>>> import os
>>> os.close(3)
>>> cur = conn.cursor()
>>> cur.execute("SELECT 1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
psycopg2.OperationalError: could not receive data from server: Bad file descriptor

>>> conn.closed
0
>>> conn.poll()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
psycopg2.OperationalError: could not receive data from server: Bad file descriptor

>>> conn.closed
0
@dvarrazzo
Copy link
Member

I can reproduce the issue, e.g.:

In [2]: cnn = psycopg2.connect('')

In [3]: os.close(cnn.fileno())

In [4]: cnn.poll()
[21750] conn_poll: status = 1
[21750] conn_poll: async_status = ASYNC_DONE
[21750] pq_is_busy: consuming input
[21750] pq_is_busy: PQconsumeInput() failed
[21750] conn_poll: poll reading
---------------------------------------------------------------------------
OperationalError                          Traceback (most recent call last)
/home/piro/dev/psycopg2/<ipython-input-4-2d99764f7d0a> in <module>()
----> 1 cnn.poll()

OperationalError: could not receive data from server: Bad file descriptor


In [5]: cnn.closed, cnn.status
Out[5]: (0, 1)

However the libpq doesn't consider the connection broken either:

In [6]: cnn.get_transaction_status()
Out[6]: 0

When we find an error, e.g. in poll() or in execute() we ask an opinion to the libpq about the status of the connection. It seems like it says it's still fine, so we don't close it. I'll have to wrap up a minimal C program reproducing the issue and ask postgres developers what to do: if it's ok to be proactive and close the connection on error regardless of PQstatus.

@dvarrazzo
Copy link
Member

A C test confirms that the libpq informative functions return the connection in good state after the fd is closed. Let's see what can we do about that...

@NightTsarina
Copy link
Author

On 22/09/14 06:32, Daniele Varrazzo wrote:

However the libpq doesn't consider the connection broken either:

|In [6]: cnn.get_transaction_status()
Out[6]: 0

Hah, this bug keeps getting deeper :-)

When we find an error, e.g. in |poll()| or in |execute()| we ask an
opinion to the libpq
https://github.com/psycopg/psycopg2/blob/2_5_4/psycopg/pqpath.c#L806
about the status of the connection. It seems like it says it's still
fine, so we don't close it. I'll have to wrap up a minimal C program
reproducing the issue and ask postgres developers what to do: if it's ok
to be proactive and close the connection on error regardless of |PQstatus|.

Thanks for looking into this!

Martín Ferrari (Tincho)

@dvarrazzo
Copy link
Member

After Tom Lane's message I had a quick run of testing with kill -9 of the backend. It seems that in that case we are mostly doing the right thing (only once triggered a case where the connection was dead but reported open, but I can't reproduce it anymore, and on poll sometimes we raise an exception with an empty message: I'll look better into these).

However, I don't think psycopg should do much different than what is doing: when we receive an error we ask the libpq if the connection is still ok: if it says no we close it (usually with closed = 2). We cannot be more aggressive than that or we risk closing connections that are still ok.

Can I please ask you to repeat your tests? Please check the closed, status and get_transaction_status() on the connections you find in error. If closed = 0 and get_transaction_status = 4 it's a psycopg bug because in some code path we fell out of sync with the libpq connection state, and I'm interested in knowing the steps to reproduce it. But if you find a different transaction status (values are in psycopg2.extensions) and the connection is still open then it is the libpq that could be not handling correctly some error message value on the socket, in which case you will have to raise a bug to the Postgres guys.

Also note that the connection is not reported broken if not after an attempt to read: in psycopg this happens both on cursor.execute() and connection.poll() which are two different code paths and may be affected by different bugs.

Thank you.

@NightTsarina
Copy link
Author

Hi Daniele,

On 23/09/14 11:19, Daniele Varrazzo wrote:

After Tom Lane's message
http://www.postgresql.org/message-id/12708.1411394055@sss.pgh.pa.us I
had a quick run of testing with kill -9 of the backend. It seems that in
that case we are mostly doing the right thing (only once triggered a
case where the connection was dead but reported open, but I can't
reproduce it anymore, and on poll sometimes we raise an exception with
an empty message: I'll look better into these).

I expected that, I had also made other tests where psycopg did the right
thing. This condition is pretty tricky to reproduce..

However, I don't think psycopg should do much different than what is
doing: when we receive an error we ask the libpq if the connection is
still ok: if it says no we close it (usually with closed = 2). We cannot
be more aggressive than that or we risk closing connections that are
still ok.

Can I please ask you to repeat your tests? Please check the |closed|,
|status| and |get_transaction_status()| on the connections you find in
error. If closed = 0 and get_transaction_status = 4 it's a psycopg bug
because in some code path we fell out of sync with the libpq connection
state, and I'm interested in knowing the steps to reproduce it. But if
you find a different transaction status (values are in
|psycopg2.extensions|) and the connection is still open then it is the
libpq that could be not handling correctly some error message value on
the socket, in which case you will have to raise a bug to the Postgres guys.

I have added extra logging to my code to see what is the status of these
three values when the exact error happens again.

I have created a minimal test case, which breaks the connection in two
different ways: by closing the fd directly, and by closing the
connection through psycopg, but in a separate process (so the fd is
open, but the connection is unusable). None of these are the exact
problem I am getting, but are still valid error conditions.

I have read Tom Lane's message, and I understand his point about this
not being the normal way a connection dies, but it is still a write and
a read returning -1, so if the error is ignored I think it is still a
bug (albeit maybe a different one).

Also note that the connection is not reported broken if not after an
attempt to read: in psycopg this happens both on |cursor.execute()| and
|connection.poll()| which are two different code paths and may be
affected by different bugs.

In the test case I tried execute, poll right after failed execute, and
poll just after the connection is broken; with the two ways to break the
connection:

$ python testpg.py
Testing with closing the connection on another process.

exception in execute: 'SSL connection has been closed unexpectedly\n'
closed: 0, status: 2, get_transaction_status: 4
exception in poll after execute: 'connection not open\n'
closed: 2, status: 2, get_transaction_status: 4
exception in poll: 'SSL connection has been closed unexpectedly\n'
closed: 2, status: 1, get_transaction_status: 4

Testing with closing the underlying file descriptor.

exception in execute: 'SSL SYSCALL error: Bad file descriptor\n'
closed: 0, status: 2, get_transaction_status: 2
exception in poll after execute: 'SSL SYSCALL error: Bad file descriptor\n'
closed: 0, status: 2, get_transaction_status: 2
exception in poll: 'SSL SYSCALL error: Bad file descriptor\n'
closed: 0, status: 1, get_transaction_status: 0

As you can see, the behaviour is pretty different. If I close the
connection in a fork, psycopg seems to be doing the right thing,
although the closed attribute lags a bit (and therefore sqlalchemy will
fail a second time before realising the connection is dead, unless they
add a poll call after every exception).

In the second case, although I don't know the semantics of the values,
it seems to me that libpg is not marking the connection as dead at all.
I suspect my bug in production must be in the same code path (at least
it has the same 'SSL SYSCALL error' prefix), and as soon as I have data
from the changes I have just pushed, I will report back.

The code for the tests is this:

$ cat testpg.py
#!/usr/bin/env python
# vim:ts=2:sw=2:et:ai:sts=2


import os

import psycopg2

def try_report(msg, connection, call, *args):
  try:
    call(*args)
  except psycopg2.OperationalError as e:
    print 'exception in %s: %r' % (msg, str(e))
    print 'closed: %d, status: %d, get_transaction_status: %s' % (
        connection.closed, connection.status,
        connection.get_transaction_status())

def test(uri, fail_func, desc):
  print 'Testing with %s.\n' % desc
  conn = psycopg2.connect(uri)
  cur = conn.cursor()
  cur.execute('SELECT 1')

  fail_func(conn)
  cur = conn.cursor()
  try_report('execute', conn, cur.execute, 'SELECT 1')
  try_report('poll after execute', conn, conn.poll)

  conn = psycopg2.connect(uri)
  fail_func(conn)
  try_report('poll', conn, conn.poll)

  print

def fail_fork(conn):
  pid = os.fork()
  if pid:
    os.wait()
  else:
    conn.close()
    os._exit(0)

def fail_closefd(conn):
  os.close(conn.fileno())

if __name__ == '__main__':
  uri = os.environ['DATABASE_URI']
  test(uri, fail_fork, 'closing the connection on another process')
  test(uri, fail_closefd, 'closing the underlying file descriptor')

Thanks!!

Martín Ferrari (Tincho)

@dvarrazzo
Copy link
Member

Thank you for the extensive testing: I'll try to reproduce and fix the case of

exception in execute: 'SSL connection has been closed unexpectedly\n'
closed: 0, status: 2, get_transaction_status: 4

for the other ones I think you'll have to talk with PG developers: they could be handling some SSL error value inappropriately.

@zzzeek
Copy link

zzzeek commented Sep 24, 2014

However, I don't think psycopg should do much different than what is doing: when we receive an error we ask the libpq if the connection is still ok: if it says no we close it (usually with closed = 2). We cannot be more aggressive than that or we risk closing connections that are still ok.

Well, the fact that the error you got was "SSL SYSCALL error: Bad file descriptor" would tell you that the connection is not OK. But I assume psycopg2 doesn't want to get into the business of heuristics based on exceptions, which is fine! we already do it, and based on this thread it seems like we really can't rely on connection.closed very much for our purposes at all, so I'll keep adding more strings as users report them.

@NightTsarina
Copy link
Author

On 24/09/14 14:32, mike bayer wrote:

However, I don't think psycopg should do much different than what is
doing: when we receive an error we ask the libpq if the connection
is still ok: if it says no we close it (usually with closed = 2). We
cannot be more aggressive than that or we risk closing connections
that are still ok.

Well, the fact that the error you got was "SSL SYSCALL error: Bad file
descriptor" would tell you that the connection is not OK. But I assume
psycopg2 doesn't want to get into the business of heuristics based on
exceptions, which is fine! we already do it
https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/dialects/postgresql/psycopg2.py#L512,
and based on this thread it seems like we really can't rely on
connection.closed very much for our purposes at all, so I'll keep adding
more strings as users report them.

It seems that you would need to do more than check closed, at least
calling poll() first, or checking get_transaction_status. From what I
saw, .closed is never a good indicator.

Martín Ferrari (Tincho)

@dvarrazzo
Copy link
Member

get_transaction_status() goes to 4 always after a network operation that we have initiated: if closed is not a good indicator we might be able to fix it. We have to identify the code path where this happens. If you can give me a hand, e.g. running your test that produces the inconsistent state you have found with debug enabled we may get a better picture. Otherwise I'll eventually take a look at it but it may take me a few days.

About parsing the error messages: you are right, we never take any decision based on them. They are free to change and subject to L10N too. But the libpq has the exact errno of the problem so they may be failing to handle properly some of them and improving that would be beneficial.

@NightTsarina
Copy link
Author

I would be happy to help. But I am not sure I understand what do you need me to do... The error we get in the production app is not reproducible yet (I still don't know the cause), so I can only add some debugging when I detect it.. Is there anything else I can do there?

@rkonda
Copy link

rkonda commented Dec 13, 2015

what is the resolution for this issue?

@dvarrazzo
Copy link
Member

That psycopg won't do anything actively such a poll to check the detection is broken. If your application finds a connection in broken/inconsistent state it is your responsibility to dispose of it and open a new one. If you use the connection pool it should do that for you (but on putconn, not on getconn) and if it doesn't it is our bug to fix.

I have recently improved some internal handling quirknesses so some of the issues here could have been solved, but more along the line of returning a better error message, not automatic reconnection or anything too fancy.

@dvarrazzo
Copy link
Member

Fixed with #443

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants