Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

poll() blocking with queries that raise notices #856

Closed
dpage opened this Issue Mar 5, 2019 · 8 comments

Comments

Projects
None yet
2 participants
@dpage
Copy link

commented Mar 5, 2019

Given the function:

CREATE OR REPLACE FUNCTION public.fails_after(s integer)
    RETURNS text
    LANGUAGE 'plpgsql'
AS $BODY$
begin
    raise notice 'Entered';
    perform pg_sleep(s);
    raise '!!!';
end;
$BODY$;

And the following script:

import psycopg2, select
  
def wait(conn):
    while 1:
        print("Calling poll()")
        state = conn.poll()
        if state == psycopg2.extensions.POLL_OK:
            break
        elif state == psycopg2.extensions.POLL_WRITE:
            print("Got a POLL_WRITE response")
            select.select([], [conn.fileno()], [], 2)
        elif state == psycopg2.extensions.POLL_READ:
            print("Got a POLL_READ response")
            select.select([conn.fileno()], [], [], 2)
        else:
            print("Got a POLL_WTF response")
            raise psycopg2.OperationalError("poll() returned %s" % state)


aconn = psycopg2.connect(database='postgres', async_=1)
wait(aconn)
acurs = aconn.cursor()

# The first query will block for 30 seconds on the conn.poll() call in wait().
# The second query will timeout waiting for the select()'s, but conn.poll() doesn't block

acurs.execute("SELECT 1; SELECT fails_after(30);")
# acurs.execute("SELECT 1; SELECT pg_sleep(30);")

wait(acurs.connection)

As noted in the comment, the first query will block at the poll() call, whilst if run with the second query, it will not.

In pgAdmin (where this behaviour was found), the result of running the first query in the Query Tool is that the entire application hangs (for all users in multi-user mode), whilst running the second query allows the application to continue to respond as expected.

Queries such as the first cause Ajax requests in the browser to build up in a pending state (because the pgAdmin UI polls the backend Python server for the query results and dashboard graph updates etc), and after a minute or so (if run for more than 30 seconds as shown above), the browser starts timing out and everything falls apart!

@dvarrazzo

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

Huh, was just about to release 2.8... Is it the case to investigate in this first?

@dvarrazzo

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

Before I go through this, shouldn't you use two connections to sends two commands? cancel that, I misread the issue.

@dvarrazzo

This comment has been minimized.

Copy link
Member

commented Mar 5, 2019

I think there's some problem with the notices handling: running in debug mode:

In [30]: cur.execute("select 1; select fails_after(10)")
[25087] psyco_curs_execute: starting execution of new query
[25087] pq_execute: pg connection at 0x5565bab32360 OK
[25087] pq_begin_locked: pgconn = 0x5565bab32360, autocommit = 1, status = 1
[25087] pq_begin_locked: transaction in progress
[25087] pq_execute: executing ASYNC query: pgconn = 0x5565bab32360
[25087]     select 1; select fails_after(10)
[25087] pq_execute: async query sent to backend
[25087] pq_execute: query got flushed immediately
[25087] psyco_curs_execute: res = 0, pgres = (nil)

In [31]: wait(cur.connection)
Calling poll()
[25087] conn_poll: status = 1
[25087] conn_poll: async_status = ASYNC_READ
[25087] pq_is_busy: consuming input
[25087] conn_poll: poll reading
[25087] conn_poll: async_status -> ASYNC_DONE
[25087] pq_get_last_result: got result PGRES_TUPLES_OK
[25087] conn_notice_callback: NOTICE:  Entered

# blocked here

[25087] pq_get_last_result: got result PGRES_FATAL_ERROR
[25087] pq_fetch: pgstatus = PGRES_FATAL_ERROR
[25087] pq_fetch: uh-oh, something FAILED: status = 7 pgconn = 0x7f92c4131188
[25087] pq_raise: PQresultErrorMessage: err=ERROR:  !!!
CONTEXT:  PL/pgSQL function fails_after(integer) line 5 at RAISE

[25087] pq_raise: err2=!!!
CONTEXT:  PL/pgSQL function fails_after(integer) line 5 at RAISE

[25087] pq_fetch: fetching done; check for critical errors
---------------------------------------------------------------------------
RaiseException                            Traceback (most recent call last)
<ipython-input-31-4b9d9d3cf541> in <module>()
----> 1 wait(cur.connection)

<ipython-input-16-c96fe8409369> in wait(conn)
      2     while 1:
      3         print("Calling poll()")
----> 4         state = conn.poll()
      5         if state == psycopg2.extensions.POLL_OK:
      6             break

RaiseException: !!!
CONTEXT:  PL/pgSQL function fails_after(integer) line 5 at RAISE

If the raise notice 'Entered'; is commented out from the function it behaves as expected:

In [32]: cur.execute("select 1; select fails_after(10)")
[25087] psyco_curs_execute: starting execution of new query
[25087] pq_execute: pg connection at 0x5565bab32360 OK
[25087] pq_begin_locked: pgconn = 0x5565bab32360, autocommit = 1, status = 1
[25087] pq_begin_locked: transaction in progress
[25087] pq_execute: executing ASYNC query: pgconn = 0x5565bab32360
[25087]     select 1; select fails_after(10)
[25087] pq_execute: async query sent to backend
[25087] pq_execute: query got flushed immediately
[25087] psyco_curs_execute: res = 0, pgres = (nil)

In [33]: wait(cur.connection)
Calling poll()
[25087] conn_poll: status = 1
[25087] conn_poll: async_status = ASYNC_READ
[25087] pq_is_busy: consuming input
[25087] conn_notice_process: NOTICE:  Entered

[25087] conn_poll: poll reading
Got a POLL_READ response
Calling poll()
[25087] conn_poll: status = 1
[25087] conn_poll: async_status = ASYNC_READ
[25087] pq_is_busy: consuming input
[25087] conn_poll: poll reading
Got a POLL_READ response
Calling poll()
[25087] conn_poll: status = 1
[25087] conn_poll: async_status = ASYNC_READ
[25087] pq_is_busy: consuming input
[25087] conn_poll: poll reading
Got a POLL_READ response
Calling poll()
[25087] conn_poll: status = 1
[25087] conn_poll: async_status = ASYNC_READ
[25087] pq_is_busy: consuming input
[25087] conn_poll: poll reading
Got a POLL_READ response
Calling poll()
[25087] conn_poll: status = 1
[25087] conn_poll: async_status = ASYNC_READ
[25087] pq_is_busy: consuming input
[25087] conn_poll: poll reading
Got a POLL_READ response
Calling poll()
[25087] conn_poll: status = 1
[25087] conn_poll: async_status = ASYNC_READ
[25087] pq_is_busy: consuming input
[25087] conn_poll: poll reading
[25087] conn_poll: async_status -> ASYNC_DONE
[25087] pq_get_last_result: got result PGRES_TUPLES_OK
[25087] pq_get_last_result: got result PGRES_FATAL_ERROR
[25087] pq_fetch: pgstatus = PGRES_FATAL_ERROR
[25087] pq_fetch: uh-oh, something FAILED: status = 7 pgconn = 0x7f92c4131188
[25087] pq_raise: PQresultErrorMessage: err=ERROR:  !!!
CONTEXT:  PL/pgSQL function fails_after(integer) line 5 at RAISE

[25087] pq_raise: err2=!!!
CONTEXT:  PL/pgSQL function fails_after(integer) line 5 at RAISE

[25087] pq_fetch: fetching done; check for critical errors
---------------------------------------------------------------------------
RaiseException                            Traceback (most recent call last)
<ipython-input-33-4b9d9d3cf541> in <module>()
----> 1 wait(cur.connection)

<ipython-input-16-c96fe8409369> in wait(conn)
      2     while 1:
      3         print("Calling poll()")
----> 4         state = conn.poll()
      5         if state == psycopg2.extensions.POLL_OK:
      6             break

RaiseException: !!!
CONTEXT:  PL/pgSQL function fails_after(integer) line 5 at RAISE
@dvarrazzo

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

ISTM that in the working scenario the results of the two queries are received in the same packet:

[25087] conn_poll: async_status -> ASYNC_DONE
[25087] pq_get_last_result: got result PGRES_TUPLES_OK
[25087] pq_get_last_result: got result PGRES_FATAL_ERROR
[25087] pq_fetch: pgstatus = PGRES_FATAL_ERROR

in the failing scenario, the notice probably flushes the socket.

[25087] pq_get_last_result: got result PGRES_TUPLES_OK
[25087] conn_notice_callback: NOTICE:  Entered

# blocked here

[25087] pq_get_last_result: got result PGRES_FATAL_ERROR
[25087] pq_fetch: pgstatus = PGRES_FATAL_ERROR

I think we get bitten by the same wrong assumption described in #801, and that pq_get_last_result() should be reworked as per #802.

dvarrazzo added a commit that referenced this issue Mar 6, 2019

Try avoiding pq_get_last_result reading query result [skip ci]
As per bug #856, we may end up blocking if we execute more than one
query.

Work in progress: the current implementation times out on every poll
(testing with `tests.test_async.AsyncTests.test_async_iter`).
@dpage

This comment has been minimized.

Copy link
Author

commented Mar 6, 2019

Thanks for investigating @dvarrazzo. Are you planning to work on the #802 changes in the forseeable future?

@dvarrazzo

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

Yeah, I started looking into it. 2.8 is the right moment to have this fixed as it's a deep enough change that may cause some side effects: will have to reorganise the libpq PQconsumeInput/PQisBusy/PQgetResult dance around the psycopg async state keeping.

I was really hoping to release 2.8 soon: I'll try to do the work in the next evenings.

@dpage

This comment has been minimized.

Copy link
Author

commented Mar 6, 2019

Great, thanks.

@dvarrazzo

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

The above changeset was mostly a trial to figure out the right calling order. Maybe a less invasive change would be to change pq_get_last_result() so that it could return a "would block" result, in which case go back into "wait for reading" mode.

dvarrazzo added a commit that referenced this issue Mar 8, 2019

pq_is_busy() replaced by pq_get_result_async() [skip ci]
The new function keeps together PQconsumeInput() with PQisBusy(), in
order to handle the condition in which not all the results of a sequence
of statements arrive in the same roundtrip.

Added test to verify issue #856 is fixed.

Work in progress: not all the tests pass (especially green ones).

dvarrazzo added a commit that referenced this issue Mar 8, 2019

dropped pq_get_last_result() [skip ci]
The last usage was with execute on green cursor, affected by #856 as
well. Added regression test to prove it.

Close #802.

Work in progress: tests don't pass yet.

@dvarrazzo dvarrazzo closed this in 734845b Mar 17, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.