New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python exceptions raised in a wait_callback do not always propagate #410

Closed
dhagrow opened this Issue Feb 16, 2016 · 17 comments

Comments

Projects
None yet
3 participants
@dhagrow
Copy link

dhagrow commented Feb 16, 2016

I haven't been able to put together a minimal example of this, so unfortunately I have only been able to get this problem to manifest within the scope of a project I am on. I will continue to see if I can put together an example, but in the meantime the best I can offer is some context.

The project is using SQLAlchemy on top of psycopg2, using psycogreen to provide this gevent wait_callback:

def gevent_wait_callback(conn, timeout=None):
    """A wait callback useful to allow gevent to work with Psycopg."""
    while 1:
        state = conn.poll()
        if state == extensions.POLL_OK:
            break
        elif state == extensions.POLL_READ:
            wait_read(conn.fileno(), timeout=timeout)
        elif state == extensions.POLL_WRITE:
            wait_write(conn.fileno(), timeout=timeout)
        else:
            raise psycopg2.OperationalError(
                "Bad result from poll: %r" % state)

The problem I'm seeing is that an exception that is raised within this callback is sometimes not being propagated to the Python code that is making calls to psycopg2 via SQLAlchemy. Instead, I get an OperationalError('unknown error').

To be specific, I am killing greenlets, which raises a GreenletExit exception in the greenlet when it's execution resumes. In some cases, GreenletExit is getting raised in the wait_callback. Usually, GreenletExit is then also raised in the code that makes an SQLAlchemy call (e.g. session.query(Model).filter().one()). The problem is that sometimes I get the OperationalError instead.

I've toyed around in the C layer of psycopg2, and I can see that PyErr_Occurred() returns GreenletExit after the callback is called in green.c:psyco_wait(). That is somehow getting lost in some cases before the C layer returns control to Python, and it ends up in pqpath.c:pq_complete_error() which raises OperationalError('unknown error').

This seems to me like a bug along some specific code path, but I haven't been able to trace exactly where it goes astray and the exception gets lost.

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Feb 18, 2016

Hi,

Can you please run your tests with the current master or maint_2_6 branches? I've done some cleanup to errors propagation to fix #281 so it could have moved something affecting what you are investigating.

If you manage to put together some testing rig to reproduce the issue let us know.

@dhagrow

This comment has been minimized.

Copy link

dhagrow commented Feb 18, 2016

No I'm afraid that neither branch seems to solve the issue.

Still no luck with a test script, but I'll post one when I can.

@underyx

This comment has been minimized.

Copy link

underyx commented Apr 3, 2017

Seems like we're still bumping into the same issue with

gevent==1.2.1
greenlet==0.4.12
psycogreen==1.0
psycopg2==2.6.2
SQLAlchemy==1.1.6
@underyx

This comment has been minimized.

Copy link

underyx commented Apr 4, 2017

We've tried bumping the versions of all these dependencies and still had the issue with

gevent==1.2.1
greenlet==0.4.12
psycogreen==1.0
psycopg2==2.7.1
SQLAlchemy==1.1.8
@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Apr 4, 2017

@underyx I don't have a way to reproduce this issue, I am still waiting for one.

@underyx

This comment has been minimized.

Copy link

underyx commented Apr 4, 2017

@dvarrazzo Right, I was working on that, and I came up with a rather silly script, that does reproduce the issue, but depends on random chance a bit.

Install the following packages: pip install gevent==1.2.1 greenlet==0.4.12 psycogreen==1.0 psycopg2==2.7.1 SQLAlchemy==1.1.8

Run this code:

from gevent import monkey; monkey.patch_all()
from psycogreen.gevent import patch_psycopg; patch_psycopg()

from gevent import spawn
from gevent.pool import Pool
import psycopg2
from sqlalchemy import create_engine
import time

engine = create_engine('postgresql://localhost')


def sleepy():
    try:
        conn = engine.raw_connection()
        cur = conn.cursor()
        cur.execute('SELECT pg_sleep(0.1)')
        cur.close()
        conn.close()
    except:
        import traceback; traceback.print_exc()


def killy(greenlets):
    for greenlet in greenlets:
        time.sleep(0.001)
        greenlet.kill()


def main():
    pool = Pool()
    for _ in range(100):
        greenlets = [pool.apply_async(sleepy) for _ in range(100)]
        spawn(killy, greenlets)
        pool.join()


if __name__ == '__main__':
    main()

Out of 10000 queries this results in around 80 unknown errors for me. Use python repro.py 2>&1 | grep unknown -C 10 to filter out all the other exceptions.

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Apr 4, 2017

@underyx I'll give it a run, thank you very much.

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Apr 4, 2017

@underyx I've tried run your script, then I've tried running your script with 10000 instead of 100 in the outer loop. Twice. So it ran >2M queries, and I didn't see a single unknown :(

Maybe a source of difference is the libpq version? I've tried mine with 9.6.1:

$ pip freeze
SQLAlchemy==1.1.8
argparse==1.2.1
gevent==1.2.1
greenlet==0.4.12
psycogreen==1.0
psycopg2==2.7.1
wsgiref==0.1.2

$ python -c 'import psycopg2; print psycopg2.extensions.libpq_version()'
90601

$ time python test-410.py 2>&1 | grep unknown -C 10

real	23m16.487s
user	5m28.916s
sys	1m23.760s

For completeness, I've run my test on python 2.7 on Ubuntu 16.04 64 bits. Tests run on 2.7.1 rather than on master to verify if it is a case of #539, but even on 2.7.1 I don't seem able to reproduce it.

@underyx

This comment has been minimized.

Copy link

underyx commented Apr 4, 2017

That's fascinating. The three environments where I've seen this error are:

  • my dev machine with macOS 10.12.4, python 3.5, libpq 90504
  • our production instance with Ubuntu 16.04, python 2.7, libpq 90602
  • the below Docker image with Alpine 3.4, python 3.6, libpq 90506

I've published this repo with a Dockerfile and a docker-compose.yml you can use to launch a database and run the script in a way that triggers the failure. I got 500 unknown errors with it out of 10000 queries.

PS: I'm sure you didn't make this mistake, but just in case: if time there resolves to the binary instead of the shell function, grep will never get a chance to see the script's output.

PPS: I timed my script execution and it came out to 20 seconds for 100 loops. That'd be 33 minutes for 10000 loops, which is a lot more than what it took for you. Since the timing is important here (you need to raise GreenletExit at the correct time), maybe if your connection to localhost or something is faster than mine it's possible you wouldn't see the errors.

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Apr 4, 2017

I would have exactly asked you for a Dockers thing that I could have tested: thank you very much for providing that, I appreciate the effort and will try to look into it again.

I used time only the second time I ran the script, to have a ballpark measure of how long it takes to run, not the first time, so I don't think that was the case. It is likely that the difference is given by timing: my setup seems similar to your production one: in this case the Docker test may fail too... I'll try to run it on my laptops, on my Arduino...

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Apr 4, 2017

maybe if your connection to localhost or something

Ok, that got me thinking about difference between network/unix socket connections... tested better and my run of the script was just failing to connect because missing password :\

Fixed pg_hba.conf and now the script spits a handful of unknown error in the 10K run too.

I'll try to look a bit better if I can understand what causes them.

@dvarrazzo dvarrazzo added this to the psycopg 2.7.2 milestone Apr 4, 2017

@underyx

This comment has been minimized.

Copy link

underyx commented Apr 4, 2017

the script was just failing to connect because missing password

At least I'll be able to sleep properly tonight! :D

dvarrazzo added a commit that referenced this issue Apr 5, 2017

Added test to verify #410
The 'unknown error' happens on query.
@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Apr 5, 2017

Ok nailed it. I managed to set up a test not based on concurrency and added it to the test suite.

dvarrazzo added a commit that referenced this issue Apr 5, 2017

Added test to verify #410
The 'unknown error' happens on query.
@underyx

This comment has been minimized.

Copy link

underyx commented Apr 5, 2017

Wow, that's amazing, thank you! We'll test this commit in our production env to confirm the bug to be fixed.

@dvarrazzo dvarrazzo closed this in a66c34a Apr 5, 2017

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Apr 5, 2017

That would be great: let us know, thank you very much.

@underyx

This comment has been minimized.

Copy link

underyx commented Apr 5, 2017

Yesterday we got 66 of these errors over the span of around an hour. We just ran for 40 minutes with psycopg2 at 4b4d279 and we got

  • 0 unknown errors
  • 55 of a different exception (which must be what was masked yesterday)

So, I can confirm that this issue is fixed! Thanks again 🙃

@dvarrazzo

This comment has been minimized.

Copy link
Member

dvarrazzo commented Apr 5, 2017

Perfect, thank you for testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment