Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more pool fixes #1211

Merged
merged 1 commit into from May 18, 2021
Merged

more pool fixes #1211

merged 1 commit into from May 18, 2021

Conversation

abonander
Copy link
Collaborator

  • a task that is marked woken but didn't actually wake before being cancelled will instead wake the next task in the queue

  • a task that wakes but doesn't get a connection will put itself back in the queue instead of waiting until it times out with no way to be woken

  • the idle reaper now won't run if there are tasks waiting for a connection, and also uses
    the proper SharedPool::release() to return validated connections to the pool so waiting tasks get woken

closes #622, #1210

(hopefully for good this time)

Signed-off-by: Austin Bonander austin@launchbadge.com

* a task that is marked woken but didn't actually wake before being cancelled will instead wake the next task in the queue

* a task that wakes but doesn't get a connection will put itself back in the queue instead of waiting until it times out with no way to be woken

* the idle reaper now won't run if there are tasks waiting for a connection, and also uses
the proper `SharedPool::release()` to return validated connections to the pool so waiting tasks get woken

closes #622, #1210

(hopefully for good this time)

Signed-off-by: Austin Bonander <austin@launchbadge.com>
@abonander
Copy link
Collaborator Author

@D1plo1d @bbqsrc if you would, please try this branch to see if it fixes your issues.

@D1plo1d
Copy link
Contributor

D1plo1d commented May 7, 2021

Has something changed with the Executor trait? I'm seeing lots of errors in my functions that take an executor (eg. to receive either a Transaction or a Pool<Sqlite>) like this:

error[E0277]: the trait bound `&mut Transaction<'_, Sqlite>: sqlx_core::executor::Executor<'_>` is not satisfied
   --> crates/auth/src/user/resolvers/user_mutation_resolvers.rs:106:21
    |
106 |         user.remove(&mut tx, false).await?;
    |                     ^^^^^^^ the trait `sqlx_core::executor::Executor<'_>` is not implemented for `&mut Transaction<'_, Sqlite>`

@abonander
Copy link
Collaborator Author

@D1plo1d is there more context to that error? That impl exists: https://docs.rs/sqlx/0.5.2/sqlx/struct.Transaction.html#impl-Executor%3C%27t%3E-4

@bbqsrc
Copy link

bbqsrc commented May 8, 2021

@abonander I will be back at work on Tuesday and will test it then. If you don't hear from me, please remind me with a ping.

@D1plo1d
Copy link
Contributor

D1plo1d commented May 8, 2021

@abonander I had another look - turned out I has missed a few instances of async-graphql when I was switching my monorepo to this git branch. With that fixed it looks like everything is working now so I'd say it looks like this PR fixed the issue. Cheers!

@D1plo1d
Copy link
Contributor

D1plo1d commented May 8, 2021

@abonander Upon testing with --release I'm seeing more errors:

pool timed out while waiting for an open connection

Sadly it looks like the relative slowness of my debug builds was what was preventing the race conditions earlier :(

@abonander
Copy link
Collaborator Author

@D1plo1d does setting a higher connect_timeout not fix it? The acquire timeout is an expected result at high load.

@athei
Copy link

athei commented May 10, 2021

I also have timeouts with highly parallel accesses to the connection pool. Using the postgres backend. Am am trying out this PR right now and my applications runs for a day without timeouts so far. I am cautiously optimistic that this fixes my problem.

@D1plo1d
Copy link
Contributor

D1plo1d commented May 10, 2021

@D1plo1d does setting a higher connect_timeout not fix it? The acquire timeout is an expected result at high load.

It shouldn't be under high load if I understand the throughput that would qualify as that. This is a dev instance of my server running locally with a few transactions a second normally and the occasional bursts of a hundred or so on a GraphQL request.

This is not optimized but also this worked perfectly fine up till this release and still does when I rollback sqlx.

I suspect the contention is due to the fact that I have many queries starting at approximately the same time and less to do with throughput.

I will try increasing the connect_timeout. If it is that is there any reason why the connect timeouts would be substantially slower in the new version? Again this worked previously.

@D1plo1d
Copy link
Contributor

D1plo1d commented May 10, 2021

@abonander I don't have time today but I'll reverify my claims here and also test out connect_timeout tomorrow or Wednesday.

@athei
Copy link

athei commented May 10, 2021

I had the same issue and increasing the connect_timeout to basically infinity did not help. Every task just got stuck on while for an sql connection. It seemed to be some kind of busy waiting where the processes where at full load while not getting a connection. However, for me it seems to be fixed with the this PR.

@D1plo1d
Copy link
Contributor

D1plo1d commented May 13, 2021

After re-testing I've only been able to reproduce the problem on 0.5.2 today - all my tests in this branch look good.

I think that my issues on this branch were most likely some kind of PBCAK on my end. My apologies for the confusion.

Thanks a bunch for patching this @abonander !

@abonander abonander merged commit 8f1d8c7 into master May 18, 2021
@abonander abonander deleted the ab/more-pool-fixes branch May 18, 2021 02:24
@jstasiak
Copy link

I also had the original issue with sqlx 0.5.2 and this PR fixed it – a release with this change would be greatly appreciated.

Thank you for your work!

@D1plo1d
Copy link
Contributor

D1plo1d commented May 26, 2021

Just a heads up @abonander - I've managed to observe the issue on 0.5.3.

This is with my connection_timeout set to 60 minutes.

It's more intermittent then before but the pool is still deadlocking on something for me. I've downgraded back to 0.5.1 again and re-confirmed that it still resolves the issue for me (no regression in my own codebase on that front). Not sure if there's any more information I can provide that would help debug this one - let me know.

D1plo1d added a commit to PrintSpool/PrintSpool that referenced this pull request May 26, 2021
@abonander
Copy link
Collaborator Author

A solid repro case would really, really help here, or at least a good characterization of the load that triggers the issue.

I know that's gonna be really hard to come up with though since this is ultimately a concurrency bug.

A simple project like a one-route web API that reliably deadlocks under a synthetic stress test would be great.

@D1plo1d
Copy link
Contributor

D1plo1d commented May 26, 2021

Yeah, I wish I knew what specific to my application was triggering it. I noticed that it seems to only occur when I'm using sqlx transactions. Could it be related to my use of transactions somehow?

@abonander
Copy link
Collaborator Author

You mean like only the routes that call Pool::begin()?

@D1plo1d
Copy link
Contributor

D1plo1d commented May 26, 2021

Yeah, Pool::begin(). Although looking closer at this I don't think the transactions I was initially thinking about should be running given the application state atm. I'll keep looking at this though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pool::acquire times out
5 participants