Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[needs testing] fix Pool to reduce possibility of "leaking" connections #84

Merged
merged 6 commits into from Feb 1, 2020

Conversation

abonander
Copy link
Collaborator

@abonander abonander commented Jan 22, 2020

Now uses RAII guards to control SharedPool::size so that it cannot fall out-of-sync.

Also added smoke tests for PgPool and MySqlPool that try to stress it a little and also drop acquire() futures to see if they can trigger bugs with cancellation.

closes #83

@abonander abonander marked this pull request as ready for review January 22, 2020 20:52
@abonander
Copy link
Collaborator Author

@ianthetechie do you mind trying this branch with your testing setup to see if it still leaks connections?

@abonander abonander changed the title WIP fix Pool to reduce possibility of "leaking" connections [needs testing] fix Pool to reduce possibility of "leaking" connections Jan 22, 2020
@abonander abonander force-pushed the ab/pool-fixes branch 6 times, most recently from 252da8c to 90cc737 Compare January 23, 2020 04:04
@ianthetechie
Copy link

@abonander thanks a bunch! I will test it out this afternoon (KST).

@ianthetechie
Copy link

Unfortunately it doesn't yet pass my own test with wrk. I'll try to figure out how to improve on your smoke test in the next day or so. I like the general direction of this PR though so far!

@abonander
Copy link
Collaborator Author

Can you post your test code? I'm kinda flying blind otherwise.

@ianthetechie
Copy link

I can't post it all of it unfortunately due to licensing, but I'll try to extract an MVP to reproduce the bug for you.

@abonander
Copy link
Collaborator Author

A minimal reproduction would be very helpful, thank you.

@abonander
Copy link
Collaborator Author

@ianthetechie any luck with that repro?

@ianthetechie
Copy link

Hey @abonander, sorry it took me a bit to get around to this as Friday was the Lunar New Year holiday here. Here's a repository with instructions for reproducing the issue: https://github.com/ianthetechie/sqlx-leak-repro.

I will also add that this PR does improve things. It appears to address at least one of the data race/state issues, but there seem to be more. It's just harder to reproduce now.

@abonander
Copy link
Collaborator Author

No worries, I just wanted to make sure you didn't forget.

@abonander
Copy link
Collaborator Author

abonander commented Jan 28, 2020

I had a look through the repro project and I'm not sure what it's doing with sqlx that my smoke test doesn't. I can change the query it uses to select pg_sleep(0.5) and it still completes.

Are we certain that it's not Hyper leaking the futures for abandoned connections? That might explain why we didn't see the same behaviors with Tide and siege. I could also try increasing the number of spawned tasks.

@abonander
Copy link
Collaborator Author

I was able to repro with your project, and I confirmed it's not Hyper leaking futures by adding an RAII guard to the handler that counts the number of times it's leaked in a static AtomicUsize, and after wrk is killed and connections are timing out the count is 0. The pool also shows 0 idle connections so it still thinks they're alive somewhere. I just have to track that down now.

@abonander
Copy link
Collaborator Author

I think I've nailed down the source of this bug but I'll have to play with some refactors to see if my guess is correct.

  • In the connection queue we actually have two separate queues, one for idle connections and the other for futures::channel::oneshot::Senders which are clients waiting to acquire a connection.
  • When popping from the queue, if there are no connections in the idle queue we create a new oneshot channel and place the sender on the other queue, awaiting the receiver with a timeout.
  • If a connection is sent just as the timeout elapses, we can have a race condition where the channel contains a connection after it's timed out and the receiver will never be polled again, but has not been dropped yet.

The fix, I think, would be to always push connections to the idle queue, and then try to find a waiter to wake up. That way the connection always remains tracked by the pool.

Though it would not completely fix this race, neither async-std nor Tokio polls the future again after the timeout elapses, even if the result would be immediately available, which seems suboptimal to me:

Additionally, we don't have a way to wake a waiter if the size drops below the max, which means connections can time out waiting on the queue even though they have the right-of-way to open a new connection. I'll be fixing both of these.

sqlx-core/src/pool/size.rs Outdated Show resolved Hide resolved
@abonander
Copy link
Collaborator Author

abonander commented Jan 31, 2020

image

@ianthetechie I believe I have fixed the problem. At the beginning of the far right terminal window you can see the results of starting and stopping wrk bringing the pool to a halt; it was clear there was a problem as the throughput decreased every time I started and stopped wrk until it hit 0. At the point of the selected command I had my fixed implementation running and you can see it maintains throughput the entire time; the far left window is my modification of your repro showing the idle connections returning to the queue when "leak count" (which is more like the current number of running request handlers since Hyper isn't leaking handlers) is 0.

The middle command completes in normal time if I'm not running wrk but it still times out when wrk is running, which makes sense as I'm effectively DOSing the server. There's not much that can be done about that except implementing load balancing at the application level which is out of scope for this fix.

If you don't mind, give this branch a try again and see if you get the same results.

@mehcode
Copy link
Member

mehcode commented Feb 1, 2020

Great job @abonander. I played with it myself and I don't see any obvious issues. Merging as its certainly an improvement and I love how we have a couple of good smoke tests in-crate now. In the future I'd love to setup a benchmark to validate some of our assumptions about performance.

@mehcode mehcode merged commit eff7c9e into master Feb 1, 2020
@mehcode mehcode deleted the ab/pool-fixes branch February 1, 2020 07:33
@ianthetechie
Copy link

Hey guys! Thanks so much for working through this with me! I admit the repro and description wasn't as clear cut as I would have liked...

I've tested it on both my repro project and our "real world" codebase and it looks like it's working as intended. Incidentally it seems to perform a bit better.

Also the other issue I noted in my repro project (some sort of incorrect recycling causing bad behavior when connections are dropped and not tested by the pool) seems to be fixed.

pub(in crate::pool) struct DecrementSizeGuard<'a> {
size: &'a AtomicU32,
waiters: &'a SegQueue<Waker>,
dropped: bool,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this solution :) But is a dropped flag really necessary? I didn't think it was possible to drop something twice in Rust (I don't think you can call the .drop() method manually, and the prelude drop function borrows its input, so attempts to reference it further wouldn't compile).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pool can "leak" connections under certain conditions
3 participants