Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool::acquire times out #622

Closed
skuzins opened this issue Aug 10, 2020 · 20 comments · Fixed by #629 or #1211
Closed

Pool::acquire times out #622

skuzins opened this issue Aug 10, 2020 · 20 comments · Fixed by #629 or #1211

Comments

@skuzins
Copy link

skuzins commented Aug 10, 2020

I'm using a connection pool with a larger number of tokio tasks, see the example below. The tasks run a small query, do some work without holding on to a connection, then run another query. In the example, the second call to acquire quickly starts failing with PoolTimeout or it takes too long, roughly connect_timeout seconds.

Any ideas would be appreciated.

use sqlx::mysql::*;
async fn run_queries(i: u32, pool: &sqlx::Pool<MySql>) {
    {
        let mut conn = pool.acquire().await.unwrap();

        let _ = sqlx::query("SELECT 1").fetch_one(&mut conn).await.unwrap();

        // conn gets dropped here and should be returned to the pool
    }

    // (do some other work here without holding on to a connection)

    {
        let start = std::time::Instant::now();

        // this often fails (PoolTimedOut) or takes connect_timeout seconds
        let mut _conn = pool.acquire().await.unwrap();

        println!("{} pool.acquire() took {:?}", i, std::time::Instant::now() - start);

        // (run another query...)

    }
}

#[tokio::main]
async fn main() {
    let pool = MySqlPoolOptions::new()
        .max_connections(1) // also fails with higher counts, e.g. 5
        .connect(&std::env::var("DATABASE_URL").unwrap())
        .await
        .unwrap();

    let mut handles = vec![];
    for i in 0..100 {
        let pool = pool.clone();

        handles.push(tokio::spawn(async move {
            run_queries(i, &pool).await;
        }));
    }

    futures::future::join_all(handles).await;
}
[dependencies]
sqlx = { version = "0.4.0-beta.1", default-features = false, features = [ "runtime-tokio", "mysql" ] }
tokio = "0.2.22"
futures = "0.3.5"

Output snippets:

thread 'tokio-runtime-worker' panicked at 'called `Result::unwrap()` on an `Err` value: PoolTimedOut'
pool.acquire() took 30.001958452s

MySQL Engine version 8.0.17. Tested with stable-x86_64-unknown-linux-gnu and stable-x86_64-pc-windows-msvc.

@cetra3
Copy link

cetra3 commented Aug 12, 2020

I've ran into the same issue. I am not seeing connections recycled when they are dropped, which is causing issues when multiple queries happen on one web request/future.

I have worked around it by using deadpool for connection pooling, which appears to be more reliable in my limited tests.

Here's my (very simple) implementation. (also note: I needed to use DerefMut to coax the compiler, as I am not sure what combination of * and & is needed here.)

use std::ops::DerefMut;
use sqlx::{Connect, Connection, Error as SqlxError, Executor, PgConnection};

use async_trait::async_trait;
use deadpool::managed::{Manager, RecycleResult};
use log::*;
type Pool = deadpool::managed::Pool<PgConnection, SqlxError>;

struct DbPool {
    url: String,
}

impl DbPool {
    fn new(url: String, size: usize) -> Pool {
        Pool::new(DbPool { url }, size)
    }
}

#[async_trait]
impl Manager<PgConnection, SqlxError> for DbPool {
    async fn create(&self) -> Result<PgConnection, SqlxError> {
        PgConnection::connect(&self.url).await
    }
    async fn recycle(&self, obj: &mut PgConnection) -> RecycleResult<SqlxError> {
        Ok(obj.ping().await?)
    }
}



async fn main() {
    ...
    let pool = DbPool::new(url, 16);
    ...
    sqlx::query_as("select * from users where email = $1")
      .bind(&email)
      .fetch_one(pool.get().await?.deref_mut())
      .await?;
     ... 
}

@skuzins
Copy link
Author

skuzins commented Aug 12, 2020

Thanks for sharing your workaround! deadpool looks great.

@abonander
Copy link
Collaborator

abonander commented Aug 12, 2020

Currently the pool doesn't behave well if the acquire() future is cancelled; it fails to remove the waiter from the list, which holds up the queue. This could have a knock-on effect with other acquire() calls timing out due to the same problem. Each time a connection is returned to the pool it only attempts to wake the task at the head of the queue. If there's more dead tasks than connections in the queue then the pool will stall.

@skuzins in your example, could you please try debug-printing the pool after the second .acquire() call times out, instead of panicking:

if let Err(e) = pool.acquire().await {
    println!("acquire error: {}, pool state: {:?}", e, pool);
}

@skuzins
Copy link
Author

skuzins commented Aug 13, 2020

I'm not sure any future would/should normally be cancelled in this example - I believe they should all just run to completion.

Here's the debug message, please let me know if you need more information.

acquire error: pool timed out while waiting for an open connection, pool state: 
Pool { size: 1, num_idle: 0, is_closed: false, options: PoolOptions { 
max_connections: 1, min_connections: 0, connect_timeout: 30s, 
max_lifetime: Some(1800s), idle_timeout: Some(600s), test_before_acquire: true } }

@abonander
Copy link
Collaborator

The timeout itself cancels the internal future that's actually waiting in the queue.

However, I'd expect to see num_idle: 1 if that was the case. How about with .max_connections(5)?

@cetra3
Copy link

cetra3 commented Aug 13, 2020

I am not seeing cancellations in my code and the example posted doesn't have them either. It feels more like an issue with waking up after the pool has stopped using a connection.

@skuzins
Copy link
Author

skuzins commented Aug 13, 2020

@abonander The thing is, there shouldn't be any timeouts. The task acquires the connection, performs a short query, drops the connection and does the same thing once more. The whole program should finish within milliseconds without any timeouts.

Here's the output with .max_connections(5):

acquire error: pool timed out while waiting for an open connection, pool state: Pool 
{ size: 5, num_idle: 0, is_closed: false, options: PoolOptions { 
max_connections: 5, min_connections: 0, connect_timeout: 30s, 
max_lifetime: Some(1800s), idle_timeout: Some(600s), test_before_acquire: true } }

abonander added a commit that referenced this issue Aug 13, 2020
fixes #622

Signed-off-by: Austin Bonander <austin@launchbadge.com>
abonander added a commit that referenced this issue Aug 13, 2020
fixes #622

Signed-off-by: Austin Bonander <austin@launchbadge.com>
abonander added a commit that referenced this issue Aug 13, 2020
fixes #622

Signed-off-by: Austin Bonander <austin@launchbadge.com>
abonander added a commit that referenced this issue Aug 16, 2020
fixes #622

Signed-off-by: Austin Bonander <austin@launchbadge.com>
@jgrund
Copy link
Contributor

jgrund commented Aug 18, 2020

FWIW, I think I'm hitting this (or something like it) when using try_acquire with Postgres.

I have a loop which waits 5 seconds and calls try_acquire to see if any connections are available. The loop also does some work with the pool in a spawned task.

Once the 5 seconds elapse the pool shows all connections idle, but try_acquire returns None.

jgrund added a commit to whamcloud/integrated-manager-for-lustre that referenced this issue Aug 18, 2020
We're hitting an issue that looks similar to
launchbadge/sqlx#622

I was able to work around it by disabling fairness, but we can just
avoid the lookup entirely if active workers is >= the pool size.

Signed-off-by: Joe Grund <jgrund@whamcloud.io>
jgrund added a commit to whamcloud/integrated-manager-for-lustre that referenced this issue Aug 19, 2020
We're hitting an issue that looks similar to
launchbadge/sqlx#622

I was able to work around it by disabling fairness, but we can just
avoid the lookup entirely if active workers is >= the pool size.

Signed-off-by: Joe Grund <jgrund@whamcloud.io>
@bbqsrc
Copy link

bbqsrc commented May 5, 2021

I'm not certain this bug has been fixed, as I've been required to use the deadpool workaround, which has entirely solved the problems we've been having. PgPool seems to, in seemingly simple cases, break even under a very basic load.

@abonander
Copy link
Collaborator

Yeah I think adding this periodic wake was a mistake: https://github.com/launchbadge/sqlx/blob/master/sqlx-core/src/pool/inner.rs#L180

After a single wait period it effectively makes the pool unfair again as it becomes dependent on the timing of when tasks wake and poll the future.

Something I realized in 0.6 that I want to backport is that if a task is "woken" (as in its woken flag is set) but it doesn't try to acquire, on drop of the Waiter it needs to wake the next task in the list so that it can attempt to acquire. I think without that it leads to a knock-on effect where there's a connection in the idle queue and tasks waiting but none are getting woken because the first task was woken but didn't poll (either because it timed out or was cancelled).

The idle reaper might also be unintentionally preempting waiting tasks, I need to look into that.

@abonander
Copy link
Collaborator

Just spotted another bug, too; since waking a waiting task involves popping it from the SegQueue, if the task wakes and doesn't get a connection, it'll just... go back to sleep without putting itself back in the queue.

Using SegQueue was probably a bad idea anyway since waiting tasks can't retain their spot in line and have to get returned to the back of the queue. In 0.6 we're switching to an intrusive linked list where waiting tasks aren't removed until acquire() completes or is cancelled, so that should avoid this issue.

The idle reaper is also definitely preempting some waiting tasks since it directly pops connections from the idle queue without checking if any tasks are waiting.

@abonander abonander reopened this May 6, 2021
abonander added a commit that referenced this issue May 6, 2021
* a task that is marked woken but didn't actually wake will instead wake the next task
in the queue

* a task that wakes but doesn't get a connection will put itself back in the queue instead of waiting until it times out with no way to be woken

* the idle reaper won't run if there are tasks waiting for a connection, and also uses
the proper `SharedPool::release()` to return validated connections to the pool so waiting tasks get woken

closes #622, #1210

(hopefully for good this time)

Signed-off-by: Austin Bonander <austin@launchbadge.com>
abonander added a commit that referenced this issue May 6, 2021
* a task that is marked woken but didn't actually wake before being cancelled will instead wake the next task in the queue

* a task that wakes but doesn't get a connection will put itself back in the queue instead of waiting until it times out with no way to be woken

* the idle reaper now won't run if there are tasks waiting for a connection, and also uses
the proper `SharedPool::release()` to return validated connections to the pool so waiting tasks get woken

closes #622, #1210

(hopefully for good this time)

Signed-off-by: Austin Bonander <austin@launchbadge.com>
@slhmy
Copy link

slhmy commented May 6, 2021

I met similiar problem, Using MySQLPool in Actix-web 4.0-beta server.

Codes are mentioned in #1199 (comment)

I make async queries by using the pool directly in each small queries, and request queries in 512 connections concurrently.
Then I met PoolTimedOut error. Shouldn't block in pool connections since every query will ran successfully.

@abonander
Copy link
Collaborator

@slhmy if you would, please try #1211 to see if that fixes the issue.

@slhmy
Copy link

slhmy commented May 6, 2021

@slhmy if you would, please try #1211 to see if that fixes the issue.

😍fixed, I checked Tokio runtime.

@abonander
Copy link
Collaborator

That's awesome to hear, thanks!

@slckl
Copy link

slckl commented May 11, 2021

I'm also running into this issue with a somewhat silly usecase of hundreds-to-low-thousands tokio tasks running small write queries against a PgPool. It works for a couple of minutes, and then buckles with a PoolTimedOut.
#1211 unfortunately, does not seem to have any effect in my case.
will try the deadpool version suggested above.

Edit: deadpool version works just fine..

@altanozlu
Copy link
Contributor

#1211 fixed my issue thanks @abonander

abonander added a commit that referenced this issue May 18, 2021
* a task that is marked woken but didn't actually wake before being cancelled will instead wake the next task in the queue

* a task that wakes but doesn't get a connection will put itself back in the queue instead of waiting until it times out with no way to be woken

* the idle reaper now won't run if there are tasks waiting for a connection, and also uses
the proper `SharedPool::release()` to return validated connections to the pool so waiting tasks get woken

closes #622, #1210

(hopefully for good this time)

Signed-off-by: Austin Bonander <austin@launchbadge.com>
glommer pushed a commit to chiselstrike/chiselstrike that referenced this issue Jan 31, 2022
Our external cloud database doesn't seem to be able to make do
with just one. It does seem that there is a sqlx bug making this worse:
launchbadge/sqlx#622

But not a lot of options for now, and a large connection pool is good
anyway.
glommer pushed a commit to chiselstrike/chiselstrike that referenced this issue Feb 1, 2022
Our external cloud database doesn't seem to be able to make do
with just one. It does seem that there is a sqlx bug making this worse:
launchbadge/sqlx#622

But not a lot of options for now, and a large connection pool is good
anyway.
glommer pushed a commit to chiselstrike/chiselstrike that referenced this issue Feb 1, 2022
Our external cloud database doesn't seem to be able to make do
with just one. It does seem that there is a sqlx bug making this worse:
launchbadge/sqlx#622

But not a lot of options for now, and a large connection pool is good
anyway.
mergify bot added a commit to chiselstrike/chiselstrike that referenced this issue Feb 2, 2022
Our external cloud database doesn't seem to be able to make do
with just one. It does seem that there is a sqlx bug making this worse:
launchbadge/sqlx#622

But not a lot of options for now, and a large connection pool is good
anyway.
@fjoanis-legion
Copy link

fjoanis-legion commented May 20, 2022

I have a minimal repro of the timeout issue in async. This might be a wrongful usage, please tell me if that's the case!

Cargo.toml

[package]
name = "repro_sqlx_timeout"
version = "0.1.0"
edition = "2021"

[dependencies]
sqlx = { version = "0.5", features = [
    "sqlite",
    "any",
    "runtime-tokio-native-tls",
] }

tokio = { version = "1", features = ["full"] }
futures = "0.3"

[dev-dependencies]
tempfile = "3.2.0"

main.rs

fn main() {}

#[cfg(test)]
mod tests {
    use futures::future::join_all;
    use sqlx::{migrate::MigrateDatabase, Executor};
    use std::{thread, time::Duration};

    #[tokio::test]
    async fn sqlx_timeout() {
        let dir = tempfile::tempdir().unwrap();
        let path = dir.path();
        let db_path = path.join("database.db3");
        let database_uri = format!("sqlite://{}", db_path.to_str().unwrap().replace('\\', "/"));

        sqlx::Any::create_database(&database_uri).await.unwrap();

        let connection = sqlx::any::AnyPoolOptions::new()
            .max_connections(1)
            .connect_timeout(Duration::from_millis(300))
            .connect(&database_uri)
            .await
            .unwrap();

        connection
            .execute("CREATE TABLE some_table(some_number BIGINT);")
            .await
            .unwrap();

        let mut vec_future = Vec::new();

        for _ in 1..3 {
            vec_future.push(async {
                connection
                    .execute(sqlx::query("SELECT * FROM some_table"))
                    .await
                    .unwrap(); // thread 'tests::sqlx_timeout' panicked at 'called `Result::unwrap()` on an `Err` value: PoolTimedOut'
                thread::sleep(Duration::from_secs(1));
            });
        }

        let _results = join_all(vec_future).await;
    }
}

@abonander
Copy link
Collaborator

@fjoanis-legion That thread::sleep() blocks the Tokio runtime and doesn't let the connection cleanup tasks execute. You want tokio::time::sleep(Duration::from_secs(1)).await; instead.

@abonander
Copy link
Collaborator

Locking this issue to prevent further necroing. If anyone is encountering sporadic timeouts on SQLx 0.6.0, please open a new issue.

@launchbadge launchbadge locked as resolved and limited conversation to collaborators Jun 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
9 participants