-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transaction statements out of order #2805
Comments
This query executes on a different connection since you pass the let data: String = sqlx::query_scalar("select 'hello world from pg'")
.fetch_one(&pool)
.await
.map_err(internal_error)?; I suspect this may be due to cancellation, as when a connection is closed mid-request Axum will cancel the handler future. I'm guessing when you The cancellation could be leaving database connections in weird states. |
Sorry, silly mistake for the repro. It should be let data: String = sqlx::query_scalar("select 'hello world from pg'")
.fetch_one(&mut *transaction)
.await
.map_err(internal_error)?; of course. Unfortunately for the repro it does not seem to trigger it anymore in its current config. But its still happening in my "real" code where i also do use the created transaction and not the pool for any of the queries etc. the |
I could further reduce the repro to this https://github.com/pythoneer/sqlx_test/blob/main/src/main1.rs Looking at the postgres logs (i enabled statement logging) i have this
Which does not look correct and is what i suspected in the beginning. Looking at the code for the |
This seems to be an issue with cancellation safety of transactions in sqlx. I found a self-contained way of reproducing the issue.
[package]
name = "sqlx_test"
version = "0.1.0"
edition = "2021"
[dependencies]
anyhow = "1"
rand = "0.8"
clap = { version = "4", features = ["derive"] }
tokio = { version = "1", features = ["macros", "time", "rt-multi-thread"] }
tracing = "0.1"
tracing-subscriber = "0.3.17"
sqlx = {version = "=0.7.2", default-features = false, features = ["postgres", "runtime-tokio-rustls"]}
use clap::Parser;
use rand::{thread_rng, Rng};
use sqlx::PgPool;
use std::sync::Arc;
use std::time::Duration;
use tokio::task::JoinSet;
use tokio::time::Instant;
use tracing::{error, info, info_span, Instrument, Level};
#[derive(Parser, Debug)]
struct CommandlineArguments {
#[arg(long, default_value_t = 100)]
tasks: usize,
#[arg(long, default_value_t = 5)]
runtime_seconds: u64,
#[arg(long, default_value_t = 100)]
max_keep_transaction_milliseconds: u64,
#[arg(long, default_value_t = 100)]
max_drop_after_milliseconds: u64,
}
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let arguments = CommandlineArguments::parse();
let runtime = Duration::from_secs(arguments.runtime_seconds);
let max_keep_transaction = Duration::from_millis(arguments.max_keep_transaction_milliseconds);
let max_drop_after = Duration::from_millis(arguments.max_drop_after_milliseconds);
tracing_subscriber::fmt().with_max_level(Level::INFO).init();
info!(?arguments, "Commandline");
let pool =
Arc::new(PgPool::connect("postgres://postgres:postgres@localhost:51337/postgres").await?);
let mut join_set = JoinSet::new();
for index in 0..arguments.tasks {
let span = info_span!("task", index);
join_set.spawn(
task(pool.clone(), max_keep_transaction, max_drop_after, runtime).instrument(span),
);
}
while let Some(res) = join_set.join_next().await {
if let Err(err) = res {
error!("Task failed: {:?}", err);
}
}
Ok(())
}
async fn task(
pool: Arc<PgPool>,
max_keep_transaction: Duration,
max_drop_after: Duration,
runtime: Duration,
) -> anyhow::Result<()> {
let start = Instant::now();
while start.elapsed() < runtime {
let operation = Operation::random(pool.clone(), max_keep_transaction, max_drop_after);
let span = info_span!(
"operation",
keep_transaction_for = operation.keep_transaction_for.as_millis(),
drop_after = operation.drop_after.as_millis()
);
operation.run().instrument(span).await?;
}
Ok(())
}
struct Operation {
pool: Arc<PgPool>,
keep_transaction_for: Duration,
drop_after: Duration,
}
impl Operation {
fn random(pool: Arc<PgPool>, max_keep_transaction: Duration, max_drop_after: Duration) -> Self {
let mut rng = thread_rng();
Self {
pool,
keep_transaction_for: rng.gen_range(Duration::ZERO..=max_keep_transaction),
drop_after: rng.gen_range(Duration::ZERO..=max_drop_after),
}
}
async fn run(self) -> anyhow::Result<()> {
tokio::select! {
result = self.start_transaction_and_commit() => {
result
}
_ = tokio::time::sleep(self.drop_after) => {
// drops the open transaction from the other branch
Ok(())
}
}
}
async fn start_transaction_and_commit(&self) -> anyhow::Result<()> {
let begin_span = info_span!("begining transaction");
let transaction = self.pool.begin().instrument(begin_span).await?;
let inside_transaction_span = info_span!("inside transaction");
tokio::time::sleep(self.keep_transaction_for)
.instrument(inside_transaction_span)
.await;
let commit_span = info_span!("committing transaction");
transaction.commit().instrument(commit_span).await?;
Ok(())
}
}
version: "3.2"
services:
db:
image: postgres:15-alpine
container_name: sqlx_repro_db
restart: unless-stopped
tty: true
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
ports:
- "51337:5432" This is what the output looks like (with
|
Note that in I'm just simulating a dropping connection by using |
Not very familiar with the code base but it looks like this part (and the sibling functions) are not safely increasing or decreasing the sqlx/sqlx-postgres/src/transaction.rs Lines 16 to 25 in b138705
What happens if the |
Seemingly related: #2054 |
|
Bug Description
Not really sure what exactly the bug is but i am seeing strange behavior that makes it look like sql statements are not executed in the correct order. I have a reproduction here ( https://github.com/pythoneer/sqlx_test ) , but it is not really minimal. It uses axum and some strange
wrk
stuff to trigger it, because that is where i originally encountered that behavior, not sure if it could be further reduced or if it has something to do with how axum operates specifically.I am just trying to do this in a handler
https://github.com/pythoneer/sqlx_test/blob/c220fda7c5f60fc0ef22f7ba64f3436fac7063e6/src/main2.rs#L41-L59
The
SET TRANSACTION...
and theselect 'hello ..
can also be removed it will already trigger the problem with justbut it helps demonstrating the problem.
by running
and
to hit the endpoint we can sporadically see
which i can't really explain. The only way i can think of is if the statements send to the database are not in order? Normally we would see
BEGIN; COMMIT;
pairs send parallel in each connection, right? But the warnings make me assume that sometimeBEGIN; BEGIN; COMMIT;
orBEGIN; COMMIT; COMMIT;
happens.With the additional
SET TRANSACTION ...
andselect 'hello ..
we can also see panics that i create withthat shows
and the same problem here, i can only explain this (just playing around in psql) if the
select 'hello ..
is executed beforeSET TRANSACTION ...
.sqlx.mp4
Also the values (like the sleep and all the wrk parameters and connection pool size) might depend on the specific machine idk, i tuned them to be working the best on my system. I think this can change from system to system. I think it worked the best when the endpoint delivered around 2500 req/s. The reason why i am doing the "funny" stuff in
parallel_run.sh
is because i noticed that this is somehow triggered specifically in the beginning or the end of a wrk run, not really sure why and what is happening in detail. Maybe wrk "just" kills connection in the middle and axum reacts strange to killed connections while a handler is running idk. But i would say regardless of what axum is doing i don't expect any of the observed things to happen. You can trigger this withoutparallel_run.sh
but you might need to wait and potentially start and stop thewrk
command manually fast consecutively. That is basically whatparallel_run.sh
is doing.Minimal Reproduction
https://github.com/pythoneer/sqlx_test
needs
wrk
installed forparallel_run.sh
.and
Info
"postgres",
"runtime-tokio-rustls",
"macros", "migrate",
"chrono", "json", "uuid",
rustc --version
: rustc 1.72.0 (5680fa18f 2023-08-23)The text was updated successfully, but these errors were encountered: