-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement asynchronous pipe for communication with walredo process #3368
Conversation
bench_walredo:
async_pipe:
shmempipe_rs:
Ketteq: Q1
|
I get similar results for microbenches.this pr:
shmempipe_rs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. I am having hard time reading those index into vecdeque operations, but this certainly is a less-unsafe solution than the shmempipe with some added OLAP performance, and this passes the tests right away.
The difference between this solution and Shmempipe is not so big in terms of performance. However, this is much simpler than Shmempipe. Let’s go with this one! |
OLTP workload (pgbench).
shmempipe_rs:
main:
async_pipe:
|
So, on OLAP queries with parallel seqscan and prefetch main:
async_pipe:
My hypothesis is large number of concurrent tasks allows to compensate this difference in page reconstruction time: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed the core logic in this PR, especially the business around detecing a restarted walredo process while other apply_wal_records
calls are still going on.
I wrote done my current understanding as comments in this PR to this branch: #3389
Please review that I understood things correctly, then pull it in.
I'm obviously pleased with the perf results, but I think this implementation is very brittle.
I think we can do better than this.
Specifically, how about we introduce a new struct WalRedoProcess
and move the apply_wal_records
method there.
When we launch the walredo process, we heap-allocate the WalRedoProcess
struct using Arc
.
PostgresRedoManager
stores the current walredo process in a Mutex<Option<Arc<WalRedoProcess>>>
.
apply_batch_postgres
grabs that mutex briefly to check whether there is already a walredo process.
If there is, Arc::clone
it, drop mutex, call apply_wal_records
.
If apply_wal_records
comes back with an error:
- re-acquire the mutex
- if it's still the same WalRedoProcess,
.take()
it out, drop mutex, andkill_and_wait
it
Advantage with this:
- overall more idiomatic (my opinion)
- we don't need to rely on libc not recycling file descriptors. We can definitely rely on that, but I think to most programmers, it's a non-obvious interaction, whereas
Mutex<Option<..>>
is straight-forward. - Lifetimes of the file stdio filedescriptors is idiomatic and straight-forward. The
WalRedoProcess
will hold all theChildStd{in,out,err}
. Andapply_wal_records
can useas_raw_fd()
on them safely, since it's a method on theWalRedoProcess
, so, it's guaranteed that theChildStd{in,out,err}
won't be dropped whileapply_wal_records
is still running.
I can do the impl myself, but I'm busy with other stuff. So, I'd ask you to spend some time pursuing this, or another less brittle approach.
Question on Slack:
No, you would drop the We don't need Thinking about something like this:
|
There was Also, frankly speaking, I do not consider handling of walredo process as very important task. There are two different aspects of waredo crashes:
|
@problame, for me, your request looks like follow-up refactoring. Because, at this moment, it was incorporated into the existing solution. I don't think that we need to mix these two changes. |
@knizhnik I had already forgotten that there was a All I'm saying is that I find the dance with the So, what I proposed was to replace the pretty subtle reliance on FD number reuse to detect walredo process restarts with a more robust & idiomatic mechanism. That mechanism is what I outlined in comment #3368 (review) and the subsequent comment. If you don't want to go that extra mile, simply pull in the comments from #3389 that document the brittleness and I'll get out of the way here. |
I do not see something criminal in "dance with the file descriptors" - it seems to be quite obvious that file descriptor can not be reused until it is closed so it can not be used as "generation".
|
@problame can we convert your suggestion into follow up task? |
You're referring to
It very likely will not be the bottleneck, and if it turns out that it is, we can use another mechanism than Mutex that exploits the read-mostly nature of the data.
Then just have separate directories for each WalRedoProcess. No big deal.
It seems like I can't convince you there.
@vadim2404 , sure, although I fear it won't have priority, die in the backlog, and at some point in the future, someone to whom the file descriptor dance isn't obvious will introduce a subtle bug. Anyways, @knizhnik , what do you think of the suggested compromise? |
Yes, I agree that most likely it will not be a bottleneck. But still I prefer to minimize number of sync calls.
It is not a big deal. But also not so trivial. How you are going to provide uniqueness of directory?
But the main problem is that we really need to perform cleanup. It can happen that after abnormal pageserver termination there is old walredo directory on the disk. we can not leave it here - it is just waste of space (although not so much space). Actually correct addressing all this "not a big deal" issues can sigfnicantly complicate code and make it even more error prone.
I approved this PR. Will you merge them yourself or you want me to do it? |
Will rebase it and merge it.
Valid concern that it would take some time.
It's different kinds of danger, though. If someone messes up with the file descriptors, it might result in information leakage across tenants. VERY BAD. Whereas, a bug that fails to clean up a stale walredo temp dir results in out-of-disk space outage at worst, and can be trivially mitigated. Not so bad. |
Retracting my 'Request changes' vote to unblock this.
Created follow-up ticket: #3459 |
… and receiving response
…3368) Co-authored-by: Christian Schwarz <christian@neon.tech>
No description provided.