Implement asynchronous pipe for communication with walredo process #3368

knizhnik · 2023-01-17T16:07:42Z

No description provided.

knizhnik · 2023-01-17T16:12:04Z

bench_walredo:
main:

short/short/1           time:   [11.805 µs 11.987 µs 12.199 µs]
short/short/2           time:   [41.705 µs 42.227 µs 42.769 µs]
short/short/4           time:   [87.145 µs 87.749 µs 88.359 µs]
short/short/8           time:   [197.73 µs 198.69 µs 199.76 µs]
short/short/16          time:   [433.69 µs 435.78 µs 438.27 µs]

medium/medium/1         time:   [78.474 µs 80.196 µs 82.222 µs]
medium/medium/2         time:   [236.36 µs 243.76 µs 251.36 µs]
medium/medium/4         time:   [492.11 µs 506.39 µs 520.77 µs]
medium/medium/8         time:   [1.4873 ms 1.5360 ms 1.5864 ms]
medium/medium/16        time:   [2.9318 ms 3.0007 ms 3.0802 ms]

async_pipe:

short/short/1           time:   [12.906 µs 13.087 µs 13.298 µs]
short/short/2           time:   [30.255 µs 30.772 µs 31.294 µs]
short/short/4           time:   [58.226 µs 58.738 µs 59.269 µs]
short/short/8           time:   [121.36 µs 122.26 µs 123.25 µs]
short/short/16          time:   [261.18 µs 262.33 µs 263.68 µs]

medium/medium/1         time:   [81.226 µs 83.568 µs 86.293 µs]
medium/medium/2         time:   [226.95 µs 234.89 µs 242.94 µs]
medium/medium/4         time:   [458.42 µs 476.58 µs 494.95 µs]
medium/medium/8         time:   [1.1723 ms 1.1832 ms 1.1941 ms]
medium/medium/16        time:   [2.3780 ms 2.4040 ms 2.4304 ms]

shmempipe_rs:

short/short/1           time:   [5.0877 µs 5.1564 µs 5.2430 µs]
short/short/2           time:   [24.983 µs 25.335 µs 25.697 µs]
short/short/4           time:   [46.858 µs 47.352 µs 47.844 µs]
short/short/8           time:   [85.792 µs 86.275 µs 86.800 µs]
short/short/16          time:   [188.34 µs 190.89 µs 193.76 µs]

medium/medium/1         time:   [88.635 µs 89.152 µs 89.934 µs]
medium/medium/2         time:   [186.07 µs 193.70 µs 202.51 µs]
medium/medium/4         time:   [450.33 µs 477.85 µs 506.00 µs]
medium/medium/8         time:   [1.1779 ms 1.1894 ms 1.2007 ms]
medium/medium/16        time:   [2.3209 ms 2.3447 ms 2.3691 ms]

Ketteq: Q1

main: 20372.712 ms
async_pipe: 11989.555 ms
shmempipe_rs:  10473.423 ms

koivunej · 2023-01-17T16:47:40Z

I get similar results for microbenches.

this pr:

short/short/1           time:   [7.6638 µs 7.7110 µs 7.7622 µs]
short/short/2           time:   [29.826 µs 30.194 µs 30.535 µs]
short/short/4           time:   [42.652 µs 43.155 µs 43.700 µs]
short/short/8           time:   [110.86 µs 112.77 µs 114.62 µs]
short/short/16          time:   [256.57 µs 257.44 µs 258.32 µs]

medium/medium/1         time:   [66.224 µs 66.545 µs 66.866 µs]
medium/medium/2         time:   [163.19 µs 163.64 µs 164.09 µs]
medium/medium/4         time:   [284.57 µs 284.94 µs 285.33 µs]
medium/medium/8         time:   [547.03 µs 549.37 µs 551.78 µs]
medium/medium/16        time:   [1.0768 ms 1.0839 ms 1.0913 ms]

shmempipe_rs:

short/short/1           time:   [4.6481 µs 4.6662 µs 4.6929 µs]
short/short/2           time:   [20.736 µs 20.812 µs 20.890 µs]
short/short/4           time:   [32.380 µs 32.569 µs 32.781 µs]
short/short/8           time:   [71.833 µs 72.212 µs 72.559 µs]
short/short/16          time:   [148.82 µs 149.77 µs 150.72 µs]

medium/medium/1         time:   [64.319 µs 64.424 µs 64.532 µs]
medium/medium/2         time:   [140.93 µs 141.34 µs 141.82 µs]
medium/medium/4         time:   [266.33 µs 266.82 µs 267.34 µs]
medium/medium/8         time:   [509.48 µs 511.22 µs 513.18 µs]
medium/medium/16        time:   [978.29 µs 986.37 µs 994.89 µs]

koivunej

I think this looks good. I am having hard time reading those index into vecdeque operations, but this certainly is a less-unsafe solution than the shmempipe with some added OLAP performance, and this passes the tests right away.

vadim2404 · 2023-01-17T21:17:39Z

The difference between this solution and Shmempipe is not so big in terms of performance. However, this is much simpler than Shmempipe. Let’s go with this one!

knizhnik · 2023-01-17T21:35:15Z

OLTP workload (pgbench).
I used the following script:

pgbench -i -s 100 "host=localhost port=55432 user=cloud_admin dbname=postgres"
pgbench -c 10 -T 600 -N -M prepared "host=localhost port=55432 user=cloud_admin dbname=postgres"
sleep 60
pgbench -c 10 -T 100 -S -M prepared "host=localhost port=55432 user=cloud_admin dbname=postgres"
pgbench -c 1 -T 100 -S -M prepared "host=localhost port=55432 user=cloud_admin dbname=postgres"

shmempipe_rs:

pgbench -i -s 100 
done in 50.40 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 7.44 s, vacuum 24.66 s, primary keys 18.29 s)
pgbench -c 10 -T 600 -N -M prepared 
tps = 4691
pgbench -c 10 -T 100 -S -M prepared
tps = 18021
pgbench -c 1 -T 100 -S -M prepared
tps = 3885

main:

pgbench -i -s 100 
done in 68.74 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 7.90 s, vacuum 34.07 s, primary keys 26.76 s).
pgbench -c 10 -T 600 -N -M prepared 
tps = 4185
pgbench -c 10 -T 100 -S -M prepared
tps =  17912
pgbench -c 1 -T 100 -S -M prepared 
tps = 3886

async_pipe:

pgbench -i -s 100 
done in 67.62 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 7.37 s, vacuum 33.95 s, primary keys 26.29 s).
pgbench -c 10 -T 600 -N -M prepared 
tps = 4471
pgbench -c 10 -T 100 -S -M prepared
tps = 18197
pgbench -c 1 -T 100 -S -M prepared 
tps = 4303

knizhnik · 2023-01-17T22:05:08Z

So, on OLAP queries with parallel seqscan and prefetch async_pipe provides ~2 times speedup comparing with main,
as well as shmem_pipe.
On OLTP workload there is almost no difference with main, although according to metrics WAL redo takes signficantly less time.
For 100 second read-only iteration of pgbench with 10 clients metrics are the following:

main:

page_reconstruction_time: 242
wal_redo_time: 58
wal_redo_wait_time: 175

async_pipe:

page_reconstruction_time: 189
wal_redo_time: 90
wal_redo_wait_time: 92

My hypothesis is large number of concurrent tasks allows to compensate this difference in page reconstruction time:
if some backend is waiting response from pageserver, there is still more tasks system can perform so the value of delay doesn't have big influence on throughput.
And with single client there are no concurrent requests to walredo, so old pipe works with the same speed as async_pipe.

pageserver/src/walredo.rs

problame

Reviewed the core logic in this PR, especially the business around detecing a restarted walredo process while other apply_wal_records calls are still going on.

I wrote done my current understanding as comments in this PR to this branch: #3389
Please review that I understood things correctly, then pull it in.

I'm obviously pleased with the perf results, but I think this implementation is very brittle.
I think we can do better than this.
Specifically, how about we introduce a new struct WalRedoProcess and move the apply_wal_records method there.
When we launch the walredo process, we heap-allocate the WalRedoProcess struct using Arc.
PostgresRedoManager stores the current walredo process in a Mutex<Option<Arc<WalRedoProcess>>>.
apply_batch_postgres grabs that mutex briefly to check whether there is already a walredo process.
If there is, Arc::clone it, drop mutex, call apply_wal_records.
If apply_wal_records comes back with an error:

re-acquire the mutex
if it's still the same WalRedoProcess, .take() it out, drop mutex, and kill_and_wait it

Advantage with this:

overall more idiomatic (my opinion)
we don't need to rely on libc not recycling file descriptors. We can definitely rely on that, but I think to most programmers, it's a non-obvious interaction, whereas Mutex<Option<..>> is straight-forward.
Lifetimes of the file stdio filedescriptors is idiomatic and straight-forward. The WalRedoProcess will hold all the ChildStd{in,out,err}. And apply_wal_records can use as_raw_fd() on them safely, since it's a method on the WalRedoProcess, so, it's guaranteed that the ChildStd{in,out,err} won't be dropped while apply_wal_records is still running.

I can do the impl myself, but I'm busy with other stuff. So, I'd ask you to spend some time pursuing this, or another less brittle approach.

problame · 2023-01-20T10:12:24Z

Question on Slack:

I am a bit unsure how can it be done with Mutex<Option<Arc> .... would you periodically take this one mutex to access fd's or ... what? (edited)

No, you would drop the Mutex after Arc::clone'ing the WalRedoProcess.

We don't need ProcessInput and ProcessOutput with RawFd's anymore.

Thinking about something like this:

pub struct PostgresRedoManager {
    tenant_id: TenantId,
    conf: &'static PageServerConf,
    redo_process: Mutex<Option<Arc<WalRedoProcess>>>,
}

struct WalRedoProcess {
    child: NoLeakChild,

    // stderr is used by both tx and rx. Rule is: must hold either tx or rx mutex before this one.
    stderr: Mutex<ChildSterr>

    tx_mutex: Mutex<ChildStdin>,

    rx_mutex: Mutex<struct {
      stdout: ChildStdout,
      pending_responses: VecDeque<Option<Bytes>>,
      n_processed_responses: usize,
    }>
}

knizhnik · 2023-01-20T19:48:15Z

Reviewed the core logic in this PR, especially the business around detecing a restarted walredo process while other apply_wal_records calls are still going on.

I wrote done my current understanding as comments in this PR to this branch: #3389 Please review that I understood things correctly, then pull it in.

I'm obviously pleased with the perf results, but I think this implementation is very brittle. I think we can do better than this. Specifically, how about we introduce a new struct WalRedoProcess and move the apply_wal_records method there. When we launch the walredo process, we heap-allocate the WalRedoProcess struct using Arc. PostgresRedoManager stores the current walredo process in a Mutex<Option<Arc<WalRedoProcess>>>. apply_batch_postgres grabs that mutex briefly to check whether there is already a walredo process. If there is, Arc::clone it, drop mutex, call apply_wal_records. If apply_wal_records comes back with an error:
1. re-acquire the mutex

2. if it's still the same WalRedoProcess, `.take()` it out, drop mutex, and `kill_and_wait` it
Advantage with this:
1. overall more idiomatic (my opinion)

2. we don't need to rely on libc not recycling file descriptors. We can definitely rely on that, but I think to most programmers, it's a non-obvious interaction, whereas `Mutex<Option<..>>` is straight-forward.

3. Lifetimes of the file stdio filedescriptors is idiomatic and straight-forward. The `WalRedoProcess` will hold all the `ChildStd{in,out,err}`. And `apply_wal_records`  can use `as_raw_fd()` on them safely, since it's a method on the `WalRedoProcess`, so, it's guaranteed that the `ChildStd{in,out,err}` won't be dropped while `apply_wal_records` is still running.
I can do the impl myself, but I'm busy with other stuff. So, I'd ask you to spend some time pursuing this, or another less brittle approach.

There was WalRedoProcess before this PR. Then I split into to ProcessInput and ProcessOutput.
Frankly speaking I do not understand motivation to make resurrect it once again and use Arc for it.

Also, frankly speaking, I do not consider handling of walredo process as very important task. There are two different aspects of waredo crashes:

If them are caused by bug in Neon, then it seems to indicate some very serious problem. It is not enough just to restart process and consider that nothing happen. It will be better to provide all necessary information to be able to understand the actual source of the problem. There were several crashes of walredo process, mostly cause by incorrect wal records sequences sent to it (some records are missed because of bugs in layer map).
walredo is considered s untruste process. Some illegal WAL record can crack this process and get full control over it. seccomp restricts number of allows syscalls and makes this process less harmful. But still it can send arbitrary garbage to oageserver,crash,... We should guarantee that it will not affect other tenants served by pageserver.

vadim2404 · 2023-01-23T11:34:31Z

@problame, for me, your request looks like follow-up refactoring. Because, at this moment, it was incorporated into the existing solution. I don't think that we need to mix these two changes.

problame · 2023-01-23T15:26:45Z

@knizhnik I had already forgotten that there was a WalRedoProcess in the past.
And I think, because I reused that name, you misread my proposal.

All I'm saying is that I find the dance with the as_raw_fd()s and ProcessInput & ProcessOutput relatively brittle.
The comments I added in #3389 should make that clear.

So, what I proposed was to replace the pretty subtle reliance on FD number reuse to detect walredo process restarts with a more robust & idiomatic mechanism. That mechanism is what I outlined in comment #3368 (review) and the subsequent comment.

If you don't want to go that extra mile, simply pull in the comments from #3389 that document the brittleness and I'll get out of the way here.

knizhnik · 2023-01-23T17:09:24Z

I do not see something criminal in "dance with the file descriptors" - it seems to be quite obvious that file descriptor can not be reused until it is closed so it can not be used as "generation".
What I do not like in your proposal:

Original apply_batch_postgres locks one mutex, my implementation - 2 mutexes, your - 3 mutexes. Not sure that it is bottleneck, but I still prefer to minimize number of sync calls.
As far as I understand in your case it will be possible that there are will be two instance of walredo process at the same time (because of Arc). It may case some unexpected effects in walredo process because them will share the same working directory and one of them will try to remove it while another may continue to access files. Once again I a not sure that it mat really cause problems because one of them i expected to be in zomby phase at this moment. But sill...

vadim2404 · 2023-01-24T20:59:55Z

@problame can we convert your suggestion into follow up task?

problame · 2023-01-25T16:43:58Z

Original apply_batch_postgres locks one mutex, my implementation - 2 mutexes, your - 3 mutexes. Not sure that it is bottleneck, but I still prefer to minimize number of sync calls.

You're referring to

    redo_process: Mutex<Option<Arc<WalRedoProcess>>>,

It very likely will not be the bottleneck, and if it turns out that it is, we can use another mechanism than Mutex that exploits the read-mostly nature of the data.
Maybe RwLock, maybe some fancy atomic pointer swap.

As far as I understand in your case it will be possible that there are will be two instance of walredo process at the same time (because of Arc).

Then just have separate directories for each WalRedoProcess. No big deal.

I do not see something criminal in "dance with the file descriptors" - it seems to be quite obvious that file descriptor can not be reused until it is closed so it can not be used as "generation".

It seems like I can't convince you there.
So, all we can do is agree to disagree.
I think pulling in my comments from #3389 is the minimal compromise here.

@problame can we convert your suggestion into follow up task?

@vadim2404 , sure, although I fear it won't have priority, die in the backlog, and at some point in the future, someone to whom the file descriptor dance isn't obvious will introduce a subtle bug.
IMO it's better to get such things right the first time.

Anyways, @knizhnik , what do you think of the suggested compromise?

knizhnik · 2023-01-26T06:49:17Z

It very likely will not be the bottleneck, and if it turns out that it is, we can use another mechanism than Mutex that exploits the read-mostly nature of the data. Maybe RwLock, maybe some fancy atomic pointer swap.

Yes, I agree that most likely it will not be a bottleneck. But still I prefer to minimize number of sync calls.
Concerning suggestion of replacing mutex with RwLock: it my reduce probability of conflict but not overhead of syscall itself.

Then just have separate directories for each WalRedoProcess. No big deal.

It is not a big deal. But also not so trivial. How you are going to provide uniqueness of directory?

Use some autoincremented counter to generate directory name? here it will be stored?
Include timestamp in directory name? With which resolution?
Probe if directory already exists?

But the main problem is that we really need to perform cleanup. It can happen that after abnormal pageserver termination there is old walredo directory on the disk. we can not leave it here - it is just waste of space (although not so much space).
But when we can delete it? Right now it is done at walredo process start. But if there are more than one walredo processes running at the same time (even if one of them is zomby), it can be dangerous.'

Actually correct addressing all this "not a big deal" issues can sigfnicantly complicate code and make it even more error prone.
This is why I do not like the proposed solution. I think that "file descriptor dance" is actually less dangerous.

I think pulling in my comments from #3389 is the minimal compromise here.

I approved this PR. Will you merge them yourself or you want me to do it?

problame · 2023-01-26T10:37:55Z

I approved this PR. Will you merge them yourself or you want me to do it?

Will rebase it and merge it.

Actually correct addressing all this "not a big deal" issues can sigfnicantly complicate code and make it even more error prone.

Valid concern that it would take some time.
Let's punt it into a follow-up issue, then.

This is why I do not like the proposed solution. I think that "file descriptor dance" is actually less dangerous.

It's different kinds of danger, though.

If someone messes up with the file descriptors, it might result in information leakage across tenants. VERY BAD.

Whereas, a bug that fails to clean up a stale walredo temp dir results in out-of-disk space outage at worst, and can be trivially mitigated. Not so bad.

Retracting my 'Request changes' vote to unblock this.

problame · 2023-01-26T15:53:23Z

Created follow-up ticket: #3459

… and receiving response

…3368) Co-authored-by: Christian Schwarz <christian@neon.tech>

knizhnik requested review from a team as code owners January 17, 2023 16:07

knizhnik requested review from funbringer and SomeoneToIgnore and removed request for a team January 17, 2023 16:07

koivunej reviewed Jan 17, 2023

View reviewed changes

koivunej reviewed Jan 18, 2023

View reviewed changes

pageserver/src/walredo.rs Outdated Show resolved Hide resolved

koivunej reviewed Jan 18, 2023

View reviewed changes

pageserver/src/walredo.rs Outdated Show resolved Hide resolved

koivunej approved these changes Jan 19, 2023

View reviewed changes

problame previously requested changes Jan 20, 2023

View reviewed changes

koivunej mentioned this pull request Jan 23, 2023

Epic: shmempipe for WAL redo process communication #3184

Closed

4 tasks

knizhnik mentioned this pull request Jan 23, 2023

Epic: Storage perf improvements #3401

Closed

2 tasks

knizhnik force-pushed the async_pipe branch from 1b3f319 to 690c4e2 Compare January 23, 2023 14:43

This comment was marked as duplicate.

Sign in to view

problame mentioned this pull request Jan 26, 2023

walredo: find a design that uses Rust typesystem to child stdio file descriptor lifetimes [P:2] [S:2] #3459

Closed

knizhnik and others added 5 commits January 27, 2023 17:15

Implement asynchronous pipe for communication with walredo process

131ceff

Handle case when walredo process is restarted between sending request…

bd9dc46

… and receiving response

Rebase with main

2d771c9

improve comment by giving an example

3930ebe

more comments on the output.stdout.as_raw_fd() != stdout_fd business

17c8fc5

knizhnik force-pushed the async_pipe branch from 910aab5 to 17c8fc5 Compare January 27, 2023 15:16

knizhnik merged commit c5ca7d0 into main Jan 27, 2023

knizhnik deleted the async_pipe branch January 27, 2023 16:36

vadim2404 pushed a commit that referenced this pull request Jan 31, 2023

Implement asynchronous pipe for communication with walredo process (#…

3a70c1e

…3368) Co-authored-by: Christian Schwarz <christian@neon.tech>

koivunej mentioned this pull request Jul 17, 2023

Increase WAL redo speed #1339

Closed

problame mentioned this pull request Oct 13, 2023

Check if walredo pipe was recreated by some other backend before killling walredo process #5545

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement asynchronous pipe for communication with walredo process #3368

Implement asynchronous pipe for communication with walredo process #3368

knizhnik commented Jan 17, 2023

knizhnik commented Jan 17, 2023

koivunej commented Jan 17, 2023

koivunej left a comment

vadim2404 commented Jan 17, 2023

knizhnik commented Jan 17, 2023 •

edited

knizhnik commented Jan 17, 2023 •

edited

problame left a comment

problame commented Jan 20, 2023

knizhnik commented Jan 20, 2023

vadim2404 commented Jan 23, 2023

problame commented Jan 23, 2023 •

edited

knizhnik commented Jan 23, 2023

vadim2404 commented Jan 24, 2023

problame commented Jan 25, 2023

knizhnik commented Jan 26, 2023

problame commented Jan 26, 2023

This comment was marked as duplicate.

problame commented Jan 26, 2023

Implement asynchronous pipe for communication with walredo process #3368

Implement asynchronous pipe for communication with walredo process #3368

Conversation

knizhnik commented Jan 17, 2023

knizhnik commented Jan 17, 2023

koivunej commented Jan 17, 2023

koivunej left a comment

Choose a reason for hiding this comment

vadim2404 commented Jan 17, 2023

knizhnik commented Jan 17, 2023 • edited

knizhnik commented Jan 17, 2023 • edited

problame left a comment

Choose a reason for hiding this comment

problame commented Jan 20, 2023

knizhnik commented Jan 20, 2023

vadim2404 commented Jan 23, 2023

problame commented Jan 23, 2023 • edited

knizhnik commented Jan 23, 2023

vadim2404 commented Jan 24, 2023

problame commented Jan 25, 2023

knizhnik commented Jan 26, 2023

problame commented Jan 26, 2023

This comment was marked as duplicate.

problame commented Jan 26, 2023

knizhnik commented Jan 17, 2023 •

edited

knizhnik commented Jan 17, 2023 •

edited

problame commented Jan 23, 2023 •

edited