Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement asynchronous pipe for communication with walredo process #3368

Merged
merged 5 commits into from
Jan 27, 2023

Conversation

knizhnik
Copy link
Contributor

No description provided.

@knizhnik knizhnik requested review from a team as code owners January 17, 2023 16:07
@knizhnik knizhnik requested review from funbringer and SomeoneToIgnore and removed request for a team January 17, 2023 16:07
@knizhnik
Copy link
Contributor Author

bench_walredo:
main:

short/short/1           time:   [11.805 µs 11.987 µs 12.199 µs]
short/short/2           time:   [41.705 µs 42.227 µs 42.769 µs]
short/short/4           time:   [87.145 µs 87.749 µs 88.359 µs]
short/short/8           time:   [197.73 µs 198.69 µs 199.76 µs]
short/short/16          time:   [433.69 µs 435.78 µs 438.27 µs]

medium/medium/1         time:   [78.474 µs 80.196 µs 82.222 µs]
medium/medium/2         time:   [236.36 µs 243.76 µs 251.36 µs]
medium/medium/4         time:   [492.11 µs 506.39 µs 520.77 µs]
medium/medium/8         time:   [1.4873 ms 1.5360 ms 1.5864 ms]
medium/medium/16        time:   [2.9318 ms 3.0007 ms 3.0802 ms]

async_pipe:

short/short/1           time:   [12.906 µs 13.087 µs 13.298 µs]
short/short/2           time:   [30.255 µs 30.772 µs 31.294 µs]
short/short/4           time:   [58.226 µs 58.738 µs 59.269 µs]
short/short/8           time:   [121.36 µs 122.26 µs 123.25 µs]
short/short/16          time:   [261.18 µs 262.33 µs 263.68 µs]

medium/medium/1         time:   [81.226 µs 83.568 µs 86.293 µs]
medium/medium/2         time:   [226.95 µs 234.89 µs 242.94 µs]
medium/medium/4         time:   [458.42 µs 476.58 µs 494.95 µs]
medium/medium/8         time:   [1.1723 ms 1.1832 ms 1.1941 ms]
medium/medium/16        time:   [2.3780 ms 2.4040 ms 2.4304 ms]

shmempipe_rs:

short/short/1           time:   [5.0877 µs 5.1564 µs 5.2430 µs]
short/short/2           time:   [24.983 µs 25.335 µs 25.697 µs]
short/short/4           time:   [46.858 µs 47.352 µs 47.844 µs]
short/short/8           time:   [85.792 µs 86.275 µs 86.800 µs]
short/short/16          time:   [188.34 µs 190.89 µs 193.76 µs]

medium/medium/1         time:   [88.635 µs 89.152 µs 89.934 µs]
medium/medium/2         time:   [186.07 µs 193.70 µs 202.51 µs]
medium/medium/4         time:   [450.33 µs 477.85 µs 506.00 µs]
medium/medium/8         time:   [1.1779 ms 1.1894 ms 1.2007 ms]
medium/medium/16        time:   [2.3209 ms 2.3447 ms 2.3691 ms]

Ketteq: Q1

main: 20372.712 ms
async_pipe: 11989.555 ms
shmempipe_rs:  10473.423 ms

@koivunej
Copy link
Contributor

I get similar results for microbenches.

this pr:

short/short/1           time:   [7.6638 µs 7.7110 µs 7.7622 µs]
short/short/2           time:   [29.826 µs 30.194 µs 30.535 µs]
short/short/4           time:   [42.652 µs 43.155 µs 43.700 µs]
short/short/8           time:   [110.86 µs 112.77 µs 114.62 µs]
short/short/16          time:   [256.57 µs 257.44 µs 258.32 µs]

medium/medium/1         time:   [66.224 µs 66.545 µs 66.866 µs]
medium/medium/2         time:   [163.19 µs 163.64 µs 164.09 µs]
medium/medium/4         time:   [284.57 µs 284.94 µs 285.33 µs]
medium/medium/8         time:   [547.03 µs 549.37 µs 551.78 µs]
medium/medium/16        time:   [1.0768 ms 1.0839 ms 1.0913 ms]

shmempipe_rs:

short/short/1           time:   [4.6481 µs 4.6662 µs 4.6929 µs]
short/short/2           time:   [20.736 µs 20.812 µs 20.890 µs]
short/short/4           time:   [32.380 µs 32.569 µs 32.781 µs]
short/short/8           time:   [71.833 µs 72.212 µs 72.559 µs]
short/short/16          time:   [148.82 µs 149.77 µs 150.72 µs]

medium/medium/1         time:   [64.319 µs 64.424 µs 64.532 µs]
medium/medium/2         time:   [140.93 µs 141.34 µs 141.82 µs]
medium/medium/4         time:   [266.33 µs 266.82 µs 267.34 µs]
medium/medium/8         time:   [509.48 µs 511.22 µs 513.18 µs]
medium/medium/16        time:   [978.29 µs 986.37 µs 994.89 µs]

Copy link
Contributor

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. I am having hard time reading those index into vecdeque operations, but this certainly is a less-unsafe solution than the shmempipe with some added OLAP performance, and this passes the tests right away.

@vadim2404
Copy link
Contributor

The difference between this solution and Shmempipe is not so big in terms of performance. However, this is much simpler than Shmempipe. Let’s go with this one!

@knizhnik
Copy link
Contributor Author

knizhnik commented Jan 17, 2023

OLTP workload (pgbench).
I used the following script:

pgbench -i -s 100 "host=localhost port=55432 user=cloud_admin dbname=postgres"
pgbench -c 10 -T 600 -N -M prepared "host=localhost port=55432 user=cloud_admin dbname=postgres"
sleep 60
pgbench -c 10 -T 100 -S -M prepared "host=localhost port=55432 user=cloud_admin dbname=postgres"
pgbench -c 1 -T 100 -S -M prepared "host=localhost port=55432 user=cloud_admin dbname=postgres"

shmempipe_rs:

pgbench -i -s 100 
done in 50.40 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 7.44 s, vacuum 24.66 s, primary keys 18.29 s)
pgbench -c 10 -T 600 -N -M prepared 
tps = 4691
pgbench -c 10 -T 100 -S -M prepared
tps = 18021
pgbench -c 1 -T 100 -S -M prepared
tps = 3885

main:

pgbench -i -s 100 
done in 68.74 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 7.90 s, vacuum 34.07 s, primary keys 26.76 s).
pgbench -c 10 -T 600 -N -M prepared 
tps = 4185
pgbench -c 10 -T 100 -S -M prepared
tps =  17912
pgbench -c 1 -T 100 -S -M prepared 
tps = 3886

async_pipe:

pgbench -i -s 100 
done in 67.62 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 7.37 s, vacuum 33.95 s, primary keys 26.29 s).
pgbench -c 10 -T 600 -N -M prepared 
tps = 4471
pgbench -c 10 -T 100 -S -M prepared
tps = 18197
pgbench -c 1 -T 100 -S -M prepared 
tps = 4303

@knizhnik
Copy link
Contributor Author

knizhnik commented Jan 17, 2023

So, on OLAP queries with parallel seqscan and prefetch async_pipe provides ~2 times speedup comparing with main,
as well as shmem_pipe.
On OLTP workload there is almost no difference with main, although according to metrics WAL redo takes signficantly less time.
For 100 second read-only iteration of pgbench with 10 clients metrics are the following:

main:

page_reconstruction_time: 242
wal_redo_time: 58
wal_redo_wait_time: 175

async_pipe:

page_reconstruction_time: 189
wal_redo_time: 90
wal_redo_wait_time: 92

My hypothesis is large number of concurrent tasks allows to compensate this difference in page reconstruction time:
if some backend is waiting response from pageserver, there is still more tasks system can perform so the value of delay doesn't have big influence on throughput.
And with single client there are no concurrent requests to walredo, so old pipe works with the same speed as async_pipe.

pageserver/src/walredo.rs Outdated Show resolved Hide resolved
pageserver/src/walredo.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the core logic in this PR, especially the business around detecing a restarted walredo process while other apply_wal_records calls are still going on.

I wrote done my current understanding as comments in this PR to this branch: #3389
Please review that I understood things correctly, then pull it in.

I'm obviously pleased with the perf results, but I think this implementation is very brittle.
I think we can do better than this.
Specifically, how about we introduce a new struct WalRedoProcess and move the apply_wal_records method there.
When we launch the walredo process, we heap-allocate the WalRedoProcess struct using Arc.
PostgresRedoManager stores the current walredo process in a Mutex<Option<Arc<WalRedoProcess>>>.
apply_batch_postgres grabs that mutex briefly to check whether there is already a walredo process.
If there is, Arc::clone it, drop mutex, call apply_wal_records.
If apply_wal_records comes back with an error:

  1. re-acquire the mutex
  2. if it's still the same WalRedoProcess, .take() it out, drop mutex, and kill_and_wait it

Advantage with this:

  1. overall more idiomatic (my opinion)
  2. we don't need to rely on libc not recycling file descriptors. We can definitely rely on that, but I think to most programmers, it's a non-obvious interaction, whereas Mutex<Option<..>> is straight-forward.
  3. Lifetimes of the file stdio filedescriptors is idiomatic and straight-forward. The WalRedoProcess will hold all the ChildStd{in,out,err}. And apply_wal_records can use as_raw_fd() on them safely, since it's a method on the WalRedoProcess, so, it's guaranteed that the ChildStd{in,out,err} won't be dropped while apply_wal_records is still running.

I can do the impl myself, but I'm busy with other stuff. So, I'd ask you to spend some time pursuing this, or another less brittle approach.

@problame
Copy link
Contributor

Question on Slack:

I am a bit unsure how can it be done with Mutex<Option<Arc> .... would you periodically take this one mutex to access fd's or ... what? (edited)

No, you would drop the Mutex after Arc::clone'ing the WalRedoProcess.

We don't need ProcessInput and ProcessOutput with RawFd's anymore.

Thinking about something like this:

pub struct PostgresRedoManager {
    tenant_id: TenantId,
    conf: &'static PageServerConf,
    redo_process: Mutex<Option<Arc<WalRedoProcess>>>,
}

struct WalRedoProcess {
    child: NoLeakChild,

    // stderr is used by both tx and rx. Rule is: must hold either tx or rx mutex before this one.
    stderr: Mutex<ChildSterr>

    tx_mutex: Mutex<ChildStdin>,

    rx_mutex: Mutex<struct {
      stdout: ChildStdout,
      pending_responses: VecDeque<Option<Bytes>>,
      n_processed_responses: usize,
    }>
}

@knizhnik
Copy link
Contributor Author

Reviewed the core logic in this PR, especially the business around detecing a restarted walredo process while other apply_wal_records calls are still going on.

I wrote done my current understanding as comments in this PR to this branch: #3389 Please review that I understood things correctly, then pull it in.

I'm obviously pleased with the perf results, but I think this implementation is very brittle. I think we can do better than this. Specifically, how about we introduce a new struct WalRedoProcess and move the apply_wal_records method there. When we launch the walredo process, we heap-allocate the WalRedoProcess struct using Arc. PostgresRedoManager stores the current walredo process in a Mutex<Option<Arc<WalRedoProcess>>>. apply_batch_postgres grabs that mutex briefly to check whether there is already a walredo process. If there is, Arc::clone it, drop mutex, call apply_wal_records. If apply_wal_records comes back with an error:

1. re-acquire the mutex

2. if it's still the same WalRedoProcess, `.take()` it out, drop mutex, and `kill_and_wait` it

Advantage with this:

1. overall more idiomatic (my opinion)

2. we don't need to rely on libc not recycling file descriptors. We can definitely rely on that, but I think to most programmers, it's a non-obvious interaction, whereas `Mutex<Option<..>>` is straight-forward.

3. Lifetimes of the file stdio filedescriptors is idiomatic and straight-forward. The `WalRedoProcess` will hold all the `ChildStd{in,out,err}`. And `apply_wal_records`  can use `as_raw_fd()` on them safely, since it's a method on the `WalRedoProcess`, so, it's guaranteed that the `ChildStd{in,out,err}` won't be dropped while `apply_wal_records` is still running.

I can do the impl myself, but I'm busy with other stuff. So, I'd ask you to spend some time pursuing this, or another less brittle approach.

There was WalRedoProcess before this PR. Then I split into to ProcessInput and ProcessOutput.
Frankly speaking I do not understand motivation to make resurrect it once again and use Arc for it.

Also, frankly speaking, I do not consider handling of walredo process as very important task. There are two different aspects of waredo crashes:

  1. If them are caused by bug in Neon, then it seems to indicate some very serious problem. It is not enough just to restart process and consider that nothing happen. It will be better to provide all necessary information to be able to understand the actual source of the problem. There were several crashes of walredo process, mostly cause by incorrect wal records sequences sent to it (some records are missed because of bugs in layer map).
  2. walredo is considered s untruste process. Some illegal WAL record can crack this process and get full control over it. seccomp restricts number of allows syscalls and makes this process less harmful. But still it can send arbitrary garbage to oageserver,crash,... We should guarantee that it will not affect other tenants served by pageserver.

@vadim2404
Copy link
Contributor

@problame, for me, your request looks like follow-up refactoring. Because, at this moment, it was incorporated into the existing solution. I don't think that we need to mix these two changes.

@problame
Copy link
Contributor

problame commented Jan 23, 2023

@knizhnik I had already forgotten that there was a WalRedoProcess in the past.
And I think, because I reused that name, you misread my proposal.

All I'm saying is that I find the dance with the as_raw_fd()s and ProcessInput & ProcessOutput relatively brittle.
The comments I added in #3389 should make that clear.

So, what I proposed was to replace the pretty subtle reliance on FD number reuse to detect walredo process restarts with a more robust & idiomatic mechanism. That mechanism is what I outlined in comment #3368 (review) and the subsequent comment.

If you don't want to go that extra mile, simply pull in the comments from #3389 that document the brittleness and I'll get out of the way here.

@knizhnik
Copy link
Contributor Author

I do not see something criminal in "dance with the file descriptors" - it seems to be quite obvious that file descriptor can not be reused until it is closed so it can not be used as "generation".
What I do not like in your proposal:

  1. Original apply_batch_postgres locks one mutex, my implementation - 2 mutexes, your - 3 mutexes. Not sure that it is bottleneck, but I still prefer to minimize number of sync calls.
  2. As far as I understand in your case it will be possible that there are will be two instance of walredo process at the same time (because of Arc). It may case some unexpected effects in walredo process because them will share the same working directory and one of them will try to remove it while another may continue to access files. Once again I a not sure that it mat really cause problems because one of them i expected to be in zomby phase at this moment. But sill...

@vadim2404
Copy link
Contributor

@problame can we convert your suggestion into follow up task?

@problame
Copy link
Contributor

Original apply_batch_postgres locks one mutex, my implementation - 2 mutexes, your - 3 mutexes. Not sure that it is bottleneck, but I still prefer to minimize number of sync calls.

You're referring to

    redo_process: Mutex<Option<Arc<WalRedoProcess>>>,

It very likely will not be the bottleneck, and if it turns out that it is, we can use another mechanism than Mutex that exploits the read-mostly nature of the data.
Maybe RwLock, maybe some fancy atomic pointer swap.

As far as I understand in your case it will be possible that there are will be two instance of walredo process at the same time (because of Arc).

Then just have separate directories for each WalRedoProcess. No big deal.

I do not see something criminal in "dance with the file descriptors" - it seems to be quite obvious that file descriptor can not be reused until it is closed so it can not be used as "generation".

It seems like I can't convince you there.
So, all we can do is agree to disagree.
I think pulling in my comments from #3389 is the minimal compromise here.

@problame can we convert your suggestion into follow up task?

@vadim2404 , sure, although I fear it won't have priority, die in the backlog, and at some point in the future, someone to whom the file descriptor dance isn't obvious will introduce a subtle bug.
IMO it's better to get such things right the first time.

Anyways, @knizhnik , what do you think of the suggested compromise?

@knizhnik
Copy link
Contributor Author

It very likely will not be the bottleneck, and if it turns out that it is, we can use another mechanism than Mutex that exploits the read-mostly nature of the data. Maybe RwLock, maybe some fancy atomic pointer swap.

Yes, I agree that most likely it will not be a bottleneck. But still I prefer to minimize number of sync calls.
Concerning suggestion of replacing mutex with RwLock: it my reduce probability of conflict but not overhead of syscall itself.

Then just have separate directories for each WalRedoProcess. No big deal.

It is not a big deal. But also not so trivial. How you are going to provide uniqueness of directory?

  • Use some autoincremented counter to generate directory name? here it will be stored?
  • Include timestamp in directory name? With which resolution?
  • Probe if directory already exists?

But the main problem is that we really need to perform cleanup. It can happen that after abnormal pageserver termination there is old walredo directory on the disk. we can not leave it here - it is just waste of space (although not so much space).
But when we can delete it? Right now it is done at walredo process start. But if there are more than one walredo processes running at the same time (even if one of them is zomby), it can be dangerous.'

Actually correct addressing all this "not a big deal" issues can sigfnicantly complicate code and make it even more error prone.
This is why I do not like the proposed solution. I think that "file descriptor dance" is actually less dangerous.

I think pulling in my comments from #3389 is the minimal compromise here.

I approved this PR. Will you merge them yourself or you want me to do it?

@problame
Copy link
Contributor

I approved this PR. Will you merge them yourself or you want me to do it?

Will rebase it and merge it.

Actually correct addressing all this "not a big deal" issues can sigfnicantly complicate code and make it even more error prone.

Valid concern that it would take some time.
Let's punt it into a follow-up issue, then.

This is why I do not like the proposed solution. I think that "file descriptor dance" is actually less dangerous.

It's different kinds of danger, though.

If someone messes up with the file descriptors, it might result in information leakage across tenants. VERY BAD.

Whereas, a bug that fails to clean up a stale walredo temp dir results in out-of-disk space outage at worst, and can be trivially mitigated. Not so bad.

problame

This comment was marked as duplicate.

@problame problame dismissed their stale review January 26, 2023 14:05

Retracting my 'Request changes' vote to unblock this.

@problame
Copy link
Contributor

Created follow-up ticket: #3459

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants