-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shmempipe in rust #3163
shmempipe in rust #3163
Conversation
Thinking about the security aspects of this a little: The postgres process is untrusted, we have to assume that it could write arbitrary malicious data to the shared memory area. Is |
Those parts look ok to me |
I have to say that my original C implementation of shmpipe also suffers from this problem. It was intended that walredo process calls just two function from libshmpipe.a to receive request and send response. But we in any case has some region of memory shared between walredo and pageserver processes. Cracked walredo process can place any data in this segment. It means that pageserver should treate content of this region as absolutely untrusted. Any manipulation with pipe positions, with sync primitives should be preceded with verification of content of this structs. It is not so hard to do with ring buffer position, but how can I validate for example content of pthread_mutex and guarantee that call of pthread_mutex_lock will not cause SIGSEGV? I afraid that the same problem exists in Rust version of shmpipe, especially if it is using some external crate. Most likely |
Ouch, I hadn't thought of that! Perhaps we should use eventfd(2) for the notifications instead of pthreads. Not sure what the performance of that compares with pthread cond variables. And I guess eventfd is not available on macos but could emulate it with a regular pipe there.
It's just a few array bounds checks in the right places, so I'm not worried about the performance here. But I'm afraid it rules out using any existing crates, and we really need to design the thing carefully. We should keep the unsafe primitives as small as possible so that it can be reviewed carefully, and build higher level abstractions on top of that. |
Updated the description. This is actually slower than Konstantins work, around 10us for
I guess lockups and thread starvation would be a better angle, but this is a good point :)
Even trying to find what is supported and available on macos (not to mention working) is tedious work. I was thinking we could just keep the old stdio based for compatibility.
Why does it rule out other crates? Index manipulations are here: https://docs.rs/ringbuf/latest/src/ringbuf/ring_buffer/base.rs.html#110-144, but if we look closer this is all safe code. (Lately what I remember reading, removing bounds checks from rust with unsafe have in many occasions made the performance worse, so I wasn't even thinking that someone would do that. But this might be because so many are using iterators allowing better codegen than doing indexed accesses all over.)
if this is related to how I use unsafe all over with the |
Lockups and starvation may affect only one tenant. Frankly speaking I do not see acceptable safe solution at this moment. Verification of other shmpipe state variables (position in the buffers) also should be done with special care: as far as content of memory can be changed at any time (doesn't matter whether we are in critical section or not), it is necessary either to use atomic CAS, either copy state to local variables and wok with this local copies. |
We do have global pool of threads however, so starving any tokio worker would decrease them, after 8 locking up page request handling. |
That particular code looks safe. I'm still scared though. Look at
The untrusted process can set 'head' and 'tail' to anything, and with suitable values this integer arithmetic can overflow. Integer overflow panics on debug builds. A panic won't lead to arbitrary code execution, but it's still unpleasant. I'm not sure if there are more serious cases than that. Maybe not. But the bottom line is that the crate wasn't really designed with such a hostile environment in mind, so it's on us to inspect it, and determine which functions are safe to use and what kind of inconsistencies are possible. For example, with suitable values of 'head' and 'tail', it's possible that |
Microbenchmarking results.
|
I think the delta between solutions has to be within the neon/pgxn/neon_walredo/walredoproc.c Line 447 in 7ff591f
Thinking this more, it probably doesn't happen, because the However, my hacky attempt at optimistically slicing did not yield any measurable difference: https://gist.github.com/koivunej/4b31df6f8e222019c96ff56a1839398c Trying with the full message reading with in-band length prefixing. |
There seems to one more possible vulnerability of both C and Rust version of shmpipe. I afraid that is can be show stopper. Will be happy to be wrong. In both cases But what can prevent cracked walredo process from mapping segment of pipe belonging to another tenant? It just needs to somehow guess it's tenant ID... After that it will be able to monitor all walredo traffic of another tenant, i.e. be aware of all updates it performs. It just need to decode wal records and store data in own database and voilà: data is stolen. |
seccomp prevents that. After calling seccomp(), that walredo process cannot open any files. |
Nice to hear it. I have not realized that we prohibits open. |
The makefile is busted however, the dependency to the target/release/libshmempipe.a cannot be at the highest level, because it doesn't necessarily cause anything to be rebuilt or relinked even if the |
In a hurry, have to push more ugly commits. As an episodic update, now there's the eventfd usage. Most of the time during xmas has been spent understanding the memory corruption caused by using a postponed publisher. I do not know why that happens, but it does happen, leading to "stall" as too many bytes are read from the Another interesting was the completly buffy impl of UnparkInorder, which is now heap based. It is faster in microbenchmarks, saves some time with the large threads waiting. I still don't know how to make this into a 7us version. Currently on my machine with taskset and performance governor, boost disabled, it's consistent 9-10us. Also noting that memcpy only |
tracked by #3184 |
40df63c
to
bc9d6f6
Compare
this is probably still not good enough; what happens if we don't drop in the parent process but the child is then killed. Will anything of value remain after robust unlock? probably not.
sadly tempfile crate is not currently supported. to run the tests with miri, just do: cargo +nightly miri test -p shmempipe next steps: - flesh out a proper two thread test case
this is most likely a bogus thing to do ... added a longer fixme about how could it be avoided. it probably never matters what is done over at C-side, it is wrong that in rust we just cast it as mut slice, but there are no really safe alternatives.
I had earlier misprofiled that the waits should be much smaller than they end up being. busy looping for 100us seems to work much better and is on-par with the shmem_pipe, but will be calibrating a loop count for this instead of polling the instant.
as part of trying out if this makes sense in a world where we start by having two different implementations, shmempipe for linux, stdio for !linux. so far no regressions, perhaps a slight one on oltp select-after-update. continuing with trying without allocating the whole message.
instead feed the small chunks at a time. this might give less time to sleep and thus interleave the start of redoing with the request sending operation.
earlier as in right after writing the first slice (first message) in case it was sleeping.
9e53fcb
to
c06c07a
Compare
May be I am wrong, but this function is not used now by our code (including Consumer::len()/Producer::len(). |
Sorry, I found a place where it is used - in debugger asserting in advance current position. |
Yeah, the crate contains many traits implementing different parts. For our purposes, I think we should remove many of the abstractions and all the code that we don't need. That makes it easier to see what's going on, and makes it easier to verify that the unsafe code is sound. |
it's more clear that we might need to restart these requests with a slice of bytes instead of clonable iterator or something.
this also relaxes the bound on &[Bytes] to &[impl AsRef<[u8]>] so that we can keep using Vec and slices.. perhaps a compromise would be the bytes::Buf? still, making the bytes dependency optional again.
Closing this PR in favor of async WAL redo |
Very much an early draft, still has crashes and a lot of refactorings to do. Port/rewrite of earlier #2947.
Creating a PR to solicit feedback esp on:
And to manage TODO's:
short/short/1
around 10us, Shmem pipe #2947 is 7us, main is 12us