New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel access to Buffer can trigger segfaults #11279
Comments
A nerd-sniping bug for sure. The current code for let add_string b s =
let len = String.length s in
let new_position = b.position + len in
if new_position > b.length then resize b len;
Bytes.unsafe_blit_string s 0 b.buffer b.position len;
b.position <- new_position Changing let add_string b s =
let len = String.length s in
let {position; buffer} = b in
if position + len <= b.length then
Bytes.unsafe_blit_string s 0 buffer position len
else begin
resize b len;
Bytes.blit_string s 0 b.buffer b.position len;
end;
b.position <- b.position + len In this version, we work on the underlying buffer (whose size cannot be mutated by concurrent programs), so the unsafe_ reasoning is safe. (One invariant I rely on is that I think that this change is simple enough that we could consider applying it to all Is this what you had in mind with your plan (3), or something simpler? |
This invariant doesn't always hold. If you look at the code of One possibility would be to introduce one level of indirection: instead of Overall, I tend to prefer @dra27's solution 1 over solution 2. I suspect that the expensive part of the bounds check is not the comparisons, but the length computation. Although it would be nice to see benchmarks. |
I agree that the invariant does not hold -- I guess I should be more clear. I wrote it this way to try, and the test passes, but I don't think that version is correct. I tried some benchmarks but it appears difficult to obtain robust results. |
So I wrote microbenchmarks and the results are interesting! Benchmarking code: Early benchmarksI microbenchmarked
So we see from the benchmarks that... the shortcut versions are noticeably faster than the current code! SpillingThe difference comes from spilling decisions made by the compiler. Compare the stdlib implementation and the 'test' version: let add_char_std b c =
let pos = b.position in
if pos >= b.length then resize b 1;
Bytes.unsafe_set b.buffer pos c;
b.position <- pos + 1
let add_char_test b c =
let {position; buffer} = b in
if position < Bytes.length buffer then
Bytes.unsafe_set buffer position c
else begin
resize b 1;
Bytes.set b.buffer b.position c
end;
b.position <- b.position + 1 The standard version should be faster, because it has one less call to Bytes.length in the hot path and otherwise the same (hot path) logic. It is slower because of spilling: the call to (Use I'm not sure why there is this difference in spilling decisions, but I think that it is related to the fact that, after the join point after the conditional, It is of course possible to write "tuned" versions of let add_char_std_nospill b c =
let pos = b.position in
if pos >= b.length then begin
resize b 1;
Bytes.unsafe_set b.buffer pos c;
end else
Bytes.unsafe_set b.buffer pos c;
b.position <- pos + 1
let add_char_safe_nospill b c =
let pos = b.position in
if pos >= b.length then
(resize b 1;
Bytes.set b.buffer pos c)
else
(Bytes.set b.buffer pos c);
b.position <- pos + 1 and indeed those versions are much faster than the previous versions.
add_char vs. add_stringThe cost of spilling is also noticeable in a microbenchmark of
the 'data' approachIn my test, the 'data' approach, where the bytes buffer and its length are packed in an immutable tuple to avoid an extra call to
SummaryThere was a trivial micro-optimization missing from the Buffer module, making the code about 20-25% slower than it should be. The cost of adding the bound checking (a trivial change to get Multicore safety) is about the same! So if we do the two changes at once, we have a safer version with comparable performance. (* before *)
let add_char_std b c =
let pos = b.position in
if pos >= b.length then resize b 1;
Bytes.unsafe_set b.buffer pos c;
b.position <- pos + 1
(* after *)
let add_char_safe_nospill b c =
let pos = b.position in
if pos >= b.length then
(resize b 1;
Bytes.set b.buffer pos c)
else
(Bytes.set b.buffer pos c);
b.position <- pos + 1 If we want, there is also a slightly more optimized and slightly more complex version, which is slightly faster than the safe version. let add_char_test b c =
let {position; buffer} = b in
if position < Bytes.length buffer then
Bytes.unsafe_set buffer position c
else begin
resize b 1;
Bytes.set b.buffer b.position c
end;
b.position <- b.position + 1 |
I haven't looked myself at the output of It's possible that a different register allocator (or just a different spilling algorithm) could have yielded the performance of the (By the way, I wouldn't consider this micro-optimisation "trivial". I don't even understand why you're keeping the update of |
Yes, that's a good summary of what's going on. Optimal insertion of spills and reloads is a hard problem (as in NP-hard), just like register allocation. So, don't expect too much cleverness from your favorite compiler. |
I wouldn't call that an obvious solution, but it's definitely good to know about this strategy. |
Oh yeah 🤓
I'd started down the line of enforcing the invariant that I hadn't got as far as (your lovely!) benchmarks, so I was at the faulty heuristics stage - my assumption was that boxing the buffer and its length in a pair would be dreadful, and also that keeping the spatial locality of having |
@dra27 what's your current plan? I could send a PR that proposes the "slightly more optimized" versions for Buffer, but I would happy to let you send a PR / don't want to duplicate work. |
Sorry, I had managed to forget about this. I disappeared down another slight rabbit hole with this - in particular, on my machine while I see the speed-up in the nospill benchmarks for |
I am curious, could a (private) version of Like: let add_char_std b c =
let pos = b.position in
let buffer = if pos >= b.length then resize_and_return b 1 else b.buffer in
Bytes.unsafe_set buffer pos c;
b.position <- pos + 1 or even the updated version of @gasche: let add_char_test b c =
let {position; buffer} = b in
if position < Bytes.length buffer then
Bytes.unsafe_set buffer position c
else begin
let buffer = resize_and_return b 1 in
Bytes.unsafe_set buffer b.position c
end;
b.position <- b.position + 1 |
OK, I've opened the simplest PR just to ensure that we there's something - it's up to you if you want to have a go at the spill optimisations, @gasche? |
@c-cube - I'd been mucking around with ideas based around potentially blitting to the wrong buffer, yes... just running out of time 🙂 |
The segfaults should be fixed with #11742, even if there are still rooms for (micro-)optimizations at a later point for interested people. |
There are various uses of
_unsafe
functions in the implementation of Buffer. In 4.x, it's impossible1 to write OCaml code which caninvalidate the checks done before these
_unsafe
functions are used, but in 5.x it's relatively easy, e.g.:Obviously Buffers should not be accessed in parallel without being guarded by some kind of lock, but parallel accesses may happen by mistake and these should never cause the running program to segfault.
There are three possible solutions:
_unsafe
functions entirely. This will impose a measurable penalty on correct single-domain use of Buffer (which is why effort was put into switching to the_unsafe
functions before)bytes
value. In particular, relaxing the invariant on the cached length field and the position fields should yield a buffer which has a very similar fast path for the adding functions and only one additional check required for the less-frequent retrieval functions (contents
, etc.)I have a partial implementation of 3, but it's clearly not for 5.0 and, if we're going to do something that crazy, it will need benchmarks to justify it.
2 is straightforward, and I intend to open a PR for it - I'm just making sure there's a tracking issue, as I'm on vacation next week 🙂
(credit to @jmid's property testing work and @jonludlam for spinning the repro case out from it)
Footnotes
well, almost impossible: I think it's possible to engineer a program which might just manage to execute a parallel access to a Buffer which interleaves the checks with a reset (using either signals or some Gc evil), but the only reason this would happen is because you were actually writing a program which tried to do this, so it's much less relevant than in 5.x where such programs could be written by mistake. ↩
The text was updated successfully, but these errors were encountered: