Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel access to Buffer can trigger segfaults #11279

Closed
dra27 opened this issue May 26, 2022 · 15 comments
Closed

Parallel access to Buffer can trigger segfaults #11279

dra27 opened this issue May 26, 2022 · 15 comments
Assignees
Labels
Milestone

Comments

@dra27
Copy link
Member

dra27 commented May 26, 2022

There are various uses of _unsafe functions in the implementation of Buffer. In 4.x, it's impossible1 to write OCaml code which can
invalidate the checks done before these _unsafe functions are used, but in 5.x it's relatively easy, e.g.:

let worker b n () =
  let old_size = ref (Buffer.length b) in
  let s = String.make 500 'x' in
  while true do
    let l = Buffer.length b in
    if !old_size <> l then begin
      Format.eprintf "%d size: %d\n%!" n l;
      old_size := l;
    end;
    let () = Buffer.reset b in
    try
    Buffer.add_string b s
    with e -> Printf.eprintf "%s\n%!" (Printexc.to_string e)
  done

let _ =
  let buffer = Buffer.create 1024 in
  let _ = Domain.spawn (worker buffer 1)   in
  let _ = Domain.spawn (worker buffer 2)   in
  let _ = Domain.spawn (worker buffer 3)   in
  let _ = Domain.spawn (worker buffer 4)  in
  let _ = Domain.spawn (worker buffer 5)   in
  let _ = Domain.spawn (worker buffer 6)   in
  let _ = Domain.spawn (worker buffer 7)   in
  let _ = Domain.spawn (worker buffer 8)   in
  while true do () done

Obviously Buffers should not be accessed in parallel without being guarded by some kind of lock, but parallel accesses may happen by mistake and these should never cause the running program to segfault.

There are three possible solutions:

  1. The simplest is to cease using the _unsafe functions entirely. This will impose a measurable penalty on correct single-domain use of Buffer (which is why effort was put into switching to the _unsafe functions before)
  2. Introduce additional bytes primitives which assume the obvious checks on the arguments (i.e. positive indexes, etc.) but repeat only the bounds check. For blit, this eliminates 3 comparisons which, even taking into account the unpredictability of hardware, is likely to be slightly faster. It's also very easy to reason about.
  3. Do something slightly more clever with cached copies of the fields of the buffer to ensure that while the parallel calls will result in a sequentially invalid buffer, that blits will always take place to an actual bytes value. In particular, relaxing the invariant on the cached length field and the position fields should yield a buffer which has a very similar fast path for the adding functions and only one additional check required for the less-frequent retrieval functions (contents, etc.)

I have a partial implementation of 3, but it's clearly not for 5.0 and, if we're going to do something that crazy, it will need benchmarks to justify it.

2 is straightforward, and I intend to open a PR for it - I'm just making sure there's a tracking issue, as I'm on vacation next week 🙂

(credit to @jmid's property testing work and @jonludlam for spinning the repro case out from it)

Footnotes

  1. well, almost impossible: I think it's possible to engineer a program which might just manage to execute a parallel access to a Buffer which interleaves the checks with a reset (using either signals or some Gc evil), but the only reason this would happen is because you were actually writing a program which tried to do this, so it's much less relevant than in 5.x where such programs could be written by mistake.

@dra27 dra27 added the bug label May 26, 2022
@dra27 dra27 added this to the 5.0.0 milestone May 26, 2022
@dra27 dra27 self-assigned this May 26, 2022
@gasche
Copy link
Member

gasche commented May 27, 2022

A nerd-sniping bug for sure.

The current code for Buffer.add_string is as follows:

let add_string b s =
  let len = String.length s in
  let new_position = b.position + len in
  if new_position > b.length then resize b len;
  Bytes.unsafe_blit_string s 0 b.buffer b.position len;
  b.position <- new_position

Changing unsafe_blit_string to blit_string makes the segfault in your repro case go away. But this variation also does:

let add_string b s =
  let len = String.length s in
  let {position; buffer} = b in
  if position + len <= b.length then
    Bytes.unsafe_blit_string s 0 buffer position len
  else begin
    resize b len;
    Bytes.blit_string s 0 b.buffer b.position len;
  end;
  b.position <- b.position + len

In this version, we work on the underlying buffer (whose size cannot be mutated by concurrent programs), so the unsafe_ reasoning is safe. (One invariant I rely on is that b.length <= Bytes.length b.buffer.)

I think that this change is simple enough that we could consider applying it to all unsafe_* uses guarded by resize.

Is this what you had in mind with your plan (3), or something simpler?

@lthls
Copy link
Contributor

lthls commented May 27, 2022

(One invariant I rely on is that b.length <= Bytes.length b.buffer.)

This invariant doesn't always hold. If you look at the code of reset, between b.buffer <- b.initial_buffer and b.length <- Bytes.length b.buffer the invariant may be broken.
And of course, the invariant is a bit meaningless as even if it holds at any time t, there is no way in OCaml to get both b.length and b.buffer at the same time (let { length; buffer } = b in ... compiles to sequential accesses). So you actually rely on a stronger invariant, which is that if you read b.buffer (as buf) at time t, then at any time t' >= t the inequality b.length <= Bytes.length buf holds. And this clearly isn't the case.

One possibility would be to introduce one level of indirection: instead of mutable buffer : bytes; mutable length : int; we use mutable buffer_with_length : (bytes * int);. But this could end up being even worse for performance than switching to safe blits. I could also think of various hacks that encode the length into the first bytes of the buffer, which would remove the extra indirection; that's a lot of added complexity and potential mistakes though (but might be a better alternative to proposition 3 ?).

Overall, I tend to prefer @dra27's solution 1 over solution 2. I suspect that the expensive part of the bounds check is not the comparisons, but the length computation. Although it would be nice to see benchmarks.

@gasche
Copy link
Member

gasche commented May 27, 2022

I agree that the invariant does not hold -- I guess I should be more clear. I wrote it this way to try, and the test passes, but I don't think that version is correct. I tried some benchmarks but it appears difficult to obtain robust results.

@gasche
Copy link
Member

gasche commented May 31, 2022

So I wrote microbenchmarks and the results are interesting!

Benchmarking code:
https://gitlab.com/gasche-snippets/buffer-bench/-/blob/trunk/bench_buffer.ml

Early benchmarks

I microbenchmarked Buffer.add_char to compare several approaches:

  • std: the current code with unconditional unsafe_set
  • safe: the easy version with just set instead (expectation: noticeably slower)
  • "shortcut" versions that use unsafe_set in the path were no resizing is necessary:
    • test_wrong"the version I demonstrated above, it assumes that b.length and b.buffer are synchronized and is unsafe in presence of races
    • test: calls Bytes.length in the hot path (Vincent assumed that this was the main cost with the safe versions)
    • data_safe: version changes the representation of buffers to store an immutable bytes * int pair of buffer and length, to be able to correctly assume that they remain synchronized (at the cost of an extra memory indirection)
bench time (best out of 3) instructions
std 3.75s 36.45G
safe 4.02s 43.80G
test 3.23s 36.56G
test_wrong 2.93s 31.34G
data_safe 3.35s 33.54G

So we see from the benchmarks that... the shortcut versions are noticeably faster than the current code!

Spilling

The difference comes from spilling decisions made by the compiler.

Compare the stdlib implementation and the 'test' version:

let add_char_std b c =
  let pos = b.position in
  if pos >= b.length then resize b 1;
  Bytes.unsafe_set b.buffer pos c;
  b.position <- pos + 1

let add_char_test b c =
  let {position; buffer} = b in
  if position < Bytes.length buffer then
    Bytes.unsafe_set buffer position c
  else begin
    resize b 1;
    Bytes.set b.buffer b.position c
  end;
  b.position <- b.position + 1

The standard version should be faster, because it has one less call to Bytes.length in the hot path and otherwise the same (hot path) logic. It is slower because of spilling: the call to resize b 1 (in both versions) requires spilling the registers holding b and c, but in the add_char_test this spilling only occurs in the else case (only in the slow path), while in add_char_std the spilling occurs at the toplevel, so it is also there in the hot path. This difference largely dominates the performance difference between the two versions on my machine.

(Use ocamlopt -dspill to observe the spilling differences.)

I'm not sure why there is this difference in spilling decisions, but I think that it is related to the fact that, after the join point after the conditional, b and c are still needed in add_char_std while they are not needed anymore in add_char_test.

It is of course possible to write "tuned" versions of add_char_std or add_char_safe to avoid the spilling in the same way:

let add_char_std_nospill b c =
  let pos = b.position in
  if pos >= b.length then begin
    resize b 1;
    Bytes.unsafe_set b.buffer pos c;
  end else
    Bytes.unsafe_set b.buffer pos c;
  b.position <- pos + 1

let add_char_safe_nospill b c =
  let pos = b.position in
  if pos >= b.length then
    (resize b 1;
     Bytes.set b.buffer pos c) 
  else
    (Bytes.set b.buffer pos c);
  b.position <- pos + 1

and indeed those versions are much faster than the previous versions.

bench time (best out of 3) instructions
std 3.75s 36.45G
std_nospill 3.00s 30.23G
safe 4.02s 43.80G
safe_nospill 3.53s 37.58G
test 3.23s 36.56G

add_char vs. add_string

The cost of spilling is also noticeable in a microbenchmark of add_string on small strings (of length 4).

bench time (best out of 3) instructions
string 8.96s 75.04G
string_nospill 8.84s 70.91G

the 'data' approach

In my test, the 'data' approach, where the bytes buffer and its length are packed in an immutable tuple to avoid an extra call to Bytes.length, has basically the same performance as the test version which does the extra call. It has a lower instruction count, but some of these instructions are memory loads that take longer.

bench time (best out of 3) instructions
std_nospill 3.00s 30.23G
safe_nospill 3.53s 37.58G
test 3.23s 36.56G
data_safe 3.64s 33.54G

Summary

There was a trivial micro-optimization missing from the Buffer module, making the code about 20-25% slower than it should be. The cost of adding the bound checking (a trivial change to get Multicore safety) is about the same! So if we do the two changes at once, we have a safer version with comparable performance.

(* before *)
let add_char_std b c =
  let pos = b.position in
  if pos >= b.length then resize b 1;
  Bytes.unsafe_set b.buffer pos c;
  b.position <- pos + 1


(* after *)
let add_char_safe_nospill b c =
  let pos = b.position in
  if pos >= b.length then
    (resize b 1;
     Bytes.set b.buffer pos c) 
  else
    (Bytes.set b.buffer pos c);
  b.position <- pos + 1

If we want, there is also a slightly more optimized and slightly more complex version, which is slightly faster than the safe version.

let add_char_test b c =
  let {position; buffer} = b in
  if position < Bytes.length buffer then
    Bytes.unsafe_set buffer position c
  else begin
    resize b 1;
    Bytes.set b.buffer b.position c
  end;
  b.position <- b.position + 1

@lthls
Copy link
Contributor

lthls commented May 31, 2022

I'm not sure why there is this difference in spilling decisions, but I think that it is related to the fact that, after the join point after the conditional, b and c are still needed in add_char_std while they are not needed anymore in add_char_test.

I haven't looked myself at the output of -dspill, but I suspect the explanation is simpler: the reload instructions are inserted before the first use of the variable after a spill. In the original version, that means after the join (so in the hot path too). With your new versions, the variables that needs to be reloaded after the join (b and pos) are also needed in the cold path after the call to resize, so the reload is done at that point, and no reload needs to be done in the hot path (neither in the hot branch nor in the shared code after the join).

It's possible that a different register allocator (or just a different spilling algorithm) could have yielded the performance of the nospill versions even with the original code, although of course that would not come without a cost elsewhere.

(By the way, I wouldn't consider this micro-optimisation "trivial". I don't even understand why you're keeping the update of b.position in common; the obvious solution would have been to keep the hot and cold path fully separate to ensure the hot path can't be influenced by the cold path.)

@xavierleroy
Copy link
Contributor

he reload instructions are inserted before the first use of the variable after a spill.

Yes, that's a good summary of what's going on.

Optimal insertion of spills and reloads is a hard problem (as in NP-hard), just like register allocation. So, don't expect too much cleverness from your favorite compiler.

@bluddy
Copy link
Contributor

bluddy commented Jun 1, 2022

(By the way, I wouldn't consider this micro-optimisation "trivial". I don't even understand why you're keeping the update of b.position in common; the obvious solution would have been to keep the hot and cold path fully separate to ensure the hot path can't be influenced by the cold path.)

I wouldn't call that an obvious solution, but it's definitely good to know about this strategy.

@dra27
Copy link
Member Author

dra27 commented Jun 6, 2022

A nerd-sniping bug for sure.

Oh yeah 🤓

Is this what you had in mind with your plan (3), or something simpler?

I'd started down the line of enforcing the invariant that b.length <= Bytes.length b.buffer. resize always makes the b.buffer larger, so as long as b.length is updated after b.buffer then the invariant holds - reset needs to do the opposite, updating b.length first.

I hadn't got as far as (your lovely!) benchmarks, so I was at the faulty heuristics stage - my assumption was that boxing the buffer and its length in a pair would be dreadful, and also that keeping the spatial locality of having b.length (and not resorting to Bytes.length b.buffer) was "likely" to matter for very large buffers. In the code I was already hacking, I'd made the add functions recursively call themselves after resizing (in order to reload the fields), so it's possible that I'd also achieved the spill optimisation, I just hadn't checked yet!

@gasche
Copy link
Member

gasche commented Jun 9, 2022

@dra27 what's your current plan? I could send a PR that proposes the "slightly more optimized" versions for Buffer, but I would happy to let you send a PR / don't want to duplicate work.

@dra27
Copy link
Member Author

dra27 commented Jul 14, 2022

Sorry, I had managed to forget about this. I disappeared down another slight rabbit hole with this - in particular, on my machine while I see the speed-up in the nospill benchmarks for add_char, add_string is consistently slower in the nospill version. It's up to you - I'm happy to put together the simple PR for now reverting the use of _unsafe functions and allowing this to be revisited. Equally, if you'd like to open a PR with yours and we can run some more benchmarking and see!

@c-cube
Copy link
Contributor

c-cube commented Sep 29, 2022

I am curious, could a (private) version of resize that also returns the new, larger array be used to keep the unsafe_set/unsafe_blit calls? In case of race condition, you might set/blit into the wrong buffer, but not get a segfault?

Like:

let add_char_std b c =
  let pos = b.position in
  let buffer = if pos >= b.length then resize_and_return b 1 else b.buffer in
  Bytes.unsafe_set buffer pos c;
  b.position <- pos + 1

or even the updated version of @gasche:

let add_char_test b c =
  let {position; buffer} = b in
  if position < Bytes.length buffer then
    Bytes.unsafe_set buffer position c
  else begin
    let buffer = resize_and_return b 1 in
    Bytes.unsafe_set buffer b.position c
  end;
  b.position <- b.position + 1

@dra27
Copy link
Member Author

dra27 commented Sep 29, 2022

OK, I've opened the simplest PR just to ensure that we there's something - it's up to you if you want to have a go at the spill optimisations, @gasche?

@dra27
Copy link
Member Author

dra27 commented Sep 29, 2022

@c-cube - I'd been mucking around with ideas based around potentially blitting to the wrong buffer, yes... just running out of time 🙂

@dra27 dra27 linked a pull request Oct 3, 2022 that will close this issue
@Octachron
Copy link
Member

@dra27 how bad was the slowdown for @gasche nospill version on your machine?

@Octachron
Copy link
Member

The segfaults should be fixed with #11742, even if there are still rooms for (micro-)optimizations at a later point for interested people.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants