Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize some buffer operations (implements #6148) #8596

Merged
merged 5 commits into from Apr 8, 2019

Conversation

damiendoligez
Copy link
Member

Buffer.add_string and Buffer.add_substring are the two remaining functions from #6148 that were not optimized when we switched to safe strings.

Copy link
Member

@gasche gasche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have an assert at the end of resize that the postcondition b.position + more <= Bytes.length b.buffer holds?

stdlib/bytesLabels.mli Outdated Show resolved Hide resolved
@gasche
Copy link
Member

gasche commented Apr 8, 2019

When I reviewed the file to ensure that these unsafe_ are correct (it's a global condition: for example we assume that b.position is not negative and this is not checked explicitly), I noticed that the add_channel and add_channel_rec functions are a bit suspicious. add_channel_rec implicitly relies on the invariant that b.position + len <= b.length checked by the wrapper function add_channel, and :

  1. It would be nicer to have add_channel_rec a local function within add_channel, to avoid exposing an unsafe function (if its preconditions are not checked) to the rest of the module
  2. I could not convince myself that this invariant always holds during recursive calls to the add_channel_rec function, in the event for example that another OCaml thread would be awaken by the I/O operations and perform operations on the same OCaml buffer.

I would prefer if add_channel_rec was a local function, and if it checked/asserted b.position + len >= b.length before (or after?) each call to the input function.

@damiendoligez
Copy link
Member Author

@gasche

Could we have an assert at the end of resize that the postcondition b.position + more <= Bytes.length b.buffer holds?

That should be easy to prove from the code of resize and it's not really related to this PR, is it?

@gasche
Copy link
Member

gasche commented Apr 8, 2019

To review a change from a safe function to an unsafe function, I try to convince myself ("mentally prove") that the check being removed is redundant at the call site. If the proof is obvious or easy, the change is safe, is the proof is subtle or hard or impossible-without-writing-stuff-down, the change is not necessarily a good idea, or at least this suggests a better documentation of the various invariants in the codebase. In the case of this PR, ensuring that the change is correct requires reviewing resize, and having this assert at the end would make the verification trivial -- for the b.position + len < b.length part, not for b.position >= 0 which is a global invariant which requires a more tiresome verification.

@gasche
Copy link
Member

gasche commented Apr 8, 2019

My approval of the PR extends to the to_seq{,i} functions which I had already looked at. One slightly tricky (but correct) thing about them is that they are careful to recalculate b.position each time the next element is accessed, so they remain safe even if the underlying buffer is reset or truncated in the meantime. (@damiendoligez, you could consider adding a comment to make this subtlety explicit in the code.)

@damiendoligez
Copy link
Member Author

Are buffers supposed to be thread-safe? The documentation doesn't say. Even if they currently are because of implicit critical sections, this implementation certainly won't be when we switch to multicore OCaml.

For add_channel, for your first point I would simply rename add_channel_rec into unsafe_add_channel (even though it is "safe"). For your second point, input is a safe function, so there may be some data lost, some uninitialized bytes in the buffer, or an Invalid_argument exception, but (I think) no segfault.

@damiendoligez
Copy link
Member Author

I've added the assertion and the comments.

@alainfrisch
Copy link
Contributor

Are buffers supposed to be thread-safe? The documentation doesn't say.

We should indeed document that their are not thread-safe (and also do it for Hashtbl, etc).

@gasche
Copy link
Member

gasche commented Apr 8, 2019

Are buffers supposed to be thread-safe?

I make a distinction between "this module may behave incorrect if used in multi-threaded environments" and breaking the memory-safety guarantees of the language. In particular, I believe that it is incorrect to make single-threading assumptions to remove bound checks -- even if we make the assumption for correctness.

For your second point, input is a safe function, so there may be some data lost, some uninitialized bytes in the buffer, or an Invalid_argument exception, but (I think) no segfault.

I agree with this analysis, but still the b.position <- b.position + n write will occur with the value n set to the number of bytes read in the old b.buffer value. If some concurrent code called Buffer.reset on the same buffer b, you can end with b.position > b.length. Most functions will just handle this as a position-at-the-end event and resize the buffer if you call them, so they are fine, but if I read the code of blit correctly I think that I can break stuff by calling Buffer.blit with this invariant-breaking b as the source buffer.

@alainfrisch
Copy link
Contributor

In particular, I believe that it is incorrect to make single-threading assumptions to remove bound checks -- even if we make the assumption for correctness.

Wouln't that exclude most of those unsafe functions in Buffer? Even in add_char, it's not clear to me that unsafe_set is ok.

@gasche
Copy link
Member

gasche commented Apr 8, 2019

let add_char b c =
  let pos = b.position in
  if pos >= b.length then resize b 1;
  Bytes.unsafe_set b.buffer pos c;
  b.position <- pos + 1

I believe that the invariant that b.length = Bytes.length b.buffer is preserved, even under multi-threaded usage (or finalization functions being called at allocation points, etc). If resizing doesn't happen we have pos <= b.length b so we know that the write is safe. If resizing does happen, pos may be a stale value so indeed it is not obvious; but I believe that the new length after resizing is always greater than the length before resizing, so that the write is safe. (The code would be easier to check if b.position was used consistently instead of pos.)

@alainfrisch
Copy link
Contributor

Ok, this works under the asumptions that "sequences of instructions" without allocations cannot be interleaved with other threads. Does this still hold under Multicore?

@gasche
Copy link
Member

gasche commented Apr 8, 2019

Yes, I believe it does. But note that if we had doubts, we should move back to safe writes, instead of giving up on memory-safety in the multi-threaded case.

@damiendoligez
Copy link
Member Author

so that the write is safe

I don't think so. If pos is much greater than b.length, then resize will fail to make the buffer large enough and we write to arbitrary memory.

@gasche
Copy link
Member

gasche commented Apr 8, 2019

But pos was the initial value of b.position before, so (except for the add_channel_rec issue I pointed above) we can assume that it is at most equal to the initial value of b.length. If my claim that the value of b.length is only increased by resize is correct, there shouldn't be an issue there -- even if changes to b.position and b.length happen during the Bytes.create call within resize.

@gasche
Copy link
Member

gasche commented Apr 8, 2019

Maybe we should rewrite buffer.ml in WhyML.

@damiendoligez
Copy link
Member Author

(except for the add_channel_rec issue I pointed above)

Ah, sorry. I was implicitly assuming that we were in the case where add_channel_rec went wrong.

@gasche
Copy link
Member

gasche commented Apr 8, 2019

My current understanding is summarized as follows:

  • The patch currently being proposed, touching add_{sub,}string and to_seq{,i}, is fine.

  • I believe that there is a reentrancy issue in add_channel_rec that could be fixed by either:

    • adding assert (b.position + len <= b.length) after the input call in add_channel_rec
    • or just using really_input instead of input and have a sensibly simpler implementation
  • I think that the use of unsafe_get in to_seq{,i} is correct assuming that 0 <= b.position <= b.length = Bytes.length b always hold (which may be broken by add_channel_rec above)

  • I believe that the unsafe_set in the add_* functions are correct if we furthermore assume that when resize b more returns, we have <position> + more <= b.length for both the old and the new value of b.position. (The new assert guarantees this for the new value, and I believe it may also hold for the old value.)

@damiendoligez
Copy link
Member Author

You can't use really_input because, when it encounters end-of-file, it will lose the information of how many bytes were read, which add_channel explicitly preserves.

@damiendoligez
Copy link
Member Author

See #6719 and #7136

@gasche
Copy link
Member

gasche commented Apr 8, 2019

Indeed, thanks.

Where do we stand with the current PR? It could be merged as-is -- do you agree? Or are you convinced that there is a small issue with add_channel_rec, and are you willing to address it?

@damiendoligez
Copy link
Member Author

I'm convinced there is a problem with add_channel_rec but I'm not willing to address it now. Could you open a bug report?

@gasche gasche merged commit d216d36 into ocaml:trunk Apr 8, 2019
gasche added a commit to gasche/ocaml that referenced this pull request Apr 8, 2019
gasche added a commit to gasche/ocaml that referenced this pull request Apr 8, 2019
@gasche gasche mentioned this pull request Apr 8, 2019
@damiendoligez damiendoligez deleted the optimize-buffer-add-string branch April 9, 2019 09:41
gasche added a commit to gasche/ocaml that referenced this pull request Apr 16, 2019
gasche added a commit to gasche/ocaml that referenced this pull request Apr 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants