Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize some buffer operations (implements #6148) #8596

Merged
merged 5 commits into from Apr 8, 2019

Conversation

@damiendoligez
Copy link
Member

commented Apr 8, 2019

Buffer.add_string and Buffer.add_substring are the two remaining functions from #6148 that were not optimized when we switched to safe strings.

@gasche
gasche approved these changes Apr 8, 2019
Copy link
Member

left a comment

Could we have an assert at the end of resize that the postcondition b.position + more <= Bytes.length b.buffer holds?

stdlib/bytesLabels.mli Outdated Show resolved Hide resolved
@gasche

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

When I reviewed the file to ensure that these unsafe_ are correct (it's a global condition: for example we assume that b.position is not negative and this is not checked explicitly), I noticed that the add_channel and add_channel_rec functions are a bit suspicious. add_channel_rec implicitly relies on the invariant that b.position + len <= b.length checked by the wrapper function add_channel, and :

  1. It would be nicer to have add_channel_rec a local function within add_channel, to avoid exposing an unsafe function (if its preconditions are not checked) to the rest of the module
  2. I could not convince myself that this invariant always holds during recursive calls to the add_channel_rec function, in the event for example that another OCaml thread would be awaken by the I/O operations and perform operations on the same OCaml buffer.

I would prefer if add_channel_rec was a local function, and if it checked/asserted b.position + len >= b.length before (or after?) each call to the input function.

@damiendoligez

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

@gasche

Could we have an assert at the end of resize that the postcondition b.position + more <= Bytes.length b.buffer holds?

That should be easy to prove from the code of resize and it's not really related to this PR, is it?

@gasche

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

To review a change from a safe function to an unsafe function, I try to convince myself ("mentally prove") that the check being removed is redundant at the call site. If the proof is obvious or easy, the change is safe, is the proof is subtle or hard or impossible-without-writing-stuff-down, the change is not necessarily a good idea, or at least this suggests a better documentation of the various invariants in the codebase. In the case of this PR, ensuring that the change is correct requires reviewing resize, and having this assert at the end would make the verification trivial -- for the b.position + len < b.length part, not for b.position >= 0 which is a global invariant which requires a more tiresome verification.

@gasche

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

My approval of the PR extends to the to_seq{,i} functions which I had already looked at. One slightly tricky (but correct) thing about them is that they are careful to recalculate b.position each time the next element is accessed, so they remain safe even if the underlying buffer is reset or truncated in the meantime. (@damiendoligez, you could consider adding a comment to make this subtlety explicit in the code.)

@damiendoligez

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

Are buffers supposed to be thread-safe? The documentation doesn't say. Even if they currently are because of implicit critical sections, this implementation certainly won't be when we switch to multicore OCaml.

For add_channel, for your first point I would simply rename add_channel_rec into unsafe_add_channel (even though it is "safe"). For your second point, input is a safe function, so there may be some data lost, some uninitialized bytes in the buffer, or an Invalid_argument exception, but (I think) no segfault.

@damiendoligez

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

I've added the assertion and the comments.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

Are buffers supposed to be thread-safe? The documentation doesn't say.

We should indeed document that their are not thread-safe (and also do it for Hashtbl, etc).

@gasche

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

Are buffers supposed to be thread-safe?

I make a distinction between "this module may behave incorrect if used in multi-threaded environments" and breaking the memory-safety guarantees of the language. In particular, I believe that it is incorrect to make single-threading assumptions to remove bound checks -- even if we make the assumption for correctness.

For your second point, input is a safe function, so there may be some data lost, some uninitialized bytes in the buffer, or an Invalid_argument exception, but (I think) no segfault.

I agree with this analysis, but still the b.position <- b.position + n write will occur with the value n set to the number of bytes read in the old b.buffer value. If some concurrent code called Buffer.reset on the same buffer b, you can end with b.position > b.length. Most functions will just handle this as a position-at-the-end event and resize the buffer if you call them, so they are fine, but if I read the code of blit correctly I think that I can break stuff by calling Buffer.blit with this invariant-breaking b as the source buffer.

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

In particular, I believe that it is incorrect to make single-threading assumptions to remove bound checks -- even if we make the assumption for correctness.

Wouln't that exclude most of those unsafe functions in Buffer? Even in add_char, it's not clear to me that unsafe_set is ok.

@gasche

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

let add_char b c =
  let pos = b.position in
  if pos >= b.length then resize b 1;
  Bytes.unsafe_set b.buffer pos c;
  b.position <- pos + 1

I believe that the invariant that b.length = Bytes.length b.buffer is preserved, even under multi-threaded usage (or finalization functions being called at allocation points, etc). If resizing doesn't happen we have pos <= b.length b so we know that the write is safe. If resizing does happen, pos may be a stale value so indeed it is not obvious; but I believe that the new length after resizing is always greater than the length before resizing, so that the write is safe. (The code would be easier to check if b.position was used consistently instead of pos.)

@alainfrisch

This comment has been minimized.

Copy link
Contributor

commented Apr 8, 2019

Ok, this works under the asumptions that "sequences of instructions" without allocations cannot be interleaved with other threads. Does this still hold under Multicore?

@gasche

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

Yes, I believe it does. But note that if we had doubts, we should move back to safe writes, instead of giving up on memory-safety in the multi-threaded case.

@damiendoligez

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

so that the write is safe

I don't think so. If pos is much greater than b.length, then resize will fail to make the buffer large enough and we write to arbitrary memory.

@gasche

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

But pos was the initial value of b.position before, so (except for the add_channel_rec issue I pointed above) we can assume that it is at most equal to the initial value of b.length. If my claim that the value of b.length is only increased by resize is correct, there shouldn't be an issue there -- even if changes to b.position and b.length happen during the Bytes.create call within resize.

@gasche

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

Maybe we should rewrite buffer.ml in WhyML.

@damiendoligez

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

(except for the add_channel_rec issue I pointed above)

Ah, sorry. I was implicitly assuming that we were in the case where add_channel_rec went wrong.

@gasche

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

My current understanding is summarized as follows:

  • The patch currently being proposed, touching add_{sub,}string and to_seq{,i}, is fine.

  • I believe that there is a reentrancy issue in add_channel_rec that could be fixed by either:

    • adding assert (b.position + len <= b.length) after the input call in add_channel_rec
    • or just using really_input instead of input and have a sensibly simpler implementation
  • I think that the use of unsafe_get in to_seq{,i} is correct assuming that 0 <= b.position <= b.length = Bytes.length b always hold (which may be broken by add_channel_rec above)

  • I believe that the unsafe_set in the add_* functions are correct if we furthermore assume that when resize b more returns, we have <position> + more <= b.length for both the old and the new value of b.position. (The new assert guarantees this for the new value, and I believe it may also hold for the old value.)

@damiendoligez

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

You can't use really_input because, when it encounters end-of-file, it will lose the information of how many bytes were read, which add_channel explicitly preserves.

@damiendoligez

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

See #6719 and #7136

@gasche

This comment has been minimized.

Copy link
Member

commented Apr 8, 2019

Indeed, thanks.

Where do we stand with the current PR? It could be merged as-is -- do you agree? Or are you convinced that there is a small issue with add_channel_rec, and are you willing to address it?

@damiendoligez

This comment has been minimized.

Copy link
Member Author

commented Apr 8, 2019

I'm convinced there is a problem with add_channel_rec but I'm not willing to address it now. Could you open a bug report?

@gasche gasche merged commit d216d36 into ocaml:trunk Apr 8, 2019
2 checks passed
2 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
gasche added a commit to gasche/ocaml that referenced this pull request Apr 8, 2019
gasche added a commit to gasche/ocaml that referenced this pull request Apr 8, 2019
@gasche gasche referenced this pull request Apr 8, 2019
@damiendoligez damiendoligez deleted the damiendoligez:optimize-buffer-add-string branch Apr 9, 2019
gasche added a commit to gasche/ocaml that referenced this pull request Apr 16, 2019
gasche added a commit to gasche/ocaml that referenced this pull request Apr 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.