Read to separator fixes #2961

nirvdrum · 2023-03-22T02:21:32Z

This PR fixes IO::EachReader#read_to_separator and IO::EachReader#read_to_separator_with_limit with multi-byte delimiters. If the delimiter appears in the IO object being read from but it split over multiple reads, the old search logic would fail to find the delimiter. Confusingly, if the delimiter appeared in full later in the IO object, that second occurrence's byte position would be returned.

I feel quite confident about the IO::EachReader#read_to_separator fix. This original issue was detected when working on getting the ruby-lsp test suite passing on TruffleRuby. That test suite uses a pipe to communicate with a subprocess running the Ruby LSP server. With this PR in place those tests now pass. The new gets spec was based on the observed problem with that test suite. Our first attempt was insufficient because it failed to re-insert any bytes past the delimiter back into the read buffer. The test accounts for this situation.

I'm less confident about the IO::EachReader#read_to_separator_with_limit changes. I based them on the changes for read_to_separator and wrote a new gets spec that initially failed, but now passes. However, I don't have any real life code using this call so I can't test against that. Unfortunately, the logic is a bit more involved than simply finding the delimiter or reading the first N bytes because Ruby will read beyond N bytes to ensure only complete codepoints are returned.

Beyond that, I switched some internal calls over to primitives. And I did a minor code clean up of IO::EachReader. I think that code in general could benefit from a more comprehensive pass. I've left that for future work. I was aiming to keep this PR as bug fixes only. Each commit is logically standalone so we can selectively apply what's needed for now.

Fixes #2938: IO reads with a delimiter broken for multi-byte delimiters.

/cc @vinistock

oracle-contributor-agreement · 2023-03-22T02:21:37Z

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

vinistock@users.noreply.github.com

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

eregon

Thank you for the fix, this looks quite complicated.
@andrykonchin Could you review this PR as well, especially the bug fixes themselves, and integrate this PR?

I wonder why s.prepend(str) vs appending to str (already there before). It seems to make it much harder to reason about it.

spec/ruby/core/io/gets_spec.rb

src/main/ruby/truffleruby/core/io.rb

eregon · 2023-03-24T12:51:56Z

This area indeed feels like it could benefit from a big refactor or even rewriting it from scratch, it's particularly hard to follow the logic and likely it optimizes performance for flat strings but since we have ropes we have different performance concerns.
The logic for gets(sep) should ideally just be read a chunk from IO and append it to str, if we find sep in there yield that and slice off the yielded part, if we don't keep reading. And then indeed optimize to avoid searching what we already did by keeping a start offset.

eregon · 2023-03-24T12:52:52Z

@nirvdrum Could you add a changelog entry, under 22.3 Bug fixes?
@andrykonchin I think we should backport this fix.

andrykonchin · 2023-03-24T13:15:01Z

@andrykonchin Could you review this PR as well, especially the bug fixes themselves, and integrate this PR?

Yes, will do 👌.

nirvdrum · 2023-03-24T16:05:24Z

Thank you for the fix, this looks quite complicated. @andrykonchin Could you review this PR as well, especially the bug fixes themselves, and integrate this PR?

I wonder why s.prepend(str) vs appending to str (already there before). It seems to make it much harder to reason about it.

We can revisit that. I remember doing that years ago to avoid string allocations in the most common case. I think this most often gets called as part of readline and we usually find the \n in the first buffer fill so it didn't make sense to read the byte array, turn that into a string (s in the code), that append that to an empty string (str), just to then yield str.

The prepend was also optimized for our old ropes implementation. IIRC, the concatenation code would determine if one side was empty and just return the other. We had a situation where we had complex rope shapes when a simple leaf rope would have been sufficient, and ultimately faster. It was a big regression we had at the time when we switched to ropes.

eregon · 2023-03-24T17:04:20Z

^ ah OK, then it'd be useful if you can add a comment explaining the rationale there.

andrykonchin

All changes looks good 👍 (except related to read_to_separator_with_limit). I've left some comments.

src/main/ruby/truffleruby/core/io.rb

andrykonchin · 2023-03-27T16:01:51Z

src/main/ruby/truffleruby/core/io.rb

            wanted = limit
+
+            next


I suppose in this case (separator isn't found in @buffer) we still should check content of str and a case when separator is partially in str and partially in @buffer.

We cannot just yield str content.

andrykonchin · 2023-03-27T16:09:37Z

src/main/ruby/truffleruby/core/io.rb

+            next
+          else
+            str << s
+          end


After str << s we should probably adjust wanted as well, e.g. wanted -= bytes_to_read as far as wanted = @limit - str.bytesize is supposed to always be true.

What if str is longer than @limit now?

andrykonchin · 2023-03-27T16:17:08Z

src/main/ruby/truffleruby/core/io.rb

+            str.clear
+            last_scan_start = 0
+
+            yield res


Could res be longer than @limit here?

Yes. It's confusing and contrary to the documentation for gets with a limit value. Ruby will read enough points to complete a codepoint. That's why this method uses @buffer.read_to_char_boundary. I didn't change any of that logic.

Yeah, I see. I meant that we ignore wanted value and return all bytes in str before the separator.

andrykonchin · 2023-03-27T16:18:55Z

src/main/ruby/truffleruby/core/io.rb

+            # will have the correct bytes.
+            if offset < str.bytesize
+              @buffer.put_back(str.byteslice(offset, str.bytesize - offset))
+            end


After @buffer.put_back we should probably adjust wanted as well I suppose. I mean wanted += str.bytesize - offset.

I think you're right that wanted should be adjusted, but I think it needs to be reset to limit since we've found a match. Putting the bytes back affects @buffer, but that's state separately managed. I overlooked this with gets because the method is never re-entered after the first string is yielded. I'll look at a spec for readlines with a limit.

nirvdrum · 2023-03-29T02:06:09Z

I looked into it and TruffleString optimizes for empty strings in an append/concat situation for either a or b in a + b. We had that at one point for ropes, but I think we removed it because it often made call sites polymorphic. I'm sure the Truffle team has looked at this though, so we can simplify some of that read implementation.

andrykonchin · 2023-03-29T09:42:01Z

I've noticed 2 more issues related to the put_back method:

we accumulate in str all the bytes until a separator is met so str might be longer (or much longer) than existing @buffer (in EachReader) that is 32kb long. @storage.prepend(str) increases the buffer capacity and it will never decreased until an io object is garbage collected. So it's kind of memory leak. It's probably OK but we should be aware of it and the issue may deserve to be described in a comment in the code.
it seems to me that the put_back method is supposed to add to a buffer only just read (with shift method, or discarded) bytes. And it works incorrectly when we try to return back more bytes than fit in (0..@start) @buffer region. It just prepends new returned back bytes to a @storage and ignores @start index. So all the already read bytes (located in (0...@start) region) might be present in the buffer twice (in prepended bytes and in @storage itself). So I suppose put_back method should be adjusted as well to handle this case correctly.

    def put_back(str)
      length = str.bytesize
      # A simple case, which is common and can be done efficiently
      if @start >= length
        # ...
      else
        @storage = @storage.prepend(str)
        @total += length
        @start = 0
        @used += length
      end
    end

nirvdrum · 2023-03-29T16:23:10Z

I've noticed 2 more issues related to the put_back method:

* we accumulate in `str` all the bytes until a separator is met so `str` might be longer (or much longer) than existing `@buffer` (in `EachReader`) that is 32kb long. `@storage.prepend(str)` increases the buffer capacity and it will never decreased until an io object is _garbage collected_. So it's kind of _memory leak_. It's probably OK but we should be aware of it and the issue may deserve to be described in a comment in the code.
* it seems to me that the `put_back` method is supposed to add to a buffer only just read (with `shift` method, or discarded) bytes. And it works incorrectly when we try to return back more bytes than fit in (0..`@start`) `@buffer` region. It just prepends new returned back bytes to a `@storage` and ignores `@start` index. So all the already read bytes (located in `(0...@start)` region) might be present in the buffer twice (in prepended bytes and in `@storage` itself). So I suppose `put_back` method should be adjusted as well to handle this case correctly.

Do we have any callers doing this? The only cases of put_back that I've used should never put back more bytes than were just read. If you're saying you think the code is doing otherwise, please let me know.

If it's just a general design note, let's start a new issue for that discussion. A lot of this code only works under specific circumstances, but it's all internal-only. A lot of it came from Rubinius and hasn't been touched much. Some of it is written weirdly to optimize because we had very IO performance in the past. Just to keep this PR focused, I'd like to keep the comments focused on things to change in this PR.

The previous code expected the entire delimiter to be present in any given read. If the separator was present in the IO source, but the bytes were split between multiple reads, the search for the separator would fail. Co-authored-by: Vinicius Stock <vinicius.stock@shopify.com>

…t's only used internally by TruffleRuby.

…iters.

nirvdrum · 2023-03-30T15:08:23Z

I've removed the commit that was intended to fix reading with a multi-byte separator and a limit. @andrykonchin found a couple of issues, which indicated the existing specs weren't as extensive as I thought. I tried to write new specs. I'm pretty sure I fixed the problems, but while tracing the specs they didn't perform the reads and writes in the same order every time. That's a problem because the specs would pass if the read occurs after two writes, but fail if a read came between two writes. I'm not sure why, but the Thread.pass until t.stop? trick I used for the IO#gets tests wouldn't work with IO#each.

While I think I fixed the issues, I don't have confidence without a reliable test suite. Moreover, we haven't received any bug reports about multi-byte delimiters with a limit. I only attempted to fix it because the code is similar to multi-byte delimiters without a limit. Due to the risk involved and how difficult it is to follow, I think we should defer the work with a delimiter, while retaining the work without a delimiter. That can be addressed in a more comprehensive cleanup and hopefully be much easier to reason about.

andrykonchin · 2023-03-31T14:48:39Z

The only cases of put_back that I've used should never put back more bytes than were just read

Agree, my bad.

oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Mar 22, 2023

nirvdrum added bug shopify Pull requests from Shopify labels Mar 22, 2023

nirvdrum requested a review from eregon March 22, 2023 13:28

nirvdrum force-pushed the read-to-separator-fixes branch from 1a125e6 to 097bee0 Compare March 23, 2023 14:30

oracle-contributor-agreement bot added OCA Verified All contributors have signed the Oracle Contributor Agreement. and removed OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. labels Mar 23, 2023

eregon approved these changes Mar 24, 2023

View reviewed changes

spec/ruby/core/io/gets_spec.rb Outdated Show resolved Hide resolved

src/main/ruby/truffleruby/core/io.rb Outdated Show resolved Hide resolved

src/main/ruby/truffleruby/core/io.rb Show resolved Hide resolved

eregon requested a review from andrykonchin March 24, 2023 12:43

eregon assigned andrykonchin Mar 24, 2023

andrykonchin reviewed Mar 27, 2023

View reviewed changes

nirvdrum force-pushed the read-to-separator-fixes branch 2 times, most recently from c2e4d6c to ce1577b Compare March 30, 2023 02:54

nirvdrum and others added 4 commits March 30, 2023 10:51

Turn StringFromByteArrayPrimitiveNode into a real primitive since i…

ca54fc6

…t's only used internally by TruffleRuby.

Turn ByteArrayNodes.LocateNode into a primitive for efficiency.

6ae852f

Add a CHANGELOG entry for the fix for IO#gets with multi-byte delim…

18e4ae0

…iters.

nirvdrum force-pushed the read-to-separator-fixes branch from ce1577b to 18e4ae0 Compare March 30, 2023 14:51

Clean up some code duplication in IO::EachReader.

a1040f0

andrykonchin added this to the 23.0.0 Release (April 18, 2023) milestone Mar 31, 2023

graalvmbot merged commit 8f0cc2e into oracle:master Apr 4, 2023

Read to separator fixes #2961

Read to separator fixes #2961

Uh oh!

Conversation

nirvdrum commented Mar 22, 2023

Uh oh!

oracle-contributor-agreement bot commented Mar 22, 2023

Uh oh!

eregon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eregon commented Mar 24, 2023

Uh oh!

eregon commented Mar 24, 2023

Uh oh!

andrykonchin commented Mar 24, 2023

Uh oh!

nirvdrum commented Mar 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eregon commented Mar 24, 2023

Uh oh!

andrykonchin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andrykonchin Mar 27, 2023

Choose a reason for hiding this comment

Uh oh!

andrykonchin Mar 27, 2023

Choose a reason for hiding this comment

Uh oh!

andrykonchin Mar 27, 2023

Choose a reason for hiding this comment

Uh oh!

nirvdrum Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

andrykonchin Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrykonchin Mar 27, 2023

Choose a reason for hiding this comment

Uh oh!

nirvdrum Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirvdrum commented Mar 29, 2023

Uh oh!

andrykonchin commented Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nirvdrum commented Mar 29, 2023

Uh oh!

nirvdrum commented Mar 30, 2023

Uh oh!

andrykonchin commented Mar 31, 2023

Uh oh!

Uh oh!

nirvdrum commented Mar 24, 2023 •

edited

Loading

andrykonchin Mar 29, 2023 •

edited

Loading

nirvdrum Mar 29, 2023 •

edited

Loading

andrykonchin commented Mar 29, 2023 •

edited

Loading