Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read to separator fixes #2961

Merged
merged 5 commits into from
Apr 4, 2023
Merged

Conversation

nirvdrum
Copy link
Collaborator

This PR fixes IO::EachReader#read_to_separator and IO::EachReader#read_to_separator_with_limit with multi-byte delimiters. If the delimiter appears in the IO object being read from but it split over multiple reads, the old search logic would fail to find the delimiter. Confusingly, if the delimiter appeared in full later in the IO object, that second occurrence's byte position would be returned.

I feel quite confident about the IO::EachReader#read_to_separator fix. This original issue was detected when working on getting the ruby-lsp test suite passing on TruffleRuby. That test suite uses a pipe to communicate with a subprocess running the Ruby LSP server. With this PR in place those tests now pass. The new gets spec was based on the observed problem with that test suite. Our first attempt was insufficient because it failed to re-insert any bytes past the delimiter back into the read buffer. The test accounts for this situation.

I'm less confident about the IO::EachReader#read_to_separator_with_limit changes. I based them on the changes for read_to_separator and wrote a new gets spec that initially failed, but now passes. However, I don't have any real life code using this call so I can't test against that. Unfortunately, the logic is a bit more involved than simply finding the delimiter or reading the first N bytes because Ruby will read beyond N bytes to ensure only complete codepoints are returned.

Beyond that, I switched some internal calls over to primitives. And I did a minor code clean up of IO::EachReader. I think that code in general could benefit from a more comprehensive pass. I've left that for future work. I was aiming to keep this PR as bug fixes only. Each commit is logically standalone so we can selectively apply what's needed for now.

Fixes #2938: IO reads with a delimiter broken for multi-byte delimiters.

/cc @vinistock

@oracle-contributor-agreement
Copy link

Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA).
The following contributors of this PR have not signed the OCA:

To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application.

When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated.

If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. label Mar 22, 2023
@nirvdrum nirvdrum requested a review from eregon March 22, 2023 13:28
@oracle-contributor-agreement oracle-contributor-agreement bot added OCA Verified All contributors have signed the Oracle Contributor Agreement. and removed OCA Required At least one contributor does not have an approved Oracle Contributor Agreement. labels Mar 23, 2023
Copy link
Member

@eregon eregon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix, this looks quite complicated.
@andrykonchin Could you review this PR as well, especially the bug fixes themselves, and integrate this PR?

I wonder why s.prepend(str) vs appending to str (already there before). It seems to make it much harder to reason about it.

spec/ruby/core/io/gets_spec.rb Outdated Show resolved Hide resolved
src/main/ruby/truffleruby/core/io.rb Outdated Show resolved Hide resolved
src/main/ruby/truffleruby/core/io.rb Show resolved Hide resolved
@eregon
Copy link
Member

eregon commented Mar 24, 2023

This area indeed feels like it could benefit from a big refactor or even rewriting it from scratch, it's particularly hard to follow the logic and likely it optimizes performance for flat strings but since we have ropes we have different performance concerns.
The logic for gets(sep) should ideally just be read a chunk from IO and append it to str, if we find sep in there yield that and slice off the yielded part, if we don't keep reading. And then indeed optimize to avoid searching what we already did by keeping a start offset.

@eregon
Copy link
Member

eregon commented Mar 24, 2023

@nirvdrum Could you add a changelog entry, under 22.3 Bug fixes?
@andrykonchin I think we should backport this fix.

@andrykonchin
Copy link
Member

@andrykonchin Could you review this PR as well, especially the bug fixes themselves, and integrate this PR?

Yes, will do 👌.

@nirvdrum
Copy link
Collaborator Author

nirvdrum commented Mar 24, 2023

Thank you for the fix, this looks quite complicated. @andrykonchin Could you review this PR as well, especially the bug fixes themselves, and integrate this PR?

I wonder why s.prepend(str) vs appending to str (already there before). It seems to make it much harder to reason about it.

We can revisit that. I remember doing that years ago to avoid string allocations in the most common case. I think this most often gets called as part of readline and we usually find the \n in the first buffer fill so it didn't make sense to read the byte array, turn that into a string (s in the code), that append that to an empty string (str), just to then yield str.

The prepend was also optimized for our old ropes implementation. IIRC, the concatenation code would determine if one side was empty and just return the other. We had a situation where we had complex rope shapes when a simple leaf rope would have been sufficient, and ultimately faster. It was a big regression we had at the time when we switched to ropes.

@eregon
Copy link
Member

eregon commented Mar 24, 2023

^ ah OK, then it'd be useful if you can add a comment explaining the rationale there.

Copy link
Member

@andrykonchin andrykonchin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes looks good 👍 (except related to read_to_separator_with_limit). I've left some comments.

src/main/ruby/truffleruby/core/io.rb Outdated Show resolved Hide resolved
wanted = limit

next
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose in this case (separator isn't found in @buffer) we still should check content of str and a case when separator is partially in str and partially in @buffer.

We cannot just yield str content.

next
else
str << s
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After str << s we should probably adjust wanted as well, e.g. wanted -= bytes_to_read as far as wanted = @limit - str.bytesize is supposed to always be true.

What if str is longer than @limit now?

str.clear
last_scan_start = 0

yield res
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could res be longer than @limit here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's confusing and contrary to the documentation for gets with a limit value. Ruby will read enough points to complete a codepoint. That's why this method uses @buffer.read_to_char_boundary. I didn't change any of that logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see. I meant that we ignore wanted value and return all bytes in str before the separator.

# will have the correct bytes.
if offset < str.bytesize
@buffer.put_back(str.byteslice(offset, str.bytesize - offset))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After @buffer.put_back we should probably adjust wanted as well I suppose. I mean wanted += str.bytesize - offset.

Copy link
Collaborator Author

@nirvdrum nirvdrum Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that wanted should be adjusted, but I think it needs to be reset to limit since we've found a match. Putting the bytes back affects @buffer, but that's state separately managed. I overlooked this with gets because the method is never re-entered after the first string is yielded. I'll look at a spec for readlines with a limit.

@nirvdrum
Copy link
Collaborator Author

I looked into it and TruffleString optimizes for empty strings in an append/concat situation for either a or b in a + b. We had that at one point for ropes, but I think we removed it because it often made call sites polymorphic. I'm sure the Truffle team has looked at this though, so we can simplify some of that read implementation.

@andrykonchin
Copy link
Member

andrykonchin commented Mar 29, 2023

I've noticed 2 more issues related to the put_back method:

  • we accumulate in str all the bytes until a separator is met so str might be longer (or much longer) than existing @buffer (in EachReader) that is 32kb long. @storage.prepend(str) increases the buffer capacity and it will never decreased until an io object is garbage collected. So it's kind of memory leak. It's probably OK but we should be aware of it and the issue may deserve to be described in a comment in the code.
  • it seems to me that the put_back method is supposed to add to a buffer only just read (with shift method, or discarded) bytes. And it works incorrectly when we try to return back more bytes than fit in (0..@start) @buffer region. It just prepends new returned back bytes to a @storage and ignores @start index. So all the already read bytes (located in (0...@start) region) might be present in the buffer twice (in prepended bytes and in @storage itself). So I suppose put_back method should be adjusted as well to handle this case correctly.
    def put_back(str)
      length = str.bytesize
      # A simple case, which is common and can be done efficiently
      if @start >= length
        # ...
      else
        @storage = @storage.prepend(str)
        @total += length
        @start = 0
        @used += length
      end
    end

@nirvdrum
Copy link
Collaborator Author

I've noticed 2 more issues related to the put_back method:

* we accumulate in `str` all the bytes until a separator is met so `str` might be longer (or much longer) than existing `@buffer` (in `EachReader`) that is 32kb long. `@storage.prepend(str)` increases the buffer capacity and it will never decreased until an io object is _garbage collected_. So it's kind of _memory leak_. It's probably OK but we should be aware of it and the issue may deserve to be described in a comment in the code.
* it seems to me that the `put_back` method is supposed to add to a buffer only just read (with `shift` method, or discarded) bytes. And it works incorrectly when we try to return back more bytes than fit in (0..`@start`) `@buffer` region. It just prepends new returned back bytes to a `@storage` and ignores `@start` index. So all the already read bytes (located in `(0...@start)` region) might be present in the buffer twice (in prepended bytes and in `@storage` itself). So I suppose `put_back` method should be adjusted as well to handle this case correctly.

Do we have any callers doing this? The only cases of put_back that I've used should never put back more bytes than were just read. If you're saying you think the code is doing otherwise, please let me know.

If it's just a general design note, let's start a new issue for that discussion. A lot of this code only works under specific circumstances, but it's all internal-only. A lot of it came from Rubinius and hasn't been touched much. Some of it is written weirdly to optimize because we had very IO performance in the past. Just to keep this PR focused, I'd like to keep the comments focused on things to change in this PR.

@nirvdrum nirvdrum force-pushed the read-to-separator-fixes branch 2 times, most recently from c2e4d6c to ce1577b Compare March 30, 2023 02:54
nirvdrum and others added 4 commits March 30, 2023 10:51
The previous code expected the entire delimiter to be present in any given read. If the separator was present in the IO source, but the bytes were split between multiple reads, the search for the separator would fail.

Co-authored-by: Vinicius Stock <vinicius.stock@shopify.com>
@nirvdrum
Copy link
Collaborator Author

I've removed the commit that was intended to fix reading with a multi-byte separator and a limit. @andrykonchin found a couple of issues, which indicated the existing specs weren't as extensive as I thought. I tried to write new specs. I'm pretty sure I fixed the problems, but while tracing the specs they didn't perform the reads and writes in the same order every time. That's a problem because the specs would pass if the read occurs after two writes, but fail if a read came between two writes. I'm not sure why, but the Thread.pass until t.stop? trick I used for the IO#gets tests wouldn't work with IO#each.

While I think I fixed the issues, I don't have confidence without a reliable test suite. Moreover, we haven't received any bug reports about multi-byte delimiters with a limit. I only attempted to fix it because the code is similar to multi-byte delimiters without a limit. Due to the risk involved and how difficult it is to follow, I think we should defer the work with a delimiter, while retaining the work without a delimiter. That can be addressed in a more comprehensive cleanup and hopefully be much easier to reason about.

@andrykonchin
Copy link
Member

The only cases of put_back that I've used should never put back more bytes than were just read

Agree, my bad.

@graalvmbot graalvmbot merged commit 8f0cc2e into oracle:master Apr 4, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug OCA Verified All contributors have signed the Oracle Contributor Agreement. shopify
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IO reads with a delimiter broken for multi-byte delimiters
4 participants