Why `RawStream` contains an `Iterator` of `Vec<u8>` instead of `std::io::Read`? #7017

Kixunil · 2022-11-05T19:26:53Z

Question

RawStream is an iterator of heap-allocated chunks of bytes as opposed to std::io::Read which may harm performance in some cases. The only reason that comes to my mind is having a different error type but that could've been implemented by having a custom Read trait that's a copy of std::io::Read except for error bytes.

Is there any other reason I don' see? If not how hard would it be to refactor it?

Additional context and details

I started working on optimizing json objects parsing to not collect the whole stream into string as mentioned in #6979 (comment) and assumed it implements Read which is not the case. I even implemented a bunch of optimizations aiming at significantly reducing heap allocations. 🤦‍♂️ I'm now wondering how to proceed.

The text was updated successfully, but these errors were encountered:

rgwood · 2022-12-02T00:07:37Z

Hi, I looked at this a bit and I'm not sure why we did it that way, sorry! If you want to try adding a new custom Read trait like that, feel free.

nushell/crates/nu-protocol/src/value/stream.rs

Line 11 in cf0a18b

    
           pub stream: Box<dyn Iterator<Item = Result<Vec<u8>, ShellError>> + Send + 'static>,

# Description This PR introduces a `ByteStream` type which is a `Read`-able stream of bytes. Internally, it has an enum over three different byte stream sources: ```rust pub enum ByteStreamSource { Read(Box<dyn Read + Send + 'static>), File(File), Child(ChildProcess), } ``` This is in comparison to the current `RawStream` type, which is an `Iterator<Item = Vec<u8>>` and has to allocate for each read chunk. Currently, `PipelineData::ExternalStream` serves a weird dual role where it is either external command output or a wrapper around `RawStream`. `ByteStream` makes this distinction more clear (via `ByteStreamSource`) and replaces `PipelineData::ExternalStream` in this PR: ```rust pub enum PipelineData { Empty, Value(Value, Option<PipelineMetadata>), ListStream(ListStream, Option<PipelineMetadata>), ByteStream(ByteStream, Option<PipelineMetadata>), } ``` The PR is relatively large, but a decent amount of it is just repetitive changes. This PR fixes #7017, fixes #10763, and fixes #12369. This PR also improves performance when piping external commands. Nushell should, in most cases, have competitive pipeline throughput compared to, e.g., bash. | Command | Before (MB/s) | After (MB/s) | Bash (MB/s) | | -------------------------------------------------- | -------------:| ------------:| -----------:| | `throughput \| rg 'x'` | 3059 | 3744 | 3739 | | `throughput \| nu --testbin relay o> /dev/null` | 3508 | 8087 | 8136 | # User-Facing Changes - This is a breaking change for the plugin communication protocol, because the `ExternalStreamInfo` was replaced with `ByteStreamInfo`. Plugins now only have to deal with a single input stream, as opposed to the previous three streams: stdout, stderr, and exit code. - The output of `describe` has been changed for external/byte streams. - Temporary breaking change: `bytes starts-with` no longer works with byte streams. This is to keep the PR smaller, and `bytes ends-with` already does not work on byte streams. - If a process core dumped, then instead of having a `Value::Error` in the `exit_code` column of the output returned from `complete`, it now is a `Value::Int` with the negation of the signal number. # After Submitting - Update docs and book as necessary - Release notes (e.g., plugin protocol changes) - Adapt/convert commands to work with byte streams (high priority is `str length`, `bytes starts-with`, and maybe `bytes ends-with`). - Refactor the `tee` code, Devyn has already done some work on this. --------- Co-authored-by: Devyn Cairns <devyn.cairns@gmail.com>

# Description This PR introduces a `ByteStream` type which is a `Read`-able stream of bytes. Internally, it has an enum over three different byte stream sources: ```rust pub enum ByteStreamSource { Read(Box<dyn Read + Send + 'static>), File(File), Child(ChildProcess), } ``` This is in comparison to the current `RawStream` type, which is an `Iterator<Item = Vec<u8>>` and has to allocate for each read chunk. Currently, `PipelineData::ExternalStream` serves a weird dual role where it is either external command output or a wrapper around `RawStream`. `ByteStream` makes this distinction more clear (via `ByteStreamSource`) and replaces `PipelineData::ExternalStream` in this PR: ```rust pub enum PipelineData { Empty, Value(Value, Option<PipelineMetadata>), ListStream(ListStream, Option<PipelineMetadata>), ByteStream(ByteStream, Option<PipelineMetadata>), } ``` The PR is relatively large, but a decent amount of it is just repetitive changes. This PR fixes nushell#7017, fixes nushell#10763, and fixes nushell#12369. This PR also improves performance when piping external commands. Nushell should, in most cases, have competitive pipeline throughput compared to, e.g., bash. | Command | Before (MB/s) | After (MB/s) | Bash (MB/s) | | -------------------------------------------------- | -------------:| ------------:| -----------:| | `throughput \| rg 'x'` | 3059 | 3744 | 3739 | | `throughput \| nu --testbin relay o> /dev/null` | 3508 | 8087 | 8136 | # User-Facing Changes - This is a breaking change for the plugin communication protocol, because the `ExternalStreamInfo` was replaced with `ByteStreamInfo`. Plugins now only have to deal with a single input stream, as opposed to the previous three streams: stdout, stderr, and exit code. - The output of `describe` has been changed for external/byte streams. - Temporary breaking change: `bytes starts-with` no longer works with byte streams. This is to keep the PR smaller, and `bytes ends-with` already does not work on byte streams. - If a process core dumped, then instead of having a `Value::Error` in the `exit_code` column of the output returned from `complete`, it now is a `Value::Int` with the negation of the signal number. # After Submitting - Update docs and book as necessary - Release notes (e.g., plugin protocol changes) - Adapt/convert commands to work with byte streams (high priority is `str length`, `bytes starts-with`, and maybe `bytes ends-with`). - Refactor the `tee` code, Devyn has already done some work on this. --------- Co-authored-by: Devyn Cairns <devyn.cairns@gmail.com>

Kixunil added the question the issue author asks something label Nov 5, 2022

Kixunil mentioned this issue Nov 5, 2022

nu freezes with 100% CPU usage when dealing with a huge structure #6979

Open

rgwood added the streaming Issues related to streaming data (or collecting data when it should be streamed) label Mar 1, 2023

IanManske mentioned this issue May 11, 2024

Replace ExternalStream with new ByteStream type #12774

Merged

stormasm closed this as completed in #12774 May 16, 2024

hustcer added this to the v0.94.0 milestone May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why `RawStream` contains an `Iterator` of `Vec<u8>` instead of `std::io::Read`? #7017

Why `RawStream` contains an `Iterator` of `Vec<u8>` instead of `std::io::Read`? #7017

Kixunil commented Nov 5, 2022

rgwood commented Dec 2, 2022 •

edited

Loading

Why RawStream contains an Iterator of Vec<u8> instead of std::io::Read? #7017

Why RawStream contains an Iterator of Vec<u8> instead of std::io::Read? #7017

Comments

Kixunil commented Nov 5, 2022

Question

Additional context and details

rgwood commented Dec 2, 2022 • edited Loading

Why `RawStream` contains an `Iterator` of `Vec<u8>` instead of `std::io::Read`? #7017

Why `RawStream` contains an `Iterator` of `Vec<u8>` instead of `std::io::Read`? #7017

rgwood commented Dec 2, 2022 •

edited

Loading