Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix input_offset resetting in stream #25

Merged
merged 2 commits into from
Jul 10, 2023

Conversation

myl7
Copy link
Contributor

@myl7 myl7 commented Jul 3, 2023

Close #22, close #23

The incorrect input_offset resetting causes that, while the stream outputs are still decompress-able, when the input is long enough, the output would become incorrect (means decompression can not output the original input). The PR fixes it.

@pimterry
Copy link
Member

pimterry commented Jul 3, 2023

Amazing @myl7! Thanks for investigating this.

Do you know how long the input has to be to trigger this issue? Is it possible to write a test to reproduce it, just to check this fix and to ensure we don't run into the same issue again in future?

@myl7
Copy link
Contributor Author

myl7 commented Jul 3, 2023

A ~570KB file is enough. How about to add a test with a 600KB file?

@pimterry
Copy link
Member

pimterry commented Jul 3, 2023

Yes, that sounds good to me. I'm happy with either committing a 600KB file or just generating it at runtime (we can use getRandomValues in a loop for 65K at a time) assuming that's fairly quick.

We need to remember to confirm that any new test for this does definitely fail in the previous implementation without this fix 😄

@pimterry pimterry closed this in #26 Jul 3, 2023
@pimterry pimterry reopened this Jul 3, 2023
@pimterry
Copy link
Member

pimterry commented Jul 3, 2023

Oops, sorry, that was closed automatically due to merging #26.

@myl7
Copy link
Contributor Author

myl7 commented Jul 4, 2023

My fault😂. I mistakenly used the term like "the fix # 25" in that.

I checked the processing of input_offset.
The problem looks more like a design choice rather than logical error.

When the instance requires more input and returns NeedsMoreInput, input_offset is previously set to input.len(), which is fair since for this chunk all bytes have been consumed.
However, since in the TransformStream we use do {} while {} and go to the next chunk immediately after getting NeedMoreInput, input_offset should be 0 for the next chunk but incorrectly kept input.len() which is for the previous chunk.
It causes the problem.

In the tests since input_offset is manually managed so the problem does occur.
The above mistakes seem to cause the instance keeps some chunks so the output is still decompressable but can not be recovered to the original input.

Adding an extra step in the TransformStream helper to manually reset input_offset seems also OK.
But I personally slightly prefer resetting input_offset in the library like the fix because it is easier for users to use.
How about your preference?

@pimterry
Copy link
Member

pimterry commented Jul 5, 2023

I see, thanks for the explanation, I think I understand the issue much more clearly.

What's really going on here is that within brotli-wasm we're tracking state that tightly depends on how it's used in the stream wrapper - it's not really part of our Brotli streaming implementation at all. It's breaking here not because it's wrong, but because it doesn't match details of how that other implementation works. The fixed logic in this PR to update input_offset makes sense in the current suggested implementation, but might still be wrong in other implementations - for example reading from a Buffer that is being appended to elsewhere (i.e. where the input offset doesn't change between chunks), something like this:

const buffer = /* ... some fixed buffer that's being written to elsewhere */ 
let inputOffset = 0;

while (true) {
    const nextInput = buffer.slice(inputOffset, endOfBuffer);
    output.append(this.stream.compress(nextInput, outputSize));

    if (this.stream.result() === brotliWasm.BrotliStreamResult.NeedsMoreInput) {
        // Wait for data from elsewhere to update endOfBuffer
        await waitForMoreData();

        // Last_input_offset will often be wrong here - it really should be endOfBuffer
        // now, but in fact it will be 0, because it's assuming that buffer is reset.
    }
}

(Very rough code, but you get the idea)

Even when it's correct, it's a bit 'too clever' and I think moving that into the stream wrapper (but making it easy to implement there) would be simpler & clearer.

What would happen if we remove last_input_offset completely, and instead make compress and decompress return the details of what was consumed each time. Then it's up to the stream wrapper to handle that and track input offset itself, depending on how the input actually works. In the web stream case, the stream implementation would be something like this:

let inputOffset = 0;
do {
    const input = chunk.slice(inputOffset);
    const { output, consumedInputLength } = this.stream.compress(input, this.outputSize);
    inputOffset += consumedInputLength;
    controller.enqueue(output);
} while (this.stream.result() === brotliWasm.BrotliStreamResult.NeedsMoreOutput)

What do you think?

myl7 added 2 commits July 6, 2023 21:48
Other than keep them in the stream.
New returned struct is added.

Also change the error type to JsError.
See rustwasm/wasm-bindgen#2710 .
@myl7 myl7 force-pushed the fix-stream-input-offset branch from 700a90d to e983db1 Compare July 6, 2023 14:19
@myl7
Copy link
Contributor Author

myl7 commented Jul 6, 2023

True. Returning input_offset other than keeping it in the stream looks better.

I have updated the stream Rust code with the tests.
If OK, I will then also update the stream decompression example PR #27 .

@pimterry pimterry merged commit 4fa04d6 into httptoolkit:main Jul 10, 2023
7 checks passed
@pimterry
Copy link
Member

Looks great, thanks @myl7! I think this is definitely better and should make stream usage generally a little clearer, and fix the original bug en route too.

I won't merge #27 for now, if you could add the matching changes in there that would be great, thank you! I'll do a release once that's all done.

@myl7 myl7 deleted the fix-stream-input-offset branch July 11, 2023 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leaking memory when used streaming Streaming output not the same as input
2 participants