Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy dropEnd and friends #306

Closed
Bodigrim opened this issue Oct 18, 2020 · 4 comments · Fixed by #395
Closed

Lazy dropEnd and friends #306

Bodigrim opened this issue Oct 18, 2020 · 4 comments · Fixed by #395
Milestone

Comments

@Bodigrim
Copy link
Contributor

There are no dropEnd, takeEnd, dropWhileEnd, takeWhileEnd, spanEnd and breakEnd operating on lazy bytestrings. However, it is not clear for me which degree of laziness is desired here.

Imagine a lazy bytestring of several chunks of length 100, and dropEnd 80.

One strategy is as follows: we look at the first chunk only, without forcing the rest of the string. It has 100 chars, so we can return a chunk containing the first 20 chars immediately, and put remaning 80 in a buffer. Then we force the tail and observe that the next chunk is also 100-chars-long, so we can flush the buffer, and also a chunk of 20 chars, etc. The benefit is that we operate as lazy as possible and do not trigger hidden effects (lazy IO) unnecessary. But the cost is that the result has twice as many chunks as the input: 20 chars, 80 chars, 20 chars, 80 chars...

Another strategy is more eager. Evaluate first two chunks immediately: if the length of the second is > 80, return the first one unchanged. Evaluate the third chunk: if it is longer than 80 chars, return the second chunk unchanged, etc. In this case the output string retain original chunks of length 100, except the last one, which is good because of less fragmentation and pointers involved. The downside is that we may cause more lazy IO than strictly necessary.

@vdukhovni @sjakobi @hsyl20 what do you think?

@vdukhovni
Copy link
Contributor

The second strategy is correct. It avoids lots of copying that's likely more expensive than any benefit from forcing an I/O early. With the first strategy you end up copying every buffer in the stream, with the second, all but the last are returned verbatim.

I should also mention that you're assuming that lazy bytestrings have pending lazy I/O operations, but this is not true in general, often they're just pure octet-strings stored as multiple chunks. And, for example, with takeWhileEnd you can't return anything until you've reached the actual end, because there could always be some later octets that don't meed the predicate.

It is perhaps worth noting that takeEnd and dropEnd can be performed in a streaming manner, within a finite memory budget, but dropWhileEnd and takeWhileEnd are not compatible with streaming in general. Though of course with dropWhileEnd, if any octet of a chunk fails the predicate, one can immediately release all prior chunks and also that one if the octet in question is the last one. The worst behaviour of the lot is takeWhileEnd, nothing can be output until the end has been seen. Don't do that with large data streams that involve lazy I/O.

FWIW, none of these operations are available in Streaming.ByteString, but that's not too surprising, since the idea was to build a "better lazy ByteString", with as much of the same API as made sense...

@Bodigrim
Copy link
Contributor Author

Yeah, I'm also in favor of the second strategy.

@Bodigrim
Copy link
Contributor Author

@sjakobi @hsyl20 opinions?

@sjakobi
Copy link
Member

sjakobi commented Nov 1, 2020

The second strategy sounds preferable to me. Thanks for explaining the trade-off so clearly!

@Bodigrim Bodigrim added this to the 0.11.2.0 milestone Jul 28, 2021
Bodigrim pushed a commit that referenced this issue Jul 28, 2021
* Add lazy dropEnd and friends

Fixes #306.
Utilizes an eager strategy as described in the original issue. Chunks
are eagerly evaluated but the overall structure and pointers are kept
intact.

* Fix `since` version number

* Use `foldrChunks` instead of `length` based checks

* Add review changes

- Pass Int64 to accumulators
- Clean up `breakEnd`
- Implement lazy version of `dropWhileEnd`
- Add dropWhileEnd lazy test

* Fix `breakEnd` and `spanEnd` lazyness

* Make `dropEnd` lazier

* Formatting

* Fix lazy `dropEnd`

* Add `Deque` module

`dropEnd` now uses `Deque` for handling the accumulated bytestrigs

* Normalize function names

* Return `Maybe (S.ByteString, Deque)` from pops

Plus all other review suggestions

* Add review changes

* Rename `dropElements` to `dropEndBytes`

* Add examples + style fixes

* Update Data/ByteString/Lazy/Internal/Deque.hs

Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>

* Replace `elemLength` with `byteLength`

* Upadate examples

Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>
Bodigrim pushed a commit that referenced this issue Aug 1, 2021
* Add lazy dropEnd and friends

Fixes #306.
Utilizes an eager strategy as described in the original issue. Chunks
are eagerly evaluated but the overall structure and pointers are kept
intact.

* Fix `since` version number

* Use `foldrChunks` instead of `length` based checks

* Add review changes

- Pass Int64 to accumulators
- Clean up `breakEnd`
- Implement lazy version of `dropWhileEnd`
- Add dropWhileEnd lazy test

* Fix `breakEnd` and `spanEnd` lazyness

* Make `dropEnd` lazier

* Formatting

* Fix lazy `dropEnd`

* Add `Deque` module

`dropEnd` now uses `Deque` for handling the accumulated bytestrigs

* Normalize function names

* Return `Maybe (S.ByteString, Deque)` from pops

Plus all other review suggestions

* Add review changes

* Rename `dropElements` to `dropEndBytes`

* Add examples + style fixes

* Update Data/ByteString/Lazy/Internal/Deque.hs

Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>

* Replace `elemLength` with `byteLength`

* Upadate examples

Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>
noughtmare pushed a commit to noughtmare/bytestring that referenced this issue Dec 12, 2021
* Add lazy dropEnd and friends

Fixes haskell#306.
Utilizes an eager strategy as described in the original issue. Chunks
are eagerly evaluated but the overall structure and pointers are kept
intact.

* Fix `since` version number

* Use `foldrChunks` instead of `length` based checks

* Add review changes

- Pass Int64 to accumulators
- Clean up `breakEnd`
- Implement lazy version of `dropWhileEnd`
- Add dropWhileEnd lazy test

* Fix `breakEnd` and `spanEnd` lazyness

* Make `dropEnd` lazier

* Formatting

* Fix lazy `dropEnd`

* Add `Deque` module

`dropEnd` now uses `Deque` for handling the accumulated bytestrigs

* Normalize function names

* Return `Maybe (S.ByteString, Deque)` from pops

Plus all other review suggestions

* Add review changes

* Rename `dropElements` to `dropEndBytes`

* Add examples + style fixes

* Update Data/ByteString/Lazy/Internal/Deque.hs

Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>

* Replace `elemLength` with `byteLength`

* Upadate examples

Co-authored-by: Simon Jakobi <simon.jakobi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants