Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Straighten folds and scans. #364

Merged
merged 15 commits into from
Aug 19, 2021

Conversation

kindaro
Copy link
Contributor

@kindaro kindaro commented Feb 20, 2021

Resolve #373.
Resolve #372.

  • foldr'
  • foldr1'
  • scanl1
  • scanr
  • scanr1
  • Systematic strictness checks.

@kindaro
Copy link
Contributor Author

kindaro commented Feb 20, 2021

I am going to keep this pull request updated so that fellow developers may be apprised of my progress. Feel free to comment!

@Bodigrim
Copy link
Contributor

(Windows test is flaky, do not pay attention to it; I'll rerun)

@kindaro
Copy link
Contributor Author

kindaro commented Feb 20, 2021

@Bodigrim  I noticed something suspicious.

To get a feel for the ball park my benchmarks should be in, I added a benchmark for strict and lazy foldl'. (Both are already in master — I did not touch them.) What I notice is that the lazy foldl' performs 10 times as fast as the strict one. How can it be possible? Maybe my benchmarks are broken?

This is what I see.

By chance you have an explanation?

@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 20, 2021

By chance you have an explanation?

Yes, it is a known quirk of benchmarks. See #345, #329, #23.

Note that benchmarks have a form nf (foldl' arg1 arg2) bs. Now compare definitions of strict and lazy bytestrings:

foldl' :: (a -> Word8 -> a) -> a -> ByteString -> a
foldl' f v (BS fp len) =

foldl' :: (a -> Word8 -> a) -> a -> ByteString -> a
foldl' f = go

The application of the strict foldl' needs three arguments to become saturated and suitable for inlining. But under nf only two arguments are supplied, so it remains non-inlined with correspondent runtime performance penalty. But the lazy foldl', according to its definition, needs only one argument to be saturated, so it inlines perfectly even under nf.

I think that real-world consequences of this are not as severe as one may expect looking at benchmarks: I expect that in the majority of client code foldl' has all three arguments (or is expanded by GHC to have them, see examples with foldl' (+) 0 . id in #23). Still it would be nice to fix this quirk, but #345 uncovered more issues with inlining (basically, it inlines better without {-# INLINE #-} at all), and I did not have time to experiment with it properly.

@kindaro
Copy link
Contributor Author

kindaro commented Feb 23, 2021

Curious, I would never think some inlining could give a tenfold performance boost.

@Bodigrim   I added some strictness checks and it looks to me as though the functions I defined behave the way their names suggest. (It is all quite intricate so a second look would be good.) However, I found out that I do not really know if it is appropriate to call them right folds. A lazy byte string is essentially a list of lists, and there is not really a text book definition for a left or right fold over that sort of thing… Does this look like a right fold to you?

foldr' f a (Chunk c cs) = S.foldr' f (foldr' f a cs) c

I would like to make sure this piece of code is up to our quality standards before moving on to other functions, so your review would be appreciated!

Data/ByteString/Lazy.hs Outdated Show resolved Hide resolved
Data/ByteString/Lazy.hs Outdated Show resolved Hide resolved
tests/Properties.hs Outdated Show resolved Hide resolved
tests/Properties.hs Outdated Show resolved Hide resolved
tests/Properties.hs Outdated Show resolved Hide resolved
@kindaro
Copy link
Contributor Author

kindaro commented Mar 1, 2021

We have comments like this:

-- | 'scanl' is similar to 'foldl', but returns a list of successive
-- reduced values from the left. This function will fuse.

I suppose it would be good if I can claim something similar for the functions I add, like scanl1. But I do not know how.

How do we know that a given function will fuse? What does this even mean? What are the laws? Where is it documented and checked?

@kindaro kindaro changed the title Iron out API discrepancies Straighten folds and scans. Mar 1, 2021
@kindaro
Copy link
Contributor Author

kindaro commented Mar 1, 2021

I think this is a good place to take a break and get things merged. Property checks tell us that the behaviour of the new functions is identical to their strict analogues.

@kindaro kindaro marked this pull request as ready for review March 1, 2021 14:51
tests/Properties.hs Outdated Show resolved Hide resolved
Data/ByteString/Lazy/Char8.hs Outdated Show resolved Hide resolved
Data/ByteString/Lazy/Char8.hs Outdated Show resolved Hide resolved
@Bodigrim
Copy link
Contributor

Bodigrim commented Mar 1, 2021

How do we know that a given function will fuse? What does this even mean? What are the laws? Where is it documented and checked?

Thanks for noticing this. The thing is the comment is 15 years old. 13 years ago fusion framework in ByteString was replaced by a stub, and eventually removed 10 years ago. But the comment has persisted. I'm afraid there is nothing we can call "fusion" in ByteString nowadays. Could you please clean up survived instances of this comment?

@kindaro
Copy link
Contributor Author

kindaro commented Mar 1, 2021

I'm afraid there is nothing we can call "fusion" in ByteString nowadays. Could you please clean up survived instances of this comment?

I opened an issue #374 to track this.

@kindaro kindaro force-pushed the iron-out-api-discrepancies branch from d7097a4 to b3ac740 Compare March 1, 2021 19:39
@kindaro kindaro requested a review from Bodigrim March 1, 2021 19:40
tests/Properties.hs Outdated Show resolved Hide resolved
tests/Properties.hs Outdated Show resolved Hide resolved
-- ^ input of length n
-> ByteString
-- ^ output of length n+1
scanr f z = pack . fmap (foldr f z) . tails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit unfortunate, because there is no sharing between tails, so you end up folding each tail with f independently, O(n^2) operations in total.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would like to do something about it. I need to meditate on this. Can I maybe reverse the thing and then fold it from the start? It is forced all along anyway.

I am not sure how to think about this sort of things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not gave it much thought, but maybe we can start from a high-level picture? Let's write lazy

mapAccumLChunks :: (acc -> Strict.ByteString -> (acc, Strict.ByteString)) -> acc -> Lazy.ByteString -> (acc, ByteString)

mapAccumRChunks :: (acc -> Strict.ByteString -> (acc, Strict.ByteString)) -> acc -> Lazy.ByteString -> (acc, ByteString)

Then reuse them in definitions of mapAccum{L,R}, and scan{l,r} as well (scans are just a special case of mapAccum).

Copy link
Contributor Author

@kindaro kindaro Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that, since we output a byte string about as long as the input, time and space complexity is bounded below by O(n). This means to my mind that reversing and then using scanl is as efficient as it gets. Is there anything I get wrong? What is the mark I am aiming at?

Also, should I write some bench marks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do a bit better than reversing the entire input. I think we can do one pass to reverse the order of the chunks and determine the total length. Then do a second pass over the reversed sequence of chunks and write the output.

Benchmarks would be useful to check that we're roughly in the right ballpark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I get it like this:

  • We agree that linear time and space are lower bounds.
  • We still want to avoid some expensive operations, like reversing all chunks.

Sounds about right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sound right to me! :)

@kindaro
Copy link
Contributor Author

kindaro commented Mar 5, 2021

I ran some bench marks and performance is orders of magnitude behind expectations. I am now angling to follow the advice above and write everything in terms of mapAccumLChunks and mapAccumRChunks. It will take a few days.

@kindaro
Copy link
Contributor Author

kindaro commented Apr 18, 2021

How about this? Turns out we can use standard recursion schemes. I am not sure if there would be some performance drop on older compilers but I did not notice any in my usual setting.

Data/ByteString/Lazy.hs Outdated Show resolved Hide resolved
@kindaro kindaro marked this pull request as draft August 15, 2021 11:11
@kindaro
Copy link
Contributor Author

kindaro commented Aug 15, 2021

Sorry I disappeared.

I had been trying to add checks for other functions and it was hard. So I took a month long creative break. It was fun and now I am back with new ideas. I am going to do this:

  1. Property check the evaluation of strict subjects versus strict oracles, and the other way around.

    It should pass, but it can also pass if the check is not fine enough, or if the functions are actually the same.

  2. Property check the evaluation of strict subjects versus lazy oracles, and the other way around.

    We are going to expect failure here. If it fails, we can say that our checks are fine enough to detect the difference in laziness, and that said difference actually is there.

So, our checks are going to check themselves, and also verify that our definitions of functions with different strictness are truly different.

@sjakobi
Copy link
Member

sjakobi commented Aug 16, 2021

@kindaro I'm kind of wary of large, long-running PRs and shifting goal-posts. PRs like this tend to attract merge conflicts and ultimately consume a disproportionate amount of developers' and reviewers' resources.

Are there parts in this PR that could be merged already? Even if some parts are not "fully" ready, we could merge them and possibly consider not exposing them in the next release if we think they might not be safe for users yet.

@kindaro
Copy link
Contributor Author

kindaro commented Aug 16, 2021

Yes, we can throw out automated strictness checks and trust our own judgement with regard to whether the functions in question have the right strictness properties. Then we can merge tomorrow.

However, the way I see it, the whole point of automated strictness checks is to reduce the involvement of reviewers. So the optimizations for merging sooner and for spending less reviewers' resources are actually conflicting. (I am fine with spending more developers' resources since I am the only developer and I optimize for quality.)

We can also merge the library code tomorrow, then merge the checks when they are ready and quickly fix the library code if any faults turn up.

Whatever the maintainers say I do.

@Bodigrim
Copy link
Contributor

I'm happy to spend additional reviewers' resources here, whenever you mark this PR as ready for review. While I deeply appreciate your efforts on automatic strictness checks, it feels like their volume and dependency footprint warrants a separate PR (and potentially a separate package).

@kindaro kindaro marked this pull request as ready for review August 18, 2021 00:22
@kindaro
Copy link
Contributor Author

kindaro commented Aug 18, 2021

Yo, I took away the strictness checks and responded to the pending comments from previous reviews.

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Couple minor suggestions, but looks good overall.

Data/ByteString/Lazy.hs Outdated Show resolved Hide resolved
Data/ByteString/Lazy.hs Outdated Show resolved Hide resolved
Data/ByteString/Lazy.hs Outdated Show resolved Hide resolved
L.take (L.length xs + 1) (L.scanl (+) 0 (explosiveTail (xs <> L.singleton 1))) === (L.pack . fmap (L.foldr (+) 0) . L.inits) xs
, testProperty "scanl1 is lazy" $ \ xs -> L.length xs > 0 ==>
L.take (L.length xs) (L.scanl1 (+) (explosiveTail (xs <> L.singleton 1))) === (L.pack . fmap (L.foldr1 (+)) . tail . L.inits) xs
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why scanr is less lazy than scanl. The thing is that its output starts from an accumulator, and Data.ByteString.mapAccumR is too strict in this respect.

go src dst = mapAccumR_ acc (len-1)
where
mapAccumR_ !s (-1) = return s
mapAccumR_ !s !n = do
x <- peekByteOff src n
let (s', y) = f s x
pokeByteOff dst n y
mapAccumR_ s' (n-1)
acc' <- unsafeWithForeignPtr gp (go a)
return (acc', BS gp len)

I think this is fine: there are no particular expectations about strictness of scanr (there is no scanr' in Prelude).

Copy link
Contributor Author

@kindaro kindaro Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow. Is there a specific proposition you are reasoning towards? Or a question I may answer? For example:

PropositionData.ByteString.Lazy.scanr cannot be lazy.

Proof

As you noted, the output of a scanr starts from the end, so this is the sort of laziness we can have:

λ take 2 . reverse $ Prelude.scanr (+) 0 [undefined, 1]
[0,1]

So, first the spine of the input list is evaluated to the end, then elements are evaluated from the end backwards. (Whether the accumulator is evaluated before or after the first element depends on the order of evaluation of +.) Similarly, the byte stream's spine would have to be evaluated first. But the spine of the byte stream is strict in the leaf:

-- | A space-efficient representation of a 'Word8' vector, supporting many
-- efficient operations.
--
-- A lazy 'ByteString' contains 8-bit bytes, or by using the operations
-- from "Data.ByteString.Lazy.Char8" it can be interpreted as containing
-- 8-bit characters.
--
data ByteString = Empty | Chunk {-# UNPACK #-} !S.ByteString ByteString
deriving (Typeable, TH.Lift)
-- See 'invariant' function later in this module for internal invariants.

The leaf itself is a byte array and therefore also strict throughout. So, once we force the spine, every byte is also forced. There is no lazy scanr for byte streams. ∎

Something like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a remark for myself and @sjakobi and anyone else who is puzzled why we have laziness properties for scanl, but not for scanr.

It's not like you cannot make Data.ByteString.Lazy.scanr a bit lazier. E. g., for the proposed implementation

> Data.ByteString.Lazy.head $ Data.ByteString.Lazy.scanr const 42 ("foo" <> undefined)
*** Exception: Prelude.undefined
CallStack (from HasCallStack):
  error, called at libraries/base/GHC/Err.hs:79:14 in base:GHC.Err
  undefined, called at <interactive>:11:75 in interactive:Ghci1

However, if we are ready to sacrifice performance, one can define

scanr f z bs = cons hd tl
  where
    (_, tl) = mapAccumR (\x y -> (f y x, x)) z bs
    (hd, _) = List.mapAccumR (\x y -> (f y x, x)) z (unpack bs)

for which

> Data.ByteString.Lazy.head $ Data.ByteString.Lazy.scanr const 42 ("foo" <> undefined)
102

You can define an even lazier (and slower) version, capable to return first few chunks of bytestring, as long as f is very lazy (e. g., f = const).

My point is that this is a rare use case, which does not justify performance sacrifices, especially given that there is no general expectation how lazy scanr should be. I'm fine with your implementation, no action required.

@Bodigrim Bodigrim requested a review from sjakobi August 18, 2021 23:57
Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to see that this PR is close to being merged now! :)

Data/ByteString/Lazy.hs Outdated Show resolved Hide resolved
Turns out we do not really need it. We thought we need it to implement `scan[lr]`, but actually `mapAccum[LR]` is enough.
Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @kindaro! :)

@Bodigrim Bodigrim added this to the 0.11.2.0 milestone Aug 19, 2021
@Bodigrim Bodigrim merged commit 99b7ff6 into haskell:master Aug 19, 2021
@Bodigrim
Copy link
Contributor

Great stuff, @kindaro! Thanks!

Bodigrim pushed a commit to Bodigrim/bytestring that referenced this pull request Aug 19, 2021
* Add strict right folds.

* Add property checks.

* Add benchmarks.

* Inline strictness checks.

* Straighten scans.

* Fix whitespace.

* Use `===` for equality.

* Use infix operator for brevity.

* Add bench marks for lazy scans.

* Use standard recursion schemes.

* Dodge import conflicts on older GHC versions.

* Final considerations according to the last review.

* Final considerations according to one more last review.

* Add bench mark for lazy accumulating maps.

* Throw away `mapAccum[LR]Chunks`.

Turns out we do not really need it. We thought we need it to implement `scan[lr]`, but actually `mapAccum[LR]` is enough.
noughtmare pushed a commit to noughtmare/bytestring that referenced this pull request Dec 12, 2021
* Add strict right folds.

* Add property checks.

* Add benchmarks.

* Inline strictness checks.

* Straighten scans.

* Fix whitespace.

* Use `===` for equality.

* Use infix operator for brevity.

* Add bench marks for lazy scans.

* Use standard recursion schemes.

* Dodge import conflicts on older GHC versions.

* Final considerations according to the last review.

* Final considerations according to one more last review.

* Add bench mark for lazy accumulating maps.

* Throw away `mapAccum[LR]Chunks`.

Turns out we do not really need it. We thought we need it to implement `scan[lr]`, but actually `mapAccum[LR]` is enough.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Straighten folds and scans. Lazy scanl is not actually lazy.
3 participants