Skip to content

Conversation

treeowl
Copy link
Contributor

@treeowl treeowl commented Apr 6, 2020

Make (<*) for Data.Sequence incrementally asymptotically optimal.
This finally completes the task, begun in December 2014, of making all
the Applicative methods for sequences asymptotically optimal
even when their results are consumed incrementally.

Closes #314

@treeowl
Copy link
Contributor Author

treeowl commented Apr 6, 2020

@int-e, @oisdk, and @sjakobi: I'd really appreciate a review of this one. I wrote it, but can't claim to understand it all properly. Some of the names are bloody awful, and I find it really hard to keep track of what things go on the left and what things go on the right. I haven't written nearly enough comments, and that's largely because any I write are likely to be at least partially wrong.

@treeowl treeowl force-pushed the seq-before branch 2 times, most recently from 07f0ec1 to 8af84b4 Compare April 6, 2020 08:47
@treeowl treeowl requested review from oisdk, dmwit and sjakobi April 6, 2020 08:54
@sjakobi
Copy link
Member

sjakobi commented Apr 6, 2020

Impressive!

Have you benchmarked whether this is in fact faster in practice?

BTW I'm wondering under what circumstances you'd even use <* on Seq. Any ideas?

@jwaldmann
Copy link
Contributor

jwaldmann commented Apr 6, 2020

BTW I'm wondering under what circumstances you'd even use <* on Seq. Any ideas?

I was wondering the same thing. Perhaps it appears in the result of an Applicative-Do transform?

I am not seeing it in the examples https://gitlab.haskell.org/ghc/ghc/-/wikis/applicative-do but perhaps there are further optimisations.

[EDIT] Section 8 of: https://doi.org/10.1145/2976002.2976007 https://simonmar.github.io/bib/papers/applicativedo.pdf says "Our desugaring does not currently exploit the *> or <* combinators that Applicative provides, and in certain cases using these operators instead of <*> can result in performance benefits."

@treeowl
Copy link
Contributor Author

treeowl commented Apr 6, 2020

Impressive!

Have you benchmarked whether this is in fact faster in practice?

BTW I'm wondering under what circumstances you'd even use <* on Seq. Any ideas?

No formal benchmarks, but informal tests are mighty convincing. The allocation and residency drops are enormous. I have no idea why you'd want this operation, sadly.

--
-- '<*' takes time and space proportional to the product of the length
-- of its first argument and the logarithm of the length of its second
-- argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh. Why do the asymptotics for <* and *> differ? Is that really right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it needs more explanation. *> gets to reuse the original structure of the second sequence quite heavily, so that's correct for additional allocation and residency, but doesn't count the fact that it will retain large portions of the original structure of the second sequence as well. Please feel free to suggest a better blurb. For <*, we need to build an all new copy of a structure like the second sequence for every element of the first sequence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For *>, once we've turned the second sequence inside out (in time logarithmic in its length), we just stick that in a bunch of places unchanged. We can't get that kind of repetition in <*.

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admittedly haven't looked very closely at the code, but since the property test indicates that it's correct and you're happy with the performance, I don't see a reason not to merge this.

Once we know of a usecase we can still invest more effort.

We could also consider exporting beforeSeq' as e.g. replicateEach :: Int -> Seq a -> Seq a if that would be useful to anybody.

@treeowl
Copy link
Contributor Author

treeowl commented Apr 6, 2020

@sjakobi, I'm really looking for ideas to make the code more readable/understandable more than performance thoughts. I've thrown much more and larger QuickCheck runs at this on my computer than we have Travis do for us, so I'm pretty well convinced it's correct.

@sjakobi
Copy link
Member

sjakobi commented Apr 6, 2020

Thanks for the clarification @treeowl! I'm currently trying to focus on other projects, but I'll try to an in-depth review if I find myself idle!

@treeowl
Copy link
Contributor Author

treeowl commented Apr 6, 2020

@Lysxia, you might find this interesting.

@treeowl
Copy link
Contributor Author

treeowl commented Apr 7, 2020

One way to go from here is to integrate the rreplicate machinery into beforeSeq' and raptyMiddle, fusing away the explicit intermediate spine representation and getting rid of most or all of the extra datatypes. That should improve immediate-indexing performance slightly (which doesn't matter much). What's less obvious to me is whether it will make the whole implementation easier to understand or even harder. It'll probably get shorter, but that only goes so far.

Make `(<*)` for `Data.Sequence` incrementally asymptotically optimal.
This finally completes the task, begun in December 2014, of making all
the `Applicative` methods for sequences asymptotically optimal
even when their results are consumed incrementally.
@treeowl
Copy link
Contributor Author

treeowl commented Apr 7, 2020

I added a commit that removes the intermediate structure. I'm not so sure which version is less unclear. Do you have an opinion?

@Lysxia
Copy link
Contributor

Lysxia commented Apr 7, 2020

I like the new version better.

One idea to improve the documentation is to write down the invariants of the various helpers. Once they are known reviewing the code becomes pretty easy. It's good to remove intermediate structures as you did to make the code more compact, because there are less invariants to write.

@treeowl
Copy link
Contributor Author

treeowl commented Apr 7, 2020

@Lysxia, yes, that would be quite helpful.... I'll just need to figure out what the invariants are. That'll have the side benefit of helping work out whether there's a simpler or more efficient way to manage all the size arithmetic. Would you like to take a crack at it?

@Lysxia
Copy link
Contributor

Lysxia commented Apr 7, 2020

I can give it a try this weekend.

Remove the explicit intermediate spine structure from `(<*)`.
I don't really know if this is a good or a bad thing from a
clarity standpoint.
@treeowl
Copy link
Contributor Author

treeowl commented Apr 8, 2020

Many thanks, @Lysxia. Suggestions for better function and variable names will certainly be appreciated as well. Also, the way the nine arguments to raptyMiddle are currently bundled and arranged seems pretty ... arbitrary. Let me know if you have a better idea for that.

@int-e
Copy link
Contributor

int-e commented Apr 11, 2020

I think I mostly understand what's going on now... but those names :-/

For starters I'd suggest replicateEach as a name for beforeSeq'.

While trying to wrap my head around this problem (and code), I have implemented a version based on rigidify, but less aggressively optimized, see gist. Some of the optimization should be recovered by inlining... but I have not done any benchmarks. Even if it's not competetive, the code may be a source for names.

@treeowl
Copy link
Contributor Author

treeowl commented Apr 11, 2020

@int-e, have you looked at the PR @Lysxia opened against my branch a couple hours ago? Some of the names get much less bad there, and there's some documentation!

beforeSeq :: Seq a -> Seq b -> Seq a
beforeSeq xs ys = beforeSeq' xs (length ys)

beforeSeq' :: Seq a -> Int -> Seq a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be a good idea to comment on the basic ideas somewhere around here... from a 2-3 tree point of view, the result is obtained by fixing a 2-3 tree shape of size lenys >= 2 (which can be constructed in logarithmic time and space) that is then planeted at every leaf of xs.

The shape is defined recursively in a bottom-up fashion; at each level there is an initial node, a middle node that has to be replicated a number of times, and a final node, (l, n x m, r). For n = 0 or n = 1, a single node is enough to accomplish this; for n = 2, two nodes are required. In all other cases, we combine l with one or two m's into a single node and symmetrically, one or two m's with r for the last node, such that the remaining number of m's (n-4 to n-2) is divisible by 3. So for the middle part, we can combine 3 m's and then recurse.

The code is complicated by the fact that the first and last elements of the initial list are put on the fringes of the finger tree while the shape is being fixed. This involves quite a bit of bookkeeping (partly captured by RCountMid above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, an explanation of the basic ideas is indeed in order. I don't yet understand the second paragraph in your comment above: specifically what you mean about "a middle node that has to be replicated a number of times". I was thinking about the general process as using a (virtual) rigid tree for the second argument (the one where we only care about the length) in which every node that's not on the spine is a 3-node. To create the corresponding 2-3 tree, we need to create three things: the squashed-up left spine, the squashed-up right spine, and (in case we have a Single at the bottom) a full 3-tree for the center. Fortunately, that full 3-tree is produced "for free" out of the process of building the other parts, since we need such trees at each level to fill in the internal nodes. But I know my thinking about this is rather fuzzy in general; I mostly worked things out by trial and error starting with the geometrical intuition. So it's very likely possible to explain it better some other way. But yours remains mysterious to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, your perspective is different from mine. In my mind, I build the full 2-3 tree first. The finger tree is then obtained by peeling off digits from the left and the right, leaving leftover digits of size 1 or 2. (This can be done on the fly as the 2-3 tree is being built.) Btw, my n corresponds to deep_count in your code.

Anyway, let's see an example. To build a full 2-3 tree with 14 nodes, we can combine them as follows:

abbbbbbbbbbbbc -- 14 nodes: combine triples in the middle, and pairs or triples at the ends
(ab)(bbb)(bbb)(bbb)(bbc) -- 4 nodes: ditto
((ab)(bbb))((bbb)(bbb)(bbc)) -- 2 nodes: base case
(((ab)(bbb))((bbb)(bbb)(bbc)))

The digits that can be peeled off are a,b and (bbb) from the left, and b,b,c and (bbb),(bbb) from the right.

@int-e
Copy link
Contributor

int-e commented Apr 11, 2020

@treeowl no, I missed it. Too tired to look now. (Maybe tomorrow...)

P.S. actually I'm not even sure how I could've discovered it, except by proactively looking at your repo and pull requests. A comment here might have helped. (Link: treeowl#2)

@treeowl
Copy link
Contributor Author

treeowl commented Apr 12, 2020

@int-e, indeed, discovering such things is not nearly as easy as it should be. I'm expecting to merge it to my branch tomorrow. When it came in, I was out in the woods. Now I'm helping put away a massive grocery order.

* Document the invariants of `raptyMiddle`.
* Simplify some of the arithmetic.
* Reorganize the arguments for clarity.
@treeowl
Copy link
Contributor Author

treeowl commented Apr 13, 2020

Any opinions on whether we need to use that lazy pattern match and build the 2-3 trees top-down rather than bottom-up? My intuition suggests we probably don't, and that we therefore probably shouldn't, since doing so adds four thunks to each level of each 2-3 tree and the trees are shallow enough that building them lazily probably doesn't improve locality.

@treeowl treeowl force-pushed the seq-before branch 2 times, most recently from 99d20e7 to fbb516a Compare April 13, 2020 19:33
(RCountMid pr deep_count sf) -- deep_count > 1
= case deep_count `quotRem` 3 of
(q,0)
-> Deep (deep_count * sizec + lenys * (size midxs + 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observations:

  • This can also be computed as lenys * (size midxs + 2) - ssf - spr.
  • The expression is the same for all cases so one could factor out the Deep (...) part.
  • The value of lenys * (size midxs + 1) (or ... + 2) never changes, is it worth floating out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.... do you have an opinion about which calculation is better? Yes, we should factor out the Deep part; I'll try to remember to get that done. I don't know if it's worth floating out size midxs. Maybe. I haven't made any efforts at micro-optimizations like that because they require proper benchmarking, which is a bit fussy. I'm inclined not to bother for the moment.

@int-e
Copy link
Contributor

int-e commented Apr 14, 2020

I highly doubt the lazy pattern match will ever be worth it. Just look at its type, which is a nesting of Nodes (and adds another level on top), so we provably need O(log(n)) calls to reach that type. So to make an asymptotic difference, we'd have to look at several nodes on that level, but only inspect the top part of each... that's highly artificial.

Adding in that all this laziness isn't exactly free I think the pattern match should be strict.

@treeowl
Copy link
Contributor Author

treeowl commented Apr 14, 2020

I highly doubt the lazy pattern match will ever be worth it. Just look at its type, which is a nesting of Nodes (and adds another level on top), so we provably need O(log(n)) calls to reach that type. So to make an asymptotic difference, we'd have to look at several nodes on that level, but only inspect the top part of each... that's highly artificial.

Adding in that all this laziness isn't exactly free I think the pattern match should be strict.

That was my thought... but now I'm beginning to doubt myself. When we index into or split a sequence, we have to touch just the root of a number of nodes along the way, to find the right spot. So maybe we really can't do this after all?

@int-e
Copy link
Contributor

int-e commented Apr 14, 2020

That was my thought... but now I'm beginning to doubt myself. When we index into or split a sequence, we have to touch just the root of a number of nodes along the way, to find the right spot. So maybe we really can't do this after all?

Uhm but you won't be touching most of the roots anyway because they'll be further up the tree? Most of the decisions will be made based upon structure produced by mapMulFT. So you'll home in on just one or two of the trees produced by the fill23 functions, and you will almost certainly get to the bottom of those.

@Lysxia
Copy link
Contributor

Lysxia commented Apr 14, 2020

When we index into or split a sequence, we have to touch just the root of a number of nodes along the way, to find the right spot.

The node that matters for this is the one constructed by the fill23_final functions. I think those are the ones that should be lazy so the size is immediately available, if you want to avoid forcing one or two trees needlessly. AFAICT as soon as a function starts to look under the node constructed by fill23_final, it is already about to go all the way down, that's when the whole thing will be forced anyway, that's why we might as well make lift_fill23 strict.

A microscopic concern is that the strict version consumes slightly more stack space than the lazy one, but only by an amount logarithmic in lenys. There's probably a proper tail-recursive version out there, but that seems too much trouble for the moment.

@treeowl
Copy link
Contributor Author

treeowl commented Apr 14, 2020

Aha! Yes, I think that's right. We can and should construct the root lazily and the rest strictly, and indeed that seems very natural in this context.

@treeowl
Copy link
Contributor Author

treeowl commented Apr 14, 2020

And yes, I think you're right, we probably could make this tail recursive. I'm not sure it's worth the trouble, but I'll think on it at some point.

@treeowl treeowl force-pushed the seq-before branch 2 times, most recently from 35b613a to 1af5a64 Compare April 14, 2020 19:26
@treeowl
Copy link
Contributor Author

treeowl commented Apr 14, 2020

Hmm... A tail recursive version seems a bit tricky, thanks to the polymorphic recursion. I think making it tail recursive essentially means reassociating function compositions from f . (g . (h . i)) to ((f . g) . h) . i. That presumably requires an explicit representation of the composition chain. If that's right, I doubt it's worth the trouble.

* Rename and clarify some functions based on discussions with
  @int-e and @Lysxia.
* Make `mapMulFT` unconditionally strict in its multiplier to avoid
  one silly box.
* Update changelog with credits.
* Build the 2-3 trees more eagerly to reduce heap allocation.
@treeowl treeowl merged commit 0061862 into haskell:master Apr 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize <* for sequences
6 participants