Skip to content

Conversation

Rufflewind
Copy link
Member

It works similar to alter, but more general (allows an arbitrary Functor instead of just Identity). For doing an arbitrary combined lookup + insert/delete in a monad, this might be a little faster than doing them separately.

Did some performance benchmarks with GHC 7.10.3 and as far as I can tell it works just as good as alter and lookup with the right choice of Functor.

(It is slower by 50% if you try to use (,) () instead of Identity. I have not figured out why this is the case − I guess this is harder to optimize out than a trivial wrapper.)

@foxik
Copy link
Contributor

foxik commented Mar 28, 2016

Hi,

thanks for the patch, including bechmarks and tests!

I am not totally convinced about performance of at. Consider the results of your benchmarks:

"lookup absent"                 3.993e-4
"lookup present"                3.176e-4
"at lookup absent"              3.937e-4
"at lookup present"             3.437e-4
"insert absent"                 1.504e-3
"insert present"                1.038e-3
"update absent"                 1.278e-3
"update present"                1.018e-3
"delete absent"                 1.279e-3
"delete present"                1.283e-3
"at alter absent"               1.397e-3
"at alter insert"               1.674e-3
"at alter update"               1.114e-3
"at alter delete"               1.463e-3

There are four "patterns" of using at, all can be composed of existing functions:

  • at alter absent vs lookup absent: 1.39ms vs 0.39ms, +250%
  • at alter insert vs lookup absent+insert absent: 1.67ms vs 1.89ms, -15%
  • at alter update vs lookup present+update present: 1.11ms vs 1.33ms -17%
  • at alter delete vs lookup present+delete present: 1.46ms vs 1.6ms -9%
    I agree that is minor speedup in three of the four cases. Nevertheless, that is for the trivial functor, for non-trivial the results will be much worse, as you comment in the log.

I think it is harmful to wrap each recursion of go into the functor, because for non-trivial functors like (,) (), GHC probably allocates on every level of recursion. Therefore, I think your at will be slower that At instance from lens for non-trivial functors, because the former "embeds" O(log n) computations in the functor, while the latter only two.

If you want to pursue this, we need to make sure the performance is reasonable. Therefore please add additional benchmarks -- take implementation of At.at from lens, and your at, and run both on the four cases (even-even, even-odd, odd-even, odd-odd), and use some non-trivial functor (either (,) (), or Maybe, or []).

Also, if we are going to add at, we should add it to IntMap as well, because we try to provide the closes possible operations for Map and IntMap.

@foxik
Copy link
Contributor

foxik commented Mar 28, 2016

BTW, the benchmarks may of course show that I am not right and that your at is already better than At instance from lens :-)

@Rufflewind
Copy link
Member Author

I've added comparisons with the existing lens implementation.

at alter absent vs lookup absent: 1.39ms vs 0.39ms, +250%

I would compare at alter absent with alter absent, and at lookup absent with lookup absent. The first version uses the Identity functor, while the second one uses the Const functor and is therefore much faster. On my machine, the difference between at lookup absent and lookup absent is negligible.

Comparing against the existing Lens implementation of at, all of the tests are either neutral or favor the new implementation with modest speedups except insert absent, in which the new implementation is slower by ~8%:

benchmarking insert absent
time                 672.4 μs   (671.1 μs .. 674.4 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 673.9 μs   (672.3 μs .. 677.3 μs)
std dev              6.970 μs   (3.826 μs .. 13.03 μs)

benchmarking at insert absent
time                 725.4 μs   (722.6 μs .. 731.2 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 724.1 μs   (722.3 μs .. 727.9 μs)
std dev              8.809 μs   (4.650 μs .. 16.05 μs)

benchmarking atLens insert absent
time                 681.2 μs   (679.9 μs .. 682.6 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 684.4 μs   (682.8 μs .. 687.2 μs)
std dev              6.973 μs   (5.176 μs .. 9.875 μs)

I'm not entirely sure why this is the case. In constrast, at insert present is slightly faster than insert present and atLens insert present.

It'd be nice if there was some clever way to avoid the repeated fmap applications.

@foxik
Copy link
Contributor

foxik commented Mar 30, 2016

Thanks for the benchmarks. On my machine (GHC 7.6, I use Debian Stable), the benchmarks from 584ce4b give the following results:

  • using runIdentity (i.e., as in your benchmark):
                    at       atLens
"* lookup absent"   0.393ms  0.364ms
"* lookup present"  0.335ms  0.306ms
"* insert absent"   1.600ms  1.567ms
"* insert present"  1.209ms  1.074ms
"* alter absent"    1.349ms  0.391ms
"* alter insert"    1.633ms  1.520ms
"* alter update"    1.064ms  1.584ms
"* alter delete"    1.393ms  1.629ms

Here, the atLens is faster, except for alter update and alter delete.

  • using fromJust instead runIdentity:
                    at       atLens
"* lookup absent"   0.375ms  0.374ms
"* lookup present"  0.327ms  0.317ms
"* insert absent"   2.468ms  1.560ms
"* insert present"  1.769ms  1.067ms
"* alter absent"    2.116ms  0.407ms
"* alter insert"    2.468ms  1.538ms
"* alter update"    1.805ms  1.586ms
"* alter delete"    1.948ms  1.646ms

Now atLens is faster in all cases, and note big difference for alter absent and also alter insert.

  • using head instead runIdentity:
                    at       atLens
"* lookup absent"   0.376ms  0.373ms
"* lookup present"  0.319ms  0.319ms
"* insert absent"   3.027ms  1.581ms
"* insert present"  2.234ms  1.093ms
"* alter absent"    3.173ms  0.398ms
"* alter insert"    3.096ms  1.532ms
"* alter update"    2.276ms  1.578ms
"* alter delete"    2.447ms  1.624ms

Once again, atLens is always faster, and the difference for alter absent and alter insert is even bigger.

Altogether, you can see that atLens implementation is quite independent on the choice of the functor, while at gets slower the "bigger" the functor is, which is probably caused by the increased number of allocations.

With this performance profile, atLens is probably preferred implementation.

@Rufflewind
Copy link
Member Author

I think the Identity functor in transformers is probably not defined in the most efficient way. In base-4.8 Identity makes use of coerce, so that is probably why yours favors atLens.

I personally think (at least based on my benchmarks) it's kind of a mixed bag: better in some situations but worse in others.

I can't really think of a way to improve it – if only there was some way to capture the call stack efficiently and do an fmap at the very end…

@Rufflewind Rufflewind closed this Mar 30, 2016
@foxik
Copy link
Contributor

foxik commented Mar 30, 2016

The problem is that the new implementatioin is better only for trivial functors -- but I would expect the use case to be some nontrivial functors (IO, lists, Maybe, ...). Therefore, the cost of traversing the path to the element in question twice is better than allocating each modified subtree in the functor. Note that even if we traverse the path twice, the first it "read-only" lookup which is an order faster that the second where we rebuild the tree if required.

@Rufflewind
Copy link
Member Author

Therefore, the cost of traversing the path to the element in question twice is better than allocating each modified subtree in the functor.

That's certainly true and believable for integers. But what about data types with nontrivial Ord instances though?

@foxik
Copy link
Contributor

foxik commented Mar 30, 2016

That is of course right. However, I would hope that for types with costly Ord instance, a HashMap would be used.

Generally, I prefer less allocations when I choose between allocations vs time, but it is more a personal oppinion.

It is unfortunate though that we cannot have "the best" from the both implementations -- traverse the tree only once, and allocating only two values in the functor (the user functions and the complete results). If you had an idea how to achive that, do not hesitate to tell me :-) But we would need to use the stack for representing the traversed path, using some kind of explicit zipper will be bad for performance.

@treeowl treeowl reopened this Apr 28, 2016
@treeowl
Copy link
Contributor

treeowl commented Apr 28, 2016

@Rufflewind, do you think you could benchmark this with a Coyoneda improvement? That is, write your current at term with the type signature

atCoyoneda :: Ord k =>
      k -> (Maybe a -> Coyoneda f (Maybe a)) -> Map k a -> Coyoneda f (Map k a)

and write at to use atCoyoneda with liftCoyoneda and lowerCoyoneda? I'd be interested to see if this would be fast enough to be useful. Prefer INLINABLE to SPECIALIZE, I think.

@treeowl
Copy link
Contributor

treeowl commented Apr 28, 2016

Actually... Map is strict in its structure, so use Yoneda instead of Coyoneda, I think.

@Rufflewind
Copy link
Member Author

Assuming I used Yoneda properly, here are the benchmarks for f7bb3ab

benchmarking lookup absent
time                 119.6 μs   (119.2 μs .. 119.8 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 120.6 μs   (120.6 μs .. 120.7 μs)
std dev              94.72 ns   (80.77 ns .. 117.2 ns)

benchmarking at lookup absent
time                 147.6 μs   (147.5 μs .. 147.6 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 147.6 μs   (147.5 μs .. 147.6 μs)
std dev              101.9 ns   (32.49 ns .. 176.3 ns)

benchmarking atLens lookup absent
time                 120.2 μs   (120.1 μs .. 120.2 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 120.1 μs   (120.0 μs .. 120.1 μs)
std dev              148.2 ns   (127.6 ns .. 185.6 ns)

benchmarking lookup present
time                 103.9 μs   (103.8 μs .. 103.9 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 103.8 μs   (103.8 μs .. 103.9 μs)
std dev              61.96 ns   (51.46 ns .. 76.27 ns)

benchmarking at lookup present
time                 128.0 μs   (128.0 μs .. 128.0 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 128.0 μs   (128.0 μs .. 128.0 μs)
std dev              24.95 ns   (20.47 ns .. 33.91 ns)

benchmarking atLens lookup present
time                 104.1 μs   (104.0 μs .. 104.1 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 104.1 μs   (104.0 μs .. 104.1 μs)
std dev              97.35 ns   (74.08 ns .. 159.0 ns)

benchmarking insert absent
time                 683.1 μs   (682.4 μs .. 684.0 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 683.1 μs   (681.9 μs .. 684.2 μs)
std dev              3.639 μs   (2.867 μs .. 5.151 μs)

benchmarking at insert absent
time                 1.225 ms   (1.223 ms .. 1.227 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.225 ms   (1.223 ms .. 1.228 ms)
std dev              8.203 μs   (6.727 μs .. 10.13 μs)

benchmarking atLens insert absent
time                 683.2 μs   (682.4 μs .. 684.0 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 683.2 μs   (682.0 μs .. 684.4 μs)
std dev              4.125 μs   (3.093 μs .. 5.908 μs)

benchmarking insert present
time                 499.4 μs   (498.7 μs .. 500.0 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 500.1 μs   (499.4 μs .. 500.9 μs)
std dev              2.449 μs   (1.920 μs .. 3.405 μs)

benchmarking at insert present
time                 849.4 μs   (848.1 μs .. 850.7 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 849.9 μs   (848.8 μs .. 851.1 μs)
std dev              4.178 μs   (3.464 μs .. 5.206 μs)

benchmarking atLens insert present
time                 499.3 μs   (498.4 μs .. 500.4 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 500.0 μs   (499.3 μs .. 500.6 μs)
std dev              2.126 μs   (1.820 μs .. 2.580 μs)

benchmarking alter absent
time                 511.3 μs   (510.8 μs .. 511.9 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 511.3 μs   (510.4 μs .. 512.1 μs)
std dev              2.876 μs   (2.303 μs .. 3.750 μs)

benchmarking at alter absent
time                 951.1 μs   (949.4 μs .. 952.8 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 951.3 μs   (949.5 μs .. 953.0 μs)
std dev              5.699 μs   (4.259 μs .. 7.794 μs)

benchmarking atLens alter absent
time                 131.6 μs   (131.6 μs .. 131.7 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 131.6 μs   (131.5 μs .. 131.6 μs)
std dev              80.66 ns   (67.07 ns .. 98.03 ns)

benchmarking alter insert
time                 712.2 μs   (711.2 μs .. 713.1 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 711.8 μs   (710.7 μs .. 713.0 μs)
std dev              3.865 μs   (2.949 μs .. 5.534 μs)

benchmarking at alter insert
time                 1.204 ms   (1.203 ms .. 1.205 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.204 ms   (1.202 ms .. 1.207 ms)
std dev              9.006 μs   (6.225 μs .. 12.46 μs)

benchmarking atLens alter insert
time                 687.3 μs   (686.0 μs .. 688.5 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 685.1 μs   (683.3 μs .. 686.5 μs)
std dev              5.149 μs   (4.070 μs .. 6.961 μs)

benchmarking alter update
time                 455.5 μs   (455.0 μs .. 456.2 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 455.7 μs   (454.7 μs .. 456.9 μs)
std dev              3.377 μs   (2.570 μs .. 4.836 μs)

benchmarking at alter update
time                 859.3 μs   (858.3 μs .. 860.4 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 859.2 μs   (857.9 μs .. 860.8 μs)
std dev              4.927 μs   (4.124 μs .. 6.263 μs)

benchmarking atLens alter update
time                 612.3 μs   (611.5 μs .. 613.1 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 612.2 μs   (610.9 μs .. 613.3 μs)
std dev              3.888 μs   (3.079 μs .. 5.166 μs)

benchmarking alter delete
time                 483.8 μs   (483.2 μs .. 484.3 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 483.7 μs   (483.1 μs .. 484.3 μs)
std dev              1.956 μs   (1.646 μs .. 2.478 μs)

benchmarking at alter delete
time                 905.4 μs   (904.2 μs .. 906.6 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 905.8 μs   (904.4 μs .. 907.3 μs)
std dev              4.996 μs   (3.911 μs .. 6.898 μs)

benchmarking atLens alter delete
time                 564.6 μs   (564.0 μs .. 565.2 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 565.2 μs   (564.4 μs .. 566.4 μs)
std dev              3.334 μs   (2.422 μs .. 5.184 μs)

@Rufflewind
Copy link
Member Author

Rufflewind commented Apr 28, 2016

Assuming I did it right, the results aren't too promising unfortunately. I didn't do any tests with a nontrivial Functor though, so I think it might win in those situations.

I think I actually tried this experiment a while back but by directly passing the fmapped function around and combining them with (.), which − if my understanding is correct − is what Yoneda does under the hood. (I didn't post the results because it was quite slow.)

From a low-level perspective, composing functions with (.) (or indirectly via Yoneda) probably requires an allocation each time, so in effect we are manually reconstructing the call stack as a linked-list-like data structure. It would be more efficient if there was a way (like a GHC primitive) to capture a portion of the call stack as a function:

setjmp :: ((r -> a) -> b) -> (r, a -> b)

at :: (Functor f, Ord k) => k -> (Maybe a -> f (Maybe a)) -> Map k a -> f (Map k a)
at k f m = g `fmap` f r
  where

    (r, g) = setjmp $ \ longjmp -> go longjmp k m

    go longjmp !k Tip = case longjmp Nothing of
               Nothing -> Tip
               Just x  -> singleton k x

    go longjmp !k (Bin sx kx x l r) = case compare k kx of
               LT -> balance kx x (go longjmp k l) r
               GT -> balance kx x l (go longjmp k r)
               EQ -> case longjmp (Just x) of
                       Just x' -> Bin sx kx x' l r
                       Nothing -> glue l r

It's a rather unsafe combinator though, because the "function" it returns can only be called once – how would one enforce that?

@treeowl
Copy link
Contributor

treeowl commented May 1, 2016

@Rufflewind, I'd very much like to have some implementation of this operation to deal with keys that are somewhat more expensive than Int in situations where only Ord is available or where anything better turns out to be too complicated. One option might be to perform the lookup and record the path taken. We maintain an invariant that size l <= 3 * size r, and the other way around, at each node. Is that tight enough to use 63 bits to represent the lefts and rights, with a single bit to reveal the path length? My arithmetic brain isn't on today. If so, we'd have just a couple extra bitops and a fast Word64 equality test per branch, with new insertUsingPath and deleteUsingPath for the cases other than Nothing-Nothing.

@treeowl
Copy link
Contributor

treeowl commented May 1, 2016

Hrmm. No, one word is not enough for the biggest maps, but it is big enough for fairly large ones. Either we could use a two-word implementation (perhaps based on ideas in data-dword), or we could use different implementations depending on map size. I can try to code this up tomorrow, maybe, if no one beats me to it.

@treeowl
Copy link
Contributor

treeowl commented May 3, 2016

@Rufflewind, I fleshed out the bit queue idea in #215. It seems to give fairly decent performance. That said, I am a bit-twiddling novice and my horrible code could probably be made substantially faster by someone who knows more about such things. Do you think you could take a look?

@treeowl treeowl mentioned this pull request May 4, 2016
@Rufflewind
Copy link
Member Author

If we have a W-bit address space (2^W addressable bytes), and each node requires 3W/8 bytes, then there can be at most N=2^W/(3W/8) nodes.

We maintain an invariant that size l <= 3 * size r

According to this, the largest and most imbalanced tree would have 3N/4 nodes on the left and N/4 nodes on the right. Applying this recursively, one obtains the approximate equation

N (3/4)^H = 3

where H is the height. Hence:

H = log(2^W/(9W/8))/log(4/3)

which is ≈140 for 64-bit and ≈65 for 32-bit. (The ratio stays between 2 and log(4/3)≈2.41.)

@treeowl
Copy link
Contributor

treeowl commented May 4, 2016

Even 64-bit machines don't have 64-bit address spaces (I think maybe 48 or
52 bits these days?), so I don't think we have to worry about maps quite
so large. Also, I believe 3 words substantially underestimates the size of
a node. There's a key pointer, a value pointer, two children, and I think
another couple of words of overhead. The key pointers all point to distinct
heap objects, which themselves need memory. Of course, address spaces will
continue to grow, but I'm hopeful that Haskell will support SSE2
instructions, making a larger queue practical, by the time that happens.
It'd be worth seeing how much a safety check on the size would cost.
On May 4, 2016 3:27 AM, "Phil Ruffwind" notifications@github.com wrote:

If we have a W-bit address space (2^W addressable bytes), and each node
requires 3W/8 bytes, then there can be at most N=2^W/(3W/8) nodes.

We maintain an invariant that size l <= 3 * size r

According to this, the largest and most imbalanced tree would have 3N/4
nodes on the left and N/4 nodes on the right. Applying this recursively,
one obtains the approximate equation

N (3/4)^H = 3

where H is the height. Hence:

H = log(2^W/(9W/8))/log(4/3)

which is ≈140 for 64-bit and ≈65 for 32-bit. (The ratio stays between 2 and
log(4/3)≈2.41.)


You are receiving this because you modified the open/close state.

Reply to this email directly or view it on GitHub
#192 (comment)

@treeowl treeowl closed this May 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants