Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vector version of mapAccumL that behaves like the list version #228

Open
newhoggy opened this issue Nov 8, 2018 · 23 comments
Open

Add vector version of mapAccumL that behaves like the list version #228

newhoggy opened this issue Nov 8, 2018 · 23 comments

Comments

@newhoggy
Copy link

newhoggy commented Nov 8, 2018

I've implemented a hand rolled version, and another two versions based on a combination of mapM and the lazy and strict versions of State monad.

haskell-works/hw-prim#38

The benchmarks show that the hand rolled versions run two times faster than the lazy state monad version and 16 times faster than the strict state monad version.

I found the slow performance of the strict monad version most surprising.

I'm aware that the version that using mapM might enable fusion, however it is a fair bit slower than a hand rolled version that defeats fusion.

I would love to have a fusion-enabled version that runs as fast as the hand rolled version. Would that be possible?

@cartazio
Copy link
Contributor

cartazio commented Nov 9, 2018

well, lets work it out and measure!

https://hackage.haskell.org/package/vector-0.12.0.1/docs/Data-Vector-Fusion-Stream-Monadic.html

write a mapAccumL for the Stream type, and then its super easy to have that be the internal implementation for the rest :)

@cartazio
Copy link
Contributor

cartazio commented Nov 9, 2018

(modulo a few subtleties, like i think you need to make sure your stream step function is non recursive, but that should be fine, plus theres some other stuff, but i'm here to help (and trick someone into learning stream fusion :) )
)

@newhoggy
Copy link
Author

newhoggy commented Nov 9, 2018

I've updated my PR with an implementation of mapAccumL for streams. I couldn't see how I could do it without StateT. As soon as I use StateT, I pay a hefty 30x performance penalty.

@newhoggy
Copy link
Author

newhoggy commented Nov 9, 2018

See haskell-works/hw-prim#38

Specifically these benchmark results where mapAccumLFusable is my unsuccessful attempt.

$ ./project.sh bench
$ ./project.sh bench
hw-prim-0.6.2.19: benchmarks
Running 1 benchmarks...
Benchmark bench: RUNNING...
benchmarking medium.csv/mapAccumL               for DVS.Vector Word64
time                 2.084 ms   (2.042 ms .. 2.149 ms)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 2.133 ms   (2.113 ms .. 2.160 ms)
std dev              79.03 μs   (60.33 μs .. 100.7 μs)
variance introduced by outliers: 22% (moderately inflated)

benchmarking medium.csv/mapAccumLViaStrictState for DVS.Vector Word64
time                 98.55 ms   (97.85 ms .. 99.32 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 99.12 ms   (98.70 ms .. 100.4 ms)
std dev              1.053 ms   (239.3 μs .. 1.709 ms)

benchmarking medium.csv/mapAccumLViaLazyState   for DVS.Vector Word64
time                 61.51 ms   (60.62 ms .. 62.18 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 62.98 ms   (62.47 ms .. 63.64 ms)
std dev              1.086 ms   (797.4 μs .. 1.478 ms)

benchmarking medium.csv/mapAccumLFusable        for DVS.Vector Word64
time                 61.69 ms   (61.28 ms .. 61.94 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 62.22 ms   (61.86 ms .. 62.65 ms)
std dev              696.0 μs   (420.0 μs .. 1.077 ms)

@cartazio
Copy link
Contributor

cartazio commented Nov 9, 2018

ok, so we need to do some digging it seems.

also perhaps some comparing of Core generated for each of these ... hrmm

@cartazio
Copy link
Contributor

cartazio commented Nov 9, 2018

(also thanks for getting the ball rolling! )

@newhoggy
Copy link
Author

newhoggy commented Nov 9, 2018

What are the best options for generating core in this situation.

My prior attempts at generating core have created quite an intelligible monstrosity (at least for me).

@cartazio
Copy link
Contributor

cartazio commented Nov 9, 2018

i dont remember the precise spelling for flags, though ghc --show-options | grep $FOO where FOO is words like "dump" or "simpl" or "suppress" or "file" should give you the correct spellings of
-ddump-simple -dsuppress-all -ddump-to-file
however they're spelt

@cartazio
Copy link
Contributor

cartazio commented Nov 9, 2018

also: adding those [0] and [1] annotations is usually driven by a reason!

(we could try to do some sort of rewrite rule to write back out to the optimized version when we can't fuse, but lets profile using -g3 and perf tools or something to understand whats happening for each of these, or something)

@newhoggy
Copy link
Author

newhoggy commented Nov 9, 2018

With those annotations, I've tried to keep them the same as what the vector package does.

I've read a little bit about rewrite rules and how ordering of inlining and rewriting matters, although, I haven't got a good way to debug these.

@newhoggy
Copy link
Author

newhoggy commented Nov 9, 2018

BTW, thanks for jumping on and helping with this 😀

@newhoggy
Copy link
Author

newhoggy commented Nov 9, 2018

Both were generated with ./project.sh bench --ghc-options "-ddump-simpl -dsuppress-all"

@cartazio
Copy link
Contributor

cartazio commented Nov 9, 2018 via email

@cartazio
Copy link
Contributor

cartazio commented Nov 9, 2018 via email

@newhoggy
Copy link
Author

newhoggy commented Nov 9, 2018

Fast hand-rolled version:
MapAccumL.dump-simpl.gz

Slower lazy StateT based version:
MapAccumLFusable.dump-simpl.gz

@newhoggy
Copy link
Author

newhoggy commented Nov 9, 2018

After some discussion on GHC channel on IRC with mpickering.

Found that:

$s$wfoldrM_loop_sgzz produces a list
which $s$wfoldlM'_loop_sgzs immediately consumes



Data.Vector.Storable.mapM calls Data.Vector.Generic.mapM calls Data.Vector.Generic.unstreamM

And unstreamM is implemented as:

unstreamM :: (Monad m, Vector v a) => MBundle m u a -> m (v a)
{-# INLINE_FUSED unstreamM #-}
unstreamM s = do
                xs <- MBundle.toList s
                return $ unstream $ Bundle.unsafeFromList (MBundle.size s) xs

@newhoggy
Copy link
Author

newhoggy commented Nov 9, 2018

Could conversion to list be the cause of the performance issue?

@cartazio
Copy link
Contributor

cartazio commented Nov 9, 2018 via email

@newhoggy
Copy link
Author

I tried writing alternate mapM function implementation that works on Bundle directly, but I'm stuck on how to deal with the effect m when trying to implement fc.

mapM :: forall m v a b . Monad m => (a -> m b) -> Bundle m v a -> Bundle m v b
{-# INLINE_FUSED mapM #-}
mapM f Bundle { sElems = s, sChunks = c, sVector = v, sSize = n} = Bundle
  { sElems  = S.mapM f s
  , sChunks = S.mapM fc c
  , sVector = Nothing
  , sSize   = n
  }
  where fc :: Chunk v a -> m (Chunk v b)
        fc = undefined

@treeowl
Copy link
Contributor

treeowl commented Nov 14, 2018

One question: how does producing the final seed affect fusion on the other side? Another question: how do the general problems fusing unstreamM play here, and is there any way to work around those?

@cartazio
Copy link
Contributor

this fell off the wagon but should be revisisted at some point

@newhoggy
Copy link
Author

Yes please 😁

@newhoggy
Copy link
Author

newhoggy commented Feb 3, 2020

Anything I can do to help move this forward?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants