Add vector version of mapAccumL that behaves like the list version #228

newhoggy · 2018-11-08T22:30:15Z

I've implemented a hand rolled version, and another two versions based on a combination of mapM and the lazy and strict versions of State monad.

haskell-works/hw-prim#38

The benchmarks show that the hand rolled versions run two times faster than the lazy state monad version and 16 times faster than the strict state monad version.

I found the slow performance of the strict monad version most surprising.

I'm aware that the version that using mapM might enable fusion, however it is a fair bit slower than a hand rolled version that defeats fusion.

I would love to have a fusion-enabled version that runs as fast as the hand rolled version. Would that be possible?

The text was updated successfully, but these errors were encountered:

cartazio · 2018-11-09T00:57:44Z

well, lets work it out and measure!

https://hackage.haskell.org/package/vector-0.12.0.1/docs/Data-Vector-Fusion-Stream-Monadic.html

write a mapAccumL for the Stream type, and then its super easy to have that be the internal implementation for the rest :)

cartazio · 2018-11-09T01:00:28Z

(modulo a few subtleties, like i think you need to make sure your stream step function is non recursive, but that should be fine, plus theres some other stuff, but i'm here to help (and trick someone into learning stream fusion :) )
)

newhoggy · 2018-11-09T03:54:53Z

I've updated my PR with an implementation of mapAccumL for streams. I couldn't see how I could do it without StateT. As soon as I use StateT, I pay a hefty 30x performance penalty.

newhoggy · 2018-11-09T03:56:02Z

See haskell-works/hw-prim#38

Specifically these benchmark results where mapAccumLFusable is my unsuccessful attempt.

$ ./project.sh bench
$ ./project.sh bench
hw-prim-0.6.2.19: benchmarks
Running 1 benchmarks...
Benchmark bench: RUNNING...
benchmarking medium.csv/mapAccumL               for DVS.Vector Word64
time                 2.084 ms   (2.042 ms .. 2.149 ms)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 2.133 ms   (2.113 ms .. 2.160 ms)
std dev              79.03 μs   (60.33 μs .. 100.7 μs)
variance introduced by outliers: 22% (moderately inflated)

benchmarking medium.csv/mapAccumLViaStrictState for DVS.Vector Word64
time                 98.55 ms   (97.85 ms .. 99.32 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 99.12 ms   (98.70 ms .. 100.4 ms)
std dev              1.053 ms   (239.3 μs .. 1.709 ms)

benchmarking medium.csv/mapAccumLViaLazyState   for DVS.Vector Word64
time                 61.51 ms   (60.62 ms .. 62.18 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 62.98 ms   (62.47 ms .. 63.64 ms)
std dev              1.086 ms   (797.4 μs .. 1.478 ms)

benchmarking medium.csv/mapAccumLFusable        for DVS.Vector Word64
time                 61.69 ms   (61.28 ms .. 61.94 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 62.22 ms   (61.86 ms .. 62.65 ms)
std dev              696.0 μs   (420.0 μs .. 1.077 ms)

cartazio · 2018-11-09T04:01:16Z

ok, so we need to do some digging it seems.

also perhaps some comparing of Core generated for each of these ... hrmm

cartazio · 2018-11-09T04:02:05Z

(also thanks for getting the ball rolling! )

newhoggy · 2018-11-09T04:06:48Z

What are the best options for generating core in this situation.

My prior attempts at generating core have created quite an intelligible monstrosity (at least for me).

cartazio · 2018-11-09T04:09:37Z

i dont remember the precise spelling for flags, though ghc --show-options | grep $FOO where FOO is words like "dump" or "simpl" or "suppress" or "file" should give you the correct spellings of
-ddump-simple -dsuppress-all -ddump-to-file
however they're spelt

cartazio · 2018-11-09T04:10:51Z

also: adding those [0] and [1] annotations is usually driven by a reason!

(we could try to do some sort of rewrite rule to write back out to the optimized version when we can't fuse, but lets profile using -g3 and perf tools or something to understand whats happening for each of these, or something)

newhoggy · 2018-11-09T04:14:39Z

With those annotations, I've tried to keep them the same as what the vector package does.

I've read a little bit about rewrite rules and how ordering of inlining and rewriting matters, although, I haven't got a good way to debug these.

newhoggy · 2018-11-09T04:17:25Z

BTW, thanks for jumping on and helping with this 😀

newhoggy · 2018-11-09T04:25:49Z

Both were generated with ./project.sh bench --ghc-options "-ddump-simpl -dsuppress-all"

cartazio · 2018-11-09T04:31:56Z

You will want the dump to file flag and then upload them as attachments :) @treeowl , while I understand stream fusion isn’t the flavor you’re most comfy with, do you have any good suggestions John can try to help the fusion stuff ? Or something ? :)

…

On Thu, Nov 8, 2018 at 11:25 PM John Ky ***@***.***> wrote: Both were generated with ./project.sh bench --ghc-options "-ddump-simpl -dsuppress-all" — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#228 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAQwrwN5Hlo88QodcBhl9Ts2Sj_eIDVks5utQPOgaJpZM4YVm-z> .

cartazio · 2018-11-09T04:34:04Z

One crazy thought I just had is a hybrid version: Let’s look at the versions where the result is stream structured but one or none of the inputs are. I think we need to be careful ... our stream benchmarks may be just including something we assume isn’t being recomputed. Or something. On Thu, Nov 8, 2018 at 11:31 PM Carter Schonwald <carter.schonwald@gmail.com> wrote:

…

You will want the dump to file flag and then upload them as attachments :) @treeowl , while I understand stream fusion isn’t the flavor you’re most comfy with, do you have any good suggestions John can try to help the fusion stuff ? Or something ? :) On Thu, Nov 8, 2018 at 11:25 PM John Ky ***@***.***> wrote: > Both were generated with ./project.sh bench --ghc-options "-ddump-simpl > -dsuppress-all" > > — > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > <#228 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAAQwrwN5Hlo88QodcBhl9Ts2Sj_eIDVks5utQPOgaJpZM4YVm-z> > . >

newhoggy · 2018-11-09T05:55:57Z

Fast hand-rolled version:
MapAccumL.dump-simpl.gz

Slower lazy StateT based version:
MapAccumLFusable.dump-simpl.gz

newhoggy · 2018-11-09T13:13:15Z

After some discussion on GHC channel on IRC with mpickering.

Found that:

$s$wfoldrM_loop_sgzz produces a list
which $s$wfoldlM'_loop_sgzs immediately consumes

  Data.Vector.Storable.mapM calls Data.Vector.Generic.mapM calls Data.Vector.Generic.unstreamM

And unstreamM is implemented as:

unstreamM :: (Monad m, Vector v a) => MBundle m u a -> m (v a)
{-# INLINE_FUSED unstreamM #-}
unstreamM s = do
                xs <- MBundle.toList s
                return $ unstream $ Bundle.unsafeFromList (MBundle.size s) xs

newhoggy · 2018-11-09T13:13:55Z

Could conversion to list be the cause of the performance issue?

cartazio · 2018-11-09T13:16:04Z

I’m not awake yet so I’ve not caught up on everything. But it’s definitelu true that ghc is much more conservative on simplifying code it things is recursive, so this could be a culprit. I’ll try to investigate a bit today.

…

On Fri, Nov 9, 2018 at 8:13 AM John Ky ***@***.***> wrote: Could conversion to list be the cause of the performance issue? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#228 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAQwmJLv2R-PbjjROwmCFusWAQ5U1Chks5utX-UgaJpZM4YVm-z> .

newhoggy · 2018-11-14T13:06:54Z

I tried writing alternate mapM function implementation that works on Bundle directly, but I'm stuck on how to deal with the effect m when trying to implement fc.

mapM :: forall m v a b . Monad m => (a -> m b) -> Bundle m v a -> Bundle m v b
{-# INLINE_FUSED mapM #-}
mapM f Bundle { sElems = s, sChunks = c, sVector = v, sSize = n} = Bundle
  { sElems  = S.mapM f s
  , sChunks = S.mapM fc c
  , sVector = Nothing
  , sSize   = n
  }
  where fc :: Chunk v a -> m (Chunk v b)
        fc = undefined

treeowl · 2018-11-14T19:17:40Z

One question: how does producing the final seed affect fusion on the other side? Another question: how do the general problems fusing unstreamM play here, and is there any way to work around those?

cartazio · 2020-01-31T17:39:51Z

this fell off the wagon but should be revisisted at some point

newhoggy · 2020-01-31T23:42:31Z

Yes please 😁

newhoggy · 2020-02-03T22:12:37Z

Anything I can do to help move this forward?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vector version of mapAccumL that behaves like the list version #228

Add vector version of mapAccumL that behaves like the list version #228

newhoggy commented Nov 8, 2018 •

edited

Loading

cartazio commented Nov 9, 2018

cartazio commented Nov 9, 2018

newhoggy commented Nov 9, 2018

newhoggy commented Nov 9, 2018 •

edited

Loading

cartazio commented Nov 9, 2018

cartazio commented Nov 9, 2018

newhoggy commented Nov 9, 2018 •

edited

Loading

cartazio commented Nov 9, 2018

cartazio commented Nov 9, 2018

newhoggy commented Nov 9, 2018

newhoggy commented Nov 9, 2018

newhoggy commented Nov 9, 2018

cartazio commented Nov 9, 2018 via email

cartazio commented Nov 9, 2018 via email

newhoggy commented Nov 9, 2018

newhoggy commented Nov 9, 2018

newhoggy commented Nov 9, 2018

cartazio commented Nov 9, 2018 via email

newhoggy commented Nov 14, 2018

treeowl commented Nov 14, 2018

cartazio commented Jan 31, 2020

newhoggy commented Jan 31, 2020

newhoggy commented Feb 3, 2020

Add vector version of mapAccumL that behaves like the list version #228

Add vector version of mapAccumL that behaves like the list version #228

Comments

newhoggy commented Nov 8, 2018 • edited Loading

cartazio commented Nov 9, 2018

cartazio commented Nov 9, 2018

newhoggy commented Nov 9, 2018

newhoggy commented Nov 9, 2018 • edited Loading

cartazio commented Nov 9, 2018

cartazio commented Nov 9, 2018

newhoggy commented Nov 9, 2018 • edited Loading

cartazio commented Nov 9, 2018

cartazio commented Nov 9, 2018

newhoggy commented Nov 9, 2018

newhoggy commented Nov 9, 2018

newhoggy commented Nov 9, 2018

cartazio commented Nov 9, 2018 via email

cartazio commented Nov 9, 2018 via email

newhoggy commented Nov 9, 2018

newhoggy commented Nov 9, 2018

newhoggy commented Nov 9, 2018

cartazio commented Nov 9, 2018 via email

newhoggy commented Nov 14, 2018

treeowl commented Nov 14, 2018

cartazio commented Jan 31, 2020

newhoggy commented Jan 31, 2020

newhoggy commented Feb 3, 2020

newhoggy commented Nov 8, 2018 •

edited

Loading

newhoggy commented Nov 9, 2018 •

edited

Loading

newhoggy commented Nov 9, 2018 •

edited

Loading