Optimize generic version of basicSet. by OlivierSohn · Pull Request #196 · haskell/vector

OlivierSohn · 2018-01-02T08:39:30Z

Closes #195: instead of doing N reads on the vector, where N is the length of the vector, we do 0 read. The order of writes is unchanged.

I looked at other places, I saw no other potential optimizations of that sort.

I also updated the commented implementation in Mutable.hs to keep the two in sync.

EDIT: It seems to me that the CI errors are not related to these changes

EDIT2 : Added a benchmark function : performance is 5% to 15% better with the new version

EDIT3 : updated some bounds in order to compile with stack lts-10.2

cartazio · 2018-01-02T17:20:23Z

Do we have benchmarks to validate the perf improvement? I agree that it sounds like it should be an improvement, but benchmarks help a lot :)

…

On Tue, Jan 2, 2018 at 12:39 AM OlivierSohn ***@***.***> wrote: Closes #195 <#195>: instead of doing N reads on the vector, where N is the length of the vector, we do 0 read. The order of writes is unchanged. I looked at other places, I saw no other potential optimizations of that sort. I also updated the commented implementation in Mutable.hs to keep the two in sync. ------------------------------ You can view, comment on, or merge this pull request online at: #196 Commit Summary - Optimize generic version of basicSet. File Changes - *M* Data/Vector/Generic/Mutable.hs <https://github.com/haskell/vector/pull/196/files#diff-0> (16) - *M* Data/Vector/Generic/Mutable/Base.hs <https://github.com/haskell/vector/pull/196/files#diff-1> (17) Patch Links: - https://github.com/haskell/vector/pull/196.patch - https://github.com/haskell/vector/pull/196.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#196>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAQwsaoGUcZVNPm4Ja9vKxBXBH1hhv3ks5tGetCgaJpZM4RQYwC> .

OlivierSohn · 2018-01-02T22:53:46Z

Yes, at some point I compared the 2 versions using the benchmark function in the PR, here are 2 consecutive runs:

benchmarking mutableSetOld
time                 103.6 ms   (97.81 ms .. 107.5 ms)
                     0.998 R²   (0.995 R² .. 1.000 R²)
mean                 103.5 ms   (101.7 ms .. 106.1 ms)
std dev              3.250 ms   (1.684 ms .. 4.659 ms)

benchmarking mutableSetNew
time                 95.17 ms   (90.92 ms .. 99.08 ms)
                     0.997 R²   (0.990 R² .. 0.999 R²)
mean                 96.94 ms   (94.97 ms .. 99.06 ms)
std dev              3.260 ms   (2.510 ms .. 4.339 ms)

benchmarking mutableSetOld
time                 104.0 ms   (101.7 ms .. 105.9 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 102.1 ms   (101.2 ms .. 103.0 ms)
std dev              1.402 ms   (1.010 ms .. 1.885 ms)

benchmarking mutableSetNew
time                 97.97 ms   (94.92 ms .. 101.2 ms)
                     0.998 R²   (0.994 R² .. 1.000 R²)
mean                 97.73 ms   (96.36 ms .. 99.58 ms)
std dev              2.470 ms   (1.749 ms .. 3.207 ms)

cartazio · 2018-01-03T00:03:02Z

OVer what input sizes? I guess my main concern is that for this sort of optimization we wanna validate that it’s a Win in both small and large updates :)

…

On Tue, Jan 2, 2018 at 3:53 PM OlivierSohn ***@***.***> wrote: Yes, at some point I compared the 2 versions using te benchmark function in the PR, here are 2 consecutive runs: benchmarking mutableSetOldtime 103.6 ms (97.81 ms .. 107.5 ms) 0.998 R² (0.995 R² .. 1.000 R²) mean 103.5 ms (101.7 ms .. 106.1 ms) std dev 3.250 ms (1.684 ms .. 4.659 ms) benchmarking mutableSetNewtime 95.17 ms (90.92 ms .. 99.08 ms) 0.997 R² (0.990 R² .. 0.999 R²) mean 96.94 ms (94.97 ms .. 99.06 ms) std dev 3.260 ms (2.510 ms .. 4.339 ms) benchmarking mutableSetOldtime 104.0 ms (101.7 ms .. 105.9 ms) 1.000 R² (0.999 R² .. 1.000 R²) mean 102.1 ms (101.2 ms .. 103.0 ms) std dev 1.402 ms (1.010 ms .. 1.885 ms) benchmarking mutableSetNewtime 97.97 ms (94.92 ms .. 101.2 ms) 0.998 R² (0.994 R² .. 1.000 R²) mean 97.73 ms (96.36 ms .. 99.58 ms) std dev 2.470 ms (1.749 ms .. 3.207 ms) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#196 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAQwoOAg014D_WD1rhWWTbPy9Rg9H1Oks5tGrN7gaJpZM4RQYwC> .

OlivierSohn · 2018-01-03T00:19:11Z

Well, in that particular case, if you look at the code before, you'll see that the implementation makes little sense. And you'll understand that it has to be faster with the new implementation, whatever the size of the input.

I explain more in #195 why it is faster with the new implementation: in the old implementation if N is the length of the vector, we were doing N reads and N writes. But reading the vector for this algorithm makes no sense at all, and just adds more work. So now we do N writes, 0 read of the vector. Again, look at the implementation as it was before and you'll understand that it's not the way it should have been done.

And to answer your question regarding number of elements, in that case it's 100000 elements.

cartazio · 2018-01-03T15:50:37Z

I agree algorithmically once you’re larger than some small power of two number of elements. But In eg the case of unboxed or storable arrays modern cpus are very good at memory prefetch etc for linear access patterns. That sort of effect is hard to analyze non empirically. If you could humor me and give us a criterion suite run with tests for a range of small and larger inputs, that would be super appreciated. If you’re strapped for time I’ll try to find some time soon, I’m just a tad swamped with a few work things this week/Month like mentoring a new hire ;)

…

On Tue, Jan 2, 2018 at 7:19 PM OlivierSohn ***@***.***> wrote: Well, in that particular case, if you look at the code you understand that it has to be faster with the new implementation, whatever the size of the input. In that case it's 100000 elements. But feel free to run the tests wit ha different number of elements, I won't do it because I know that algorithmically there is just no ambiguity here. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#196 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAQwtkxS5uyoXqA5-3fCzVn5GaY_wFrks5tGsd_gaJpZM4RQYwC> .

OlivierSohn · 2018-01-03T16:24:02Z

Indeed, linear access pattern is what I do with this implementation, and it was not the case in the previous one (well, it was for writes but interleaved with reads).

Sorry, I won't have time for doing these tests, as I don't see how they would help in that case.

Honestly, once you understand the previous and the current implementations, you can see that it cannot be slower now than it was before, so writing these tests would be like testing that when you call a function two times it's slower than calling it once :) I don't write that kind of tests :)

cartazio · 2018-01-04T01:29:12Z

You’re likely correct. I’m just being a humbug ;) I’ll try to find some procrastination time to compare the two using your test or whatever as a starting point

…

On Wed, Jan 3, 2018 at 11:24 AM OlivierSohn ***@***.***> wrote: Indeed, linear access pattern is what I do with this implementation, and it was not the case in the previous one (well, it was for writes but interleaved with reads). Sorry, I won't have time for doing these tests, as I don't see how they would help in that case. Honestly, once you understand the previous and the current implementations, you can see that it *cannot* be slower now than it was before, so writing these tests would be like testing that when you call a function two times it's slower than calling it once :) I don't write that kind of tests :) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#196 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAQwvCQRpi8W9i5uxJNgz65YVu27VdOks5tG6migaJpZM4RQYwC> .

OlivierSohn · 2018-01-04T02:35:07Z

On testing, when I compared the 2 I wrote the old implementation and the new one in the same version of the code (basicSetOld, basicSetNew), so that I didn't have to change branch / recompile everything every time I wanted to do a comparison. But even doing it this way, every time the test code was changed, every package used in the tests and depending on vector was recompiled (5 minutes or so). I don't know if it's because I was compiling with stack or not... If you do the test and compile with cabal, I'd be interested to know how it goes on this aspect!

cartazio · 2018-01-04T05:08:51Z

Cabal new build works way better. You should try it out. Also it’d be easy to compare the two in userland

…

On Wed, Jan 3, 2018 at 9:35 PM OlivierSohn ***@***.***> wrote: On testing, when I compared the 2 I wrote the old implementation and the new one in the same version of the code (basicSetOld, basicSetNew), so that I didn't have to change branch / recompile everything every time I wanted to do a comparison. But even doing it this way, every time the *test* code was changed, every dependency of vector was recompiled (5 minutes or so). I don't know if it's because I was compiling with stack or not... If you do the test and compile with cabal, I'd be interested to know how it goes on this aspect! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#196 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAQwqXTCXf0oibYrWMGfplZo0QqpL28ks5tHDjbgaJpZM4RQYwC> .

OlivierSohn · 2018-01-04T05:51:55Z

Thanks for pointing that out, cabal new-build seems more flexible and would have made that easier!

cartazio · 2018-01-04T14:12:27Z

You need to add a cabal.project file in the root that points to the relative path of the other cabal files

…

On Thu, Jan 4, 2018 at 12:51 AM OlivierSohn ***@***.***> wrote: Thanks for pointing that out, cabal new-build seems more flexible and would have made that easier! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#196 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAQwg5npH8S391C0irgiRAFXOLJKNcAks5tHGb7gaJpZM4RQYwC> .

cartazio · 2018-01-04T14:31:04Z

The new cabal manual has some decent expository material on this On Thu, Jan 4, 2018 at 9:12 AM Carter Schonwald <carter.schonwald@gmail.com> wrote:

…

You need to add a cabal.project file in the root that points to the relative path of the other cabal files On Thu, Jan 4, 2018 at 12:51 AM OlivierSohn ***@***.***> wrote: > Thanks for pointing that out, cabal new-build seems more flexible and > would have made that easier! > > — > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub > <#196 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/AAAQwg5npH8S391C0irgiRAFXOLJKNcAks5tHGb7gaJpZM4RQYwC> > . >

OlivierSohn · 2018-01-04T14:40:51Z

Thanks, I had found it in-between, and deleted the comment

cartazio · 2018-01-21T18:40:34Z

did you updated your comparison / reference code? :)

cartazio · 2018-01-21T18:40:56Z

also on the CI theres some failures, though other folks may understand the error better than I

OlivierSohn · 2018-01-21T19:54:20Z

@cartazio I explained in the discussion why the test is sufficient as-is imho. Feel free to do more tests if you need confirmation of the performance boost.

cartazio · 2020-01-29T22:47:25Z

i'm so sorry for being slow to follow up on this, what i'd like to do (in a few weeks after other stuff stabilizes more, is do a detailed benchmark that replicates that timing comparison you did, but using a number of different memory sizes and such to understand what the impact is at various sizes wrt memory hiearchy and cpus). This sort of optimization can be great if its robust over sizes, but sometimes tricks that work great at say L1/L2 size fall over at L3 or full ram roundtrip

OlivierSohn · 2020-01-30T08:57:46Z

@cartazio Your comment makes me doubt that you took the time to read the changes and understand them.
You'll see that what I do is not a "trick" at all. It's just that the initial implementation was unnecessarily convoluted, and cannot be faster that what I do.

cartazio · 2020-01-30T13:14:34Z

Quite possibly! The point I was communicating is that I’ll look again closely and soon.

…

On Thu, Jan 30, 2020 at 3:57 AM OlivierSohn ***@***.***> wrote: @cartazio <https://github.com/cartazio> Your comment makes me doubt that you took the time to read the changes and understand them. You'll see that what I do is not a "trick" at all. It's just that the initial implementation was unnecessarily convoluted, and *cannot* be faster that what I do. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#196?email_source=notifications&email_token=AAABBQXFJ2ABTXHLPELQTATRAKJAXA5CNFSM4EKBRQBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKKGFEA#issuecomment-580149904>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABBQQLWWBJT2JN3DH44YDRAKJAXANCNFSM4EKBRQBA> .

Bodigrim · 2020-02-03T23:09:42Z

@OlivierSohn I agree that the original version reads N bytes and writes N bytes, while your code reads 0 bytes and writes N bytes, which is strictly better. And your code is more straightforward (it is basically as straightforward as possible ;).

However, I cannot fully agree that basicSet "cannot be faster that what [you] do". The reason is that it is possible to have log N basicUnsafeCopy faster than N basicUnsafeWrite, even while they transfer the same amount of data. E. g., could you please check that your patch does not make Storable vectors of Int8 slower? They use copyArray routine, which presumably just calls memcpy.

I think it is reasonable to apply your patch for Generic vectors, but copy the recursive implementation of basicSet to Storable ones.

cartazio · 2020-02-03T23:49:38Z

My main point is that modern computers have a pretty nuanced hierarchical memory model that sometimes behaves differently than PDP8 era flat memory, and it can be surprising

cartazio · 2020-02-03T23:51:33Z

and that those constant factors from the memory model vary as one changes the size of the working set

OlivierSohn · 2020-02-04T00:40:29Z

@Bodigrim, @cartazio I won't have time to work on it (I opened this PR 2 years ago, I now work on different projects).

If you feel like taking the lead on doing these tests / merging this PR, go ahead!

cartazio · 2020-02-04T00:54:46Z

thanks for the original work, We do plan to make use of it and do some careful experiments (finite time etc etc)

…

On Mon, Feb 3, 2020 at 7:40 PM OlivierSohn ***@***.***> wrote: @Bodigrim <https://github.com/Bodigrim>, @cartazio <https://github.com/cartazio> I won't have time to work on it (I opened this PR 2 years ago, I now work on different projects). If you feel like taking the lead on doing these tests / merging this PR, go ahead! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#196?email_source=notifications&email_token=AAABBQUGZHGGETOITVHLSGDRBC2P5A5CNFSM4EKBRQBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKV5XCI#issuecomment-581688201>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAABBQRFPWIO5I4TCHTL47TRBC2P5ANCNFSM4EKBRQBA> .

lehins · 2020-07-11T17:50:46Z

There was really no reason for such a simple PR to hang for so long. Sorry about that @OlivierSohn

That being said, this PR introduces a regression, not an improvement. The reason why you didn't see it with your benchmarks is because you benchmarked unboxed vector, while that default implementation of basicSet only affects the boxed vector. The 5%~15% performance was probably sue to some other nuances, but they are gone now. Storable and Primitive version have their own optimized basicSet implementations, which very hard to beat. And Unboxed vectors fall back onto the underlying primitive implementation for each individual type and when it doesn't, then the current implementation on master will be faster anyways.

I just tried both versions on boxed vectors. Here is yours:

      do_set i | i < n = do
                           basicUnsafeWrite v i x
                           do_set (i+1)
               | otherwise = return ()

with runtime:

benchmarking mutableSet
time                 286.4 ms   (265.8 ms .. 304.0 ms)
                     0.998 R²   (0.991 R² .. 1.000 R²)
mean                 291.7 ms   (287.3 ms .. 297.2 ms)
std dev              6.385 ms   (3.795 ms .. 7.964 ms)
variance introduced by outliers: 16% (moderately inflated)

and current master:

      do_set i | 2*i < n = do basicUnsafeCopy (basicUnsafeSlice i i v)
                                              (basicUnsafeSlice 0 i v)
                              do_set (2*i)
               | otherwise = basicUnsafeCopy (basicUnsafeSlice i (n-i) v)
                                             (basicUnsafeSlice 0 (n-i) v)

with runtime almost twice as fast.

benchmarking mutableSet
time                 143.7 ms   (140.8 ms .. 146.2 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 143.2 ms   (141.9 ms .. 144.8 ms)
std dev              2.016 ms   (1.438 ms .. 2.784 ms)
variance introduced by outliers: 12% (moderately inflated)

Reason for this performance improvement with a more convoluted approach is because it uses optimized version of copyMutableArray underneath, which doesn't require iterating over each element.

With all that in mind, this PR still adds something useful, namely a benchmark for setting a vector to the same value.

If you feel like taking the lead on doing these tests / merging this PR, go ahead!

I followed the above suggestion and took it upon myself to fix this PR. I reverted implementation of basicSet and switched benchmark to boxed vector.

Shimuuar · 2020-07-11T18:03:19Z

Reason for this performance improvement with a more convoluted approach is because it uses optimized version of copyMutableArray underneath, which doesn't require iterating over each element.

@lehins I think it would be worthwhile to add comment explaining why convoluted approach is taken and that it's optimized for boxed vector

lehins · 2020-07-11T18:23:02Z

@Shimuuar I am not the original implementor of the basicSet. Roman wrote it almost a decade ago. I personally cannot explain 100% why it is faster. All I know that naive approach that was suggested in this PR was 2x slower and that is why I reverted it.

Shimuuar · 2020-07-11T18:28:32Z

Just notice that this implementation is optimized for boxed vectors and unboxed/primitive/storable have specialized one would be helpful I think

Bodigrim · 2020-07-11T18:37:07Z

It's not specifically optimized for boxed vectors only, it is optimized for all vectors backed by a continuous chunk of memory. In this case basicUnsafeCopy basically becomes memcpy, which employs CPU vectorization capabilities to copy data in blocks of 128-512 bytes.

Storable and Primitive vectors can go further by relying on memset, which can be even faster than memcpy.

OlivierSohn · 2020-07-11T19:56:35Z

Just notice that this implementation is optimized for boxed vectors and unboxed/primitive/storable have specialized one would be helpful I think

I agree that adding a comment would be nice: without it, it is hard to guess why one would implement it like this (and whether it is a beginner's mistake or an expert trick).

lehins · 2020-07-11T19:58:24Z

Thanks @Bodigrim that is a pretty good explanation. @Shimuuar if you really feel like this #196 (comment) deserves to be in the codebase, by all means add it as a separate commit. I personally don't care much about comments like that directly near the implementation.

Especially because it requires going into quite a bit of detail. For example, despite what @Bodigrim said is true "It's not specifically optimized for boxed vectors only", it is not being used by any other implementation except boxed vectors. At the same time it might be used by custom a Unbox instances that doesn't supply the implementation for basicSet. I'd rather concentrate our efforts on cleaning up the backlog and improving user facing documentation than dwell on commenting the code.

In any case, this PR has been boiling here for over two years, I don't see any reason for holding it up any longer. Merging.

lehins · 2020-07-11T20:02:35Z

@OlivierSohn by all means. If you or anyone else feel like researching current implementation and describing what happens for each particular memory representation with notes on complexity and runtimes, then it will make a great PR. I promise it will not hang there for another two years ;) I personally don't feel like it is a good use of my time.

OlivierSohn force-pushed the feature/195-optimize-basicSet branch 2 times, most recently from b45ac4a to 15e72d2 Compare January 2, 2018 13:13

OlivierSohn mentioned this pull request Jan 4, 2018

Data.Vector.Generic.Mutable.Base.basicSet could be optimized #195

Closed

OlivierSohn mentioned this pull request Jan 21, 2018

Add dropRight and takeRight #191

Closed

lehins force-pushed the feature/195-optimize-basicSet branch from eea7b93 to 90649f0 Compare July 11, 2020 17:04

Add benchmark for set function for mutable boxed vector.

804a598

lehins force-pushed the feature/195-optimize-basicSet branch from 90649f0 to 804a598 Compare July 11, 2020 17:48

lehins merged commit a6c2db5 into haskell:master Jul 11, 2020

Conversation

OlivierSohn commented Jan 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cartazio commented Jan 2, 2018 via email

Uh oh!

OlivierSohn commented Jan 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cartazio commented Jan 3, 2018 via email

Uh oh!

OlivierSohn commented Jan 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cartazio commented Jan 3, 2018 via email

Uh oh!

OlivierSohn commented Jan 3, 2018

Uh oh!

cartazio commented Jan 4, 2018 via email

Uh oh!

OlivierSohn commented Jan 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cartazio commented Jan 4, 2018 via email

Uh oh!

OlivierSohn commented Jan 4, 2018

Uh oh!

cartazio commented Jan 4, 2018 via email

Uh oh!

cartazio commented Jan 4, 2018 via email

Uh oh!

OlivierSohn commented Jan 4, 2018

Uh oh!

cartazio commented Jan 21, 2018

Uh oh!

cartazio commented Jan 21, 2018

Uh oh!

OlivierSohn commented Jan 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cartazio commented Jan 29, 2020

Uh oh!

OlivierSohn commented Jan 30, 2020

Uh oh!

cartazio commented Jan 30, 2020 via email

Uh oh!

Bodigrim commented Feb 3, 2020

Uh oh!

cartazio commented Feb 3, 2020

Uh oh!

cartazio commented Feb 3, 2020

Uh oh!

OlivierSohn commented Feb 4, 2020

Uh oh!

cartazio commented Feb 4, 2020 via email

Uh oh!

lehins commented Jul 11, 2020

Uh oh!

Shimuuar commented Jul 11, 2020

Uh oh!

lehins commented Jul 11, 2020

Uh oh!

Shimuuar commented Jul 11, 2020

Uh oh!

Bodigrim commented Jul 11, 2020

Uh oh!

OlivierSohn commented Jul 11, 2020

Uh oh!

lehins commented Jul 11, 2020

Uh oh!

lehins commented Jul 11, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

OlivierSohn commented Jan 2, 2018 •

edited

Loading

OlivierSohn commented Jan 2, 2018 •

edited

Loading

OlivierSohn commented Jan 3, 2018 •

edited

Loading

OlivierSohn commented Jan 4, 2018 •

edited

Loading

OlivierSohn commented Jan 21, 2018 •

edited

Loading