Skip to content

Optimize generic version of basicSet.#196

Merged
lehins merged 1 commit intohaskell:masterfrom
OlivierSohn:feature/195-optimize-basicSet
Jul 11, 2020
Merged

Optimize generic version of basicSet.#196
lehins merged 1 commit intohaskell:masterfrom
OlivierSohn:feature/195-optimize-basicSet

Conversation

@OlivierSohn
Copy link
Copy Markdown
Contributor

@OlivierSohn OlivierSohn commented Jan 2, 2018

Closes #195: instead of doing N reads on the vector, where N is the length of the vector, we do 0 read. The order of writes is unchanged.

I looked at other places, I saw no other potential optimizations of that sort.

I also updated the commented implementation in Mutable.hs to keep the two in sync.

EDIT: It seems to me that the CI errors are not related to these changes

EDIT2 : Added a benchmark function : performance is 5% to 15% better with the new version

EDIT3 : updated some bounds in order to compile with stack lts-10.2

@OlivierSohn OlivierSohn force-pushed the feature/195-optimize-basicSet branch 2 times, most recently from b45ac4a to 15e72d2 Compare January 2, 2018 13:13
@cartazio
Copy link
Copy Markdown
Contributor

cartazio commented Jan 2, 2018 via email

@OlivierSohn
Copy link
Copy Markdown
Contributor Author

OlivierSohn commented Jan 2, 2018

Yes, at some point I compared the 2 versions using the benchmark function in the PR, here are 2 consecutive runs:

benchmarking mutableSetOld
time                 103.6 ms   (97.81 ms .. 107.5 ms)
                     0.998 R²   (0.995 R² .. 1.000 R²)
mean                 103.5 ms   (101.7 ms .. 106.1 ms)
std dev              3.250 ms   (1.684 ms .. 4.659 ms)

benchmarking mutableSetNew
time                 95.17 ms   (90.92 ms .. 99.08 ms)
                     0.997 R²   (0.990 R² .. 0.999 R²)
mean                 96.94 ms   (94.97 ms .. 99.06 ms)
std dev              3.260 ms   (2.510 ms .. 4.339 ms)
benchmarking mutableSetOld
time                 104.0 ms   (101.7 ms .. 105.9 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 102.1 ms   (101.2 ms .. 103.0 ms)
std dev              1.402 ms   (1.010 ms .. 1.885 ms)

benchmarking mutableSetNew
time                 97.97 ms   (94.92 ms .. 101.2 ms)
                     0.998 R²   (0.994 R² .. 1.000 R²)
mean                 97.73 ms   (96.36 ms .. 99.58 ms)
std dev              2.470 ms   (1.749 ms .. 3.207 ms)

@cartazio
Copy link
Copy Markdown
Contributor

cartazio commented Jan 3, 2018 via email

@OlivierSohn
Copy link
Copy Markdown
Contributor Author

OlivierSohn commented Jan 3, 2018

Well, in that particular case, if you look at the code before, you'll see that the implementation makes little sense. And you'll understand that it has to be faster with the new implementation, whatever the size of the input.

I explain more in #195 why it is faster with the new implementation: in the old implementation if N is the length of the vector, we were doing N reads and N writes. But reading the vector for this algorithm makes no sense at all, and just adds more work. So now we do N writes, 0 read of the vector. Again, look at the implementation as it was before and you'll understand that it's not the way it should have been done.

And to answer your question regarding number of elements, in that case it's 100000 elements.

@cartazio
Copy link
Copy Markdown
Contributor

cartazio commented Jan 3, 2018 via email

@OlivierSohn
Copy link
Copy Markdown
Contributor Author

Indeed, linear access pattern is what I do with this implementation, and it was not the case in the previous one (well, it was for writes but interleaved with reads).

Sorry, I won't have time for doing these tests, as I don't see how they would help in that case.

Honestly, once you understand the previous and the current implementations, you can see that it cannot be slower now than it was before, so writing these tests would be like testing that when you call a function two times it's slower than calling it once :) I don't write that kind of tests :)

@cartazio
Copy link
Copy Markdown
Contributor

cartazio commented Jan 4, 2018 via email

@OlivierSohn
Copy link
Copy Markdown
Contributor Author

OlivierSohn commented Jan 4, 2018

On testing, when I compared the 2 I wrote the old implementation and the new one in the same version of the code (basicSetOld, basicSetNew), so that I didn't have to change branch / recompile everything every time I wanted to do a comparison. But even doing it this way, every time the test code was changed, every package used in the tests and depending on vector was recompiled (5 minutes or so). I don't know if it's because I was compiling with stack or not... If you do the test and compile with cabal, I'd be interested to know how it goes on this aspect!

@cartazio
Copy link
Copy Markdown
Contributor

cartazio commented Jan 4, 2018 via email

@OlivierSohn
Copy link
Copy Markdown
Contributor Author

Thanks for pointing that out, cabal new-build seems more flexible and would have made that easier!

@cartazio
Copy link
Copy Markdown
Contributor

cartazio commented Jan 4, 2018 via email

@cartazio
Copy link
Copy Markdown
Contributor

cartazio commented Jan 4, 2018 via email

@OlivierSohn
Copy link
Copy Markdown
Contributor Author

Thanks, I had found it in-between, and deleted the comment

@cartazio
Copy link
Copy Markdown
Contributor

did you updated your comparison / reference code? :)

@cartazio
Copy link
Copy Markdown
Contributor

also on the CI theres some failures, though other folks may understand the error better than I

@OlivierSohn
Copy link
Copy Markdown
Contributor Author

OlivierSohn commented Jan 21, 2018

@cartazio I explained in the discussion why the test is sufficient as-is imho. Feel free to do more tests if you need confirmation of the performance boost.

@cartazio
Copy link
Copy Markdown
Contributor

i'm so sorry for being slow to follow up on this, what i'd like to do (in a few weeks after other stuff stabilizes more, is do a detailed benchmark that replicates that timing comparison you did, but using a number of different memory sizes and such to understand what the impact is at various sizes wrt memory hiearchy and cpus). This sort of optimization can be great if its robust over sizes, but sometimes tricks that work great at say L1/L2 size fall over at L3 or full ram roundtrip

@OlivierSohn
Copy link
Copy Markdown
Contributor Author

@cartazio Your comment makes me doubt that you took the time to read the changes and understand them.
You'll see that what I do is not a "trick" at all. It's just that the initial implementation was unnecessarily convoluted, and cannot be faster that what I do.

@cartazio
Copy link
Copy Markdown
Contributor

cartazio commented Jan 30, 2020 via email

@Bodigrim
Copy link
Copy Markdown
Contributor

Bodigrim commented Feb 3, 2020

@OlivierSohn I agree that the original version reads N bytes and writes N bytes, while your code reads 0 bytes and writes N bytes, which is strictly better. And your code is more straightforward (it is basically as straightforward as possible ;).

However, I cannot fully agree that basicSet "cannot be faster that what [you] do". The reason is that it is possible to have log N basicUnsafeCopy faster than N basicUnsafeWrite, even while they transfer the same amount of data. E. g., could you please check that your patch does not make Storable vectors of Int8 slower? They use copyArray routine, which presumably just calls memcpy.

I think it is reasonable to apply your patch for Generic vectors, but copy the recursive implementation of basicSet to Storable ones.

@cartazio
Copy link
Copy Markdown
Contributor

cartazio commented Feb 3, 2020

My main point is that modern computers have a pretty nuanced hierarchical memory model that sometimes behaves differently than PDP8 era flat memory, and it can be surprising

@cartazio
Copy link
Copy Markdown
Contributor

cartazio commented Feb 3, 2020

and that those constant factors from the memory model vary as one changes the size of the working set

@OlivierSohn
Copy link
Copy Markdown
Contributor Author

@Bodigrim, @cartazio I won't have time to work on it (I opened this PR 2 years ago, I now work on different projects).

If you feel like taking the lead on doing these tests / merging this PR, go ahead!

@cartazio
Copy link
Copy Markdown
Contributor

cartazio commented Feb 4, 2020 via email

@lehins lehins force-pushed the feature/195-optimize-basicSet branch from eea7b93 to 90649f0 Compare July 11, 2020 17:04
@lehins lehins force-pushed the feature/195-optimize-basicSet branch from 90649f0 to 804a598 Compare July 11, 2020 17:48
@lehins
Copy link
Copy Markdown
Contributor

lehins commented Jul 11, 2020

There was really no reason for such a simple PR to hang for so long. Sorry about that @OlivierSohn

That being said, this PR introduces a regression, not an improvement. The reason why you didn't see it with your benchmarks is because you benchmarked unboxed vector, while that default implementation of basicSet only affects the boxed vector. The 5%~15% performance was probably sue to some other nuances, but they are gone now. Storable and Primitive version have their own optimized basicSet implementations, which very hard to beat. And Unboxed vectors fall back onto the underlying primitive implementation for each individual type and when it doesn't, then the current implementation on master will be faster anyways.

I just tried both versions on boxed vectors. Here is yours:

      do_set i | i < n = do
                           basicUnsafeWrite v i x
                           do_set (i+1)
               | otherwise = return ()

with runtime:

benchmarking mutableSet
time                 286.4 ms   (265.8 ms .. 304.0 ms)
                     0.998 R²   (0.991 R² .. 1.000 R²)
mean                 291.7 ms   (287.3 ms .. 297.2 ms)
std dev              6.385 ms   (3.795 ms .. 7.964 ms)
variance introduced by outliers: 16% (moderately inflated)

and current master:

      do_set i | 2*i < n = do basicUnsafeCopy (basicUnsafeSlice i i v)
                                              (basicUnsafeSlice 0 i v)
                              do_set (2*i)
               | otherwise = basicUnsafeCopy (basicUnsafeSlice i (n-i) v)
                                             (basicUnsafeSlice 0 (n-i) v)

with runtime almost twice as fast.

benchmarking mutableSet
time                 143.7 ms   (140.8 ms .. 146.2 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 143.2 ms   (141.9 ms .. 144.8 ms)
std dev              2.016 ms   (1.438 ms .. 2.784 ms)
variance introduced by outliers: 12% (moderately inflated)

Reason for this performance improvement with a more convoluted approach is because it uses optimized version of copyMutableArray underneath, which doesn't require iterating over each element.

With all that in mind, this PR still adds something useful, namely a benchmark for setting a vector to the same value.

If you feel like taking the lead on doing these tests / merging this PR, go ahead!

I followed the above suggestion and took it upon myself to fix this PR. I reverted implementation of basicSet and switched benchmark to boxed vector.

@Shimuuar
Copy link
Copy Markdown
Contributor

Reason for this performance improvement with a more convoluted approach is because it uses optimized version of copyMutableArray underneath, which doesn't require iterating over each element.

@lehins I think it would be worthwhile to add comment explaining why convoluted approach is taken and that it's optimized for boxed vector

@lehins
Copy link
Copy Markdown
Contributor

lehins commented Jul 11, 2020

@Shimuuar I am not the original implementor of the basicSet. Roman wrote it almost a decade ago. I personally cannot explain 100% why it is faster. All I know that naive approach that was suggested in this PR was 2x slower and that is why I reverted it.

@Shimuuar
Copy link
Copy Markdown
Contributor

Just notice that this implementation is optimized for boxed vectors and unboxed/primitive/storable have specialized one would be helpful I think

@Bodigrim
Copy link
Copy Markdown
Contributor

It's not specifically optimized for boxed vectors only, it is optimized for all vectors backed by a continuous chunk of memory. In this case basicUnsafeCopy basically becomes memcpy, which employs CPU vectorization capabilities to copy data in blocks of 128-512 bytes.

Storable and Primitive vectors can go further by relying on memset, which can be even faster than memcpy.

@OlivierSohn
Copy link
Copy Markdown
Contributor Author

Just notice that this implementation is optimized for boxed vectors and unboxed/primitive/storable have specialized one would be helpful I think

I agree that adding a comment would be nice: without it, it is hard to guess why one would implement it like this (and whether it is a beginner's mistake or an expert trick).

@lehins
Copy link
Copy Markdown
Contributor

lehins commented Jul 11, 2020

Thanks @Bodigrim that is a pretty good explanation. @Shimuuar if you really feel like this #196 (comment) deserves to be in the codebase, by all means add it as a separate commit. I personally don't care much about comments like that directly near the implementation.

Especially because it requires going into quite a bit of detail. For example, despite what @Bodigrim said is true "It's not specifically optimized for boxed vectors only", it is not being used by any other implementation except boxed vectors. At the same time it might be used by custom a Unbox instances that doesn't supply the implementation for basicSet. I'd rather concentrate our efforts on cleaning up the backlog and improving user facing documentation than dwell on commenting the code.

In any case, this PR has been boiling here for over two years, I don't see any reason for holding it up any longer. Merging.

@lehins lehins merged commit a6c2db5 into haskell:master Jul 11, 2020
@lehins
Copy link
Copy Markdown
Contributor

lehins commented Jul 11, 2020

@OlivierSohn by all means. If you or anyone else feel like researching current implementation and describing what happens for each particular memory representation with notes on complexity and runtimes, then it will make a great PR. I promise it will not hang there for another two years ;) I personally don't feel like it is a good use of my time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data.Vector.Generic.Mutable.Base.basicSet could be optimized

5 participants