Optimize generic version of basicSet.#196
Conversation
b45ac4a to
15e72d2
Compare
|
Do we have benchmarks to validate the perf improvement?
I agree that it sounds like it should be an improvement, but benchmarks
help a lot :)
…On Tue, Jan 2, 2018 at 12:39 AM OlivierSohn ***@***.***> wrote:
Closes #195 <#195>: instead of
doing N reads on the vector, where N is the length of the vector, we do 0
read. The order of writes is unchanged.
I looked at other places, I saw no other potential optimizations of that
sort.
I also updated the commented implementation in Mutable.hs to keep the two
in sync.
------------------------------
You can view, comment on, or merge this pull request online at:
#196
Commit Summary
- Optimize generic version of basicSet.
File Changes
- *M* Data/Vector/Generic/Mutable.hs
<https://github.com/haskell/vector/pull/196/files#diff-0> (16)
- *M* Data/Vector/Generic/Mutable/Base.hs
<https://github.com/haskell/vector/pull/196/files#diff-1> (17)
Patch Links:
- https://github.com/haskell/vector/pull/196.patch
- https://github.com/haskell/vector/pull/196.diff
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#196>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAQwsaoGUcZVNPm4Ja9vKxBXBH1hhv3ks5tGetCgaJpZM4RQYwC>
.
|
|
Yes, at some point I compared the 2 versions using the benchmark function in the PR, here are 2 consecutive runs: benchmarking mutableSetOld
time 103.6 ms (97.81 ms .. 107.5 ms)
0.998 R² (0.995 R² .. 1.000 R²)
mean 103.5 ms (101.7 ms .. 106.1 ms)
std dev 3.250 ms (1.684 ms .. 4.659 ms)
benchmarking mutableSetNew
time 95.17 ms (90.92 ms .. 99.08 ms)
0.997 R² (0.990 R² .. 0.999 R²)
mean 96.94 ms (94.97 ms .. 99.06 ms)
std dev 3.260 ms (2.510 ms .. 4.339 ms)
benchmarking mutableSetOld
time 104.0 ms (101.7 ms .. 105.9 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 102.1 ms (101.2 ms .. 103.0 ms)
std dev 1.402 ms (1.010 ms .. 1.885 ms)
benchmarking mutableSetNew
time 97.97 ms (94.92 ms .. 101.2 ms)
0.998 R² (0.994 R² .. 1.000 R²)
mean 97.73 ms (96.36 ms .. 99.58 ms)
std dev 2.470 ms (1.749 ms .. 3.207 ms)
|
|
OVer what input sizes?
I guess my main concern is that for this sort of optimization we wanna
validate that it’s a Win in both small and large updates :)
…On Tue, Jan 2, 2018 at 3:53 PM OlivierSohn ***@***.***> wrote:
Yes, at some point I compared the 2 versions using te benchmark function
in the PR, here are 2 consecutive runs:
benchmarking mutableSetOldtime 103.6 ms (97.81 ms .. 107.5 ms)
0.998 R² (0.995 R² .. 1.000 R²)
mean 103.5 ms (101.7 ms .. 106.1 ms)
std dev 3.250 ms (1.684 ms .. 4.659 ms)
benchmarking mutableSetNewtime 95.17 ms (90.92 ms .. 99.08 ms)
0.997 R² (0.990 R² .. 0.999 R²)
mean 96.94 ms (94.97 ms .. 99.06 ms)
std dev 3.260 ms (2.510 ms .. 4.339 ms)
benchmarking mutableSetOldtime 104.0 ms (101.7 ms .. 105.9 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 102.1 ms (101.2 ms .. 103.0 ms)
std dev 1.402 ms (1.010 ms .. 1.885 ms)
benchmarking mutableSetNewtime 97.97 ms (94.92 ms .. 101.2 ms)
0.998 R² (0.994 R² .. 1.000 R²)
mean 97.73 ms (96.36 ms .. 99.58 ms)
std dev 2.470 ms (1.749 ms .. 3.207 ms)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#196 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAQwoOAg014D_WD1rhWWTbPy9Rg9H1Oks5tGrN7gaJpZM4RQYwC>
.
|
|
Well, in that particular case, if you look at the code before, you'll see that the implementation makes little sense. And you'll understand that it has to be faster with the new implementation, whatever the size of the input. I explain more in #195 why it is faster with the new implementation: in the old implementation if N is the length of the vector, we were doing N reads and N writes. But reading the vector for this algorithm makes no sense at all, and just adds more work. So now we do N writes, 0 read of the vector. Again, look at the implementation as it was before and you'll understand that it's not the way it should have been done. And to answer your question regarding number of elements, in that case it's 100000 elements. |
|
I agree algorithmically once you’re larger than some small power of two
number of elements.
But In eg the case of unboxed or storable arrays modern cpus are very
good at memory prefetch etc for linear access patterns. That sort of
effect is hard to analyze non empirically.
If you could humor me and give us a criterion suite run with tests for a
range of small and larger inputs, that would be super appreciated. If
you’re strapped for time I’ll try to find some time soon, I’m just a tad
swamped with a few work things this week/Month like mentoring a new hire ;)
…On Tue, Jan 2, 2018 at 7:19 PM OlivierSohn ***@***.***> wrote:
Well, in that particular case, if you look at the code you understand that
it has to be faster with the new implementation, whatever the size of the
input. In that case it's 100000 elements. But feel free to run the tests
wit ha different number of elements, I won't do it because I know that
algorithmically there is just no ambiguity here.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#196 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAQwtkxS5uyoXqA5-3fCzVn5GaY_wFrks5tGsd_gaJpZM4RQYwC>
.
|
|
Indeed, linear access pattern is what I do with this implementation, and it was not the case in the previous one (well, it was for writes but interleaved with reads). Sorry, I won't have time for doing these tests, as I don't see how they would help in that case. Honestly, once you understand the previous and the current implementations, you can see that it cannot be slower now than it was before, so writing these tests would be like testing that when you call a function two times it's slower than calling it once :) I don't write that kind of tests :) |
|
You’re likely correct. I’m just being a humbug ;)
I’ll try to find some procrastination time to compare the two using your
test or whatever as a starting point
…On Wed, Jan 3, 2018 at 11:24 AM OlivierSohn ***@***.***> wrote:
Indeed, linear access pattern is what I do with this implementation, and
it was not the case in the previous one (well, it was for writes but
interleaved with reads).
Sorry, I won't have time for doing these tests, as I don't see how they
would help in that case.
Honestly, once you understand the previous and the current
implementations, you can see that it *cannot* be slower now than it was
before, so writing these tests would be like testing that when you call a
function two times it's slower than calling it once :) I don't write that
kind of tests :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#196 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAQwvCQRpi8W9i5uxJNgz65YVu27VdOks5tG6migaJpZM4RQYwC>
.
|
|
On testing, when I compared the 2 I wrote the old implementation and the new one in the same version of the code (basicSetOld, basicSetNew), so that I didn't have to change branch / recompile everything every time I wanted to do a comparison. But even doing it this way, every time the test code was changed, every package used in the tests and depending on vector was recompiled (5 minutes or so). I don't know if it's because I was compiling with stack or not... If you do the test and compile with cabal, I'd be interested to know how it goes on this aspect! |
|
Cabal new build works way better. You should try it out. Also it’d be easy
to compare the two in userland
…On Wed, Jan 3, 2018 at 9:35 PM OlivierSohn ***@***.***> wrote:
On testing, when I compared the 2 I wrote the old implementation and the
new one in the same version of the code (basicSetOld, basicSetNew), so that
I didn't have to change branch / recompile everything every time I wanted
to do a comparison. But even doing it this way, every time the *test*
code was changed, every dependency of vector was recompiled (5 minutes or
so). I don't know if it's because I was compiling with stack or not... If
you do the test and compile with cabal, I'd be interested to know how it
goes on this aspect!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#196 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAQwqXTCXf0oibYrWMGfplZo0QqpL28ks5tHDjbgaJpZM4RQYwC>
.
|
|
Thanks for pointing that out, |
|
You need to add a cabal.project file in the root that points to the
relative path of the other cabal files
…On Thu, Jan 4, 2018 at 12:51 AM OlivierSohn ***@***.***> wrote:
Thanks for pointing that out, cabal new-build seems more flexible and
would have made that easier!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#196 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAQwg5npH8S391C0irgiRAFXOLJKNcAks5tHGb7gaJpZM4RQYwC>
.
|
|
The new cabal manual has some decent expository material on this
On Thu, Jan 4, 2018 at 9:12 AM Carter Schonwald <carter.schonwald@gmail.com>
wrote:
… You need to add a cabal.project file in the root that points to the
relative path of the other cabal files
On Thu, Jan 4, 2018 at 12:51 AM OlivierSohn ***@***.***>
wrote:
> Thanks for pointing that out, cabal new-build seems more flexible and
> would have made that easier!
>
> —
> You are receiving this because you commented.
>
>
> Reply to this email directly, view it on GitHub
> <#196 (comment)>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AAAQwg5npH8S391C0irgiRAFXOLJKNcAks5tHGb7gaJpZM4RQYwC>
> .
>
|
|
Thanks, I had found it in-between, and deleted the comment |
|
did you updated your comparison / reference code? :) |
|
also on the CI theres some failures, though other folks may understand the error better than I |
|
@cartazio I explained in the discussion why the test is sufficient as-is imho. Feel free to do more tests if you need confirmation of the performance boost. |
|
i'm so sorry for being slow to follow up on this, what i'd like to do (in a few weeks after other stuff stabilizes more, is do a detailed benchmark that replicates that timing comparison you did, but using a number of different memory sizes and such to understand what the impact is at various sizes wrt memory hiearchy and cpus). This sort of optimization can be great if its robust over sizes, but sometimes tricks that work great at say L1/L2 size fall over at L3 or full ram roundtrip |
|
@cartazio Your comment makes me doubt that you took the time to read the changes and understand them. |
|
Quite possibly! The point I was communicating is that I’ll look again
closely and soon.
…On Thu, Jan 30, 2020 at 3:57 AM OlivierSohn ***@***.***> wrote:
@cartazio <https://github.com/cartazio> Your comment makes me doubt that
you took the time to read the changes and understand them.
You'll see that what I do is not a "trick" at all. It's just that the
initial implementation was unnecessarily convoluted, and *cannot* be
faster that what I do.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#196?email_source=notifications&email_token=AAABBQXFJ2ABTXHLPELQTATRAKJAXA5CNFSM4EKBRQBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKKGFEA#issuecomment-580149904>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAABBQQLWWBJT2JN3DH44YDRAKJAXANCNFSM4EKBRQBA>
.
|
|
@OlivierSohn I agree that the original version reads N bytes and writes N bytes, while your code reads 0 bytes and writes N bytes, which is strictly better. And your code is more straightforward (it is basically as straightforward as possible ;). However, I cannot fully agree that I think it is reasonable to apply your patch for |
|
My main point is that modern computers have a pretty nuanced hierarchical memory model that sometimes behaves differently than PDP8 era flat memory, and it can be surprising |
|
and that those constant factors from the memory model vary as one changes the size of the working set |
|
thanks for the original work, We do plan to make use of it and do some
careful experiments (finite time etc etc)
…On Mon, Feb 3, 2020 at 7:40 PM OlivierSohn ***@***.***> wrote:
@Bodigrim <https://github.com/Bodigrim>, @cartazio
<https://github.com/cartazio> I won't have time to work on it (I opened
this PR 2 years ago, I now work on different projects).
If you feel like taking the lead on doing these tests / merging this PR,
go ahead!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#196?email_source=notifications&email_token=AAABBQUGZHGGETOITVHLSGDRBC2P5A5CNFSM4EKBRQBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKV5XCI#issuecomment-581688201>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAABBQRFPWIO5I4TCHTL47TRBC2P5ANCNFSM4EKBRQBA>
.
|
eea7b93 to
90649f0
Compare
90649f0 to
804a598
Compare
|
There was really no reason for such a simple PR to hang for so long. Sorry about that @OlivierSohn That being said, this PR introduces a regression, not an improvement. The reason why you didn't see it with your benchmarks is because you benchmarked unboxed vector, while that default implementation of I just tried both versions on boxed vectors. Here is yours: do_set i | i < n = do
basicUnsafeWrite v i x
do_set (i+1)
| otherwise = return ()with runtime: and current master: do_set i | 2*i < n = do basicUnsafeCopy (basicUnsafeSlice i i v)
(basicUnsafeSlice 0 i v)
do_set (2*i)
| otherwise = basicUnsafeCopy (basicUnsafeSlice i (n-i) v)
(basicUnsafeSlice 0 (n-i) v)with runtime almost twice as fast. Reason for this performance improvement with a more convoluted approach is because it uses optimized version of With all that in mind, this PR still adds something useful, namely a benchmark for setting a vector to the same value.
I followed the above suggestion and took it upon myself to fix this PR. I reverted implementation of |
@lehins I think it would be worthwhile to add comment explaining why convoluted approach is taken and that it's optimized for boxed vector |
|
@Shimuuar I am not the original implementor of the |
|
Just notice that this implementation is optimized for boxed vectors and unboxed/primitive/storable have specialized one would be helpful I think |
|
It's not specifically optimized for boxed vectors only, it is optimized for all vectors backed by a continuous chunk of memory. In this case
|
I agree that adding a comment would be nice: without it, it is hard to guess why one would implement it like this (and whether it is a beginner's mistake or an expert trick). |
|
Thanks @Bodigrim that is a pretty good explanation. @Shimuuar if you really feel like this #196 (comment) deserves to be in the codebase, by all means add it as a separate commit. I personally don't care much about comments like that directly near the implementation. Especially because it requires going into quite a bit of detail. For example, despite what @Bodigrim said is true "It's not specifically optimized for boxed vectors only", it is not being used by any other implementation except boxed vectors. At the same time it might be used by custom a Unbox instances that doesn't supply the implementation for In any case, this PR has been boiling here for over two years, I don't see any reason for holding it up any longer. Merging. |
|
@OlivierSohn by all means. If you or anyone else feel like researching current implementation and describing what happens for each particular memory representation with notes on complexity and runtimes, then it will make a great PR. I promise it will not hang there for another two years ;) I personally don't feel like it is a good use of my time. |
Closes #195: instead of doing N reads on the vector, where N is the length of the vector, we do 0 read. The order of writes is unchanged.
I looked at other places, I saw no other potential optimizations of that sort.
I also updated the commented implementation in Mutable.hs to keep the two in sync.
EDIT: It seems to me that the CI errors are not related to these changes
EDIT2 : Added a benchmark function : performance is 5% to 15% better with the new version
EDIT3 : updated some bounds in order to compile with stack lts-10.2