Hello,
While reading the code of basicSet I found that it could be optimized. And the optimization will both simplify the code, and improve runtime performance:
Currently, if N is the length of the vector (assuming it's a power of 2), basicSet does
N/2 + N/4 + N/8 + .... = N reads and N writes of the vector, with a recursive logic.
EDIT 1: Note that if N is not a power of two, the number of reads and writes is also N
I propose a non recursive implementation where we do 0 reads from the vector (we have the value as argument) and N writes. It will improve the performance because we'll do less reads. Maybe also because the cache / prefetcher will be more efficient as we don't interleave writes with reads at another location.
EDIT 2 : After implementing it in #196, I copy the benchmark result, on unboxed vectors of length 100000 for 2 consecutive runs. We see the expected performance improvement:
benchmarking mutableSetOld
time 103.6 ms (97.81 ms .. 107.5 ms)
0.998 R² (0.995 R² .. 1.000 R²)
mean 103.5 ms (101.7 ms .. 106.1 ms)
std dev 3.250 ms (1.684 ms .. 4.659 ms)
benchmarking mutableSetNew
time 95.17 ms (90.92 ms .. 99.08 ms)
0.997 R² (0.990 R² .. 0.999 R²)
mean 96.94 ms (94.97 ms .. 99.06 ms)
std dev 3.260 ms (2.510 ms .. 4.339 ms)
benchmarking mutableSetOld
time 104.0 ms (101.7 ms .. 105.9 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 102.1 ms (101.2 ms .. 103.0 ms)
std dev 1.402 ms (1.010 ms .. 1.885 ms)
benchmarking mutableSetNew
time 97.97 ms (94.92 ms .. 101.2 ms)
0.998 R² (0.994 R² .. 1.000 R²)
mean 97.73 ms (96.36 ms .. 99.58 ms)
std dev 2.470 ms (1.749 ms .. 3.207 ms)
vector/Data/Vector/Generic/Mutable/Base.hs
Line 104 in 90bc4ee
Hello,
While reading the code of
basicSetI found that it could be optimized. And the optimization will both simplify the code, and improve runtime performance:Currently, if N is the length of the vector (assuming it's a power of 2),
basicSetdoesN/2 + N/4 + N/8 + .... = N reads and N writes of the vector, with a recursive logic.
EDIT 1: Note that if N is not a power of two, the number of reads and writes is also N
I propose a non recursive implementation where we do 0 reads from the vector (we have the value as argument) and N writes. It will improve the performance because we'll do less reads. Maybe also because the cache / prefetcher will be more efficient as we don't interleave writes with reads at another location.
EDIT 2 : After implementing it in #196, I copy the benchmark result, on unboxed vectors of length 100000 for 2 consecutive runs. We see the expected performance improvement: