Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.Sign up
runtime: support SSE2 memmove on older amd64 hardware #38512
When compared to a similar C++/glibc implementation of the same operation, the glibc one was about twice as fast on my hardware (a Xeon E5 v1, Sandy Bridge) to the Go implementation.
Come to find out, that while the Xeon E5v1s have AVX instructions, they have a slower per-cycle data path. That explains why Go 1.14 has the following code restriction for Intel "Bridge families":
This hardware issue was present until Haswell Xeons (E5v3) which happily (and quickly) use the AVX.
glibc concurs. In a
Which is interesting -- because Go doesn't have the equivalent.
I went ahead and implemented it; but the README said to file an issue for inclusion first, so here we are (my first issue!)
Forcing AVX on Sandy Bridge functioned, and varied in speed, but was slower than expected (and slower than SSE2-paths) and especially slower when the non-temporal moves got involved.
The biggest win came in the backwards path alone.
So I implemented the backwards path only on my branch of the Go runtime and here's some preliminary highlights:
I figured compiling and benchmarking other packages from the wild with my patch would be interesting, and sure enough...
Using bbolt 's B-Tree write benchmark:
@barakmich I have been looking into a new memmove implementation (https://go-review.googlesource.com/c/go/+/228820) that tries to slim down the function size and also uses sse2 (even for forward). Unfortunately I have not been able to produce a clear net win for all amd64 cpus while playing with new memmove implementations. Any specialization usally increases icache misses and runtime overhead.
We have to take into account the overall performance of sandy bridge, haswell, skylake, ryzen, ... CPUs when modifying the memmove implementation as other CPUs can regress in performance. REP MOVSQ, REP MOVSB, SSE2 vs AVX can differ in performance on different platforms and depending on alignment and size. In general there is an opportunity that by making the implemntation smaller we can save cycles everywhere by avoiding icache misses.
Do you have a link to your modified version?
As a first step just adding simple Backwards memmove benchmarks would be a nice commit.