-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memprof: optimize random sampling #9466
Conversation
What part of #8371 are you referring to? There were several things in this PR and a complex discussion, but regarding backtrace construction optimization I understand that there were two changes (two commits):
Are you talking about this second change, or another optimization? |
I am talking about that second change, yes. |
Thanks. (@stedolan mentioned a 10-15% difference, so that would shave maybe 0.3-0.4 seconds off your benchmark.) |
0d2f014
to
fe137c9
Compare
I like the idea because I use it with callstack_size = 0, and I often do very little work in the callbacks, in which case the benefit is even greater, and the higher I can chose a sampling rate without affecting performance the better. The software you mention (https://github.com/jhjourdan/SIMD-math-prims/) says that there is a loss of precision. Is this relevant here? If so can you give us an idea of how this skews the distribution? (e.g. the impact on the number of samples expected after N allocations for N very large). |
I'm reminded of the famous Quake III inverse square root function, but we need fewer comments and more profanity to rise to the Quake III level :-) I confirm that the code vectorizes perfectly with SSE, AVX2, and all the way to AVX512. Impressive! As to accuracy, the code claims 10^-6 accuracy/ I can't say whether it's enough or not, but note that single-precision FP gives 10^-7 accuracy at most. So, if single FP is good enough, the proposed code is probably good enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of stylistic comments.
Maybe, maybe not, and maybe it was already too low before and my use of memprof reminds of a famous bug in a weapon that had to be rebooted every eight hours to remain accurate. Is it possible to convert the figure into an error range for the sampling rate? Then I can run some calculation to tell you if it has an impact. If it does, at worst I guess it is possible to switch to an even greater precision for very low values of the sampling rate, where the performance loss should not be noticeable. |
Don't expect too much, though: compilers will not do so, except if they are given the right e.g., -march options, which won't happen with OCaml's Makefile.
10^-6 was too optimistic. The actual precision is about 1.5e-5. But still enough for what we are doing. |
fe137c9
to
92228b9
Compare
Thanks for the precisions. |
About accuracy: I ran some tests in order to determine the actual impact on sampling rate. It turns out that there are only 2^32 values that can be uniformly generated by the Mersenne Twister. And it turns out that, thanks to vectorization, it is particularly easy to compute the actual sampling rate when requesting a given sampling rate by averaging over all of the 2^32 values. So, here are the results:
This means that, for example, with lambda = 0.99, one would need about to allocate about 2.29143e+12 words in order to actually be able to observe the bias. Hence, the loss in precision only has negligible impact on the behavior of statmemprof. Of course, this reasoning rely on the actual uniformity of the Mersenne Twister. |
We could do that, but low values of the sampling rate are values for which the loss in precision is even less noticeable, since there are even less samplings... |
Thanks a lot for looking for those numbers (I do not understand everything, but it looks fun). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as far as C programming style is concerned. Is anyone willing and able to review the math too, or shall we just trust @jhjourdan ?
I'm happy to give the maths a read. (I actually wrote one of these a while ago but never got around to making a PR, so it will be fun to compare) |
b1200cd
to
6391fd5
Compare
Looks good to me! There are a few more games you can play here if you like, but moving from libm's There have in the past existed machines with different endianness for floats and integers. Are we confident these are all gone? The polynomial is accurate to about 10^-5 on [1,2], which seems plenty. In this application, we don't really care about accuracy of a single sample, but rather about accuracy of a sum of samples as this is what determines the drift in sample rate. In other words, we (unusually) care about the mean error, rather than the max absolute error. In this respect, the polynomial approximation is even better, with average error about 10^-6 on [1,2]. If you want to optimise further, there are much smaller and faster RNGs than Mersenne Twister. I'd suggest either PCG or xoroshiro128+. (The former is better in ways that are unlikely to matter much here, the latter will be very fast even on 32-bit systems). |
In case anyone's curious: my version of this was a much cruder approximation of log by a quadratic. It's based on the unique quadratic that approximates log2 on [1,2] in a way that's continuous and differentiable when extended and has zero average error. It ends up with a large worst-case error, about 10^-2, but as I said above the sample rate is affected only by average error. The code is here, it's neither slow nor readable. There's also von Neumann's delightful algorithm for generating exponential samples that requires no logs at all. In fact, it requires no arithmetic beyond increments! There's an implementation here. (thanks to @lpw25 for help digging that one up) (Sadly, I did benchmark this and it works out slower than polynomial approximations of log) |
The only example I encountered was one of the ARM soft FP emulation which used a mixed-endian representation for float64. But float32 had the same endianness as integers. Following Wikipedia pointers, the PDP-11 might have had mixed-endian float32 and float64, and some of that nonsense may have carried over to the VAX.
Yes. |
If anyone is interested, I think I originally found that algorithm on page 12 of this book: Techniques for efficient Monte Carlo Simulation, Volume II written in 1975 from the "Radiation Shielding Information Center". Where it occurs with no explanation in early Fortran. Stephen and I eventually traced its origins to this 1951 paper by Von Neumann: Various Techniques Used in Connection With Random Digits I am genuinely disappointed that this lost knowledge of the ancients did not turn out to be faster than the approximations of log. |
Thanks, @stedolan, for the pointers. Please wait before the merge, I would like to give a chance to your proposals. |
5fc7fee
to
6b23698
Compare
Instead of using the stdlib logf function for computing logarithms, we use a faster polynomial-based approximation. We use the xoshiro PRNG instead of the Mersenne Twister, which is less efficient and more complex. We generate samples by batches so that we let compilers optimize the generation loops using SIMD instructions when possible.
6b23698
to
05fd7ab
Compare
I changed the code, by following some of @stedolan's proposals:
|
... and what's the impact of those proposals on performance? It always feels good to read about that. |
The impact on performance is positive: the time spent on the example discussed above with callstack deactivated is reduced by 10%. However, this seems mostly related to cache effects (the Mersenne Twister used quite a bit of memory), since the time spent in sampling functions seems mostly unchanged. In any case, I think the change in the PRNG is a good change since it also simplifies the code. The change in the |
Very nice! And I'm glad to see Mersenne Twister replaced by a simpler PRNG. Are you satisfied, @jhjourdan? Shall we merge? |
I am satisfied. We could merge. But since the change in the PR is substential, @stedolan could have an objection. |
No objections here! New version looks good to me. |
It was an honor to merge such clever code. |
Instead of using the stdlib logf function for computing logarithms, we use a faster polynomial-based approximation. We use the xoshiro PRNG instead of the Mersenne Twister. xoshiro is simpler and faster. We generate samples by batches so that compilers can vectorize the generation loops using SIMD instructions when possible.
This PR includes two optimizations to improve the performance of random sampling in Memprof:
This is using the same ideas I have already used in another project : https://github.com/jhjourdan/SIMD-math-prims/. The code is portable enough (the least portable part is that it depends on the relative endianness of float and uint32_t, which I think is always the same in OCaml architecture).
I tested (using godbolt) on several versions of CLang (>4.0.0), GCC (> 4.7), and they all produce vectorized code. Of course, other compilers (e.g., MSVC, ICC) will not perform these optimizations, but the code is then still more efficient than the current one.
I benchmarked with a micro-benchmark adapted from @stedolan's #8731
Without these optimizations, on my machine, this code takes about 3.6 seconds. With this new code but without vectorization, it takes 3.4 seconds, and with vectorization, it takes 3.1 seconds. Given that the runtime of this program is about 0.7 seconds when statmemprof is disabled, this mean that this patch improves by about 17% the overhead of statmemprof.
This is an improvement, but I am still not completely sure this is worth it, given, in particular, that it requires some changes in
configure
to detect the specific attributes we need to pass to GCC to tell it to optimize. What do people think?Note that @stedolan proposed in #8731 and optimization that should improve the performances of
caml_next_frame_descriptor
, which is now the main bottleneck.