Memprof: optimize random sampling #9466

jhjourdan · 2020-04-18T13:01:40Z

This PR includes two optimizations to improve the performance of random sampling in Memprof:

Instead of using the stdlib logf function for computing logarithms, we use a faster polynomial-based approximation.

This is using the same ideas I have already used in another project : https://github.com/jhjourdan/SIMD-math-prims/. The code is portable enough (the least portable part is that it depends on the relative endianness of float and uint32_t, which I think is always the same in OCaml architecture).

We generate samples by batches so that we let compilers optimize the generation loops using SIMD instructions when possible.

I tested (using godbolt) on several versions of CLang (>4.0.0), GCC (> 4.7), and they all produce vectorized code. Of course, other compilers (e.g., MSVC, ICC) will not perform these optimizations, but the code is then still more efficient than the current one.

I benchmarked with a micro-benchmark adapted from @stedolan's #8731

let samples = ref 0

let rec recur n f = match n with 0 -> f () | n -> recur (n-1) f + 1

let () =
  Gc.Memprof.start
    ~sampling_rate:0.1
    ~callstack_size:10
    { Gc.Memprof.null_tracker with
      alloc_minor = fun _ -> incr samples; None };

  for i = 1 to 10_000_000 do
    let d = recur 20 (fun () ->
      Array.make 20 0 |> ignore; ref 42 |> Sys.opaque_identity |> ignore; 0) in
    assert (d = 20)
  done;

  Printf.printf "%d\n" !samples

Without these optimizations, on my machine, this code takes about 3.6 seconds. With this new code but without vectorization, it takes 3.4 seconds, and with vectorization, it takes 3.1 seconds. Given that the runtime of this program is about 0.7 seconds when statmemprof is disabled, this mean that this patch improves by about 17% the overhead of statmemprof.

This is an improvement, but I am still not completely sure this is worth it, given, in particular, that it requires some changes in configure to detect the specific attributes we need to pass to GCC to tell it to optimize. What do people think?

Note that @stedolan proposed in #8731 and optimization that should improve the performances of caml_next_frame_descriptor, which is now the main bottleneck.

gasche · 2020-04-18T13:36:43Z

Note that @stedolan proposed in #8731 and optimization that should improve the performances of caml_next_frame_descriptor, which is now the main bottleneck.

What part of #8371 are you referring to? There were several things in this PR and a complex discussion, but regarding backtrace construction optimization I understand that there were two changes (two commits):

one to avoid rescanning the stack twice that was already merged as Avoid scanning the stack twice when collecting callstacks in Memprof. #9279 (so I guess it is active in your benchmark results), and
another that uses registers for locals that, if I understand, was never integrated

Are you talking about this second change, or another optimization?

jhjourdan · 2020-04-18T13:46:19Z

Are you talking about this second change, or another optimization?

I am talking about that second change, yes.

gasche · 2020-04-18T13:51:43Z

Thanks. (@stedolan mentioned a 10-15% difference, so that would shave maybe 0.3-0.4 seconds off your benchmark.)

gadmm · 2020-04-19T20:08:00Z

This is an improvement, but I am still not completely sure this is worth it

I like the idea because I use it with callstack_size = 0, and I often do very little work in the callbacks, in which case the benefit is even greater, and the higher I can chose a sampling rate without affecting performance the better.

The software you mention (https://github.com/jhjourdan/SIMD-math-prims/) says that there is a loss of precision. Is this relevant here? If so can you give us an idea of how this skews the distribution? (e.g. the impact on the number of samples expected after N allocations for N very large).

xavierleroy · 2020-04-20T09:21:54Z

I'm reminded of the famous Quake III inverse square root function, but we need fewer comments and more profanity to rise to the Quake III level :-)

I confirm that the code vectorizes perfectly with SSE, AVX2, and all the way to AVX512. Impressive!

As to accuracy, the code claims 10^-6 accuracy/ I can't say whether it's enough or not, but note that single-precision FP gives 10^-7 accuracy at most. So, if single FP is good enough, the proposed code is probably good enough.

xavierleroy

A couple of stylistic comments.

runtime/memprof.c

gadmm · 2020-04-20T13:08:04Z

As to accuracy, the code claims 10^-6 accuracy/ I can't say whether it's enough or not, but note that single-precision FP gives 10^-7 accuracy at most. So, if single FP is good enough, the proposed code is probably good enough.

Maybe, maybe not, and maybe it was already too low before and my use of memprof reminds of a famous bug in a weapon that had to be rebooted every eight hours to remain accurate. Is it possible to convert the figure into an error range for the sampling rate? Then I can run some calculation to tell you if it has an impact.

If it does, at worst I guess it is possible to switch to an even greater precision for very low values of the sampling rate, where the performance loss should not be noticeable.

jhjourdan · 2020-04-20T14:58:55Z

I confirm that the code vectorizes perfectly with SSE, AVX2, and all the way to AVX512. Impressive!

Don't expect too much, though: compilers will not do so, except if they are given the right e.g., -march options, which won't happen with OCaml's Makefile.

As to accuracy, the code claims 10^-6 accuracy/ I can't say whether it's enough or not, but note that single-precision FP gives 10^-7 accuracy at most. So, if single FP is good enough, the proposed code is probably good enough.

10^-6 was too optimistic. The actual precision is about 1.5e-5. But still enough for what we are doing.

gadmm · 2020-04-20T16:57:15Z

But still enough for what we are doing.

Thanks for the precisions.

jhjourdan · 2020-04-20T18:06:47Z

About accuracy: I ran some tests in order to determine the actual impact on sampling rate. It turns out that there are only 2^32 values that can be uniformly generated by the Mersenne Twister. And it turns out that, thanks to vectorization, it is particularly easy to compute the actual sampling rate when requesting a given sampling rate by averaging over all of the 2^32 values.

So, here are the results:

lambda = 0.99 error =  6.57301e-08 words = 2.29143e+12
lambda = 0.9 error =  1.33628e-07 words = 5.04016e+12
lambda = 0.2 error =  9.36631e-08 words = 1.82383e+13
lambda = 0.1 error =  6.6075e-08 words = 2.06143e+13
lambda = 0.03 error =  3.19408e-08 words = 2.85234e+13
lambda = 0.01 error =  1.0935e-08 words = 8.27931e+13
lambda = 0.003 error =  3.25604e-09 words = 2.82121e+14
lambda = 0.001 error =  1.15338e-09 words = 7.50972e+14
lambda = 0.0003 error =  3.34931e-10 words = 2.6735e+15
lambda = 0.0001 error =  1.13561e-10 words = 7.75351e+15
lambda = 3e-05 error =  3.28782e-11 words = 2.77518e+16
lambda = 1e-05 error =  1.13485e-11 words = 7.76465e+16
lambda = 3e-06 error =  3.49937e-12 words = 2.44986e+17
lambda = 1e-06 error =  1.13515e-12 words = 7.76062e+17
lambda = 3e-07 error =  3.32747e-13 words = 2.70952e+18
lambda = 1e-07 error =  1.0686e-13 words = 8.75735e+18
lambda = 3e-08 error =  3.20678e-14 words = 2.9173e+19
lambda = 1e-08 error =  1.09474e-14 words = 8.34409e+19
lambda = 3e-09 error =  3.37326e-15 words = 2.63647e+20
lambda = 1e-09 error =  1.09219e-15 words = 8.38314e+20

This means that, for example, with lambda = 0.99, one would need about to allocate about 2.29143e+12 words in order to actually be able to observe the bias. Hence, the loss in precision only has negligible impact on the behavior of statmemprof.

Of course, this reasoning rely on the actual uniformity of the Mersenne Twister.

jhjourdan · 2020-04-20T18:09:58Z

If it does, at worst I guess it is possible to switch to an even greater precision for very low values of the sampling rate, where the performance loss should not be noticeable.

We could do that, but low values of the sampling rate are values for which the loss in precision is even less noticeable, since there are even less samplings...

gadmm · 2020-04-20T21:54:01Z

Thanks a lot for looking for those numbers (I do not understand everything, but it looks fun).

xavierleroy

LGTM as far as C programming style is concerned. Is anyone willing and able to review the math too, or shall we just trust @jhjourdan ?

stedolan · 2020-04-22T11:06:47Z

I'm happy to give the maths a read. (I actually wrote one of these a while ago but never got around to making a PR, so it will be fun to compare)

stedolan · 2020-04-28T15:54:23Z

Looks good to me! There are a few more games you can play here if you like, but moving from libm's logf to a polynomial approximation is the most important part.

There have in the past existed machines with different endianness for floats and integers. Are we confident these are all gone?

The polynomial is accurate to about 10^-5 on [1,2], which seems plenty. In this application, we don't really care about accuracy of a single sample, but rather about accuracy of a sum of samples as this is what determines the drift in sample rate. In other words, we (unusually) care about the mean error, rather than the max absolute error. In this respect, the polynomial approximation is even better, with average error about 10^-6 on [1,2].

If you want to optimise further, there are much smaller and faster RNGs than Mersenne Twister. I'd suggest either PCG or xoroshiro128+. (The former is better in ways that are unlikely to matter much here, the latter will be very fast even on 32-bit systems).

stedolan · 2020-04-28T16:03:41Z

In case anyone's curious: my version of this was a much cruder approximation of log by a quadratic. It's based on the unique quadratic that approximates log2 on [1,2] in a way that's continuous and differentiable when extended and has zero average error. It ends up with a large worst-case error, about 10^-2, but as I said above the sample rate is affected only by average error. The code is here, it's neither slow nor readable.

There's also von Neumann's delightful algorithm for generating exponential samples that requires no logs at all. In fact, it requires no arithmetic beyond increments! There's an implementation here. (thanks to @lpw25 for help digging that one up) (Sadly, I did benchmark this and it works out slower than polynomial approximations of log)

xavierleroy · 2020-04-28T16:24:44Z

There have in the past existed machines with different endianness for floats and integers.

The only example I encountered was one of the ARM soft FP emulation which used a mixed-endian representation for float64. But float32 had the same endianness as integers.

Following Wikipedia pointers, the PDP-11 might have had mixed-endian float32 and float64, and some of that nonsense may have carried over to the VAX.

Are we confident these are all gone?

Yes.

lpw25 · 2020-04-28T17:33:26Z

There's also von Neumann's delightful algorithm for generating exponential samples that requires no logs at all. In fact, it requires no arithmetic beyond increments! There's an implementation here. (thanks to @lpw25 for help digging that one up) (Sadly, I did benchmark this and it works out slower than polynomial approximations of log)

If anyone is interested, I think I originally found that algorithm on page 12 of this book:

Techniques for efficient Monte Carlo Simulation, Volume II

written in 1975 from the "Radiation Shielding Information Center". Where it occurs with no explanation in early Fortran.

Stephen and I eventually traced its origins to this 1951 paper by Von Neumann:

Various Techniques Used in Connection With Random Digits

I am genuinely disappointed that this lost knowledge of the ancients did not turn out to be faster than the approximations of log.

jhjourdan · 2020-04-29T13:50:13Z

Thanks, @stedolan, for the pointers. Please wait before the merge, I would like to give a chance to your proposals.

Instead of using the stdlib logf function for computing logarithms, we use a faster polynomial-based approximation. We use the xoshiro PRNG instead of the Mersenne Twister, which is less efficient and more complex. We generate samples by batches so that we let compilers optimize the generation loops using SIMD instructions when possible.

jhjourdan · 2020-05-18T16:03:11Z

I changed the code, by following some of @stedolan's proposals:

The current version now use 64 parallel xoshiro samplers instead of the Mersenne Twister
The approximation of the logarithm no uses a degree 4 polynomial. @stedolan's proposal did not work out of the box, because then log_approx(x) sometimes returns positive values for large x, which breaks assertions in the rest of the code. The fourth degree gives us a little more accuracy, which is appreciable for large values of lambda.

gasche · 2020-05-18T17:35:27Z

... and what's the impact of those proposals on performance? It always feels good to read about that.

jhjourdan · 2020-05-19T15:23:58Z

The impact on performance is positive: the time spent on the example discussed above with callstack deactivated is reduced by 10%.

However, this seems mostly related to cache effects (the Mersenne Twister used quite a bit of memory), since the time spent in sampling functions seems mostly unchanged.

In any case, I think the change in the PRNG is a good change since it also simplifies the code. The change in the log_approx function is more arguable, because I failed to measure the gain in performance, and the code is not really simpler. But it is not worse either.

xavierleroy · 2020-05-19T17:26:14Z

The impact on performance is positive: the time spent on the example discussed above with callstack deactivated is reduced by 10%.

Very nice! And I'm glad to see Mersenne Twister replaced by a simpler PRNG.

Are you satisfied, @jhjourdan? Shall we merge?

jhjourdan · 2020-05-20T08:45:37Z

I am satisfied. We could merge. But since the change in the PR is substential, @stedolan could have an objection.

stedolan · 2020-05-20T13:55:50Z

No objections here! New version looks good to me.

xavierleroy · 2020-05-25T07:51:31Z

It was an honor to merge such clever code.

Instead of using the stdlib logf function for computing logarithms, we use a faster polynomial-based approximation. We use the xoshiro PRNG instead of the Mersenne Twister. xoshiro is simpler and faster. We generate samples by batches so that compilers can vectorize the generation loops using SIMD instructions when possible.

jhjourdan force-pushed the memprof_batch_sampling branch 3 times, most recently from 0d2f014 to fe137c9 Compare April 19, 2020 12:26

xavierleroy reviewed Apr 20, 2020

View reviewed changes

runtime/memprof.c Outdated Show resolved Hide resolved

runtime/memprof.c Outdated Show resolved Hide resolved

jhjourdan force-pushed the memprof_batch_sampling branch from fe137c9 to 92228b9 Compare April 20, 2020 15:58

xavierleroy approved these changes Apr 22, 2020

View reviewed changes

jhjourdan force-pushed the memprof_batch_sampling branch 3 times, most recently from b1200cd to 6391fd5 Compare April 27, 2020 17:24

stedolan approved these changes Apr 28, 2020

View reviewed changes

jhjourdan force-pushed the memprof_batch_sampling branch 2 times, most recently from 5fc7fee to 6b23698 Compare May 11, 2020 13:51

jhjourdan force-pushed the memprof_batch_sampling branch from 6b23698 to 05fd7ab Compare May 18, 2020 15:55

Merge branch 'trunk' into memprof_batch_sampling

21f2244

xavierleroy merged commit 3d63a10 into ocaml:trunk May 25, 2020

jhjourdan mentioned this pull request May 25, 2020

Memprof: Fix duplicate declarations for static variable rand_pos. #9598

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memprof: optimize random sampling #9466

Memprof: optimize random sampling #9466

jhjourdan commented Apr 18, 2020

gasche commented Apr 18, 2020

jhjourdan commented Apr 18, 2020

gasche commented Apr 18, 2020

gadmm commented Apr 19, 2020

xavierleroy commented Apr 20, 2020

xavierleroy left a comment

gadmm commented Apr 20, 2020

jhjourdan commented Apr 20, 2020

gadmm commented Apr 20, 2020

jhjourdan commented Apr 20, 2020

jhjourdan commented Apr 20, 2020

gadmm commented Apr 20, 2020

xavierleroy left a comment

stedolan commented Apr 22, 2020

stedolan commented Apr 28, 2020

stedolan commented Apr 28, 2020

xavierleroy commented Apr 28, 2020

lpw25 commented Apr 28, 2020

jhjourdan commented Apr 29, 2020

jhjourdan commented May 18, 2020

gasche commented May 18, 2020

jhjourdan commented May 19, 2020

xavierleroy commented May 19, 2020

jhjourdan commented May 20, 2020

stedolan commented May 20, 2020

xavierleroy commented May 25, 2020

Memprof: optimize random sampling #9466

Memprof: optimize random sampling #9466

Conversation

jhjourdan commented Apr 18, 2020

gasche commented Apr 18, 2020

jhjourdan commented Apr 18, 2020

gasche commented Apr 18, 2020

gadmm commented Apr 19, 2020

xavierleroy commented Apr 20, 2020

xavierleroy left a comment

Choose a reason for hiding this comment

gadmm commented Apr 20, 2020

jhjourdan commented Apr 20, 2020

gadmm commented Apr 20, 2020

jhjourdan commented Apr 20, 2020

jhjourdan commented Apr 20, 2020

gadmm commented Apr 20, 2020

xavierleroy left a comment

Choose a reason for hiding this comment

stedolan commented Apr 22, 2020

stedolan commented Apr 28, 2020

stedolan commented Apr 28, 2020

xavierleroy commented Apr 28, 2020

lpw25 commented Apr 28, 2020

jhjourdan commented Apr 29, 2020

jhjourdan commented May 18, 2020

gasche commented May 18, 2020

jhjourdan commented May 19, 2020

xavierleroy commented May 19, 2020

jhjourdan commented May 20, 2020

stedolan commented May 20, 2020

xavierleroy commented May 25, 2020