Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memprof: optimize random sampling #9466

Merged
merged 2 commits into from
May 25, 2020

Conversation

jhjourdan
Copy link
Contributor

This PR includes two optimizations to improve the performance of random sampling in Memprof:

  1. Instead of using the stdlib logf function for computing logarithms, we use a faster polynomial-based approximation.

This is using the same ideas I have already used in another project : https://github.com/jhjourdan/SIMD-math-prims/. The code is portable enough (the least portable part is that it depends on the relative endianness of float and uint32_t, which I think is always the same in OCaml architecture).

  1. We generate samples by batches so that we let compilers optimize the generation loops using SIMD instructions when possible.

I tested (using godbolt) on several versions of CLang (>4.0.0), GCC (> 4.7), and they all produce vectorized code. Of course, other compilers (e.g., MSVC, ICC) will not perform these optimizations, but the code is then still more efficient than the current one.

I benchmarked with a micro-benchmark adapted from @stedolan's #8731

let samples = ref 0

let rec recur n f = match n with 0 -> f () | n -> recur (n-1) f + 1

let () =
  Gc.Memprof.start
    ~sampling_rate:0.1
    ~callstack_size:10
    { Gc.Memprof.null_tracker with
      alloc_minor = fun _ -> incr samples; None };

  for i = 1 to 10_000_000 do
    let d = recur 20 (fun () ->
      Array.make 20 0 |> ignore; ref 42 |> Sys.opaque_identity |> ignore; 0) in
    assert (d = 20)
  done;

  Printf.printf "%d\n" !samples

Without these optimizations, on my machine, this code takes about 3.6 seconds. With this new code but without vectorization, it takes 3.4 seconds, and with vectorization, it takes 3.1 seconds. Given that the runtime of this program is about 0.7 seconds when statmemprof is disabled, this mean that this patch improves by about 17% the overhead of statmemprof.

This is an improvement, but I am still not completely sure this is worth it, given, in particular, that it requires some changes in configure to detect the specific attributes we need to pass to GCC to tell it to optimize. What do people think?

Note that @stedolan proposed in #8731 and optimization that should improve the performances of caml_next_frame_descriptor, which is now the main bottleneck.

@gasche
Copy link
Member

gasche commented Apr 18, 2020

Note that @stedolan proposed in #8731 and optimization that should improve the performances of caml_next_frame_descriptor, which is now the main bottleneck.

What part of #8371 are you referring to? There were several things in this PR and a complex discussion, but regarding backtrace construction optimization I understand that there were two changes (two commits):

  1. one to avoid rescanning the stack twice that was already merged as Avoid scanning the stack twice when collecting callstacks in Memprof. #9279 (so I guess it is active in your benchmark results), and
  2. another that uses registers for locals that, if I understand, was never integrated

Are you talking about this second change, or another optimization?

@jhjourdan
Copy link
Contributor Author

Are you talking about this second change, or another optimization?

I am talking about that second change, yes.

@gasche
Copy link
Member

gasche commented Apr 18, 2020

Thanks. (@stedolan mentioned a 10-15% difference, so that would shave maybe 0.3-0.4 seconds off your benchmark.)

@jhjourdan jhjourdan force-pushed the memprof_batch_sampling branch 3 times, most recently from 0d2f014 to fe137c9 Compare April 19, 2020 12:26
@gadmm
Copy link
Contributor

gadmm commented Apr 19, 2020

This is an improvement, but I am still not completely sure this is worth it

I like the idea because I use it with callstack_size = 0, and I often do very little work in the callbacks, in which case the benefit is even greater, and the higher I can chose a sampling rate without affecting performance the better.

The software you mention (https://github.com/jhjourdan/SIMD-math-prims/) says that there is a loss of precision. Is this relevant here? If so can you give us an idea of how this skews the distribution? (e.g. the impact on the number of samples expected after N allocations for N very large).

@xavierleroy
Copy link
Contributor

I'm reminded of the famous Quake III inverse square root function, but we need fewer comments and more profanity to rise to the Quake III level :-)

I confirm that the code vectorizes perfectly with SSE, AVX2, and all the way to AVX512. Impressive!

As to accuracy, the code claims 10^-6 accuracy/ I can't say whether it's enough or not, but note that single-precision FP gives 10^-7 accuracy at most. So, if single FP is good enough, the proposed code is probably good enough.

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of stylistic comments.

runtime/memprof.c Outdated Show resolved Hide resolved
runtime/memprof.c Outdated Show resolved Hide resolved
@gadmm
Copy link
Contributor

gadmm commented Apr 20, 2020

As to accuracy, the code claims 10^-6 accuracy/ I can't say whether it's enough or not, but note that single-precision FP gives 10^-7 accuracy at most. So, if single FP is good enough, the proposed code is probably good enough.

Maybe, maybe not, and maybe it was already too low before and my use of memprof reminds of a famous bug in a weapon that had to be rebooted every eight hours to remain accurate. Is it possible to convert the figure into an error range for the sampling rate? Then I can run some calculation to tell you if it has an impact.

If it does, at worst I guess it is possible to switch to an even greater precision for very low values of the sampling rate, where the performance loss should not be noticeable.

@jhjourdan
Copy link
Contributor Author

I confirm that the code vectorizes perfectly with SSE, AVX2, and all the way to AVX512. Impressive!

Don't expect too much, though: compilers will not do so, except if they are given the right e.g., -march options, which won't happen with OCaml's Makefile.

As to accuracy, the code claims 10^-6 accuracy/ I can't say whether it's enough or not, but note that single-precision FP gives 10^-7 accuracy at most. So, if single FP is good enough, the proposed code is probably good enough.

10^-6 was too optimistic. The actual precision is about 1.5e-5. But still enough for what we are doing.

@gadmm
Copy link
Contributor

gadmm commented Apr 20, 2020

But still enough for what we are doing.

Thanks for the precisions.

@jhjourdan
Copy link
Contributor Author

About accuracy: I ran some tests in order to determine the actual impact on sampling rate. It turns out that there are only 2^32 values that can be uniformly generated by the Mersenne Twister. And it turns out that, thanks to vectorization, it is particularly easy to compute the actual sampling rate when requesting a given sampling rate by averaging over all of the 2^32 values.

So, here are the results:

lambda = 0.99 error =  6.57301e-08 words = 2.29143e+12
lambda = 0.9 error =  1.33628e-07 words = 5.04016e+12
lambda = 0.2 error =  9.36631e-08 words = 1.82383e+13
lambda = 0.1 error =  6.6075e-08 words = 2.06143e+13
lambda = 0.03 error =  3.19408e-08 words = 2.85234e+13
lambda = 0.01 error =  1.0935e-08 words = 8.27931e+13
lambda = 0.003 error =  3.25604e-09 words = 2.82121e+14
lambda = 0.001 error =  1.15338e-09 words = 7.50972e+14
lambda = 0.0003 error =  3.34931e-10 words = 2.6735e+15
lambda = 0.0001 error =  1.13561e-10 words = 7.75351e+15
lambda = 3e-05 error =  3.28782e-11 words = 2.77518e+16
lambda = 1e-05 error =  1.13485e-11 words = 7.76465e+16
lambda = 3e-06 error =  3.49937e-12 words = 2.44986e+17
lambda = 1e-06 error =  1.13515e-12 words = 7.76062e+17
lambda = 3e-07 error =  3.32747e-13 words = 2.70952e+18
lambda = 1e-07 error =  1.0686e-13 words = 8.75735e+18
lambda = 3e-08 error =  3.20678e-14 words = 2.9173e+19
lambda = 1e-08 error =  1.09474e-14 words = 8.34409e+19
lambda = 3e-09 error =  3.37326e-15 words = 2.63647e+20
lambda = 1e-09 error =  1.09219e-15 words = 8.38314e+20

This means that, for example, with lambda = 0.99, one would need about to allocate about 2.29143e+12 words in order to actually be able to observe the bias. Hence, the loss in precision only has negligible impact on the behavior of statmemprof.

Of course, this reasoning rely on the actual uniformity of the Mersenne Twister.

@jhjourdan
Copy link
Contributor Author

If it does, at worst I guess it is possible to switch to an even greater precision for very low values of the sampling rate, where the performance loss should not be noticeable.

We could do that, but low values of the sampling rate are values for which the loss in precision is even less noticeable, since there are even less samplings...

@gadmm
Copy link
Contributor

gadmm commented Apr 20, 2020

Thanks a lot for looking for those numbers (I do not understand everything, but it looks fun).

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as far as C programming style is concerned. Is anyone willing and able to review the math too, or shall we just trust @jhjourdan ?

@stedolan
Copy link
Contributor

I'm happy to give the maths a read. (I actually wrote one of these a while ago but never got around to making a PR, so it will be fun to compare)

@jhjourdan jhjourdan force-pushed the memprof_batch_sampling branch 3 times, most recently from b1200cd to 6391fd5 Compare April 27, 2020 17:24
@stedolan
Copy link
Contributor

Looks good to me! There are a few more games you can play here if you like, but moving from libm's logf to a polynomial approximation is the most important part.

There have in the past existed machines with different endianness for floats and integers. Are we confident these are all gone?

The polynomial is accurate to about 10^-5 on [1,2], which seems plenty. In this application, we don't really care about accuracy of a single sample, but rather about accuracy of a sum of samples as this is what determines the drift in sample rate. In other words, we (unusually) care about the mean error, rather than the max absolute error. In this respect, the polynomial approximation is even better, with average error about 10^-6 on [1,2].

If you want to optimise further, there are much smaller and faster RNGs than Mersenne Twister. I'd suggest either PCG or xoroshiro128+. (The former is better in ways that are unlikely to matter much here, the latter will be very fast even on 32-bit systems).

@stedolan
Copy link
Contributor

In case anyone's curious: my version of this was a much cruder approximation of log by a quadratic. It's based on the unique quadratic that approximates log2 on [1,2] in a way that's continuous and differentiable when extended and has zero average error. It ends up with a large worst-case error, about 10^-2, but as I said above the sample rate is affected only by average error. The code is here, it's neither slow nor readable.

There's also von Neumann's delightful algorithm for generating exponential samples that requires no logs at all. In fact, it requires no arithmetic beyond increments! There's an implementation here. (thanks to @lpw25 for help digging that one up) (Sadly, I did benchmark this and it works out slower than polynomial approximations of log)

@xavierleroy
Copy link
Contributor

There have in the past existed machines with different endianness for floats and integers.

The only example I encountered was one of the ARM soft FP emulation which used a mixed-endian representation for float64. But float32 had the same endianness as integers.

Following Wikipedia pointers, the PDP-11 might have had mixed-endian float32 and float64, and some of that nonsense may have carried over to the VAX.

Are we confident these are all gone?

Yes.

@lpw25
Copy link
Contributor

lpw25 commented Apr 28, 2020

There's also von Neumann's delightful algorithm for generating exponential samples that requires no logs at all. In fact, it requires no arithmetic beyond increments! There's an implementation here. (thanks to @lpw25 for help digging that one up) (Sadly, I did benchmark this and it works out slower than polynomial approximations of log)

If anyone is interested, I think I originally found that algorithm on page 12 of this book:

Techniques for efficient Monte Carlo Simulation, Volume II

written in 1975 from the "Radiation Shielding Information Center". Where it occurs with no explanation in early Fortran.

Stephen and I eventually traced its origins to this 1951 paper by Von Neumann:

Various Techniques Used in Connection With Random Digits

I am genuinely disappointed that this lost knowledge of the ancients did not turn out to be faster than the approximations of log.

@jhjourdan
Copy link
Contributor Author

Thanks, @stedolan, for the pointers. Please wait before the merge, I would like to give a chance to your proposals.

@jhjourdan jhjourdan force-pushed the memprof_batch_sampling branch 2 times, most recently from 5fc7fee to 6b23698 Compare May 11, 2020 13:51
Instead of using the stdlib logf function for computing logarithms, we
use a faster polynomial-based approximation.

We use the xoshiro PRNG instead of the Mersenne Twister, which is less
efficient and more complex.

We generate samples by batches so that we let compilers optimize the
generation loops using SIMD instructions when possible.
@jhjourdan
Copy link
Contributor Author

I changed the code, by following some of @stedolan's proposals:

  • The current version now use 64 parallel xoshiro samplers instead of the Mersenne Twister
  • The approximation of the logarithm no uses a degree 4 polynomial. @stedolan's proposal did not work out of the box, because then log_approx(x) sometimes returns positive values for large x, which breaks assertions in the rest of the code. The fourth degree gives us a little more accuracy, which is appreciable for large values of lambda.

@gasche
Copy link
Member

gasche commented May 18, 2020

... and what's the impact of those proposals on performance? It always feels good to read about that.

@jhjourdan
Copy link
Contributor Author

The impact on performance is positive: the time spent on the example discussed above with callstack deactivated is reduced by 10%.

However, this seems mostly related to cache effects (the Mersenne Twister used quite a bit of memory), since the time spent in sampling functions seems mostly unchanged.

In any case, I think the change in the PRNG is a good change since it also simplifies the code. The change in the log_approx function is more arguable, because I failed to measure the gain in performance, and the code is not really simpler. But it is not worse either.

@xavierleroy
Copy link
Contributor

The impact on performance is positive: the time spent on the example discussed above with callstack deactivated is reduced by 10%.

Very nice! And I'm glad to see Mersenne Twister replaced by a simpler PRNG.

Are you satisfied, @jhjourdan? Shall we merge?

@jhjourdan
Copy link
Contributor Author

I am satisfied. We could merge. But since the change in the PR is substential, @stedolan could have an objection.

@stedolan
Copy link
Contributor

No objections here! New version looks good to me.

@xavierleroy xavierleroy merged commit 3d63a10 into ocaml:trunk May 25, 2020
@xavierleroy
Copy link
Contributor

It was an honor to merge such clever code.

xavierleroy pushed a commit to xavierleroy/ocaml that referenced this pull request Jun 1, 2020
Instead of using the stdlib logf function for computing logarithms, we
use a faster polynomial-based approximation.

We use the xoshiro PRNG instead of the Mersenne Twister.  xoshiro is
simpler and faster.

We generate samples by batches so that compilers can vectorize the
generation loops using SIMD instructions when possible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants