Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement decay-based unused dirty page purging. #325

Closed
jasone opened this issue Feb 5, 2016 · 2 comments
Closed

Implement decay-based unused dirty page purging. #325

jasone opened this issue Feb 5, 2016 · 2 comments
Milestone

Comments

@jasone
Copy link
Member

jasone commented Feb 5, 2016

The current ratio-based mechanism for unused dirty page purging is effective at controlling accumulation of unused dirty pages, except that its conservatism can cause extreme performance degradation if the application's memory usage constantly fluctuates more than e.g. 1/9 with the default lg_dirty_mult:3 setting. In fact, we would prefer a much more stringent ratio, perhaps lg_dirty_mult:6, but an unacceptable fraction of applications' performance would suffer with such a setting.

Part of the problem is that our internal clock can only measure passage of time in terms of allocation event ticks. Introduce a wall clock and track how long unused dirty pages have existed, so that unused dirty page count decays toward zero over time.

@jasone
Copy link
Member Author

jasone commented Feb 5, 2016

A year ago I did some design work for this feature as part of my Applicative conference talk, and this week I wrote some simple simulation code to see how the decay algorithms behaved.

Specifically I simulated various malloc/free/noop traces and tracked the moving maximum active memory over a fixed-size trailing window of ticks. Based on the moving maximum, at each tick I determined how much unused dirty memory to purge such that active + unused memory would intersect the moving maximum after a fixed number of ticks. This algorithm requires no memory other than the moving maximum, which conveniently avoids tracking ages of unused dirty pages. Overall the algorithm worked well. An oversight jumped out at me though. Since the amount of memory to purge is discretely computed at each tick (rather than being based on a pre-computed purging schedule), the decay approximates a piecewise curve that asymptotically approaches the moving maximum. I had imagined this decay as linear, but it is not. As indicated in my Applicative presentation, the ideal decay curve is sigmoidal, and I was intending a piecewise linear approximation, but this algorithm is effectively a delayed exponential curve.

I went back to the drawing board to find a way to implement true sigmoidal decay curves. I have an algorithm for this now, and simulations indicate it behaves as desired. The basic idea is to keep a trailing record of how many unused dirty pages were created at each of the past N time slices, and apply a sigmoidal decay curve independently to each time slice. For example, if we want to assure that newly generated unused dirty pages persist no longer than 5 seconds, we record how many are generated during the current time slice, and then during each subsequent time slice for the next 5 seconds, we multiply the sigmoidal decay curve factor (which decays from 1.0 down to 0.0) with the original number of unused dirty pages. If the current number of unused dirty pages exceeds the limit set by the decay curve, we purge enough pages to drop below the limit.

Each time we make a purging decision, we must compute the sum of per time slice limits on unused dirty pages. The number of required computations is linear in the number of time slices required to cover the decay time, e.g. 100 for a 5 second decay and 50 ms time slices. Through a combination of precomputed tables, fixed point integer math, and the magic of strength reduction, we can optimize this into a series of efficient integer operations that should have inconsequential performance impact in the context of how infrequently we make purging decisions (per arena, maximum once per time slice).

Regarding the sigmoidal decay curve, we want a curve that actually hits 1.0 and 0.0, rather than merely asymptotically approaching them. The smoothstep family of curves fits this need precisely. I simulated with all three (smoothstep, smootherstep, and smootheststep), and although any of the three would suffice, I'm conceptually leaning toward smootherstep because it makes sense to me that we would want the first and second derivatives (akin to velocity and acceleration) to start at zero. In practice it might not matter because of quantization effects: we will purge the LRU unused dirty page run as a whole, and we only purge once the entirety of that run is above the decay curve (i.e. we level-trigger purging at post-purge dirty page count rather than pre-purge dirty page count).

@jasone jasone closed this as completed in 243f7a0 Feb 20, 2016
@jasone
Copy link
Member Author

jasone commented Feb 20, 2016

Note that 243f7a0 depends on over a dozen preceding commits, especially smoothstep table generation (8e82af1).

gunnarku pushed a commit to facebook/mysql-5.6 that referenced this issue Mar 8, 2016
Summary:
Enabling this functionality based on the results from a variety of performance tests. Please see jemalloc/jemalloc#325 to learn more about the functionality this change enables.

Supported ways to enable this functionality are either using `malloc_conf` or `MALLOC_CONF` environment variable. Global variable takes precedence over the environment variable.

To remove enabling this option, just revert these changes or delete the variable declaration.

Test Plan:
  - LinkBench
  - SqlBench
  - SysBench

Reviewers: jtolmer

Reviewed By: jtolmer

Subscribers: webscalesql-eng

Differential Revision: https://reviews.facebook.net/D55119
ronawho added a commit to ronawho/chapel that referenced this issue May 27, 2016
Use jemalloc's decay-based purging instead of the default ratio-based purging.
Decay-based purging was a major optimization added with jemalloc 4.1.0, but it
won't be the default until at least 5.0. I've seen some pretty nice performance
speedups from using decay-based purging, so make it our default.

For binary trees (a benchmark that really measures allocator speed) this
results in a ~10% speedup on the shootout-like machine.

From the 4.1.0 release notes:

    Implement decay-based unused dirty page purging, a major optimization with
    mallctl API impact. This is an alternative to the existing ratio-based
    unused dirty page purging, and is intended to eventually become the sole
    purging mechanism.

See jemalloc/jemalloc#325 for more info on
decay-based purging
ronawho added a commit to chapel-lang/chapel that referenced this issue May 31, 2016
Enable jemalloc's decay-based purging

Use jemalloc's decay-based purging instead of the default ratio-based purging.
Decay-based purging was a major optimization added with jemalloc 4.1.0, but it
won't be the default until at least 5.0. I've seen some pretty nice performance
speedups from using decay-based purging, so make it our default.

For binary trees (a benchmark that really measures allocator speed) this
results in a ~10% speedup on the shootout-like machine.

From the 4.1.0 release notes:

    Implement decay-based unused dirty page purging, a major optimization with
    mallctl API impact. This is an alternative to the existing ratio-based
    unused dirty page purging, and is intended to eventually become the sole
    purging mechanism.

See jemalloc/jemalloc#325 for more info on
decay-based purging
hermanlee pushed a commit to facebook/mysql-5.6 that referenced this issue Jan 31, 2017
Summary:
Enabling this functionality based on the results from a variety of performance tests. Please see jemalloc/jemalloc#325 to learn more about the functionality this change enables.

Supported ways to enable this functionality are either using `malloc_conf` or `MALLOC_CONF` environment variable. Global variable takes precedence over the environment variable.

To remove enabling this option, just revert these changes or delete the variable declaration.

Test Plan:
  - LinkBench
  - SqlBench
  - SysBench

Reviewers: jtolmer

Reviewed By: jtolmer

Subscribers: webscalesql-eng

Differential Revision: https://reviews.facebook.net/D55119
VitaliyLi pushed a commit to VitaliyLi/mysql-5.6 that referenced this issue Feb 9, 2017
Summary:
Enabling this functionality based on the results from a variety of performance tests. Please see jemalloc/jemalloc#325 to learn more about the functionality this change enables.

Supported ways to enable this functionality are either using `malloc_conf` or `MALLOC_CONF` environment variable. Global variable takes precedence over the environment variable.

To remove enabling this option, just revert these changes or delete the variable declaration.

Test Plan:
  - LinkBench
  - SqlBench
  - SysBench

Reviewers: jtolmer

Reviewed By: jtolmer

Subscribers: webscalesql-eng

Differential Revision: https://reviews.facebook.net/D55119
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant