Reconsidering "Reconsidering Custom Memory Allocation"

In this work, we revisit the Reconsidering Custom Memory Allocation paper¹. That widely-cited paper (selected as a Most Influential Paper²) conducted a study that demonstrated that in many cases, custom memory allocators often do not provide significant performance benefits over a good general-purpose memory allocator.

That paper compared the original programs (with custom allocators) with versions that used general-purpose allocators (either directly or via a "shim" that redirected individual object allocations to malloc/free). However, their approach started with a clean-slate heap (no prior allocations); this does not account for the case of long-running applications, where heap fragmentation is inevitable. We show that, for fragmented heaps, custom memory allocators can significantly outperform modern general-purpose allocators.

Littering

To simulate the effect of long-running execution on the heap, we present a novel methodology we call littering. Littering allows us to measure the benefits of custom allocators without unfairly favoring malloc/free. Unlike starting from a clean-slate heap, littering simulates the conditions of a long-running program, where fragmentation is inevitable. Littering works by first running a short pre-conditioning step before starting the program, allocating and freeing objects.

More precisely, littering works in two phases:

Detection: We run the program as usual, but with a detector library (LD_PRELOAD=libdetector.so <program>) that keeps track of every allocated object's size, binning sizes to obtain a rough size distribution of the objects used by the program, which it produces as output to be consumed in the littering phase. Detector also keeps track of several allocation statistics, including mean, min/max, and most importantly, MaxLiveAllocations.

As an example, here is the size class distribution recorded by the detector for boxed-sim.

Littering: With the statistics and histogram in hand, we next run littering (LD_PRELOAD=liblitterer.so <program>). Littering allocates LITTER_MULTIPLIER * MaxLiveAllocations objects following the recorded size distribution, (optionally) shuffling the addresses of allocated objects, and frees a fraction of 1 - LITTER_OCCUPANCY of them. We use the same malloc as the program, so the program starts with the heap in a littered state. There are a few tunable parameters, which can be provided as environment variables:
- LITTER_SEED: Seed to be used for the random number generator. Random by default.
- LITTER_OCCUPANCY: Fraction of objects to keep on the heap. Value must be between 0 and 1.
- LITTER_NO_SHUFFLE: Set to 1 to disable shuffling, and free from the last allocated objects.
- LITTER_SLEEP: Sleep x seconds after littering, but before starting the program. Default is disabled.
- LITTER_MULTIPLIER: Multiplier of number of objects to allocate. Default is 20.

The diagram below shows in a simple way the effect of littering on the heap. With a blank, fresh heap, the allocator is usually able to pack allocations in contiguous memory, yielding much better locality and cache performance throughout the program. Fragmentation however, either natural or artificial with littering, forces the allocator to return non-contiguous addresses to successive allocations. This tends to degrade performance.

Experimental Platform

All data was obtained on a Thinkpad P15s Gen 2 with an Intel i7-1165G7 processor. All results below were obtained with LITTER_MULTIPLIER = 1, varying the occupancy number. A multiplier of 1 is approximately equivalent to letting the program run once, keeping a random fraction of the objects it allocated on the heap, and running it again.

Benchmarks

We evaluate littering on the four benchmarks from the original paper we were able to obtain and run.

Benchmark	Custom allocation method	Description	Input
`197.parser`	custom	English sentence parser from SPEC2000	`ref.in`
`boxed-sim`	freelist	Polyhedral approximated balls bouncing in a box	`-n 31 -s 311317`
`mudlle`	region	MUD interpreter	`time.mud`
`175.vpr`	region	FPGA placement & routing from SPEC2000	train placement

Notice that as we run on modern hardware, a number of these benchmarks run very quickly. When possible, we increase the input size so benchmarks run longer, allowing us to provide more accurate measurements. We were not able to increase mudlle's input, with the benchmark taking typically less than a second.

Details on custom allocation methods

197.parser: This benchmark uses a custom stack-like pattern: it initially allocates a large chunk of memory, and then uses a bump pointer for each malloc call. If an object is freed, it is marked, and the pointer is sent back to the end of the last unmarked object. In the shim, we replace xalloc with the usual malloc.
boxed-sim: This benchmark uses a free list for each type of object. In the shim, we simply disable the freelists and replace the times when the program looked if an object was reusable with a standard malloc.
mudlle: A typical arena-allocator. Citing calloc-heap.c.

Based on the concept of blocks: you allocate memory from a block, and can free the whole block at one go. Individual deallocations are not possible.

We shim this system by simply keeping allocated objects on a linked list, and calling free recursively when a block is freed.
175.vpr: Mix between a standard arena-allocator and a "stack-like" allocator similar to the one used in 197.parser. "Chunks" are allocated, with allocations going in the current chunk (or a new one is created). Objects can be freed all at once by saving and passing the chunk pointer before starting bulk allocations.

Results

We present a series of graphs for each benchmark, using the following versions of glibc (the Linux default allocator), jemalloc, and mimalloc.

	Version
glibc	2.31
jemalloc	5.3.0
mimalloc	2.0.6

For each benchmark, the elapsed time graph is normalized to the elapsed time of the original program (running with its custom allocator), without littering. We also generate graphs separating events counted with perf. We assign events inside or outside of malloc/free based on which shared object (library) the sample comes from (e.g., libjemalloc.so).

Without Littering

We run each benchmark without littering, with both the custom allocation enabled and the shim.

Benchmark	custom	glibc malloc	jemalloc	mimalloc
`197.parser`	37.62s	49.97s (1.33x)	43.94s (1.17x)	40.41s (1.07x)
`boxed-sim`	17.14s	17.78s (1.04x)	17.19s (1.00x)	17.95s (1.05x)
`mudlle`	0.0900s	0.1226s (1.36x)	0.1022s (1.14x)	0.0960 (1.07x)
`175.vpr`	33.30s	33.86s (1.02x)	33.27s (1.00x)	35.23s (1.06x)

`197.parser`

`boxed-sim`

`mudlle`

`175.vpr`

Emery D. Berger, Benjamin G. Zorn, and Kathryn S. McKinley. 2002. Reconsidering custom memory allocation. In Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications (OOPSLA '02). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/582419.582421 ↩
Most Influential OOPSLA Paper Award. https://www.sigplan.org/Awards/OOPSLA ↩

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
.github/workflows		.github/workflows
data		data
diagrams		diagrams
docs		docs
graphs		graphs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

data

data

diagrams

diagrams

docs

docs

graphs

graphs

scripts

scripts

src

src

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Reconsidering "Reconsidering Custom Memory Allocation"

Littering

Experimental Platform