SMAL - A (S)imple, (M)ark-sweep (AL)locator

Overview

SMAL is a simple mark-sweep allocator for small objects. It is designed to be as minimal, performant and modular as possible.

It is not a drop-in replacement for malloc() and free().
It uses only malloc(), free(), mmap(), and munmap().
It has no default collection scheduling policy; users must explicitly start a collection.
It has no default root marking policy.
Is neither conservative, nor type-safe; users can configure it for both scenarios.
It should work in co-operation with any malloc()/free() implementation.
Most advanced options are supported on Linux and OS X.

It does, however, optionally support:

Threads.
Explicit, exact root declarations of globals and stack variables.
Implicit, conservative root scanning of stack variables and register sets.
Weak references.
Reference queues.
Finalizers.
Remembered Sets for Mostly Unchanging Objects. (See below).

Main API

smal_type_for_desc(smal_type_descriptor *desc) locates or creates a new smal_type object for objects of a particular size, mark and free functions.
smal_alloc(smal_type *type) allocates a new object of smal_type.
smal_mark_ptr(void * ptr) marks a potential pointer. This is called from within a smal_type mark_func.
smal_collect() starts a collection.

Data Structure Overview

smal_type represents a user-data type of a fixed size, with mark and free callbacks.

smal_page represents an mmap() region of smal_page_size aligned to smal_page_size.

smal_buffer represents the state of objects in a smal_page. It handles parcelling and bookkeeping for all objects allocated from its smal_page.

Types

smal_type defines the size of objects and other characteristics that will be allocated from one or more smal_buffers. mark_func and free_func callbacks can be registered for each smal_type.

Page Allocation

Each smal_page is allocated from the OS via mmap() and is ensured to be aligned to smal_page_size. The reasons for this aggressive alignment are discussed below. This is achieved by attempting to allocate smal_page_size bytes using MAP_ANON mmap(). If this buffer not aligned, SMAL munmap()s the attempted buffer and mmap()s smal_page_size * 2 bytes and subsequently munmap()s the unaligned portion(s).

Buffer Allocation

SMAL supports two smal_buffer allocation options:

Use a portion at the head each page for the page's smal_buffer structure.
Allocate a smal_buffer structure using malloc() and store a pointer to it in the first word of the page.

The former has better performance, since every smal_page* and its smal_buffer* is the same pointer, but bookkeeping operations will dirty pages during allocation and collection. This can cause unnecessary copy-on-write effort in forked processes. The latter avoids page mutations during allocation and collections but has an additional pointer dereference when mapping smal_page* pointers to smal_buffer* pointers.

Object Allocation

Objects are allocated by smal_alloc(smal_type*). Note: objects are allocated by type, not size.

Objects are parceled, on-demand, from a smal_buffer, by incrementing alloc_ptr by the smal_type's fixed object size, object_size, starting at begin_ptr and terminating at end_ptr.

Object Reclamation

Each smal_buffer keeps its own free_list. A free bitmap is also maintained to avoid double-free during sweep. smal_buffers without active objects are munmaped and returned to the OS after smal_collect(). Objects can be explicitly freed by smal_free(void *ptr).

Allocation Scheduling

Since SMAL does not compact or relocate objects during collection, it attempts to allocate from the smal_buffer with the least amount of available objects (either unparceled or on the free list), associated with the requested smal_type.

Collection Scheduling

SMAL has no default collection scheduling algorithm or policy. Users must decide on when to call smal_collect().

Memory Alignment

Benefits

The alignment restrictions of smal_buffers ensures very minimal mmap()/OS-level fragmentation. It also allows for efficient mapping from arbitrary pointers to smal_buffer*, making SMAL suitable for conservative and non-conservative collection.

Invariants

Every potential pointer maps to a unique page_id, computed by dividing the pointer address by smal_page_size.

Since every mmap()'ed page is always aligned to smal_page_size, it is trivial to map a potential object pointer to a potential page_id and it's smal_buffer*.

A non-colliding hash table mapping page_ids to smal_buffer*s is maintained. A lack of a table entry for a given page_id, means that any potential pointer mapping within the aligned smal_page_size region associated with the page_id is not a pointer to an object allocated from a smal_buffer.

Performance Characteristics

Because the table is guaranteed to be non-colliding, mapping an arbitrary pointer to a smal_buffer can be done in O(1) time, using a divide (or shift), a modulo and a vector index. This operation takes 6 x86 instructions.

If a smal_buffer* can be found for a potential pointer, the pointer must also be within the smal_buffer's region between begin_ptr and alloc_ptr. This operation takes 8 x86 instructions.

Space Overhead

Preliminary size overhead is approximately 4.5% for 10,000,000 16-byte objects.

Issues

SMAL cannot service allocations larger than smal_page_size - sizeof(smal_buffer) - object_alignment or smal_page_size - sizeof(void*) - object_alignment, depending on the configuration.
The page_id hash table may become large and sparse when mmap() tends to not allocate smal_buffers in a mostly contiguous fashion -- other active allocators (e.g.: malloc()) may cause holes in the address space mmapped by SMAL.

Object Mark/Free Bits

Each smal_buffer maintains a mark bitmap and free bitmap indexable by every object parceled from the smal_buffer.

Placing mark bits in a separate bitmap reduces memory overhead, esp. for fork()ed processes -- this protects copy-on-write pages from full mutation due to mark bit manipulations.

An object's mark bitmap index is computed by dividing, the difference between its pointer and its smal_buffer begin_ptr, by the size of the objects (object_size) allocated from the smal_buffer.

If an object pointer is known to be allocated and aligned, smal_mark_ptr() can

test the mark bit in 14 x86 instructions,
set the mark bit in 5 x86 instructions and,
recurse into the mark_func in 3 x86 instructions,

for a worst case total of 22 x86 instructions.

Object Free Lists

Each smal_buffer maintains a free list. Sweeping unused objects into the free list will cause page mutations. There will be an option to only use the free bitmap to find free objects for allocation, at the expensive of allocation speed.

Mark Queues

SMAL supports an optional mark queue to avoid C stack recursion in mark_func. This is slightly slower than C recursion but will reduce stack overflows for threads with small C stacks. The mark queue uses malloc()/free().

Mark Tail-Recursion

The mark function returns a void*. A non-zero return value will be continued for marking. This allows the mark function to avoid tail recursion into the mark engine. For example:

typedef void *my_oop;
typedef struct my_cons {
  my_oop car, cdr;
} my_cons;

static void * my_cons_mark (void *ptr)
{
  smal_mark_ptr(ptr, ((my_cons *) ptr)->car);
  return ((my_cons *) ptr)->cdr;
}

Mark-Sweep

The algorithms/data structures are as follows:

Mark a (potential) object

Map a potential object pointer to a potential page_id.
Map a potential page_id to a potential smal_buffer* using buffer_table.
Determine if object pointer within smal_buffer's smal_page region where object allocations took place.
Determine the mark bitmap offset.
If object is not already marked,
Mark the object in the bitmap, and
Call the smal_type's mark_func() function to continue marking other objects.

Sweep

For each smal_buffer:
MORE HERE.

Mostly Unchanging Objects

SMAL supports designating a smal_type as containing mostly unchanging objects. The smal_buffers for these objects are tracked for mutations using a write barrier per buffer and can be scanned for pointers less frequently by keeping a remembered set of references pointing outside itself. This is functional on Linux and OS X.

Sweep Frequency

Types can be created that will sweep objects only every N collections. This should only be used for object types that are unlikely to ever become unreferenced during normal collection life-cycles. This can be used in conjunction with "Mostly Unchanging Objects" feature.

Write Barrier

SMAL implements a write barrier to determine if a smal_buffer has been mutated since last collection. This is used to invalidate remembered sets within a buffer. Write barrier uses mprotect. Write barrier faults are trapped with POSIX sigaction on Linux and Mach exception ports on OS X.

Co-operation and Dependency

SMAL relies on MAP_ANON mmap()/munmap() and malloc()/free(), thus can co-exist with any other mmap()-based malloc()/free() library. SMAL probably cannot co-exist with allocators that use sbrk().

Configuration

The user must provide the following functions:

void smal_collect_inner_before(void *top_of_stack);
void smal_collect_before_mark();
void smal_collect_after_mark();
void smal_collect_before_sweep();
void smal_collect_after_sweep();
void smal_collect_mark_roots();

Roots

The user must provide a smal_collect_mark_roots() function. This allows the user to configure and optimize for different environments:

single-threaded vs. multi-threaded
conservative vs. type-safe

However, SMAL provides a simple, smal_roots package to explicitly declare root pointers on the C stack or in global variables.

Threads

SMAL supports a simple smal_thread abstraction around pthreads. The smal_roots, finalization and reference packages are thread-aware and thread-safe.

See include/smal/thread.h for more info.

Thread Issues

The rest of the SMAL allocator is not completely thread-safe, yet. SMAL does not yet have portable mechanisms to:

discover all active threads,
pause all threads,
collect their stacks, roots and registers,
resume all threads.

The collection should only need to stop the world during marking and should be able to allow allocation in other threads to continue while sweeping objects, at the expense of allocating additional pages until sweeping is complete. Alternately, in a "single" thread environment, the sweep phase could be executed in a dedicated thread. These features are not fully implemented.

Thread Performance

When enabled with threading support, SMAL attempts to use many smaller mutex and read/write lock regions to avoid long or global locks. There is a single allocation read/write lock that is held to allow disabling of allocation from each buffer before collection starts. The multi-thread version is approx. 2 to 3 times slower than the single-thread version, when running only one thread. Obviously there is is room for improvement.

Object Enumeration

SMAL supports global object enumeration (i.e. Ruby ObjectSpace.each_object).

Weak References and Reference Queues

SMAL contains an optional, thread-safe weak reference and reference queue implementation. See include/smal/reference.h.

Finalization

SMAL contains an optional, thread-safe, lazy finalizer implementation. See include/smal/finalizer.h.

Licensing

MIT License

TODO

Thread Support:
- Complete thread-safety.
- Thread-safe stop-world.
- Thread-safe stack and register discovery.
- Active thread discovery.
Concurrent Sweeping.

Name		Name	Last commit message	Last commit date
Latest commit History 311 Commits
doc		doc
include/smal		include/smal
lab		lab
src		src
t		t
tool		tool
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.mediawiki		README.mediawiki

License

kstephens/smal

Folders and files

Latest commit

History

Repository files navigation

Table of Contents