Permalink
Find file
d83f542 Apr 13, 2015
@nrc @Liamsi @asmaloney
331 lines (265 sloc) 15.5 KB

Graphs and arena allocation

(Note you can run the examples in this chapter by downloading this directory and running cargo run).

Graphs are a bit awkward to construct in Rust because of Rust's stringent lifetime and mutability requirements. Graphs of objects are very common in OO programming. In this tutorial I'm going to go over a few different approaches to implementation. My preferred approach uses arena allocation and makes slightly advanced use of explicit lifetimes. I'll finish up by discussing a few potential Rust features which would make using such an approach easier.

A graph is a collection of nodes with edges between some of those nodes. Graphs are a generalisation of lists and trees. Each node can have multiple children and multiple parents (we usually talk about edges into and out of a node, rather than parents/children). Graphs can be represented by adjacency lists or adjacency matrices. The former is basically a node object for each node in the graph, where each node object keeps a list of its adjacent nodes. An adjacency matrix is a matrix of booleans indicating whether there is an edge from the row node to the column node. We'll only cover the adjacency list representation, adjacency matrices have very different issues which are less Rust-specific.

There are essentially two orthogonal problems: how to handle the lifetime of the graph and how to handle it's mutability.

The first problem essentially boils down to what kind of pointer to use to point to other nodes in the graph. Since graph-like data structures are recursive (the types are recursive, even if the data is not) we are forced to use pointers of some kind rather than have a totally value-based structure. Since graphs can be cyclic, and ownership in Rust cannot be cyclic, we cannot use Box<Node> as our pointer type (as we might do for tree-like data structures or linked lists).

No graph is truly immutable. Because there may be cycles, the graph cannot be created in a single statement. Thus, at the very least, the graph must be mutable during its initialisation phase. The usual invariant in Rust is that all pointers must either be unique or immutable. Graph edges must be mutable (at least during initialisation) and there can be more than one edge into any node, thus no edges are guaranteed to be unique. So we're going to have to do something a little bit advanced to handle mutability.

One solution is to use mutable raw pointers (*mut Node). This is the most flexible approach, but also the most dangerous. You must handle all the lifetime management yourself without any help from the type system. You can make very flexible and efficient data structures this way, but you must be very careful. This approach handles both the lifetime and mutability issues in one fell swoop. But it handles them by essentially ignoring all the benefits of Rust - you will get no help from the compiler here (it's also not particularly ergonomic since raw pointers don't automatically (de-)reference). Since a graph using raw pointers is not much different from a graph in C++, I'm not going to cover that option here.

The options you have for lifetime management are reference counting (shared ownership, using Rc<...>) or arena allocation (all nodes have the same lifetime, managed by an arena; using borrowed references &...). The former is more flexible (you can have references from outside the graph to individual nodes with any lifetime), the latter is better in every other way.

For managing mutability, you can either use RefCell, i.e., make use of Rust's facility for dynamic, interior mutability, or you can manage the mutability yourself (in this case you have to use UnsafeCell to communicate the interior mutability to the compiler). The former is safer, the latter is more efficient. Neither is particularly ergonomic.

Note that if your graph might have cycles, then if you use Rc, further action is required to break the cycles and not leak memory. Since Rust has no cycle collection of Rc pointers, if there is a cycle in your graph, the ref counts will never fall to zero, and the graph will never be deallocated. You can solve this by using Weak pointers in your graph or by manually breaking cycles when you know the graph should be destroyed. The former is more reliable. We don't cover either here, in our examples we just leak memory. The approach using borrowed references and arena allocation does not have this issue and is thus superior in that respect.

To compare the different approaches I'll use a pretty simple example. We'll just have a Node object to represent a node in the graph, this will hold some string data (representative of some more complex data payload) and a Vec of adjacent nodes (edges). We'll have an init function to create a simple graph of nodes, and a traverse function which does a pre-order, depth-first traversal of the graph. We'll use this to print the payload of each node in the graph. Finally, we'll have a Node::first method which returns a reference to the first adjacent node to the self node and a function foo which prints the payload of an individual node. These functions stand in for more complex operations involving manipulation of a node interior to the graph.

To try and be as informative as possible without boring you, I'll cover two combinations of possibilities: ref counting and RefCell, and arena allocation and UnsafeCell. I'll leave the other two combinations as an exercise.

Rc<RefCell<Node>>

See full example.

This is the safer option because there is no unsafe code. It is also the least efficient and least ergonomic option. It is pretty flexible though, nodes of the graph can be easily reused outside the graph since they are ref-counted. I would recommend this approach if you need a fully mutable graph, or need your nodes to exist independently of the graph.

The node structure looks like

struct Node {
    datum: &'static str,
    edges: Vec<Rc<RefCell<Node>>>,
}

Creating a new node is not too bad: Rc::new(RefCell::new(Node { ... })). To add an edge during initialisation, you have to borrow the start node as mutable, and clone the end node into the Vec of edges (this clones the pointer, incrementing the reference count, not the actual node). E.g.,

let mut mut_root = root.borrow_mut();
mut_root.edges.push(b.clone());

The RefCell dynamically ensures that we are not already reading or writing the node when we write it.

Whenever you access a node, you have to use .borrow() to borrow the RefCell. Our first method has to return a ref-counted pointer, rather than a borrowed reference, so callers of first also have to borrow:

fn first(&self) -> Rc<RefCell<Node>> {
    self.edges[0].clone()
}

pub fn main() {
    let g = ...;
    let f = g.first();
    foo(&*f.borrow());
}

&Node and UnsafeCell

See full example.

In this approach we use borrowed references as edges. This is nice and ergonomic and lets us use our nodes with 'regular' Rust libraries which primarily operate with borrowed references (note that one nice thing about ref counted objects in Rust is that they play nicely with the lifetime system. We can create a borrowed reference into the Rc to directly (and safely) reference the data. In the previous example, the RefCell prevents us doing this, but an Rc/UnsafeCell approach should allow it).

Destruction is correctly handled too - the only constraint is that all the nodes must be destroyed at the same time. Destruction and allocation of nodes is handled using an arena.

On the other hand, we do need to use quite a few explicit lifetimes. Unfortunately we don't benefit from lifetime elision here. At the end of the section I'll discuss some future directions for the language which could make things better.

During construction we will mutate our nodes which might be multiply referenced. This is not possible in safe Rust code, so we must initialise inside an unsafe block. Since our nodes are mutable and multiply referenced, we must use an UnsafeCell to communicate to the Rust compiler that it cannot rely on its usual invariants.

When is this approach feasible? The graph must only be mutated during initialisation. In addition, we require that all nodes in the graph have the same lifetime (we could relax these constraints somewhat to allow adding nodes later as long as they can all be destroyed at the same time). Similarly, we could rely on more complicated invariants for when the nodes can be mutated, but it pays to keep things simple, since the programmer is responsible for safety in those respects.

Arena allocation is a memory management technique where a set of objects have the same lifetime and can be deallocated at the same time. An arena is an object responsible for allocating and deallocating the memory. Since large chunks of memory are allocated and deallocated at once (rather than allocating individual objects), arena allocation is very efficient. Usually, all the objects are allocated from a contiguous chunk of memory, that improves cache coherency when you are traversing the graph.

In Rust, arena allocation is supported by the libarena crate and is used throughout the compiler. There are two kinds of arenas - typed and untyped. The former is more efficient and easier to use, but can only allocate objects of a single type. The latter is more flexible and can allocate any object. Arena allocated objects all have the same lifetime, which is a parameter of the arena object. The type system ensures references to arena allocated objects cannot live longer than the arena itself.

Our node struct must now include the lifetime of the graph, 'a. We wrap our Vec of adjacent nodes in an UnsafeCell to indicate that we will mutate it even when it should be immutable:

struct Node<'a> {
    datum: &'static str,
    edges: UnsafeCell<Vec<&'a Node<'a>>>,
}

Our new function must also use this lifetime and must take as an argument the arena which will do the allocation:

fn new<'a>(datum: &'static str, arena: &'a TypedArena<Node<'a>>) -> &'a Node<'a> {
    arena.alloc(Node {
        datum: datum,
        edges: UnsafeCell::new(Vec::new()),
    })
}

We use the arena to allocate the node. The lifetime of the graph is derived from the lifetime of the reference to the arena, so the arena must be passed in from the scope which covers the graph's lifetime. For our examples, that means we pass it into the init method. (One could imagine an extension to the type system which allows creating values at scopes outside their lexical scope, but there are no plans to add such a thing any time soon). When the arena goes out of scope, the whole graph is destroyed (Rust's type system ensures that we can't keep references to the graph beyond that point).

Adding an edge is a bit different looking:

(*root.edges.get()).push(b);

We're essentially doing the obvious root.edges.push(b) to push a node (b) on to the list of edges. However, since edges is wrapped in an UnsafeCell, we have to call get() on it. That gives us a mutable raw pointer to edges (*mut Vec<&Node>), which allows us to mutate edges. However, it also requires us to manually dereference the pointer (raw pointers do not auto-deref), thus the (*...) construction. Finally, dereferencing a raw pointer is unsafe, so the whole lot has to be wrapped up in an unsafe block.

The interesting part of traverse is:

for n in &(*self.edges.get()) {
    n.traverse(f, seen);
}

We follow the previous pattern for getting at the edges list, which requires an unsafe block. In this case we know it is in fact safe because we must be post- initialisation and thus there will be no mutation.

Again, the first method follows the same pattern for getting at the edges list. And again must be in an unsafe block. However, in contrast to the graph using Rc<RefCell<_>>, we can return a straightforward borrowed reference to the node. That is very convenient. We can reason that the unsafe block is safe because we do no mutation and we are post-initialisation.

fn first(&'a self) -> &'a Node<'a> {
    unsafe {
        (*self.edges.get())[0]
    }
}

Future language improvements for this approach

I believe that arena allocation and using borrowed references are an important pattern in Rust. We should do more in the language to make these patterns safer and easier to use. I hope use of arenas becomes more ergonomic with the ongoing work on allocators. There are three other improvements I see:

Safe initialisation

There has been lots of research in the OO world on mechanisms for ensuring mutability only during initialisation. How exactly this would work in Rust is an open research question, but it seems that we need to represent a pointer which is mutable and not unique, but restricted in scope. Outside that scope any existing pointers would become normal borrowed references, i.e., immutable or unique.

The advantage of such a scheme is that we have a way to represent the common pattern of mutable during initialisation, then immutable. It also relies on the invariant that, while individual objects are multiply owned, the aggregate (in this case a graph) is uniquely owned. We should then be able to adopt the reference and UnsafeCell approach, without the UnsafeCells and the unsafe blocks, making that approach more ergonomic and more safer.

Alex Summers and Julian Viereck at ETH Zurich are investigating this further.

Generic modules

The 'lifetime of the graph' is constant for any particular graph. Repeating the lifetime is just boilerplate. One way to make this more ergonomic would be to allow the graph module to be parameterised by the lifetime, so it would not need to be added to every struct, impl, and function. The lifetime of the graph would still need to be specified from outside the module, but hopefully inference would take care of most uses (as it does today for function calls).

See ref_graph_generic_mod.rs for how that might look. (We should also be able to use safe initialisation (proposed above) to remove the unsafe code).

See also this RFC issue.

This feature would vastly reduce the syntactic overhead of the reference and UnsafeCell approach.

Lifetime elision

We currently allow the programmer to elide some lifetimes in function signatures to improve ergonomics. One reason the &Node approach to graphs is a bit ugly is because it doesn't benefit from any of the lifetime elision rules.

A common pattern in Rust is data structures with a common lifetime. References into such data structures give rise to types like &'a Foo<'a>, for example &'a Node<'a> in the graph example. It would be nice to have an elision rule that helps in this case. I'm not really sure how it should work though.

Looking at the example with generic modules, it doesn't look like we need to extend the lifetime elision rules very much (I'm not actually sure if Node::new would work without the given lifetimes, but it seems like a fairly trivial extension to make it work if it doesn't). We might want to add some new rule to allow elision of module-generic lifetimes if they are the only ones in scope (other than 'static), but I'm not sure how that would work with multiple in- scope lifetimes (see the foo and init functions, for example).

If we don't add generic modules, we might still be able to add an elision rule specifically to target &'a Node<'a>, not sure how though.