Skip to content

Improving in memory performance

Leif Walsh edited this page Jul 21, 2013 · 5 revisions

This document explores ideas for improving in-memory performance of the fractal tree library. At the writing of this document, we will soon ship TokuMX 1.0.3 and TokuDB 7.0.3.

Currently, our bottlenecks on in-memory performance seem to be the following. The OMTs used to store basement nodes and OMTs used to index message buffers of internal nodes exhibit poor caching behavior. Although the two issues are very similar, let's look at them individually. We will start with the basement nodes.

Current Behavior

As of July 19, 2013, on master, the in-memory representation of basement nodes is as follows. At a high level, the data is stored in a mempool, and the OMT references pointers that happen to point in the mempool. When a basement node is read into memory, the data is layed out, in key order, in a mempool. An OMT, whose values are pointers into the mempool, is then built. So, suppose we have mempool that is addressed at 0xa0000000 (I do not know or care if that can or cannot be a real address). Suppose each leafentry was exactly 16 bytes. Then the OMT built for the basement node will have values (0xa0000010, 0xa0000020, 0xa0000030 ...).

OMTs currently have two formats. One is the "array" format. The array format is what it says, it's basically just an array of OMTVALUEs. This format is very efficient for space. This format also makes it simple to create OMTs from arrays, which we do in multiple places like the transaction/MVCC code. The other format is the "tree" format. In the tree format, the internal representation is still an array, but OMT nodes store pointers to children. This format is a little less efficient for space.

OMTs start in array format. As soon as an OMTVALUE is inserted or deleted, the OMT converts to the tree format [Leif: inserts/deletes at the end do not cause this, only inserts/deletes in the middle]. So, a basement node that is brought into memory and only ever queried should always remain in array format. When the OMT is rebalanced, it is rebalanced into the array format [Leif: only if the root is rebalanced. rebalancing another node makes it a fully balanced tree, not an array.].

Additionally, when a basement node's mempool has all of its memory used up, a new mempool is allocated, and elements of the OMT are packed into the new mempool in sequential order. This process is called compacting the OMT. Note that we cannot do a simple realloc because the OMTs store the actual pointers into the mempool, and not an offset into the mempool.

Now that we understand how a basement node is manipulated, let's discuss the inefficiencies, in no particular order.

Inefficiencies

I see three inefficiencies, in order of importance. The first one is much bigger. Right now, to search a basement node, as we search within the OMT, we must dereference keys that are sparsely located within the mempool. I say sparsely because the keys are packed with leafentries that contain vals. The vals may be quite big, hundreds of bytes. So, each key may reside in its own L1 cacheline, practically guaranteeing a cache miss per level with each search. A second inefficiency is that the entire OMT probably does not fit in a single cacheline, so searching the OMT WITHOUT dereferencing the key will probably result in a bunch of cache misses. The third inefficiency is that when we want to grow the mempool, we have to iterate over the contents of the OMT and pack it into a new space. This is because inside the OMT, we store the direct address inside the mempool instead of an offset into the mempool.

Proposal

Solving the third inefficiency is simple. Instead of storing an eight byte address in the OMT, we can store a four byte offset. On pure insertion workloads, we can do a simple realloc when the mempool runs out of space, and do the compression of the key space only when we notice the mempool becoming heavily fragmented on delete. I prototyped this once upon a time with trac ticket #4135, but there was too much other stuff happening for this change to make it in. Solving this will also enable dealing with the first two inefficiencies.

To solve the first two inefficiencies, we can do something along the following lines.

First, let's inline the keys inside of the OMT, so when we examine an OMT node, we don't induce a cache miss to look at the key. Keys are variable size, so we need the ability to inline variable length keys. I think we can get away with using two bytes to save the length [Leif: how about VLQ?]. This also means never having the array format when searching basement nodes. Given the caching behavior of the current array basement node OMT, I think I am ok with that, but I would like to hear feedback. Note that by not having the array format, we introduce 10 bytes of overhead per OMT node, two for the length of the key and eight for the child pointers.

Second, we should try to layout the nodes in a smarter order than sequential, or whatever it is we have now. If we use something like a van Emde Boas layout (google it, please), we should get better caching behavior. As the OMT is modified, the layout will stray from the the van Emde Boas layout (which is a static layout), but we can repack it into the VEB layout on rebalance, or after N operations (for some N). This needs some design and experimentation work. Honestly, I think the high order bit here is that we should layout the data in SOMETHING that is smarter. It does not have to be VEB, and we need to design the specifics, but I think there are gains to be had here. VEB is just one good option and cache-oblivious.

Plan of attack

Here is a plan of attack for getting this work done:

  • one person needs to decouple the keys from the leafentries. That includes handling inefficiency #3. Right now, keys are embedded in ULE's and leafentries. This coupling should not exist. We can take benchmarks here if we want (we probably should just for our sanity).
  • in parallel if possible, one person can work on an OMT that allows OMT nodes to store variable length values. This first step does NOT need a smarter layout of nodes like the VEB layout.
  • With keys decoupled from leafentries and the first pass OMT work done, we can replace basement node OMTs with these new OMTs that inline keys. At this point, we should see SOME gains in performance, because cache misses will be reduced. We should also run perf just to see what changes. Hopefully, we see fewer cache misses.
  • as an aside, at this point, we may want to just take the opportunity to change the on disk format of leaf nodes to layout the keys first, and then the vals. This is unnecessary, but it may prove easier than dealing with the existing on disk format, and may help compression a little bit.
  • at this point, if we want to (and I hope we do), we can try to layout the nodes of the OMT in some smarter way, like VEB.

Also, a note here, when talking about this new OMT, I don't want to do away with the old OMT. The old OMT is used in other places and works well, and no need to change that code. Perhaps we just want to add a new OMT?

Some high level challenges and thoughts on whatever we do:

  • getting serialization and deserialization to work right
  • getting rebalancing of leaf nodes on checkpoint clone to work right. We should also be careful of adding work to this path, as it may hurt checkpoint variability.
  • Assuming we want to use a different OMT, I don't want to change the existing OMT. At least not immedietely. That is well tested working code used everywhere, and no need to risk issues in other areas when optimizing this one. [Leif: hopefully we can get rid of the complexity of the subtree marking aspect in one of the implementations]
  • Do range queries suffer?

As the final point, internal nodes and the OMTs that index message buffers suffer from the same issues. We can apply the same logic and solve the same problems. This should help insertion speed and mixed read/write workloads, such as TPCC. I don't think any design level discussion needs to happen on that. [Leif: in addition, as we do this work, it would be nice to try to abstract away the on-disk and in-memory serialization formats of leaf nodes and internal nodes a little bit better, to make efforts like this one easier now and in the future]