Skip to content

Improving in memory query performance Design

Yoni edited this page Jan 3, 2014 · 13 revisions

Purpose:

The purpose of this document is to explain the design and implementation for Improving-in-memory-performance.

Prior behavior

See Improving-in-memory-performance#current-behavior

Removing technical debt/Preparing

There was some technical debt that prevented us from easily designing and implementing this project. As of Dec 10, 2013, fixing this technical debt is on master in Tokutek/ft-index

Refactoring basement nodes into a class

Prior to this effort, basement nodes weren't really a data structure (were implemented by a set of a macros) and leafentries included keys. Single leafentries (or groups) were accessed or modified in many different places. We created a class bn_data to contain all leafentries and keys in a basement node:

  • A mempool to hold the actual leafentries
  • An OMT to store pointers into the mempool in sorted order

bn_data hides most of the complexity of basement nodes (see ft/bndata.h and ft/bndata.cc).

Removing keys from leafentries

For simplicity LEAFENTRY refers to the old leafentry that included a key, and leafentry refers to the new leafentry that does not include a key. Next step in our refactoring was to create the concept of KL_PAIRs (Key/leafentry pairs). All methods for accessing LEAFENTRYs was refactored to one of three cases

  • Those that needed access to just the key
  • Those that needed access to just the leafentry
  • Those that needed both

We wrote the basement node class such that its api treated keys and leafentries separately. We didn't yet have a dmt, so what we did was made an intermediate structure; in memory the key and leafentry pair were stored next to each other.

[4 bytes key_size] [key_size bytes] [rest of leafentry]

For example, leafentry with a key of "foo":

[4 byte integer: 4] [f o o \0] [rest of leafentry]

"klpair": keylen followed by key bytes followed by the leafentry. To do that, during deserialization we split up the KEY/Leafentry pair into key and leafentry, and then wrote them into the mempool next to each other.

We stored key leafentry pairs contiguously (4 bytes for the keylen, the key itself, then the leafentry).

Using templated omt for type safety and speed

bn_data used an OMT to hold pointers into the mempool. The OMT has since turned into a wrapper around the templated omt (omt<void*>). We replaced the OMT in bn_data with a templated omt (omt<kl_pair_struct*>). This allowed the compiler to ensure additional type safety (instead of remembering to cast void*s back and forth) and also improved performance by a small amount. (~5% on ydb-level tests)

Main effort

See Improving-in-memory-performance#inefficiencies and Improving-in-memory-performance#proposal. This details the progress of the development branch (see Tokutek/ft-index#46) as of Dec 10, 2013.

OMT in basement node stores offsets into mempool instead of pointers

We replaced 8-byte pointers with 4-byte offsets into the mempool. This saves 4 bytes per key/leafentry pair in a basement node, and also opens up future optimizations (see remaining/future work at end) As long as there isn't wasted space in a mempool (e.g. in serial insertion case), increasing the size of a mempool can be a simple realloc instead of a bunch of memcpys followed by fixing up pointers in the omt.

Inlined keys into omt

We inlined the keys (4 byte for keylen and the key itself) into the omt. So the omt now needed to store the 4 byte keylen, key itself, and a 4 byte offset into the mempool. Keys are not necessarily constant length so we needed a new data structure.

DMT (Dynamic OMT)

We created a new data structure very similar to the omt that supports dynamic sized elements. The dmt is a weight balanced search tree with dynamic sized values. Like the omt, the dmt has an array form for faster lookups. The dmt only supports array format when all its values are the same size (for example: when all keys in a fractal tree are the same length). In array format we don't store nodes nor the value length (we store the value length only once) to save memory. Similar to the omt, when a delete or insert happens in the middle of the array, we convert to a dynamic tree form.

The algorithm for the dmt is virtually identical (conceptually) to the omt with several exceptions:

  • The dmt does not try to convert a dynamic tree back to array format.
  • This can happen if a node is serialized to disk and read back (and everything is constant sized)
  • Note: The dmt will remain in tree format in the case of serialized insertions.
  • Where the omt sometimes moves values between nodes, the dmt instead always moves the metadata of the nodes.
  • This ensures that when switching nodes we do not have to allocate additional memory.
  • Note that this is not technically necessary when all keys are constant length but keys can get much larger than the metadata for a node.

The api for the dmt is mostly like the omt except:

  • Inserts use functors
  • Optimized creation of dmts has a different api (may use a builder object)

The dmt is currently only used in basement nodes.

On-Disk format changes

We introduce on-disk format 25, which changes the on-disk format for partitions (basement nodes) only.

(Old) format 24 was:

  • 1 byte magic number
  • 4 byte num_entries
  • All LEAFENTRYs in sorted order

Format 25 is:

  • 1 byte magic number
  • 4 byte num_entries
  • A new basement-node header:
    • 4 byte key_data_size (total memory for all keys combined including extra 4 bytes each for key)
      • Note: The extra 4 bytes is used either for the keylen (variable-length keys), or for the offset of that key's leafentry (for fixed-length keys)
    • 4 byte val_data_size (total memory for all leafentrys
    • 4 byte fixed_key_length (0 if keys are variable length)
    • 1 byte all_keys_same_length (0 or 1)
    • 1 byte keys_vals_separate (0 or 1)
      • Note: Currently keys_vals_separate must equal all_keys_same_length. In the future we may change the on disk format for variable sized keys.
  • if keys_vals_separate == 0 (which means variable sized keys)
    • All LEAFENTRYs in sorted order (this part exactly the same as version 24)
  • else (fixed sized keys)
    • for each key:
      • 4 byte le_offset (offset of leafentry in mempool)
      • fixed_key_length bytes: the actual key
    • The leafentry mempool. (equivalently all leafentrys in sorted order)
      • Note: Before serialization the mempool is compacted (converted to sorted order with no fragmentation)

Deserialization for variable sized keys/upgrading from version 24 is nearly identical. If the version is 24 we don't expect the additional header and overallocate memory (and then compact after loading). If the version is 25 we do the same but we don't need to overallocate memory. Thus the upgrade is almost trivial.

Deserialization for fixed-sized keys is nearly trivial. The memory for the keys is already in the dmt array format (unless the key len is not a multiple of 4, in which case it allocates extra memory and aligns it). Reading the leafentries is a memcpy.

Remaining Work/Future work/Open issues

  • DONE: Serial insertion optimization for basement nodes.
    • When there is no wasted space in a mempool, we can do a simple realloc to move leafentries from a small mempool to a larger one (instead of looking at each leafentry and copying them one at a time).
  • Code review
  • dmt
    • DONE: (Required) Additional tests
    • (Optional) fixing comments (some still refer to omt)
    • (Optional) adding comments (describe in dmt.h what a dmt is)
  • (optional) Remove technical debt by combining omt and dmt into single data structure.
    • Possibly make templated omt be a wrapper around the templated dmt.
  • (related, but can be done in the future) Tokutek/ft-index#108
    • toku_memory_footprint calculation is wrong. Can be too large in common cases (and too small in pathological cases)
    • Modify memory calculations for omt and dmt to properly use toku_memory_footprint. (Work is wasted effort until/unless 108 is done)
      • Non trivial due to using extra space in the mempool for a temporary array.
  • (optional, probably not worth it) New disk format for variable sized keys.
  • (Future performance) Further reduce cache misses:
    • Change the dmt to have a third format (or possibly replace array format).
    • Use a VanEmdeboas-like layout to further reduce cache misses during searches
      • This requires some additional research/design (probably with Michael)
      • Risk:
        • May end up costing more in computation (to access layout) for point queries than we save in cache misses/may not give any enhanced performance.
        • (low) risk of slightly lower performance for range queries even if point queries are faster
        • risk of slower (de)serialization
      • Potential performance improvement is higher for smaller keys.
  • Performance:
    • The dmt has two forms: array and dynamic tree
      • We can add a third form: fixed-key tree (inlined keys, all same size). Would automatically turn into a dynamic tree if necessary.
        • This might improve performance slightly and reduce memory usage under certain circumstances.
        • Does not help at all for read-only workloads
    • DONE: An intermediate third form: (LOW EFFORT COMPARED TO OTHERS)
      • Keep the dynamic tree, but just keep a boolean that says "so far, everything is same size"
      • This would allow for the nice serialization. Deletes would never remove this bit, and inserts CAN if they're different size.
    • We can try to improve (de)serialization time for fixed-length keys that are not a multiple of 4 bytes
    • We can try to improve serialization time (maybe not re-sort & compact under all situations).
      • e.g. skip compaction if there is no wasted space in mempool