High Concurrency B tree

Implement Lehman and Yao's high concurrency B-Tree in C language with Jaluta's balanced B-link tree operations including a locking enhancement for increased concurrency.

The methods presented by Jaluta compromise concurrency. As an update search proceeds down the B-Tree, U latches to the child nodes are obtained, the child nodes locked into memory from disk, and possibly structure modifications issued, all while the parent node U latch is held. Only one update scan is supported at a time into any area under the parent node of the Btree which is held across disk accesses. The latch protocol and btree methods presented here address this compromise. Also, the delay of structure modifications to the next update search requires extra complications in the method implementations. For comparison, an exact reference implementation of Jaluta's balanced B-tree algorithms is included on the download page which is about 50% more LOC.

Instead of the hierarchical S-U-X BTree latch modes, there are three independent page latches. This is accomplished using advisory file locking from the underlying O/S over three independent sub-segments of each node’s address space:

A sharable AccessIntent lock and exclusive NodeDelete lock.

The classic sharable ReadLock and exclusive WriteLock for the node.

The segment exclusive ParentModification lock is held while a node's fence value is posted or removed from its parent node (SMO) during a split or consolidation. It is obtained with lock-coupling over the page's WriteLock. This ensures that these structure modifications are serialized with respect to one another, while allowing release of the WriteLock on the node for increased concurrency when there is only one SMO in progress for a given node.

This enhancement allows symmetric Blink node consolidation and reuse with relative simplicity compared to previous approaches.

An additional improvement over Jaluta's Blink node layout is the introduction of a delete bit on each key value. This permits deleting a node's fence key without having to physically remove it from the node, thus obviating the need to maintain a second copy of fence values in the nodes.

Node Consolidation
Key Size and Node Layout
Buffer Pool Manager
Recommended Parameter Values
64 Bit Operational Notes
Multiple Process/Multiple Thread Version
Latch Manager
Detailed Benchmarks
Contact Information
Acknowledgement
Implementation Details
Latch Coupling
Lock Sets
Compatibility Matrix
Costs and Mitigations

Code API

Node Consolidation

Empty nodes are removed in two stages. First, a half-merge is performed by moving the right node's contents into the left empty node. Second, the left node's fence key is removed from its parent node and the right node's fence key is updated with the left node's pointer. Finally, any remaining readers to the right node are drained by obtaining its NodeDelete and WriteLock and the deleted right node is placed on free list as an available node. When a fence key in an upper level node is deleted, its upper levels are updated with the node's new fence key. This is not required when the fence key of a leaf page is deleted. When the root node is reduced to a single entry, a level is removed from the tree by promoting the single child as the new root node.

The desired effect is enabling and simplifying empty node consolidation and reuse at minimal cost while maintaining the level of concurrency of the original BLink method.

Key Size and Node Layout

The node size is specifiable in bits, from 9 (512) thru 20 (1 MB) and stores variable sized keys, each from 1 thru 255 bytes in length. Each key has a deleted bit which allows keeping the fence key value on the page after it's deleted. The space in the node from deleted keys is reclaimed during node cleanup.

Buffer Pool Manager

A buffer pool manager for memory mapped segments of pages is included that utilizes the underlying O/S file mapping interface. Up to 65536 segments are managed (this is a linux and WIN32 system limitation). The Buffer Manager will also run without a pool or file mapping, using standard file system reads and writes when the pool size is set to zero segments. Segment size is set to the number of desired pages in bits (a setting of 4 means 16 pages). With a large enough segment size it is possible to entirely map the Btree into virtual memory. Note that this file mapping approach is not supported over network file systems.

In the single process/multi-threaded implementations the buffer pool manager is shared by multiple threads and utilizes a combination of "lock-striping" and GCLOCK in the buffer pool manager's hash table to reduce contention for the manager's resources.

A more traditional buffer pool manager was written beginning with btree2u which uses the first n pages of the btree file to support an n page buffer pool in a multi-process/single-thread environment. As pages are evicted from the buffer pool they are written to the btree using file I/O. In the tst.dbsuper results shown below, a buffer pool of 4GB was configured.

Recommended Parameter Values

From experimental trials, when there is enough memory to fit the Btree into physical memory, the programs run best with the buffer pool manager set to 64000 segments, and the page size set to 12 bits, and the segment size set per the maximum file size. The 28.7 million key, 304 MB distinct_1 dataset from www.naskitis.com, indexes in 1:11 on a linux64 system with a 1GB page buffer pool (64000 segments with 2^3 pages per segment). A search of the distinct_1 dataset with the skew1_1 dataset, 177 million keys in a 1GB file, requires 5:13.

After the size of the Btree exceeds the size of physical memory, the time spent accessing the disk for leaf nodes begins to dwarf these considerations, and the btree2u traditional approach of file I/O from the buffer pool to the backing Btree is more efficient.

64 Bit Operational Notes

When run on a 64 bit operating system it is possible to configure the buffer pool manager to bring the entire B-tree into the address space, thus obviating the need to unmap and remap file segments to cover all of the Btree pages. The buffer pool segment size is configured by the number of pages per segment in bits. Thus a setting of 8 for this parameter together with a page size of 12 bits means that each of the 64000 buffer pool segments will be 1MB in size for a total of 64GB. Larger values up to 20 will mean that a Btree size of 256TB can be managed without the overhead of repeatedly mapping and remaping segments.

The traditional btree2u approach of managing individual pages in a buffer pool requires only enough pages in the buffer pool to hold the upper, non-leaf levels of the Btree.

Multiple Process/Multiple Thread Version

The multi-process/multi-thread versions are available on the download page. The B-tree page latch operating-system advisory locking calls have been replaced by much faster test and set (__sync_lock_test_and_set) built-in gcc calls on linux and similar calls on Windows. The -D STANDALONE command line interface has been re-ordered to allow for multiple input file names which are processed by independent threads in parallel. The page latch sets have been located on the first few B-tree pages to permit multi-processed applications to share access to the B-tree and its latches. The resulting latch manager does not guarantee fairness to the contending threads.

Note: linux version 2.6 handles mmap calls serially among threads, hobbling the buffer pool manager in multi-threaded applications with inadequate resources. Ensure the buffer pool manager is configured with page and pool segment size and pool slots so that remapping is not required during operation. Btree2u is not affected by this problem as only the buffer pool is mmaped into memory.

Latch Manager

To enable multi-processing, a latch manager was written beginning with threads2h and btree2t which uses the first few pages of the btree to host a fixed number of latches. Threads2h uses sharable pthreads read/write locks for pages, while btree2t uses smaller test & set latches. The default value for BT_latchtable is 128. On btree2u the latch table is incorporated with the buffer pool.

Detailed Benchmarks

On a 64 bit linux system with 4 cpu cores, 16 GB RAM, and a Nimble raid storage array, the following benchmark results were obtained using the skew1_1 thru skew5_1 datasets from www.naskitis.com. These files are comprised of 752495240 keys, of which 1401774 are unique. Some of the test listed below involve appending a line number to the keys which makes all 752M unique.

threads2h tst.db w 12 16384 1 0 skew1_1 skew2_1 skew3_1 skew4_1 skew5_1 program buffer params btree size real/user /system notes threads2h 12 16384 1 52MB 6:20/21:03/01:25 5 threads threads2i 12 16384 1 52MB 5:34/19:56/00:42 5 threads threads2j 12 16384 1 52MB 6:04/20:30/00:41 5 threads btree2s (file-io) 54MB 45:51/13:22/155:96 5 processes btree2t (file-io) 54MB 17:07/13:56/34:33 5 processes btree2u 12 20000 43.7MB 4:44/16:30/01:30 5 processes btree2v 12 20000 43.7MB 5:39/17:30/01:20 5 processes jaluta2 14 4096 0 51MB 88:00/50:00/225:00 5 processes

Line numbers were appended to the skew1_1 thru skew5_1 dataset lines and the resulting aggregate of 743107942 distinct keys was inserted in parallel by 5 threads. Note that the total btree size of 36.9 GB exceeds memory capacity.

threads2h tst.dbsuper w 12 38000 8 1 skew1_1 skew2_1 skew3_1 skew4_1 skew5_1 program buffer params btree real/user /system pool/page/mmap/notes threads2h 12 38000 8 36.9GB 30:57/56:38/ 10:27 38GB/ 4K /1MB /5 threads threads2i 12 38000 8 36.9GB 35:25/54:21/ 10:17 38GB/ 4K /1MB /5 threads threads2j 12 38000 8 36.9GB 30:12/54:55/ 9:55 38GB/ 4K /1MB /5 threads btree2t 12 38000 8 36.9GB 24:50/47:19/ 8:27 38GB/ 4K /1MB /5 processes btree2s 12 38000 8 36.9GB 80:18/44:15/237:10 38GB/ 4K /1MB /5 processes btree2u 12 1000000 31.3GB 21:57/47:29/ 8:37 4GB/ 4K /na /5 processes

The keys from this dataset were then deleted from the btree to return the tree to an empty condition with no keys, and all deleted nodes chained together on a free list.

threads2h tst.dbsuper d 12 38000 8 1 skew1_1 skew2_1 skew3_1 skew4_1 skew5_1 program buffer params btree real/user /system pool/page/mmap/notes threads2h 12 38000 8 36.9GB 38:26/53:12/ 5:35 38GB/4K /1M /5 threads threads2i 12 38000 8 36.9GB 47:36/52:04/ 8:36 38GB/4K /1M /5 threads threads2j 12 38000 8 36.9GB 50:19/47:23/ 4:20 38GB/4K /1M /5 threads

Next the 28772169 distinct keys in the distinct_1 dataset with unique line numbers appended across 5 inserts for a total of 143860845 unique keys were inserted in parallel in 5 threads with the following results:

threads2h tst.dis w 12 8192 8 100000000 distinct_1 distinct_1 distinct_1 distinct_1 distinct_1

program buffer params btree size elapsed/user /system notes 
btree2t 12 8192 8 7.7GB 6:32/ 9:35/3:20 8GB pool, 4K pages
threads2h 12 8192 8 7.7GB 6:45/14:11/4:41 8GB pool, 4K pages 
btree2u 12 128000 6.4GB 7:07/10:50/8:45 512MB pool, 4K pages

To test the raw reading metrics, the distinct_1 data set was indexed with no key modifications, and the resulting btree was searched with the skew data sets. Construction times are given here:

threads2h tst.dis w 14 8192 4 0 distinct_1

program	buffer	params	btree	elapsed	user	system	pool	page	mmap	threads
threads2i	148192	4	1.1GB	1:40	1:33	00:03	2GB	16K	262K	1
threads2h	148192	4	1.1GB	1:46	1:41	00:03	2GB	16K	262K	1
btree2u	148192		920MB	1:47	1:13	00:33	128MB	4K		1

with search times given here:

threads2h tst.dis f 14 8192 4 0 skew1_1 skew2_1 skew3_1 skew4_1 skew5_1

program	buffer	params	btree	elapsed	user	system	pool	page	mmap	threads
threads2h14	8K	4	1.1GB	9:14	32:20	1:16	2GB	16K	262K	5
btree2u	1.2M		920MB	7:38	25:40	2:40	128MB	4K		5

Finally, a penny sort file of 100M unique keys was split into 10 equal pieces of 1GB each, and indexed by 10 threads:

for file in penny[0-9]; do btree2u tst.dbpenny $file w 12 180000 & done

program	buffer	params	btree	elapsed	user	system	pool	page	mmap	threads
threads2h	1.2M	8	17.4GB	53:37	25:42	32:06	18GB	4K	1MB	10
btree2u	11.6M		17.3GB	56:56	30:14	52:02	720MB	4K		10 processes

Contact Information

Please address any bugs, suggestions, or other concerns and feedback to the author, Karl Malbrain, malbrain@cal.berkeley.edu.

Acknowledgement

Goetz Graefe reviewed earlier versions of this work and made several observations which are incorporated.

Implementation Details

The basic idea is that there are five lock types for each node in three independent sets:

Name	Set	Type	Incompatible With Other	Description
`AccessIntent`	1	Sharable	`NodeDelete`	The node page number has been obtained and the node will be read.
`NodeDelete`	1	Exclusive	`AccessIntent`	The node page number has been removed from its parent page. Wait until all readers have relinquished `AccessIntent`.
`ReadLock`	2	Sharable	`WriteLock`	Lock the node for reading.
`WriteLock`	2	Exclusive	`ReadLock`, `WriteLock`	Lock the node for modification.
`ParentModification`	3	Exclusive	`ParentModification`	Change the parent's keys for the node.

Search: A ReadLock is obtained for the root node. While descending the tree, a ReadLock on a parent is latch-coupled with obtaining AccessIntent on the child node. A ReadLock (or WriteLock for insert/delete at the target level) on the child node is latch-coupled with AccessIntent. The search ends with either a ReadLock or WriteLock on the target node.
Insert: A key is added to a node under a WriteLock obtained during Search. The same insert module is used for both leaf and branch nodes.
Split: The ParentModification lock is obtained over the existing WriteLock (which stalls if a previous split or consolidation for this node is still underway), half the contents are split into a new node (which is not locked), and the WriteLock on the node is released. After the fence keys for the two siblings to the parent node(s) have posted, the ParentModification lock is released.
Delete: A key is removed from a node under a WriteLock obtained during Search. The same delete module is used for both leaf and branch nodes.
Consolidation of an empty node: A WriteLock is obtained for the right sibling node, its contents copied into the empty left node under the existing WriteLock, and the right sibling kill bit is set. The right sibling link field is set to point to the left node, which will direct subsequent searches to the left node. A ParentModification lock is latch-coupled with the WriteLock for the left node (which will stall if another ParentModification is underway), and the WriteLock on the right node is also released. The old left node fence key is removed from its parent, and the old right node fence key is updated in its parent to point to the left node. The ParentModification lock on the left node is released. A NodeDelete andWriteLock is obtained for the right node. There are now no pointers and no readers to the right node, and it is placed onto a freelist for use in subsequent splits. The NodeDelete and WriteLock is removed on the now free node.

The bottom line is that the structure modifications to the upper tree levels are serialized while the node itself remains unlocked for subsequent reads and writes while the structure modification proceeds. Except for Consolidate (and during latch-coupling), only one node is locked at a time.

Latch Coupling

At several points in the implementation, latch coupling is used to acquire a new lock on a node and release another. During splits and consolidations this overlap is extended to include additional node processing before the previous lock is released.

Lock Sets

The three lock sets each cover only a part of each node. Thus it is possible to obtain a ParentModification lock over a WriteLock, and release the WriteLock later while continuing to hold ParentModification. The locks in each set are independent from the locks in the other two sets on each node. They are not exactly equivalent to traditional "lock modes", because it is possible to hold two locks on a node; one from each set simultaneously.

Compatibility Matrix

A yes entry means that two locks of the types specified by the row and column labels for that entry can be simultaneously held on a data item by two different btree operations; in other words, the two lock types do not conflict. A no entry means that the two lock types cannot be concurrently held on a data item by distinct operations; in simpler terms, the lock types conflict.

`AI`	`ND`	`RL`	`WL`	`PM`
`AccessIntent`	Yes	No	Yes	Yes
`NodeDelete`	No	No	Yes	Yes
`ReadLock`	Yes	Yes	Yes	No
`WriteLock`	Yes	Yes	No	No
`ParentModification`	Yes	Yes	Yes	Yes

AI ND RL WL PM

AccessIntent Y N Y Y Y NodeDelete N N Y Y Y ReadLock Y Y Y N Y WriteLock Y Y N N Y ParentModification Y Y Y Y N

Costs and Mitigations

The extra cost in the protocol is the need to obtain two node locks at each level after the root during Searches instead of one in Jaluta's method. This is mitigated by not having to perform latch upgrades during inserts or deletes. Note that the multiple-thread versions, or the latch manager installed from btree2t onwards, reduce latch costs to virtual zero.

Code API

BtDb *bt_open (char *name, uint mode, uint bits,  uint map_pool_segments, uint pool_segment_size)

Create btree object BtDb and open file under the given access mode (BT_ro for read only, BT_rw for read and write, BT_fl to flush writes to disk). bits is the btree page size in bits (usually 12 to 16). map_pool_pages is the number of segments to be cached under memory mapping or zero if file-io is to be used instead of mmap (). pool_segment_size' is the number of pages in a pool segment, given in bits (powers of 2). Note that bits` will be taken from an existing btree file, with the passed parameter ignored.

void bt_close (BtDb *bt)

Close the btree file and release the btree object.

BTERR bt_deletekey (BtDb *bt, uchar *key, uint len, uint lvl)

Delete the specified key from the btree at the specified level (lvl = 0 for leaf key). This function is called recursively to remove keys from levels higher in the tree than the leaves. Returns BTERR_ok for success, or else an error.

uint bt_findkey (BtDb *bt, uchar *key, uint len)

Find the given key in the btree, cache the btree page into bt->cursor, and return the page slot number of the key, or zero if it is not in the btree. Return zero if an error occurs with the error in bt->err.

BTERR bt_insertkey (BtDb *bt, uchar *key, uint len, uint lvl, uid id, uint tod)

Insert/Replace the given key in the btree at the given lvl (normally zero) with the given row-id and time-of-day. This function is called recursively to insert fence values into upper levels of the btree.

uint bt_startkey (BtDb *bt, uchar *key, uint len)

Find the first key greater than or equal to the given key. Cache the btree page into bt->cursor and return the btree slot. Return the slot number, or zero if an error occurs.

uint bt_nextkey (BtDb *bt, uint slot)

Return the next slot in the cached page at bt->cursor or move to the next page to the right. Return zero if an error occurs. Note that the last key in the btree is 0xffff for two bytes.

BtKey bt_key (BtDb *bt, uint slot)

Return the BtKey structure pointer for the given key slot on the cached bt->cursor page.

uid bt_uid (BtDb *bt, uint slot)

Return the 48 bit row-id for the given key slot on the cached bt->cursor page.

uint bt_tod (BtDb *bt, uint slot)

Return the tod value for the given key slot on the cached bt->cursor page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Concurrency B tree

Contents

Node Consolidation

Key Size and Node Layout

Buffer Pool Manager

Recommended Parameter Values

64 Bit Operational Notes

Multiple Process/Multiple Thread Version

Latch Manager

Detailed Benchmarks

Contact Information

Acknowledgement

Implementation Details

Latch Coupling

Lock Sets

Compatibility Matrix

Costs and Mitigations

Code API

Clone this wiki locally