-
Notifications
You must be signed in to change notification settings - Fork 0
/
doc.go
97 lines (88 loc) · 5.02 KB
/
doc.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
// Copyright 2014 The Cockroach Authors.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
// implied. See the License for the specific language governing
// permissions and limitations under the License.
/*
Package engine provides low-level storage. It interacts with storage
backends (e.g. LevelDB, RocksDB, etc.) via the Engine interface. At
one level higher, MVCC provides multi-version concurrency control
capability on top of an Engine instance.
The Engine interface provides an API for key-value stores. InMem
implements an in-memory engine using a sorted map. RocksDB implements
an engine for data stored to local disk using RocksDB, a variant of
LevelDB.
MVCC provides a multi-version concurrency control system on top of an
engine. MVCC is the basis for Cockroach's support for distributed
transactions. It is intended for direct use from storage.Range
objects.
Notes on MVCC architecture
Each MVCC value contains a metadata key/value pair and one or more
version key/value pairs. The MVCC metadata key is the actual key for
the value, using the util/encoding.EncodeBytes scheme. The MVCC
metadata value is of type MVCCMetadata and contains the most recent
version timestamp and an optional roachpb.Transaction message. If
set, the most recent version of the MVCC value is a transactional
"intent". It also contains some information on the size of the most
recent version's key and value for efficient stat counter
computations. Note that it is not necessary to explicitly store the
MVCC metadata as its contents can be reconstructed from the most
recent versioned value as long as an intent is not present. The
implementation takes advantage of this and deletes the MVCC metadata
when possible.
Each MVCC version key/value pair has a key which is also
binary-encoded, but is suffixed with a decreasing, big-endian encoding
of the timestamp (eight bytes for the nanosecond wall time, followed
by four bytes for the logical time except for meta key value pairs,
for which the timestamp is implicit). The MVCC version value is
a message of type roachpb.Value. A deletion is indicated by an
empty value. Note that an empty roachpb.Value will encode to
a non-empty byte slice. The decreasing encoding on the timestamp sorts
the most recent version directly after the metadata key, which is
treated specially by the RocksDB comparator (by making the zero
timestamp sort first). This increases the likelihood that an
Engine.Get() of the MVCC metadata will get the same block containing
the most recent version, even if there are many versions. We rely on
getting the MVCC metadata key/value and then using it to directly get
the MVCC version using the metadata's most recent version timestamp.
This avoids using an expensive merge iterator to scan the most recent
version. It also allows us to leverage RocksDB's bloom filters.
The binary encoding used on the MVCC keys allows arbitrary keys to be
stored in the map (no restrictions on intermediate nil-bytes, for
example), while still sorting lexicographically and guaranteeing that
all timestamp-suffixed MVCC version keys sort consecutively with the
metadata key. We use an escape-based encoding which transforms all nul
("\x00") characters in the key and is terminated with the sequence
"\x00\x01", which is guaranteed to not occur elsewhere in the encoded
value. See util/encoding/encoding.go for more details.
We considered inlining the most recent MVCC version in the
MVCCMetadata. This would reduce the storage overhead of storing the
same key twice (which is small due to block compression), and the
runtime overhead of two separate DB lookups. On the other hand, all
writes that create a new version of an existing key would incur a
double write as the previous value is moved out of the MVCCMetadata
into its versioned key. Preliminary benchmarks have not shown enough
performance improvement to justify this change, although we may
revisit this decision if it turns out that multiple versions of the
same key are rare in practice.
However, we do allow inlining in order to use the MVCC interface to
store non-versioned values. It turns out that not everything which
Cockroach needs to store would be efficient or possible using MVCC.
Examples include transaction records, response cache entries, stats
counters, time series data, and system-local config values. However,
supporting a mix of encodings is problematic in terms of resulting
complexity. So Cockroach treats an MVCC timestamp of zero to mean an
inlined, non-versioned value. These values are replaced if they exist
on a Put operation and are cleared from the engine on a delete.
Importantly, zero-timestamped MVCC values may be merged, as is
necessary for stats counters and time series data.
*/
package engine