forked from BohuTANG/nessDB
-
Notifications
You must be signed in to change notification settings - Fork 0
/
nessDB.txt
278 lines (237 loc) · 11.4 KB
/
nessDB.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
================================================================
nessDB with LSM-Tree
Copyright (C) 2012 BohuTANG________________
_____________________________ __ \__ __ )
__ __ \ _ \_ ___/_ ___/_ / / /_ __ |
_ / / / __/(__ )_(__ )_ /_/ /_ /_/ /
/_/ /_/\___//____/ /____/ /_____/ /_____/
================================================================
nessDB is a fast Key-Value database, supports Redis-Protocol(PING,SET,MSET,GET,MGET,DEL,EXISTS,INFO).
Which is written in ANSI C with BSD LICENSE and works in most POSIX systems without external dependencies.
OVERALL LAYOUT
==============
+------------------------------------+
| nessDB architecture |
+------------------------------------+
+--------------------------------+ +---------------------------------+
| memtable | | background-merge |
|--------------------------------|------>|---------------------------------|
| key-sorted memory table | |background merge detached thread |
+--------------+-----------------+ +---------------------------------+
|
|
+--------------+-----------------+
| meta list |
|--------------------------------|
| key-range informtation(memory) |
+--------------+-----------------+
|
|
+--------------+-----------------+
| on-disk sst |
|--------------------------------|
| key-sorted disk index files |
+--------------+-----------------+
|
|
+--------------+------------------+
| on-disk db |
|---------------------------------|
| disk database file |
+---------------------------------+
+--------------+------------------+
| on-disk log |
|---------------------------------|
| disk log file |
+---------------------------------+
nessDB has three kinds file on disk, there are: '*.sst', '*.db' and '*.log'
'*.sst' is index file, all keys are sorted and all indexes file don't contain overlapping keys
'*.db' is data file
'*.log' is data recovery file
1) memtable(in memory)
======================
It is a key sorted 'LSM-Tree' in memory and active for ADD/UPDATE/DELETE operations
(skiplist.c)
'memtable' structured:
+--------+--------------+-------------+
| key | value-offset | operation |
+--------+--------------+-------------+
'key' is entry key(is distint,or will be covered)
'value-offset' is the offset of one entry data which stored in the 'db' file
'operation' is 'ADD' or 'DEL' action.
2) meta(in memory)
==================
'meta' is key-range lists in memory, meta lists are formated as:
+---------------+-----------+---------------+--------------------+
| begin key | end key | sst file name | sequential number |
+---------------+-----------+---------------+--------------------+
| ... all the other items ... |
+---------------+-----------+---------------+--------------------+
3) sst(on-disk)
================
'*.sst' is on-disk index file(key is sorted), the max number of one '*.sst' file is limited by 'SST_MAX' macro,
which is defined in 'config.h', default count is 25000.
a '*.sst' index file layout:
+------------------------------------------+
| sst block 1 |
+------------------------------------------+
| sst block 2 |
+------------------------------------------+
| ... more sst blocks ... |
+------------------------------------------+
| sst block N |
+------------------------------------------+
| FOOTER |
+------------------------------------------+
'sst block' structured:
+--------+--------------------------+
| key | value-offset(big-endian) |
+--------+--------------------------+
'key' is entry key(is distint,or will be covered)
'value-offset' is the offset of one entry data which stored in the 'db' file
NOTICE: all 'sst block' operations are 'ADD'('DEL' operation entries are filtered when written to index file)
'FOOTER' structured:
+-------------+-------+------+
| last-key | count | crc |
+-------------+-------+------+
'last-key' is the last one key of current sorted index file, due to as meta lists end-key.
'count' is all entries's count of current sorted index file
'crc' is used to verify the current index file
4) db(on-disk)
=============
It is the 'value' storage file.
'db' structured as follow:
+-----------------------------------+
| magic number |
+-----------------+-----------------+
| value 1 length | value 1 |
+-----------------+-----------------+
| ... all the others ... |
+-----------------+-----------------+
| value N length | value N |
+-----------------+-----------------+
'magic number' is an integer for datas not start from '0' offset
5) log(on-disk)
===============
'*.log' is prepared for data recovery when storage engine shutdown than expected.
Each memtable is mapping one log file on disk. Actually, there are at most two 'log' files exists:
one is merging memtable's(immutable, read only), and another is current active memtable's.
When background detached-thread is done, the merging memtable's log will be deleted immediately.
'log' structured:
+---------------+----------+-------------+------------+
| key 1 length | key 1 | data offset | operation |
+---------------+----------+-------------+------------+
| ... all the others ... |
+---------------+----------+-------------+------------+
| key N length | key N | data offset | operation |
+---------------+----------+-------------+------------+
When nessDB reopen again, it checks is some '*.log' files exists
if exists, recovery it(read log file and add entry to current active memtable)
6) background-merge
===================
+-------------------------+ +---------------------------+
| active memtable is full |----->| become immutable memtable |
+----------+--------------+ +------------+--------------+
|
|
v
+---------------------------------+
| start background merging thread |
+---------------------------------+
This is a detached-thread to merge immutable memtable(full) entries to on-disk file(*.sst)
All of the above is about nessDB's architecture and main structure layout.
Now, let's show the processes of 'WRITE' and 'READ', 'DB-OPEN'.
WRITE PROCESS
=============
+----------------------+ +-------------------------------+ +--------------------------+
---->| write to log first |---->| write to active memtable |-->| create new memtable |---> add to the memtable
+----------------------+ +-------------------------------+ +--------------------------+
|
| if active is full(memtable becomes immutable)
v
+-------------------------------+
| create one thread to do merge |
+-------------------------------+
|
|
v
+-------------------------------+
|delete merging memtable's log |
+-------------------------------+
|
v
+-------------------------------+
| exit the merging thread |
+-------------------------------+
Summarize:
since merging is on background thread, so can't block the front-end writing,
and nessDB has read/write-locking, so concurrent writing/reading is allowed on the same index file.
READ PROCESS
=============
+----------------------------+
+----->| lookup from active metable |
^ +-----------+----------------+
| |
| exists | not exists
| +------+------------------------+
| | |
| v v
| +------------------------+ +-----------------------+
| | read data from db file | | get data from LLRU |
| +------------------------+ +-----------------------+
| exists | not exists
|<-------------------------------------------+------------------+
| |
| +---------------v--------------------+
| | get index file info from meta list |
| +---------------+--------------------+
| |
| |
| +---------------v--------------------+
| | a)read data offset from index file |
| +---------------+--------------------+
| |
| |
| +---------------v--------------------+
| | b)read data length from index file |
| +---------------+--------------------+
| |
| |
| +---------------v--------------------+
| | c)read data from db file |
| +---------------+--------------------+
| |
\---------------------------------------------------------------/
DB-OPEN PROCESS
===============
+---------------+
| db open |
+-------+-------+
|
|
v
+---------------------------------+
| read footer info from sst files |
+--------------+------------------+
|
|
v
+---------------------------------+
| create meta list |
+--------------+------------------+
|
|
v
+---------------------------------+
| recovery data from log file |
+--------------+------------------+
|
|
v
+----------------------------------+
| recovery log entry to memtable |
+----------------------------------+
nessDB?
=======
I think coding this database storage engine is without end, this is the reason why it is named 'nessDB'.
It is ness.