-
Notifications
You must be signed in to change notification settings - Fork 1
/
TODO
1199 lines (1019 loc) · 50.7 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
I Am Legend:
Items are sorted by priority (highest on top).
o a pending TODO item (for the current release)
. a pending TODO item (for future releases)
x a finished TODO item
==============================================================================
------------- hamsterdb pro 2.1.9 ------------------------------------
x documentation rewrite
x use github wiki, remove old cruft
x then create the initial repository:
https://github.com/blog/699-making-github-more-open-git-backed-wikis
x /introduction
x /evaluate & benchmark
x /faq (merge w/ performance)
x /tutorial
... (existing stuff)
x /pro
x /hola
x API pages
overview (description, status)
installation
usage (compiling, linking)
api functions (link to doxygen)
samples
x c
x c++
x java
x dotnet
x erlang
x python
x protobuf
x have the hamsterdb/documentation directory "link" to the git project
x get rid of HAM_API_REVISION, move HAM_VERSION_* to the header file, or
move version.h to include/ham
x refactoring: the ByteArray should be non-copyable, or at least
clearly define the copying semantics regarding ownership. Also
review its uses, try to figure out how it can be improved. Should it be a
DynamicArray<uint8_t>?
x Zint32: still has a few bugs when running unittests because
index->used_size is miscalculated
gdb --args ./test Zint*
b zint32.cpp:66
b hamsterdb::Zint32::Zint32KeyList::copy_to
x then once more check the monster tests
x remove dependency to malloc.h, use stdlib.h instead
x add patches from Thomas Fähnle
x refactoring: clean up the whole Database code; it's too much and too
complex; there must be an absolutely clear border between Btree and
Transaction related code.
-> should we move the transactional implementations to their own
database class, i.e. TransactionalDatabase?
x LocalDatabase::cursor_get_duplicate_position: delegate to Cursor class
x when to call cursor->set_lastop(Cursor::kLookupOrInsert)?
x LocalDatabase::cursor_insert -> calls LocalDatabase::insert
x remove BtreeCursor::insert, it does not have relevant code
x LocalDatabase::cursor_erase -> calls LocalDatabase::erase
x remove BtreeCursor::erase
x remove TransactionCursor::erase
x LocalDatabase::cursor_find -> calls LocalDatabase::find
x update BtreeCursor::find
x a lot of functions have to clear the changeset when they leave;
use ON_EXIT_SCOPE macro for this!
x run performance tests
x run monster tests
x #42: approx. matching: cursor returns wrong key when using transactions
x create a unittest to reproduce this
x the fix should be that ham_cursor_find calls ham_db_find(..., cursor)
x cherry-pick the following commits on the master branch
1447ba4eb217532e8fb49c4a84a0dc3b982a3ffe
6a8dd20ec9bd2ec718d1136db7667e0e58911003
x Thomas Fähnle: create new parameter for fadvise configuration
x implement for file-based (w and w/o memory mapping) devices
x need to check for the function in configure.ac
x return flag with ham_env_get_parameter (not documented)
x create a unittest to make sure that the parameter is not persisted
x create a new parameter for ham_bench
x A few things to refactor in the btree
x the BtreeAction classes do not need a Transaction pointer as an argument!
x actually NO part of the Btree should EVER access a Transaction...
x btree_visitor.h has a structure called ScanVisitor, and BtreeVisitor is
in btree_index.h. Can both be combined?
x Introduce new option for record numbers (HAM_RECORD_NUMBER32,
HAM_RECORD_NUMBER64), deprecate the old one. Use 32bit whenever possible!
x introduce a new flag; the old one uses 64bit. all tests must work
x implement (local)
x implement (server)
x implement (remote)
x verify with ham_dump/ham_info
x add unittests
x re-use existing tests
x also for remote!
x check for overflows!
x add to ham_bench
x fix documentation
x samples
x header file
x implement Cursor::get_record_size for remote access
x issue43: fix segfault
it seems that a node is removed which is not in the rb-tree. the node
used to be there, but got dropped when another node was removed.
x Remove requirement for "delete stability" when erasing keys
x Introduce generic BtreeUpdateAction which handles merges and splits
x when erasing a key: also expect splits; in case of a split the
BtreeEraseAction has to retry the operation - either with the old
or the new node!
x hamsterdb-erlang: add support for HAM_RECORD_NUMBER32
x Zint32: add compression based on bit-packing
-> https://github.com/lemire/simdcomp/blob/master/include/simdintegratedbitpacking.h
x add parameter to header file
x support in ham_bench
x update IndexFactory for the new KeyList
x move block/index related functionality to a base class BlockKeyList
(w/ template parameter for the block structure)
x integrate Daniel's sources as a library
x add a unittest to compress/decompress
x implement the compression
x don't use index->used_size, it's not required
x don't use index->block_size; it's always static
x use delta-encoding to have better compression
x fill in the implementation
x add unittests (similar to zint32)
x add monster tests (similar to zint32/varbyte)
x don't calculate bitwidth if index->bits is already maxxed out (32)
x there is no vacuumize prior to a split. The vacuumize is cancelled
immediately because it is always "internal".
x don't split blocks in the middle if the new key is appended
x don't split blocks in the middle if the new key is prepended
x test with bulk-erase and deltas of size 1, no records
-> no delete stability!
x erase: only calculate bits of the new delta, not of the whole block
x fix the remaining TODOs
x improve vacuumize_impl() performance by allocating multiple
blocks at once
x rewrite simdmaxbitsd1() to accept a "length" parameter
x Mail from Joel: approx. matching returns wrong key
ham_cursor_find(... HAM_FIND_GEQ_MATCH)
keys "aa", "bb", "cc", find "b": returns "cc" instead of "bb"
-> use existing sources as the basis
-> insert keys as suggested
x Impove test coverage of approx. matching w/ EQC
x need erlang function for approx. matching (ham_db_find w/ flags
-> returns status AND key)
x for all key types (incl. fixed length keys of size 5, 32, 48 and
variable length keys)
x generate keys based on key type/size
x store inserted keys
x verify the result of the find operation
x specify key flags for approx. matching and verify the result
x lt_match
x leq_match
x gt_match
x geq_match
x add inline records of max. size to increase splits
x only btree
x Update wiki documentation/web documentation
x ham_env_create: links to doxygen-reference is broken
x verify other links
x update doxygen documentation to 2.1.9
x improve HAM_PARAM_JOURNAL_SWITCH_THRESHOLD
x implement for ham_env_open
x implement for ham_env_get_parameters
x add unittest for ham_env_create and ham_env_open
x issue45: crash when env is closed
Journal::recover -> __abort_uncommitted_txns -> ham_txn_abort
-> env->get_txn_manager()->abort
crashes because env->m_txn_manager == 0
x reproduce in a unittest
x don't call ham_txn_abort but call txn->abort instead
x Mail from Joel: approx. matching returns wrong key
x increase test coverage A LOT!
x rewrite comparison logic
x reproduce with his test, fix the bug
x fix the regressions (see Joel's mail)
x reproduce crash with a unittest
btree: insert "A"
txn: erase "A"
txn: search "A" w/ GEQ
find_txn then moves to the next node, which is null
segfault when accessing *node
x PRO: add Group Varint-based integer encoding
x add parameter to header file
x support in ham_bench
x update IndexFactory for the new KeyList
x integrate Daniel's sources as a library
x implement the compression
x add unittests (similar to zint32)
x add monster tests (similar to zint32/varbyte)
x don't split blocks in the middle if the new key is appended
x don't split blocks in the middle if the new key is prepended
x tests fail if kMaxGroupVarintsPerBlock is set to 4
x also try larger values
x reduce index size by setting the correct bit widths
o Crash when opening env and Journal is applied
starting w/ key 1104669304804000000, the blob 729496 (and all following
60 blobs on page 720896) are missing (= their blob headers are empty).
-> index already has a key, therefore the key is overwritten during
recovery; but the assigned record is bogus (blob_header is not
initialized)
-> the journal has no changeset
x is the page of that blob created or modified during recovery?
=> no!
-> it's not modified in a changeset
-> the page is modified during recovery, but the affected record
is not (it's nulled out from the beginning)
x was this key the last one that was inserted?
-> no, there are more keys following; about 60 without blob, then
more with blobs
o which steps or failures could lead to this scenario?
1) a bug in the recovery process - no, because the file is already
broken prior to recovery
2) might be a crash during recovery which resulted in an invalid state
x Remove the changeset from the Environment, and see how it goes
x review: every public method in Environment, Database and
TransactionManager creates a context
x remove ALL functions that are not used in hamsterdb.cc from
the public interface!!
-> only 1 left: make txn_manager() private
x LocalDatabase: flushes/clears Context
x improve separation between logic and state; make state immutable and
thread-safe/exception safe. Ultimately, there should be a separation
between...
- mutable database state
btree nodes, blobs...
- immutable database state
static parameters
- operation state (parameters)
insert, erase...
- the code
Immutable state can be accessed concurrently, w/o locking and blocking.
Mutable state is synchronized on the lowest level, with very short locks
x Rewrite the code
- remove dependencies to upper layers, i.e. to the Environment
- change function names, if necessary
- move implementation to Impl namespace and separate file
- separate state from the code
1. Struct Impl::XXX -> has State
2. Struct XXX : public Impl::XXX -> hides State (private)
3. Struct XXXTest(XXX *) -> state() returns State
x PageManager
x Move lsn handling to its own object, clean up
x Cache
x Changeset
x Journal
x remove Journal::set_switch_threshold
x Environment - not virtual, has pimpl for local and remote
implementation
x Environment: has common code, takes care of exceptions etc
x *context_data -> move to LocalEnvironment
x move Mutex handling to env.cc (as much as possible)
x create separate EnvironmentTest class
x RemoteEnvironment
x make all other functions private
x LocalEnvironment
x clean up
x add Test interface
x remove "friend" declaration
x Database
x RemoteDatabase
x LocalDatabase
x rebase against v2!!
x Concurrency: move "purge cache" to background
-> Reorg the hash table. All buckets are in serial memory. This requires
a few allocations, but reduces cache misses because there's no
"jumping" from page to page. Such a bucket can be replaced
atomically with CAS (hazard pointer!) and appended w/o locking. And
it avoids frequent calls to remove_from_list/insert_to_list to push
the used pages to the front. As a consequence, a new purge mechanism
is required to identify "old" pages. These pages then could even be
sorted before they are flushed. (This was tested before w/o any effect,
but maybe back then the databases simply were too small?)
-> The new page "list" will support concurrent reads and writes. This
means there can be multiple threads performing I/O without
blocking each other.
x on a new branch! topic/thread
x Create a generic "PageCollection" class with atomic updates
x use static memory as default, but replace w/ dynamic memory if
capacity is not enough
x need a very fast spinlock for writing
x whenever the array grows:
1) grow into a copy
2) perform CAS with the old pointer
3) old pointer can be discarded after 1 second
x assert that there is only 1 writer at a time (i.e. use a mutex)
x when inserting: set the pointer, then atomically increment the counter
x when deleting: simply set to zero
x never use a lock when reading
x rewrite PageCollection
x use a single linked list (intrusive, just like now)
x use a Spinlock for reads *and* writes
. in the future, we can still add a bloom filter to check whether
a page exists or not
x Use PageCollection in the Changeset
x Make sure that extracting the pages is an atomic operation (locked!)
-> PageCollection::for_each(Visitor &visitor);
x Use this in the Changeset
x Use PageCollection for the Page list
x Use PageCollection for the Cache buckets
x must keep track of the *tail* as well!
x if the Cache is thread safe - should the PageCollection also be
thread safe?? - no, not required
x fix the TODOs
x Cache::m_cur_elements no longer required (use totallist.size)
x Cache must be thread-safe
x the PageManager must be thread-safe - no, it's enough if
PageManager::purge_cache is thread-safe (that's the case)
x Create a background thread (use boost::thread)
x make sure that configure.ac checks for boost::thread
x Join thread when closing the Environment
x thread should sleep till it's woken up, either by a new
message or by shutdown
x need a fast double-linked message queue
x wake up thread when new messages arrive
x thread skips all non-mandatory messages when shutting down
x Run performance tests
The cache evicts too fast, i.e. 2.1.9 page_count_fetched 0 -> 9998088
x --seed=12345 --stop-ops=10000000 --key=uint16
x --seed=12345 --stop-ops=50000 --key=uint16
x --seed=1380279291 --stop-ops=1000000 --distribution=ascending
x --seed=1380279291 --stop-ops=1000000 --distribution=ascending
x --seed=1380279291 --stop-ops=1000000 --key=uint16
x move page flush in background
-> A page is in use when it's added to a Changeset. However, pages that
are fetched read-only are never added. And if recovery is disabled
then it's not possible to figure out when a page is no longer used.
x wrap Changeset::flush and Changeset::clear into
LocalEnvironment::finalize_operation(abort/commit)
x make sure this is called after every operation (exception safe!)
-> use BOOST_SCOPE_EXIT
x every operation stores its state in its own Changeset
x ham_db_find/ham_cursor_find
x unify Database::find and Database::cursor_find,
remove cursor_find
x move creation of temp. txn to Database::find, then call
(virtual) LocalDatabase::find_impl()
x find_impl should not have cursor-specific logic
x LocalDatabase::find will then take care of the changeset
and call finalize()
x when creating a local txn: does it have to be stored in the
cursor?
x merge remote requests
x merge ham_db_erase/ham_cursor_erase
x merge ham_db_insert/ham_cursor_insert
x always add a page to the Changeset, even if in-memory or
recovery is disabled
x when going out of scope: clear the changeset or flush it
(cont below...)
x merge topic/perfect-world with v2
x then remove the threading code from v2; so far no thread is required
x PRO: rebase to v2
x regression: ./ham_bench --use-berkeleydb=true --reopen=true --use-berkeleydb=true --reopen=true --use-transactions=tmp --use-cursors --duplicate=last --fullcheck=find ext_052.tst
x ... then compare PRO vs APL sources and make sure they don't diverge
too much!
x simdcomp-tests sometimes fail. valgrind has a huge amount of errors
x PRO: add more simdcomp related functions for blocks < 128 integers
x update Daniel's code - the repository has new patches (create a fork)
git submodule add git@github.com:cruppstahl/simdcomp.git
x add 'export "C"' directives for c++
x add Makefile.am for Automake
x new function for calculating the delta
x add tests
x send push requests to Daniel
x PRO: retrieve better statistics regarding compression/overhead/payload
x for each node retrieve these statistics and accumulate them
as min, max, avg:
- number of keys stored
- actually used storage for the keys
- static overhead per page
- overhead per index
- number of indices
- unused storage
x each node has a fill_metrics() method
x extend ham_info
x by default, no new stats are printed
x otherwise call fill_metrics() and print them
x PRO: simdcomp should use binary search for decompressed blocks
(currently uses linear search)
-> see mail from Daniel for more information, code
x Simdcomp: find (line 152)
x Grouped Varint: find (line 157)
x Simdcomp: insert (line 589)
x Grouped Varint: insert (line 570)
x PRO: integrate Nathan's sources, or start a new KeyList
x add parameter to header file
x add sources to 3rdparty
x implement the encoding
x add unittests
x add support for ham_bench
x add monster tests (pro)
x make sure the library and 3rdparty are compiled with -O3 by default
x various issues for refactoring...
x merge cache_impl with cache.h
x PageManagerFactory is no longer required
x PageManager::alloc calls Page::allocate (rename to "alloc"!), which
calls DeviceDisk::read_page. Why? it's enough to assign the address
(unless the storage is mmapped)
x move 4worker/worker.h to 2worker/worker.h
x blob performance improvements
x there's no need to copy the BlobHeader; this can be returned as a pointer
instead (verify! the pointer must not be modified - make it const!)
-> or is it possible that the header is spread over two pages??
x If mmap is enabled then keys and records don't have to be copied as long
as they point into the mapped area!
x Concurrency: move "purge cache" to background (cont...)
x Right now only the Cache uses a mutex. But we need to know
whether a page is in use (= in a changeset) or if it's purged.
x add a spinlock to the page
x PageManager: lock the page after adding it to the ChangeSet
x afterwards: fetch the page if pers == 0
x how to unlock in case of an exception?
x PageManager: unlock the page when removing it from the Changeset
x when writing to disk: acquire the lock; after flushing: set the
pers-pointer to null
x make sure that ONLY page->pers is modified by the separate
thread, not the next/prev pointers!
x periodically remove empty stubs from the list
(or remove them immediately after the page is purged, and the
lock is re-acquired again
x fix TODO in cache_impl.h
x wake up thread if cache is full
x fix all regressions
x main thread must remove empty pages
x profile mutex contention with
http://0pointer.de/blog/projects/mutrace.html
The cache mutex is a problem. Verify that it's not held while pages
are flushed!
#id 5462574 288644 222691 147887.835 0.027 59.274
x some messages should only exist once, i.e. purging the cache.
no need to add them to the queue multiple times
x use valgrind to check for race conditions etc
x use Spinlock instead of Mutex for the Page
x have a special build mode where Mutex is used instead of Spinlock
(HAM_ENABLE_HELGRIND)
x tests with excessive use of pages are terminated by the OOM-killer
because purges are not fast enough
x ./ham_bench --use-berkeleydb=true --reopen=true --key=binary --keysize=64 --pagesize=1024 --recsize=0 --bulk-erase --distribution=ascending --extkey-threshold=20 --recsize=0 --stop-ops=500000
x find_first_reverse always returns the same page!
x check if there's a race condition between worker thread
and PageManager::close_database()
x run monster tests
x run performance tests
x Change copyright to 2015
x all sources
x webpage footer
x also for PRO!
x zint32: stream vbyte: use avx only if it's supported by the cpu!
(note that os.cc currently returns wrong value if __AVX__ is not set; but
we want to check the runtime capabilities and not the compile time setting)
x zint32: investigate performance problems of memmove (w/o compression!)
x look into assembler code of memmove; is it using a loop?
- yes it is (sse w/ prefetch)
x track statistics of memmove; how many bytes are moved? also track the
time to get a total MB/SEC figure
- 13.2 gb/sec
x write a micro-benchmark
- no, actually memmove is faster than memcpy. wtf?? check again
with Vtune. Is this a profiling error with valgrind/callgrind? -> no
x zint32: new KeyList for Masked VByte encoding (use StreamVbyte as a
template and compress/uncompress blockwise)
x add unittests
x add monster tests
x Erlang: investigate test failures
x in ham_eqc2 - no longer reproducible
x in test.erl
x PRO: rebase to base/v2
x PRO: zint32 currently does not work if pagesize != 16kb. Document this
limitation and return an error if a different pagesize is used
x PRO: zint32: would it make sense to have a block-based "compression" without
compression?
x PRO: check recovery.pl - extended_tests are failing on PRO because the
error inducer hits too early???
x Sometimes the performance peaks when truncating/growing the file. The device
should therefore allocate more space than required. If the file size is
small (i.e. < 100 pages) then continue as of now. Otherwise truncate
the file for multiple pages (the number should depend on the file size),
and use this spare buffer whenever required.
x when shutting down: remove the spare overhead!
x Refactoring: is Device::alloc(len) really used? - yes
x Refactoring: rewrite more classes to improve the whole structure
x BlobManager: remove dependency to env
pass Config, PageManager, Device during initialization
x HeaderPage - can it be treated like any other page, and moved
to the PageManager/Cache?? - yes, but not yet now
x transform to struct instead of class
x EnvironmentHeader - allow concurrent access from multiple databases
- postponed
x Move find_leaf from btree_index to btree_find
x btree_index.h: rename release() to drop()
x refactoring: the Cursor should be a top-level entity
x introduce a virtual base class, a LocalCursor and a RemoteCursor
x the RemoteCursor manages the remote handle
x rewrite ham_cursor_create
x ham_cursor_clone
x ham_cursor_close
x ham_cursor_get_duplicate_position
x ham_cursor_get_duplicate_count
x ham_cursor_get_record_size
x ham_cursor_overwrite
x fix the constructor call and pass a Database, not a LocalDatabase
(move m_db to derived class?)
x remove "evil cast"
x Currently, the code uses linear search over small ranges. This is not
efficient. If the range is in the cache then binary search is still faster.
x rename find_exact() and find_leaf() to find()
x rename find_child() to find_lower_bound()
x BtreeImplBase simply calls a method of the keylist (find/find_lower_bound)
-> implement it in BaseKeyList, overwrite in PAX list and in the
compressed lists
x PAX: replace lower bound implementation with std::lower_bound
x PAX: replace exact search implementation with std::lower_bound
x run performance tests - not changed
x rebase PRO
x remove linear search and move mixed binary search/linear search to
the PRO PaxKeyList
x run performance comparison
x fix a regression
./ham_bench --use-berkeleydb=true --reopen=true --key=binary --keysize=64 --recsize=0 --bulk-erase --distribution=random --extkey-threshold=20 --recsize=0 --stop-ops=500000 --seed=1430998263
o There are performance issues related to cache flushes.
1. if multiple pages are flushed then their locks are held for a long time
2. even if a single page is flushed then it's possible that there's a
significant latency spike when fetching a locked page
A strategy to avoiding these spikes would be to create a copy of the page
as soon as a page is fetched but the lock is already acquired.
The original page could be hazard-collected after a while (10 sec), or
forwarded to the worker thread for cleanup (as long as there's just one
thread this will work fine).
What if this (previous) page is in other linked lists?
-> We need a way to swap the data of a page even while it is locked and/or
flushed
x combine mutex, is_dirty and the data pointer to a separate structure
x try to avoid an extra memory allocation
x the mutex only protects this structure!
x the worker thread does not work on pages but on these structures
x don't forget to free the structure, but only if it was allocated
x ... and therefore the main thread can swap them if the page is locked!
x compare performance - looks good, peaks are gone
x fix unittests
===> implemented, but currenlty disabled. Causes test failures when
running i.e. 1/45.tst with --cache=50 vs berkeleydb
o run unittests with posix mutexes
o run with helgrind
o run monster tests
o set breakpoint if spinlock spins more than 100 times; identify the
bottleneck, try to improve
o release v2 as 2.1.11
o master: reset --hard v2, tag, push
o rebase topic/next to v2; reset v2 to topic/next;
o rebase topic/txn to v2; reset topic/next to topic/txn; remove topic/txn
o rebase pro/topic/next to base/v2
o Concurrency: flush the changeset asynchronously. As soon as the log
is written (and fsync'd), the pages can be flushed to disk in the
background. All pages are locked and forwarded to the worker thread.
As soon as they are unlocked, they can be re-used again.
o we HAVE to make sure that NO OTHER entity is allowed to flush pages,
only the worker thread! otherwise we might have race conditions
between the purger and the changeset-flush
o compare performance (before/after)
o Refactoring: completely decouple the Page from the Device; pass the Device as
a parameter to fetch(), flush()
- or vice versa, call device->flush(page), device->fetch(page)
o Make Pages resizable with variable length
Pages can have different sizes, and are always aligned to the "native"
page size of a Database (with the exception of header pages).
The page IDs are now compressible and have variable length. The first
2 bits are specifying the compression:
00 - Page ID embedded in flags and the following byte,
(2b) Page Size is always 1
max index size: 14 bits - 268 mib
01 - Page ID embedded in flags and the following 2 bytes,
(4b) Page Size: 1 byte
max index size: 22 bits ~ 68 gib
10 - Page ID embedded in flags and the following 3 bytes,
(6b) Page Size: 2 byte
max index size: 30 bits ~ 17 tib
Blob IDs have a similar schema, but in also specify the offset
of the blob in this page.
=====================
TODO die werte sind falsch!
=====================
00 - Page ID embedded in flags
(3b) Page Size is always 1 16 kib
Blob Offset is 12 bits (4 byte-aligned) 16 kib
max index size: 10 bits 16 mib
01 - Page ID embedded in flags and the following 2 bytes,
(5b) Page Size: 1 byte 4 mib
Blob Offset is 1.5 bytes (4 byte-aligned) 4 mib
max index size: 18 bits 4,2 gib
10 - Page ID embedded in flags and the following 3 bytes,
(6b) Page Size: 1 byte 4 mib
Blob Offset is 1.5 bytes (4 byte-aligned) 4 mib
max index size: 26 bits 1 tib
o deprecate the HAM_PARAM_PAGE_SIZE parameter for the Environment
o use HAM_PARAM_PAGE_SIZE parameter for the Database
o each page needs to be aware of its own size
o create code to read and write the page ID
o page IDs are addressed as uint8_t *, not uint64_t
o create code to read and write the blob ID
o blob IDs are addressed as uint8_t *, not uint64_t
o The InternalRecordList and the DefaultRecordLists need to use blocks
to efficiently manage variable-length IDs
o run monster tests
o if no page size is specified: come up with good default values for
blob pages and indices and leafs
o make blob pages larger than normal pages
o run monster tests
o run performance tests
Instead of splitting pages, they can now be resized easily. This will help
when applying BtreeUpdate. Instead of requiring a split (and a recursive
descent of the tree), the BtreeUpdate are applied to a "bigger" page. The
original page will be obsolete, pointing to the new page, but still contain
valid data (including the BtreeUpdate). As soon as the parent pointer
and the sibling pointers are updated, the original page can be
garbage-collected.
The actual update flow is like this:
1. (the original page is read-locked)
2. a new, larger page is allocated
3. BUs are merged into the new page
4. the old page's state is "obsolete", with a pointer to the new page
5. (the old page's read-lock is released)
All other changes can be handled separately (updating pointers etc).
o a page needs to know if it's obsolete (and by whom); if the PageManager
discovers an obsolete page then it needs to return the new one
o BtreeUpdate/BtreeIndex: the "callers" then have to fix their links!
o as soon as all links are fixed the page is moved to the freelist
- but how do we know if all links are fixed? would need some kind
of reference counting in the old leaf...
o add unittest
o instead of splitting pages: resize them (and append to the file), update
all references if the pages are cached, and discard the old page.
if other pages are not cached: set the overflow pointer of the old
page to the new page, flush it.
o also have the old Page* directly point to the new Page* to speed
up cache accesses
o but perform the split synchronously (if possible) if a certain
watermark is reached
o checkpoint - all tests need to work
o checkpoint - all tests need to work
o always use constant sizes for the header page(s)
o 1kb for the EnvironmentHeader
o 4kb for the PageManager state
o 11kb for the list of databases
o improve logic in ham_env_open
o remove dependeny to header page in the PageManager; it now has its
own header page!
o clean up EnvironmentHeader, header page in LocalEnvironment
o clean up BtreeHeader etc
o Do not add read-only pages to the Changeset
There's no need to add pages to the Changeset if they are not modified.
My first thought was that the lock should be replaced with a RW-lock,
and such operations (i.e. get_record_size(), find(), purger etc) acquire a
read-lock. But the current implementation does not require this level of
concurrency. Even if concurrency is implemented on a database-level
then there's no need for RW-locks in the pages (as long as blob pages
are NOT shared between databases!)
o clean up BtreeIndex class; it has too many responsibilities, i.e.
managing configuration, persisting configuration, btree traversal,
btree actions etc
o the configuration of all btrees is moved to a separate class
o the BtreeIndexFactory can create or open btree indices; it manages
the page with the persistent configurations
o the BtreeIndex has a persistent configuration, a runtime configuration
(DatabaseConfiguration) and entry points for btree actions
o create a common root class for BtreeActions with common functions,
i.e. traversal
o BtreeUpdate managed in the BtreeNode
BtreeUpdate are attached to a node. A background thread can merge them.
At runtime, they're consolidated, and queries (and other operations)
can work directly on the lists.
A DeltaUpdate can be part of a Transaction.
The existing TransactionOperation structure is replaced. The whole
Transaction handling will be rewritten. Instead of using separate
TransactionTrees, the BtreeUpdate are attached to the node.
BtreeUpdate are a sorted (!) linked list. Prior to merge, those
BtreeUpdate that delete keys will be applied.
The DeltaUpdate structure should be parameterized, just like the Btree
(i.e. use the identical type and compare function).
Pages are not merged or split as long as BUs are attached. Instead of a
split the page is grown and moved.
The DeltaUpdate structure has a pointer to its Transaction. If the
Transaction is committed or aborted, then the DeltaUpdate state is set to
"committed" (or "aborted"). When merging, aborted BtreeUpdate are removed.
The Transaction pointer is only followed if the state of the Transaction is
not clear (i.e. either committed or aborted).
For recovery, only few changes are required. All BtreeUpdates are
written to the journal. Each has its own lsn, and each page stores the
lsn of the last merged BU. Whenever a transaction is committed
(and flushed), all affected pages are flushed. When that is done,
the journal file can be switched just as it is done now.
During recovery, BUs are appended to their pages. No flushes/commits
are necessary. It is also not necessary to re-open the journal.
o create a DeltaUpdate struct
o template parmeters for key and compare function
o key, record, ...
o the state (committed, aborted)
o pointer to the Transaction
o single "next" pointer for the linked list
o define an auto-merge strategy
o in the beginning, BUs will always be merged when the node is
accessed for reading or most other operations. This will change
as soon as data can be consolidated on the fly.
o what if a split is required during auto-merge?
- first vacuumize (this happens automatically)
- either call ham_db_insert() to traverse the tree
- or split into a sibling page, insert the page in the leaf but add
the parent link later on (queries must make sure that they check
the sibling node), then fix the parent node later
- or don't split at all, but resize the page. Use logical page Ids
and simply switch the page address of the original page,
then split the page asynchronously
- if the Database is closed: write all BUs to a log and re-apply
them when the database is opened
o Each BtreeNodeProxy manages a (single) linked list of BtreeUpdate
o but also keep track of the tail for fast append operations
o This linked list is always sorted
o Add a flag to force-flush immediately (disable queueing) - per database!
-> this is a counter which specifies the queue length till the queue
is merged. default is max<int>, 0 disables queueing
o support this in ham_bench
o Merge the updates (naive implementation: apply one by one)
o first apply "delete" operations
o then all others
o ... before flushing the page
o ... before all other operations
o When splitting the page: re-distribute the updates to the
new node
o Don't merge pages as long as delta-updates are attached
o implementation checkpoint - all tests must run
o unittests
o monster tests
o performance tests
o Then consolidate the updates piece by piece
o for lookups
o for inserts
o for erase
o for traversal
o for cursors
o for scans
o ... and with duplicates
o Replace the TransactionOperation with the BtreeUpdate
o requires double linked list to all BUs within this txn?
o txn is committed: set the state of all BUs to "committed"
o txn is aborted: set the state of all BUs to "aborted"
o move the consolidation away from the txn layer towards the BU layer
o Refactoring: rewrite the Transaction layer
o remove the TransactionNode structure, it's no longer required
o remove the TransactionTree structure, it's no longer required
o temporary txn's should no longer exist as a structure; their
BUs are automatically set to "committed"
o move duplicate key consolidation down to the BU layer
o the remaining Transaction layer should only manage txn-id and
txn name; nothing else!
o the Transactions should no longer form a linked list
o temporary Transactions should be created on the stack, without any costs
o Refactoring: rewrite the whole cursor layer
o clean up the public interface
o remove the Transaction cursor, merge BtreeCursor with LocalCursor
o there should be 3 states:
- nil
- coupled to btree
- coupled to txn
o state() - returns the state
o set_state() - changes the state
o key() returns the current (coupled) key
o record() returns the current (coupled) record
o get rid of the DuplicateCache and the duplicate index in the cursor; if
the DeltaUpdates are correctly sorted, then the cursor should not be
necessary (unless the duplicate position is explicitly required via
ham_cursor_get_duplicate_position())
. More things to refactor in the btree
o EraseAction uses duplicate_index + 1, InsertAction uses duplicate_index
-> use a common behaviour/indexing
o EraseAction line 71: if the node is empty then it should be merged and
moved to the freelist!
. when splitting and HAM_HINT_APPEND is set, the new page is appended.
do the same for prepend!
. Refactoring: all unittest fixtures should derive from a BaseFixture,
which creates an Environment, creates a list of databases (w/ parameters),
and if required also a cursor, a transaction and a context
o include additional management functions like lenv(), ldb(), ltxn(),
page_manager(), cache(), context()...
o what else?
o then reorganize the tests
- public API
- internal modules
. The PageManager state is currently stored in a compressed encoding, but
it is less efficient than the standard varbyte encoding because
pages > 15 * page_size have to be split. Use a standard vbyte encoding
instead (it will anyway be required later on).
o Concurrency: merge BtreeUpdates in the background
o Refactoring: would it make sense if each Database has its own BlobManager?
Then ham_env_erase_db would be much faster, and if each Database has its
own lock then the blob pages would not block each other
o are there other modules that should be "per database" and not
"per environment"?
- PageManager
- Device
- ...?
o LocalEnvironment::open creates a "fakepage" to find out the page_size;
just assume the default page size of 16kb and read the first page. Then
discard if the real page_size is a different one
o deal with databases < 16kb?
o migrate to libuv 1.0, it has a stable API
http://docs.libuv.org/en/latest/migration_010_100.html
o also for windows!
o improve the webpage documentation
o document the various btree formats on the webpage, with images
o variable length keys (w/ extended keys)
o POD keys
o default records
o inline records
o fixed-length records
o duplicates (w/ overflow tables)
o PRO: compressed keys
o PRO: compressed records
. investigate "pointer swizzling": internal btree nodes should store "hints"
to the actual Page, not the Page IDs, as soon as the Page was loaded for
the first time. This could circumvent the buffer cache and increase
performance.
How to invalidate those hints or discover that a page was evicted from
cache?
- Eviction could only free the persistent part of a page, and not the
stub.
- Could also use reference counting for the page
o PRO: Group Varint-related improvements
o will currently not work with pages > 16kb
o vacuumize (internal == false): merge two blocks if they're underfilled?
we have to reduce the number of indices
o also for varbyte?
o PRO: allow compression of 32bit record numbers
o PRO: use zint32 compression for internal nodes
-> requires 32bit page IDs
o PRO: prefix compression for variable-length keys
use an indirection for the prefixes and suffixes; store each
part in a slot. the keys themselves have then fixed length (2 slot id's)
==> supports efficient binary search!
==> is efficient for random read/writes AND linear scans
however, it's very complex to figure out how to optimally split the
strings into prefixes and suffixes
==> prefixes and suffixes can be stored as extended keys if they become
too large
see indexcompression2009.pdf - Efficient index compression in DB2 LUW
o look for more research papers
o PRO: look for a better compression for DefaultRecordList, i.e.
- Each group is a GroupedVarInt w/ 4 bits per entry; a 64bit
number can then hold flags for 16 numbers
-> (but maybe increase this to hold at least 32 or 64 numbers, to
reduce the overhead ratio)
o create a GroupedVarInt<Max, T> class, where |Max| is the maximum number
of elements that are grouped, and T is the type of these elements
(i.e. uint64_t)
-> memory is managed by the caller
-> the state (i.e. used block size etc) is stored externally, and
managed by the caller
o append a key
o prepend a key
o insert a key in the middle