-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
58 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
=================== | ||
BlueStore Internals | ||
=================== | ||
|
||
|
||
Small write strategies | ||
---------------------- | ||
|
||
* *Normal*: The normal write path writes to unused space, waits for IO to flush, then commits the metadata. | ||
|
||
- write to new blob | ||
- kv commit | ||
|
||
* *A*: Vanilla WAL overwrite: commit intent to overwrite, then overwrite async. This matches legacy bluestore. | ||
|
||
- kv commit | ||
- wal overwrite | ||
|
||
* *B*: Do read up-front to complete a full csum or comp block, then (wal) overwrite. | ||
|
||
- read (some surrounding data) | ||
- kv commit | ||
- wal overwrite (of larger extent) | ||
|
||
* *C*: Vanilla WAL read/modify/write (like legacy bluestore). | ||
|
||
- kv commit | ||
- wal read/modify/write | ||
|
||
* *D*: Fragment lextent space by writing small piece of data into a piecemeal blob (that collects random, noncontiguous bits of data | ||
we need to write). | ||
|
||
- write to a piecemeal blob (min_alloc_size or larger, but we use just one block of it) | ||
- kv commit | ||
|
||
* *E*: Copy-on-write wal event: read from location A, combine with new write, write to location B. | ||
|
||
- kv commit | ||
- wal read from immutable blob A, modify, write to blob B | ||
|
||
* *F*: Copy-on-write wal event: read from location A, combine with new write, write to location B. Update csum/comp metadata. | ||
|
||
- kv commit | ||
- wal read from immutable blob A, verify csum, modify, [csum and/or compress+allocate,] write to blob B | ||
- kv update csum/comp/alloc metadata | ||
|
||
+----------------------+--------+--------------+-------------+--------------+---------------+ | ||
| | raw | raw (cached) | csum (4 KB) | csum (16 KB) | comp (128 KB) | | ||
+----------------------+--------+--------------+-------------+--------------+---------------+ | ||
| 4 KB overwrite | A | A | A | B | B or D | | ||
+----------------------+--------+--------------+-------------+--------------+---------------+ | ||
| 100 byte overwrite | C | A | B | B | B or D | | ||
+----------------------+--------+--------------+-------------+--------------+---------------+ | ||
| 100 byte append | C | A | B | B | B or D | | ||
+----------------------+--------+--------------+-------------+--------------+---------------+ | ||
+----------------------+--------+--------------+-------------+--------------+---------------+ | ||
| 4 KB clone overwrite | D or E | | D or F | D or F | D or F | | ||
+----------------------+--------+--------------+-------------+--------------+---------------+ |
c7cb768
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose E & F are pretty the same as you need to update lextents/blobs map in KV anyway (alloc info, num_refs, blob ids, etc )
c7cb768
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I have some doubts regarding B. validity as it probably has consistency issues - see my email in dev-list...
c7cb768
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c7cb768
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c7cb768
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine you have two WAL records that did the read for the same block. I suppose that the second one is invalid as it doesn't have data updated by the first write.
c7cb768
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the simple solution is that if there are any reads happening in do_write, we o->flush(), just like all the read methods to. This can be optimized later with ranges if we decide it matters.
c7cb768
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Sounds good