Skip to content

Commit

Permalink
bluestore write path notes
Browse files Browse the repository at this point in the history
  • Loading branch information
liewegas committed May 12, 2016
1 parent d37bc22 commit c7cb768
Showing 1 changed file with 58 additions and 0 deletions.
58 changes: 58 additions & 0 deletions doc/dev/bluestore.rst
@@ -0,0 +1,58 @@
===================
BlueStore Internals
===================


Small write strategies
----------------------

* *Normal*: The normal write path writes to unused space, waits for IO to flush, then commits the metadata.

- write to new blob
- kv commit

* *A*: Vanilla WAL overwrite: commit intent to overwrite, then overwrite async. This matches legacy bluestore.

- kv commit
- wal overwrite

* *B*: Do read up-front to complete a full csum or comp block, then (wal) overwrite.

- read (some surrounding data)
- kv commit
- wal overwrite (of larger extent)

* *C*: Vanilla WAL read/modify/write (like legacy bluestore).

- kv commit
- wal read/modify/write

* *D*: Fragment lextent space by writing small piece of data into a piecemeal blob (that collects random, noncontiguous bits of data
we need to write).

- write to a piecemeal blob (min_alloc_size or larger, but we use just one block of it)
- kv commit

* *E*: Copy-on-write wal event: read from location A, combine with new write, write to location B.

- kv commit
- wal read from immutable blob A, modify, write to blob B

* *F*: Copy-on-write wal event: read from location A, combine with new write, write to location B. Update csum/comp metadata.

- kv commit
- wal read from immutable blob A, verify csum, modify, [csum and/or compress+allocate,] write to blob B
- kv update csum/comp/alloc metadata

+----------------------+--------+--------------+-------------+--------------+---------------+
| | raw | raw (cached) | csum (4 KB) | csum (16 KB) | comp (128 KB) |
+----------------------+--------+--------------+-------------+--------------+---------------+
| 4 KB overwrite | A | A | A | B | B or D |
+----------------------+--------+--------------+-------------+--------------+---------------+
| 100 byte overwrite | C | A | B | B | B or D |
+----------------------+--------+--------------+-------------+--------------+---------------+
| 100 byte append | C | A | B | B | B or D |
+----------------------+--------+--------------+-------------+--------------+---------------+
+----------------------+--------+--------------+-------------+--------------+---------------+
| 4 KB clone overwrite | D or E | | D or F | D or F | D or F |
+----------------------+--------+--------------+-------------+--------------+---------------+

7 comments on commit c7cb768

@ifed01
Copy link

@ifed01 ifed01 commented on c7cb768 May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose E & F are pretty the same as you need to update lextents/blobs map in KV anyway (alloc info, num_refs, blob ids, etc )

@ifed01
Copy link

@ifed01 ifed01 commented on c7cb768 May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I have some doubts regarding B. validity as it probably has consistency issues - see my email in dev-list...

@liewegas
Copy link
Owner Author

@liewegas liewegas commented on c7cb768 May 12, 2016 via email

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas
Copy link
Owner Author

@liewegas liewegas commented on c7cb768 May 12, 2016 via email

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01
Copy link

@ifed01 ifed01 commented on c7cb768 May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imagine you have two WAL records that did the read for the same block. I suppose that the second one is invalid as it doesn't have data updated by the first write.

@liewegas
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the simple solution is that if there are any reads happening in do_write, we o->flush(), just like all the read methods to. This can be optimized later with ranges if we decide it matters.

@ifed01
Copy link

@ifed01 ifed01 commented on c7cb768 May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Sounds good

Please sign in to comment.