Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
db: blob storage #112
The WiscKey paper showed the value of segregating storage for large values in order to reduce write amplification. Badger is a Go KV store implementing that design. RocksDB has support for storing blobs (binary large objects) outside of the main LSM. Unfortunately, this blob-db is incompatible with the normal RocksDB API, making it unusable for us as-is. This issue outlines how I think blob storage should be added to Pebble.
The base WiscKey idea is that instead of storing large values inline in the LSM, they are stored in a separate value-log. The LSM merely contains a value-pointer. Storing large values outside of the LSM significantly reduces the size of the LSM which improves caching and reduces the overhead of compactions.
The preliminary blob support in RocksDB has outlined the basic structure for how this could work in Pebble. A new
(Note that RocksDB also provides support for a TTL on blobs, but I don't think this is usable for CockroachDB and therefore not needed in Pebble).
Blobs are stored in a series of blob logs. A background process periodically walks over the blob logs, reading the entries and checking to see if the value is still "live" in the LSM. This process is called value GC. If the value is still "live", it is written to the active blob log, and the LSM is updated.
An LSM does not allow mutating existing entries. In order to change the value for a key, a new value needs to be written which masks the old value. In Pebble/RocksDB/Badger, each version of a key/value pair has an associated sequence number (timestamp in Badger). RocksDB/Pebble take advantage of these sequence numbers to provide immutable snapshots. But the immutability of existing values presents a problem for blob storage. How do we update the LSM when a value is moved from one blob log to another? AFAICT, RocksDB's blob-db punts on this issue and just updates the current version. Badger at least attempts to deal with the problem:
The way this logical move is implemented in badger, is to rewrite the key as
The Blob Level
Blob pointers uniquely identify a blob. Blob pointers need to be immutable after they are created, because due to LSM immutability, we can't update the blob pointer without violating a core tenant of an LSM. A key observation: we don't have to update the blob pointer when a blob moves, we just need to be able to find a blob given the original blob pointer even if the blob's actual location changes. A blob pointer is composed of
Putting this together, blobs are stored in a series of logs and sstables. When first created, there is a single log that is being appended to (
As blobs get written to this log, they are given blob pointers with
A background process periodically loops over these files and rewrites them, only retaining blobs that are still "live" in the LSM. The notable difference from Badger/blob-db, is that rewritten blobs are stored in an sstable. For example, if
This sstable+log structure can be viewed as a single level in an LSM. Just as for a normal level, the metadata for this blob level is
When should GC be triggered? Simply walking through the files in the blob level in order is one approach, but we can do better. We should GC a blob file when it has a significant fraction of "dead" data. We can keep an exact count of this dead data by tracking when blob pointers are dropped during compactions (because a tombstone covering the blob has reached the bottommost level). A compaction would keep a
A background process would periodically wake up and look at the blob file metadata, looking for contiguous runs of files that will "compact" down to much smaller size. Note that we'll be able to accurately predict both the space savings and the size of the new file.
Interesting. I always wondered how the blob index would be updated after GC. Some thoughts:
Which sorting are you referring to? The sorting that takes place of the blob pointers after a compaction?
I'm not familiar with TitanDB's design. Do you have a link to something I can read?
When blob files are rewritten as sstables, we'd automatically get the compression benefits of compressing blocks instead of individual blobs.
I don't know. Perhaps the WiscKey paper has thoughts about this. We can reduce GC-induced write amplification by increasing space amplification. I'll think about this.
The TitanDB design is interesting. It looks like every sstable has an associate set of blob files (which seem to have an sstable-style structure). Blobs are indexed by
One advantage to writing the blobs to the WAL/memtable and then segregating them at flush, is that doing so removes a sync. If the blobs are segregated at commit time, then we need to sync the blob log and the WAL. I think that is worthwhile. That also opens the possibility of removing blob logs. Perhaps they are useful as a performance win (seeking to an offset is faster than a lookup in an sstable), but flushing could easily just create blob sstables.
Right. As you mentioned in Titan the blobs are indexed by
I'm still thinking about this too and don't have the intuition why GC is fundamentally better than compaction. The WiscKey paper didn't help as their analytic calculations omitted GC and their write-amp experiments were for insertion-only DBs. This was more helpful (http://smalldatum.blogspot.com/2018/07/indexlog-alternative-to-lsm.html) but I'm still not sure why it'd have better effects than reducing fanout (
We can't generally use tiered compaction unless we support multiple column families. I suppose blobs could be considered as a specialized versions of column families that are enabled "automatically". The GC of the blob level is essentially the same as universal compaction (I think, my understanding of the terminology is sometimes confused). I think it is interesting that the blob level allows trading of space amplification for write amplification. Level-based compaction trades of space+read amplification for write amplification. My suspicion is that segregated blob storage is a useful enabler of the design space, but I'm still searching for that better intuitive understanding of how to tune the rate of GC.
32 bytes does seem small to me as well. Are these blobs stored in the block cache like normal data blocks? How does that work?
Compacting blogs down back into ssts is interesting. @ajkr's question about "How do we reason about write-amplification incurred by GC vs. the write-amplification that would've happened without blob separation?" sums up my uncertainty too. Perhaps naively, I would expect that GCing blobs back into an SST would undo a lot of the benefit we get from segregated value storage.
When I first read this proposal I thought you were suggesting that we introduce an extra level of indirection through an sst-backed indexing structure pointing into the blog files. I wonder if there's something to explore there.
Also, I couldn't help but think of https://www.flickr.com/photos/k-80/4663400017 while reading through this.
I missed the issue of moving values and the LSM. Thanks for making that clear to me.
I expect that cache-amp for index+log is worse than you mention above when trying to achieve at most one IO per point query -- but maybe that constraint is less important to you. If value log GC/compaction requires index queries to determine whether a key is live then that is also likely to need a cached index or GC might be slow/inefficient.
My understanding of value log management that you propose is:
This requires a persistent mapping to get from .blog name to current container (.blog or *.sst). Assume this is an map from *.blog to current file. It would be nice if there were a way to eventually remove entries from that map that are no longer needed. If a *.blog file is 64 MB then there are 2^25 *.blog files per PB written. Questions about how many PB might be written over the lifetime of database are why I am curious about being able to GC that map.
You don't owe me an answer but if you are building this index+log from scratch, are you sure that an LSM is the best choice for the index given the complexity above (the inability to update the index with the new location of the value, the extra search required to find the value, the size of the map, GC for the map, the need to cache the LSM to make GC queries efficient)?
Note that row cache works better with value log entries when key is stable
Yep, this all looks accurate.
The mapping you're describing is simply an array of
This isn't actively being built. I wrote the above after trying to understand how other index+log systems work. I'm not convinced it is worthwhile and the expense of GC queries is definitely a concern. The inability to update the location of the value in the LSM is quite the oddity I discovered in the WiscKey approach. But the reason to use an LSM is if the application is already using an LSM and wants blob storage for some reason. One place this could be useful in CockroachDB's usage is for Raft log entries which are written once and read almost never. On the other hand, there might be an even more bespoke structure that is better for Raft log storage (note that in CockroachDB there are 10s of thousands of Raft logs per node, so it isn't feasible to actually use a separate file per Raft log).