Skip to content

Commit

Permalink
Optimize, simplify filter block building (fix regression)
Browse files Browse the repository at this point in the history
Summary: This is in part a refactoring / simplification to set up for
"decoupled" partitioned filters and in part to fix an intentional
regression for a correctness fix in facebook#12872. Basically, we are taking out
some complexity of the filter block builders, and pushing part of it
(simultaneous de-duplication of prefixes and whole keys) into the filter
bits builders, where it is more efficient by operating on hashes (rather
than copied keys).

Previously, the FullFilterBlockBuilder had a somewhat fragile and
confusing set of conditions under which it would keep a copy of the most
recent prefix and most recent whole key, along with some other state
that is essentially redundant. Now we just track (always) the previous
prefix in the PartitionedFilterBlockBuilder, to deal with the boundary
prefix Seek filtering problem. (Btw, the next PR will optimize this
away since BlockBasedTableReader already tracks the previous key.)
And to deal with the problem of de-duplicating both whole keys and
prefixes going into a single filter, we add a new function to
FilterBitsBuilder that has that extra de-duplication capabilty, which
is relatively efficient because we only have to cache an extra 64-bit
hash, not a copied key or prefix. (The API of this new function is
somewhat awkward to avoid a small CPU regression in some cases.)

Also previously, there was awkward logic split between
FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal
with some things specific to partitioning. And confusing names like Add
vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now.

The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock
into DecideCutAFilterBlock and CutAFilterBlock is to address what would
have been a slight performance regression in some cases. The split
allows for more intruction-level parallelism by reducing unnecessary
control dependencies.

Test Plan: existing tests (with some minor updates)

Also manually ported over the pre-broken regression test described in
 facebook#12870 and ran it (passed).

Performance:
Here we validate that an entire series of recent related PRs are a net
improvement in aggregate. "Before" is with these PRs reverted: facebook#12872
 facebook#12911 facebook#12874 facebook#12867 facebook#12903 facebook#12904. "After" includes this PR (and all
of those, with base revision 16c21af). Simultaneous test script designed
to maximally depend on SST construction efficiency:

```
for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results
```

Showing average ops/sec of "after" vs. "before"

```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
935586 vs. 928176 (+0.79%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
930171 vs. 926801 (+0.36%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
910727 vs. 894397 (+1.8%)
-partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1
929795 vs. 922007 (+0.84%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
921924 vs. 917285 (+0.51%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
903393 vs. 887340 (+1.8%)
```

As one would predict, the most improvement is seen in cases where we
have optimized away copying the whole key.
  • Loading branch information
pdillinger committed Aug 12, 2024
1 parent 6727f0f commit 9f02578
Show file tree
Hide file tree
Showing 9 changed files with 214 additions and 168 deletions.
9 changes: 7 additions & 2 deletions db/db_bloom_filter_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -692,15 +692,20 @@ namespace {

class AlwaysTrueBitsBuilder : public FilterBitsBuilder {
public:
void AddKey(const Slice&) override {}
size_t EstimateEntriesAdded() override { return 0U; }
void AddKey(const Slice&) override { ++count_; }
void AddKeyAndAlt(const Slice&, const Slice&) override { count_ += 2; }
size_t EstimateEntriesAdded() override { return count_; }
Slice Finish(std::unique_ptr<const char[]>* /* buf */) override {
count_ = 0;
// Interpreted as "always true" filter (0 probes over 1 byte of
// payload, 5 bytes metadata)
return Slice("\0\0\0\0\0\0", 6);
}
using FilterBitsBuilder::Finish;
size_t ApproximateNumEntries(size_t) override { return SIZE_MAX; }

private:
size_t count_ = 0;
};

class AlwaysTrueFilterPolicy : public ReadOnlyBuiltinFilterPolicy {
Expand Down
106 changes: 89 additions & 17 deletions table/block_based/filter_policy.cc
Original file line number Diff line number Diff line change
Expand Up @@ -82,21 +82,32 @@ class XXPH3FilterBitsBuilder : public BuiltinFilterBitsBuilder {
// requirements.
if (hash_entries_info_.entries.empty() ||
hash != hash_entries_info_.entries.back()) {
if (detect_filter_construct_corruption_) {
hash_entries_info_.xor_checksum ^= hash;
}
hash_entries_info_.entries.push_back(hash);
if (cache_res_mgr_ &&
// Traditional rounding to whole bucket size
((hash_entries_info_.entries.size() %
kUint64tHashEntryCacheResBucketSize) ==
kUint64tHashEntryCacheResBucketSize / 2)) {
hash_entries_info_.cache_res_bucket_handles.emplace_back(nullptr);
Status s = cache_res_mgr_->MakeCacheReservation(
kUint64tHashEntryCacheResBucketSize * sizeof(hash),
&hash_entries_info_.cache_res_bucket_handles.back());
s.PermitUncheckedError();
}
AddHash(hash);
}
}

void AddKeyAndAlt(const Slice& key, const Slice& alt) override {
uint64_t key_hash = GetSliceHash64(key);
uint64_t alt_hash = GetSliceHash64(alt);
std::optional<uint64_t> prev_key_hash;
std::optional<uint64_t> prev_alt_hash = hash_entries_info_.prev_alt_hash;
if (!hash_entries_info_.entries.empty()) {
prev_key_hash = hash_entries_info_.entries.back();
}
// Add alt first, so that entries.back() always contains previous key
// ASSUMING a change from one alt to the next implies a change to
// corresponding key
if (alt_hash != prev_alt_hash && alt_hash != key_hash &&
alt_hash != prev_key_hash) {
AddHash(alt_hash);
}
// Overwrite prev_alt_hash for cases like alt_hash == key_hash
hash_entries_info_.prev_alt_hash = alt_hash;
// NOTE: checking key_hash != prev_alt_hash for cases like
// key == prefix(key) at the end of a prefix grouping as in reverse
// byte-wise comparator
if (key_hash != prev_key_hash && key_hash != prev_alt_hash) {
AddHash(key_hash);
}
}

Expand All @@ -116,6 +127,24 @@ class XXPH3FilterBitsBuilder : public BuiltinFilterBitsBuilder {
CacheEntryRole::kFilterConstruction>::GetDummyEntrySize() /
sizeof(uint64_t);

void AddHash(uint64_t hash) {
if (detect_filter_construct_corruption_) {
hash_entries_info_.xor_checksum ^= hash;
}
hash_entries_info_.entries.push_back(hash);
if (cache_res_mgr_ &&
// Traditional rounding to whole bucket size
((hash_entries_info_.entries.size() %
kUint64tHashEntryCacheResBucketSize) ==
kUint64tHashEntryCacheResBucketSize / 2)) {
hash_entries_info_.cache_res_bucket_handles.emplace_back(nullptr);
Status s = cache_res_mgr_->MakeCacheReservation(
kUint64tHashEntryCacheResBucketSize * sizeof(hash),
&hash_entries_info_.cache_res_bucket_handles.back());
s.PermitUncheckedError();
}
}

// For delegating between XXPH3FilterBitsBuilders
void SwapEntriesWith(XXPH3FilterBitsBuilder* other) {
assert(other != nullptr);
Expand Down Expand Up @@ -282,17 +311,22 @@ class XXPH3FilterBitsBuilder : public BuiltinFilterBitsBuilder {
// Otherwise, it is 0.
uint64_t xor_checksum = 0;

// A single-element cache to help AddKeyAndAlt
std::optional<uint64_t> prev_alt_hash;

void Swap(HashEntriesInfo* other) {
assert(other != nullptr);
std::swap(entries, other->entries);
std::swap(cache_res_bucket_handles, other->cache_res_bucket_handles);
std::swap(xor_checksum, other->xor_checksum);
std::swap(prev_alt_hash, other->prev_alt_hash);
}

void Reset() {
entries.clear();
cache_res_bucket_handles.clear();
xor_checksum = 0;
prev_alt_hash = {};
}
};

Expand Down Expand Up @@ -331,6 +365,14 @@ class FastLocalBloomBitsBuilder : public XXPH3FilterBitsBuilder {

Slice Finish(std::unique_ptr<const char[]>* buf, Status* status) override {
size_t num_entries = hash_entries_info_.entries.size();
if (num_entries == 0) {
// This case migrated from FullFilterBlockBuilder::Finish
if (status) {
*status = Status::OK();
}
return FinishAlwaysFalse(buf);
}

size_t len_with_metadata = CalculateSpace(num_entries);

std::unique_ptr<char[]> mutable_buf;
Expand Down Expand Up @@ -1023,6 +1065,7 @@ class LegacyBloomBitsBuilder : public BuiltinFilterBitsBuilder {
~LegacyBloomBitsBuilder() override;

void AddKey(const Slice& key) override;
void AddKeyAndAlt(const Slice& key, const Slice& alt) override;

size_t EstimateEntriesAdded() override { return hash_entries_.size(); }

Expand Down Expand Up @@ -1050,6 +1093,9 @@ class LegacyBloomBitsBuilder : public BuiltinFilterBitsBuilder {
int bits_per_key_;
int num_probes_;
std::vector<uint32_t> hash_entries_;
// A single-element cache to help AddKeyAndAlt. (-1 == empty)
int64_t prev_alt_hash_ = -1;

Logger* info_log_;

// Get totalbits that optimized for cpu cache line
Expand Down Expand Up @@ -1079,14 +1125,39 @@ LegacyBloomBitsBuilder::~LegacyBloomBitsBuilder() = default;

void LegacyBloomBitsBuilder::AddKey(const Slice& key) {
uint32_t hash = BloomHash(key);
if (hash_entries_.size() == 0 || hash != hash_entries_.back()) {
if (hash_entries_.empty() || hash_entries_.back() != hash) {
hash_entries_.push_back(hash);
}
}

void LegacyBloomBitsBuilder::AddKeyAndAlt(const Slice& key, const Slice& alt) {
// Modified from XXPH3FilterBitsBuilder::AddKeyAndAlt
uint32_t key_hash = BloomHash(key);
uint32_t alt_hash = BloomHash(alt);

int64_t prev_key_hash = -1;
int64_t prev_alt_hash = prev_alt_hash_;
if (!hash_entries_.empty()) {
prev_key_hash = hash_entries_.back();
}
if (alt_hash != prev_alt_hash && alt_hash != key_hash &&
alt_hash != prev_key_hash) {
hash_entries_.push_back(alt_hash);
}
prev_alt_hash_ = alt_hash;
if (key_hash != prev_key_hash && key_hash != prev_alt_hash) {
hash_entries_.push_back(key_hash);
}
}

Slice LegacyBloomBitsBuilder::Finish(std::unique_ptr<const char[]>* buf) {
uint32_t total_bits, num_lines;
size_t num_entries = hash_entries_.size();
if (num_entries == 0) {
// This case migrated from FullFilterBlockBuilder::Finish
return FinishAlwaysFalse(buf);
}

uint32_t total_bits, num_lines;
char* data =
ReserveSpace(static_cast<int>(num_entries), &total_bits, &num_lines);
assert(data);
Expand Down Expand Up @@ -1127,6 +1198,7 @@ Slice LegacyBloomBitsBuilder::Finish(std::unique_ptr<const char[]>* buf) {
const char* const_data = data;
buf->reset(const_data);
hash_entries_.clear();
prev_alt_hash_ = -1;

return Slice(data, total_bits / 8 + kMetadataLen);
}
Expand Down
12 changes: 10 additions & 2 deletions table/block_based/filter_policy_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -31,10 +31,18 @@ class FilterBitsBuilder {
// added.
virtual void AddKey(const Slice& key) = 0;

// Add two entries to the filter, typically a key and, as the alternate,
// its prefix. This differs from AddKey(key); AddKey(alt); in that there
// is extra state for de-duplicating successive `alt` entries, as well
// as successive `key` entries. And there is usually de-duplication between
// `key` and `alt` entries, such that it's typically OK for an `alt` without
// a corresponding `key` to use AddKey().
virtual void AddKeyAndAlt(const Slice& key, const Slice& alt) = 0;

// Called by RocksDB before Finish to populate
// TableProperties::num_filter_entries, so should represent the
// number of unique keys (and/or prefixes) added, but does not have
// to be exact. `return 0;` may be used to conspicuously indicate "unknown".
// number of unique keys (and/or prefixes) added. MUST return 0
// if and only if none have been added, but otherwise can be estimated.
virtual size_t EstimateEntriesAdded() = 0;

// Generate the filter using the keys that are added
Expand Down
97 changes: 11 additions & 86 deletions table/block_based/full_filter_block.cc
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,8 @@ namespace ROCKSDB_NAMESPACE {
FullFilterBlockBuilder::FullFilterBlockBuilder(
const SliceTransform* _prefix_extractor, bool whole_key_filtering,
FilterBitsBuilder* filter_bits_builder)
: need_last_prefix_(whole_key_filtering && _prefix_extractor != nullptr),
prefix_extractor_(_prefix_extractor),
whole_key_filtering_(whole_key_filtering),
last_whole_key_recorded_(false),
last_prefix_recorded_(false),
last_key_in_domain_(false),
any_added_(false) {
: prefix_extractor_(_prefix_extractor),
whole_key_filtering_(whole_key_filtering) {
assert(filter_bits_builder != nullptr);
filter_bits_builder_.reset(filter_bits_builder);
}
Expand All @@ -36,94 +31,24 @@ size_t FullFilterBlockBuilder::EstimateEntriesAdded() {
}

void FullFilterBlockBuilder::Add(const Slice& key_without_ts) {
const bool add_prefix =
prefix_extractor_ && prefix_extractor_->InDomain(key_without_ts);

if (need_last_prefix_ && !last_prefix_recorded_ && last_key_in_domain_) {
// We can reach here when a new filter partition starts in partitioned
// filter. The last prefix in the previous partition should be added if
// necessary regardless of key_without_ts, to support prefix SeekForPrev.
AddKey(last_prefix_str_);
last_prefix_recorded_ = true;
}

if (whole_key_filtering_) {
if (!add_prefix) {
AddKey(key_without_ts);
if (prefix_extractor_ && prefix_extractor_->InDomain(key_without_ts)) {
Slice prefix = prefix_extractor_->Transform(key_without_ts);
if (whole_key_filtering_) {
filter_bits_builder_->AddKeyAndAlt(key_without_ts, prefix);
} else {
// if both whole_key and prefix are added to bloom then we will have whole
// key_without_ts and prefix addition being interleaved and thus cannot
// rely on the bits builder to properly detect the duplicates by comparing
// with the last item.
Slice last_whole_key = Slice(last_whole_key_str_);
if (!last_whole_key_recorded_ ||
last_whole_key.compare(key_without_ts) != 0) {
AddKey(key_without_ts);
last_whole_key_recorded_ = true;
last_whole_key_str_.assign(key_without_ts.data(),
key_without_ts.size());
}
}
}
if (add_prefix) {
last_key_in_domain_ = true;
AddPrefix(key_without_ts);
} else {
last_key_in_domain_ = false;
}
}

// Add key to filter if needed
inline void FullFilterBlockBuilder::AddKey(const Slice& key) {
filter_bits_builder_->AddKey(key);
any_added_ = true;
}

// Add prefix to filter if needed
void FullFilterBlockBuilder::AddPrefix(const Slice& key) {
assert(prefix_extractor_ && prefix_extractor_->InDomain(key));
Slice prefix = prefix_extractor_->Transform(key);
if (need_last_prefix_) {
// WART/FIXME: Because last_prefix_str_ is needed above to make
// SeekForPrev work with partitioned + prefix filters, we are currently
// use this inefficient code in that case (in addition to prefix+whole
// key). Hopefully this can be optimized with some refactoring up the call
// chain to BlockBasedTableBuilder. Even in PartitionedFilterBlockBuilder,
// we don't currently have access to the previous key/prefix by the time we
// know we are starting a new partition.

// if both whole_key and prefix are added to bloom then we will have whole
// key and prefix addition being interleaved and thus cannot rely on the
// bits builder to properly detect the duplicates by comparing with the last
// item.
Slice last_prefix = Slice(last_prefix_str_);
if (!last_prefix_recorded_ || last_prefix.compare(prefix) != 0) {
AddKey(prefix);
last_prefix_recorded_ = true;
last_prefix_str_.assign(prefix.data(), prefix.size());
filter_bits_builder_->AddKey(prefix);
}
} else {
AddKey(prefix);
} else if (whole_key_filtering_) {
filter_bits_builder_->AddKey(key_without_ts);
}
}

void FullFilterBlockBuilder::Reset() {
last_whole_key_recorded_ = false;
last_prefix_recorded_ = false;
}

Status FullFilterBlockBuilder::Finish(
const BlockHandle& /*last_partition_block_handle*/, Slice* filter,
std::unique_ptr<const char[]>* filter_owner) {
Reset();
Status s = Status::OK();
if (any_added_) {
any_added_ = false;
*filter = filter_bits_builder_->Finish(
filter_owner ? filter_owner : &filter_data_, &s);
} else {
*filter = Slice{};
}
*filter = filter_bits_builder_->Finish(
filter_owner ? filter_owner : &filter_data_, &s);
return s;
}

Expand Down
27 changes: 8 additions & 19 deletions table/block_based/full_filter_block.h
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,9 @@ class FullFilterBlockBuilder : public FilterBlockBuilder {
~FullFilterBlockBuilder() {}

void Add(const Slice& key_without_ts) override;
bool IsEmpty() const override { return !any_added_; }
bool IsEmpty() const override {
return filter_bits_builder_->EstimateEntriesAdded() == 0;
}
size_t EstimateEntriesAdded() override;
Status Finish(const BlockHandle& last_partition_block_handle, Slice* filter,
std::unique_ptr<const char[]>* filter_owner = nullptr) override;
Expand All @@ -63,30 +65,17 @@ class FullFilterBlockBuilder : public FilterBlockBuilder {
}

protected:
virtual void AddKey(const Slice& key);
const SliceTransform* prefix_extractor() const { return prefix_extractor_; }
bool whole_key_filtering() const { return whole_key_filtering_; }

std::unique_ptr<FilterBitsBuilder> filter_bits_builder_;
virtual void Reset();
void AddPrefix(const Slice& key);
const SliceTransform* prefix_extractor() { return prefix_extractor_; }
const std::string& last_prefix_str() const { return last_prefix_str_; }
bool need_last_prefix_;

private:
// important: all of these might point to invalid addresses
// at the time of destruction of this filter block. destructor
// should NOT dereference them.
const SliceTransform* prefix_extractor_;
bool whole_key_filtering_;
bool last_whole_key_recorded_;
std::string last_whole_key_str_;
bool last_prefix_recorded_;
std::string last_prefix_str_;
// Whether prefix_extractor_->InDomain(last_whole_key_) is true.
// Used in partitioned filters so that the last prefix from the previous
// filter partition will be added to the current partition if
// last_key_in_domain_ is true, regardless of the current key.
bool last_key_in_domain_;
bool any_added_;
const SliceTransform* const prefix_extractor_;
const bool whole_key_filtering_;
std::unique_ptr<const char[]> filter_data_;
};

Expand Down
Loading

0 comments on commit 9f02578

Please sign in to comment.