perf: new layout for positions and new algo for phrase query#6203
perf: new layout for positions and new algo for phrase query#6203
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
PR Review: varint for doc/freqThis PR introduces varint-delta encoding for posting list tails and a new shared position stream layout (Lucene-style packed deltas), replacing the fixed-size encoding. Good improvements for space efficiency with solid backward-compat handling via codec metadata detection. IssuesP1: Compat test no longer covers the old FTS index format The P1: Duplicated metadata construction in
fn build_index_metadata(&self, partitions: &[u64]) -> Result<HashMap<String, String>> { ... }P1: Duplicate varint encoder functions in
Minor
Overall the encoding/decoding logic looks correct, the backward-compat codec detection ( |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
| let compressed = list.value(self.idx); | ||
| let positions = decompress_positions(compressed.as_binary()); | ||
| Box::new(positions.into_iter()) as Box<dyn Iterator<Item = u32>> | ||
| } |
There was a problem hiding this comment.
I'm not sure if I understand this correctly. It looks like we have cached the entire POSITION_COL in LegacyPerDoc:
PositionsLayout::LegacyPerDoc => {
let batch = self
.reader
.read_range(self.posting_list_range(token_id), Some(&[POSITION_COL]))
.await
.map_err(|e| match e {
Error::Schema { .. } => Error::invalid_input("position is not found but required for phrase queries, try recreating the index with position".to_owned()),
e => e,
})?;
CompressedPositionStorage::LegacyPerDoc(batch[POSITION_COL].as_list::<i32>().clone())Do we still need to do this?
There was a problem hiding this comment.
it caches only the rows hit, not the entire column.
these line read the cached rows and decompress them.
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Xuanwo
left a comment
There was a problem hiding this comment.
I'm good with this PR once all CI passed
|
hold it because of compatibility issue |
|
Looks like #[serde(rename = "memory_limit", skip_serializing, ...)] memory limit still not being respected and fallback to 2GiB |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…g/inverted-index-positions
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
skipping serializing this field is expected, because we don't want to add this into index params, it's should be one-time param for only building. I just tested this param it works |
|
Just leaving for release notes: env var |
This stores all positions of single posting list in a flat array, and apply compression to the positions: - for positions of single document, storing delta positions - divide positions of single posting list into blocks (128 by default) and apply bitpack to each block - store the remaining positions in varint This also stores the remaining doc ids / frequencies in varint as well. These are breaking changes. on 1M random dataset: - build is 9.35x faster (34.269s -> 3.664s) - index is 8.87x smaller (955,002,412 bytes -> 107,603,319 bytes) - phrase query avg latency is 3.41x faster (21.631ms -> 6.334ms) and p95 latency is 3.17x faster (63.735ms -> 20.100ms) the improvement can be more significant on larger dataset --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…ormat#6203) This stores all positions of single posting list in a flat array, and apply compression to the positions: - for positions of single document, storing delta positions - divide positions of single posting list into blocks (128 by default) and apply bitpack to each block - store the remaining positions in varint This also stores the remaining doc ids / frequencies in varint as well. These are breaking changes. on 1M random dataset: - build is 9.35x faster (34.269s -> 3.664s) - index is 8.87x smaller (955,002,412 bytes -> 107,603,319 bytes) - phrase query avg latency is 3.41x faster (21.631ms -> 6.334ms) and p95 latency is 3.17x faster (63.735ms -> 20.100ms) the improvement can be more significant on larger dataset --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>
This stores all positions of single posting list in a flat array, and apply compression to the positions:
This also stores the remaining doc ids / frequencies in varint as well.
These are breaking changes.
on 1M random dataset:
the improvement can be more significant on larger dataset