Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement disk-based inverted index #2643

Merged
merged 25 commits into from
Aug 1, 2024

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Jul 26, 2024

  • read posting list from reader of inverted file
  • cache posting list
  • switch to v2 format
  • flatten the inverted file to reduce the number of IO operations
  • use binary search to speed up wand
  • reorg files, divide the inverted.rs into builder.rs, index.rs and wand.rs

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions github-actions bot added the enhancement New feature or request label Jul 26, 2024
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are all moved from inverted.rs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of them are moved from inverted.rs

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
let frequency_col =
Float32Array::from_iter_values(indices.iter().map(|&i| self.frequencies[i]));

let block_head_indices = block_head_indices(self.len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read first element of each block, and store them right before the posting list

@codecov-commenter
Copy link

codecov-commenter commented Jul 26, 2024

Codecov Report

Attention: Patch coverage is 63.28438% with 275 lines in your changes missing coverage. Please review.

Project coverage is 79.68%. Comparing base (72ff096) to head (221a264).

Files Patch % Lines
rust/lance-index/src/scalar/inverted/index.rs 64.86% 114 Missing and 16 partials ⚠️
rust/lance-index/src/scalar/inverted/builder.rs 66.66% 63 Missing and 14 partials ⚠️
rust/lance-index/src/scalar/lance_format.rs 47.77% 38 Missing and 9 partials ⚠️
rust/lance-index/src/scalar/inverted/wand.rs 73.52% 8 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/inverted.rs 0.00% 6 Missing ⚠️
rust/lance-index/src/scalar.rs 20.00% 4 Missing ⚠️
rust/lance-arrow/src/lib.rs 83.33% 1 Missing ⚠️
rust/lance/src/io/exec/fts.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2643      +/-   ##
==========================================
- Coverage   79.81%   79.68%   -0.14%     
==========================================
  Files         224      226       +2     
  Lines       65871    65981     +110     
  Branches    65871    65981     +110     
==========================================
+ Hits        52572    52574       +2     
- Misses      10231    10325      +94     
- Partials     3068     3082      +14     
Flag Coverage Δ
unittests 79.68% <63.28%> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal mentioned this pull request Jul 26, 2024
7 tasks
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@@ -347,14 +346,3 @@ pub async fn train_bitmap_index(

do_train_bitmap_index(batches_source, dictionary, index_store).await
}

pub async fn train_inverted_index(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to inverted.rs

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
let batch = RecordBatch::try_new(
arrow_schema::Schema::new(vec![
arrow_schema::Field::new("doc", doc_col.data_type().to_owned(), false),
arrow_schema::Field::new(super::ROW_ID, arrow_schema::DataType::UInt64, false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there is a ROW_ID_FIELD constant we can use for this instead.

let mut tokenizer = TOKENIZER.clone();
let mut stream = new_data;
while let Some(batch) = stream.try_next().await? {
let doc_col = batch.column(0).as_string::<Offset>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it doesn't seem worth making this whole method generic over the offsets. Could we instead write a function that does:

fn iter_str_array(array: &dyn Array) -> Box<dyn Iterator<Item = Option<&str>> + '_>

And iterate over that boxed iterator?

Comment on lines 206 to 207
let row_ids = row_ids.as_primitive::<UInt64Type>().values().to_vec();
let frequencies = frequencies.as_primitive::<Float32Type>().values().to_vec();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you are doing O(n_tokens) mem copies, which seems very inefficient. Have you considered making the PostingList storing the arrays directly? Or a ScalarBuffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I don't store them as array because i saw it's much slower than Vec, can try scalar buffer and switch to it if there's no obvious perf impact

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions, no big concerns.

@@ -64,14 +68,21 @@ pub trait IndexStore: std::fmt::Debug + Send + Sync + DeepSizeOf {
/// Create a new file and return a writer to store data in the file
async fn new_index_file(&self, name: &str, schema: Arc<Schema>)
-> Result<Box<dyn IndexWriter>>;
async fn new_index_file_v2(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a GH issue to move all scalar indices to V2 so we can get rid of these methods eventually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here it is, will work on that once this gets merged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No rush to work on it. It will probably be a bit tricky because we will need to make sure old indices continue to function.


// WAND parameters
lazy_static! {
static ref FACTOR: f32 = std::env::var("WAND_FACTOR")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why an env variable? Is it because we don't have a good way of specifying scalar config or because we doubt users will ever need to change this?

If this is changed does it mean retrain is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to move it into query parameter, changing this doesn't require retrain cause it only impacts the recall/performance for query


#[async_trait]
impl IndexReader for v2::reader::FileReader {
async fn read_record_batch(&self, offset: u32) -> Result<RecordBatch> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to just panic with unimplemented!. Reading a batch of size u32::MAX into memory isn't going to end well.

Comment on lines 265 to 297
let path = self.index_dir.child(name);

let other_store = dest_store.as_any().downcast_ref::<Self>();
if let Some(dest_lance_store) = other_store {
// If both this store and the destination are lance stores we can use object_store's copy
// This does blindly assume that both stores are using the same underlying object_store
// but there is no easy way to verify this and it happens to always be true at the moment
let dest_path = dest_lance_store.index_dir.child(name);
self.object_store.copy(&path, &dest_path).await
} else {
let reader = self.open_index_file_v2(name).await?;
let num_batches = reader.num_batches().await;
if num_batches == 0 {
return Err(Error::Internal {
message:
"Cannot copy an empty index file because the schema cannot be determined"
.into(),
location: location!(),
});
}
let first_batch = reader.read_record_batch(0).await?;
let schema = first_batch.schema();
let mut writer = dest_store.new_index_file_v2(name, schema).await?;
writer.write_record_batch(first_batch).await?;
for batch_index in 1..num_batches {
writer
.write_record_batch(reader.read_record_batch(batch_index).await?)
.await?;
}
writer.finish().await?;
Ok(())
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make a free function and avoid the repetition? This method is identical to the v1 counterpart right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not the same after make that read_record_batch() unimplemented, and this requires to expose the index_dir and object_store.

Comment on lines 66 to 87
let token_set_batch = self.tokens.to_batch()?;
let mut token_set_writer = dest_store
.new_index_file_v2(TOKENS_FILE, token_set_batch.schema())
.await?;
token_set_writer.write_record_batch(token_set_batch).await?;
token_set_writer.finish().await?;

let invert_list_batch = self.invert_list.to_batch()?;
let mut invert_list_writer = dest_store
.new_index_file_v2(INVERT_LIST_FILE, invert_list_batch.schema())
.await?;
invert_list_writer
.write_record_batch(invert_list_batch)
.await?;
invert_list_writer.finish().await?;

let docs_batch = self.docs.to_batch()?;
let mut docs_writer = dest_store
.new_index_file_v2(DOCS_FILE, docs_batch.schema())
.await?;
docs_writer.write_record_batch(docs_batch).await?;
docs_writer.finish().await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this in a helper function? It looks very similar to the logic in remap (and maybe the train function?)

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Nice work here.

@BubbleCal BubbleCal merged commit 9e7d97f into lancedb:main Aug 1, 2024
21 of 22 checks passed
eddyxu pushed a commit that referenced this pull request Aug 1, 2024
- read posting list from reader of inverted file
- cache posting list
- switch to v2 format
- flatten the inverted file to reduce the number of IO operations
- use binary search to speed up wand
- reorg files, divide the `inverted.rs` into `builder.rs`, `index.rs`
and `wand.rs`

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants