-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement disk-based inverted index #2643
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are all moved from inverted.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of them are moved from inverted.rs
let frequency_col = | ||
Float32Array::from_iter_values(indices.iter().map(|&i| self.frequencies[i])); | ||
|
||
let block_head_indices = block_head_indices(self.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read first element of each block, and store them right before the posting list
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2643 +/- ##
==========================================
- Coverage 79.81% 79.68% -0.14%
==========================================
Files 224 226 +2
Lines 65871 65981 +110
Branches 65871 65981 +110
==========================================
+ Hits 52572 52574 +2
- Misses 10231 10325 +94
- Partials 3068 3082 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@@ -347,14 +346,3 @@ pub async fn train_bitmap_index( | |||
|
|||
do_train_bitmap_index(batches_source, dictionary, index_store).await | |||
} | |||
|
|||
pub async fn train_inverted_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to inverted.rs
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
let batch = RecordBatch::try_new( | ||
arrow_schema::Schema::new(vec![ | ||
arrow_schema::Field::new("doc", doc_col.data_type().to_owned(), false), | ||
arrow_schema::Field::new(super::ROW_ID, arrow_schema::DataType::UInt64, false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI there is a ROW_ID_FIELD
constant we can use for this instead.
let mut tokenizer = TOKENIZER.clone(); | ||
let mut stream = new_data; | ||
while let Some(batch) = stream.try_next().await? { | ||
let doc_col = batch.column(0).as_string::<Offset>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it doesn't seem worth making this whole method generic over the offsets. Could we instead write a function that does:
fn iter_str_array(array: &dyn Array) -> Box<dyn Iterator<Item = Option<&str>> + '_>
And iterate over that boxed iterator?
let row_ids = row_ids.as_primitive::<UInt64Type>().values().to_vec(); | ||
let frequencies = frequencies.as_primitive::<Float32Type>().values().to_vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you are doing O(n_tokens)
mem copies, which seems very inefficient. Have you considered making the PostingList
storing the arrays directly? Or a ScalarBuffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I don't store them as array because i saw it's much slower than Vec
, can try scalar buffer and switch to it if there's no obvious perf impact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few suggestions, no big concerns.
@@ -64,14 +68,21 @@ pub trait IndexStore: std::fmt::Debug + Send + Sync + DeepSizeOf { | |||
/// Create a new file and return a writer to store data in the file | |||
async fn new_index_file(&self, name: &str, schema: Arc<Schema>) | |||
-> Result<Box<dyn IndexWriter>>; | |||
async fn new_index_file_v2( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make a GH issue to move all scalar indices to V2 so we can get rid of these methods eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here it is, will work on that once this gets merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No rush to work on it. It will probably be a bit tricky because we will need to make sure old indices continue to function.
|
||
// WAND parameters | ||
lazy_static! { | ||
static ref FACTOR: f32 = std::env::var("WAND_FACTOR") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why an env variable? Is it because we don't have a good way of specifying scalar config or because we doubt users will ever need to change this?
If this is changed does it mean retrain is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to move it into query parameter, changing this doesn't require retrain cause it only impacts the recall/performance for query
|
||
#[async_trait] | ||
impl IndexReader for v2::reader::FileReader { | ||
async fn read_record_batch(&self, offset: u32) -> Result<RecordBatch> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to just panic with unimplemented!
. Reading a batch of size u32::MAX
into memory isn't going to end well.
let path = self.index_dir.child(name); | ||
|
||
let other_store = dest_store.as_any().downcast_ref::<Self>(); | ||
if let Some(dest_lance_store) = other_store { | ||
// If both this store and the destination are lance stores we can use object_store's copy | ||
// This does blindly assume that both stores are using the same underlying object_store | ||
// but there is no easy way to verify this and it happens to always be true at the moment | ||
let dest_path = dest_lance_store.index_dir.child(name); | ||
self.object_store.copy(&path, &dest_path).await | ||
} else { | ||
let reader = self.open_index_file_v2(name).await?; | ||
let num_batches = reader.num_batches().await; | ||
if num_batches == 0 { | ||
return Err(Error::Internal { | ||
message: | ||
"Cannot copy an empty index file because the schema cannot be determined" | ||
.into(), | ||
location: location!(), | ||
}); | ||
} | ||
let first_batch = reader.read_record_batch(0).await?; | ||
let schema = first_batch.schema(); | ||
let mut writer = dest_store.new_index_file_v2(name, schema).await?; | ||
writer.write_record_batch(first_batch).await?; | ||
for batch_index in 1..num_batches { | ||
writer | ||
.write_record_batch(reader.read_record_batch(batch_index).await?) | ||
.await?; | ||
} | ||
writer.finish().await?; | ||
Ok(()) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make a free function and avoid the repetition? This method is identical to the v1 counterpart right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not the same after make that read_record_batch()
unimplemented, and this requires to expose the index_dir
and object_store
.
let token_set_batch = self.tokens.to_batch()?; | ||
let mut token_set_writer = dest_store | ||
.new_index_file_v2(TOKENS_FILE, token_set_batch.schema()) | ||
.await?; | ||
token_set_writer.write_record_batch(token_set_batch).await?; | ||
token_set_writer.finish().await?; | ||
|
||
let invert_list_batch = self.invert_list.to_batch()?; | ||
let mut invert_list_writer = dest_store | ||
.new_index_file_v2(INVERT_LIST_FILE, invert_list_batch.schema()) | ||
.await?; | ||
invert_list_writer | ||
.write_record_batch(invert_list_batch) | ||
.await?; | ||
invert_list_writer.finish().await?; | ||
|
||
let docs_batch = self.docs.to_batch()?; | ||
let mut docs_writer = dest_store | ||
.new_index_file_v2(DOCS_FILE, docs_batch.schema()) | ||
.await?; | ||
docs_writer.write_record_batch(docs_batch).await?; | ||
docs_writer.finish().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this in a helper function? It looks very similar to the logic in remap
(and maybe the train function?)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Nice work here.
- read posting list from reader of inverted file - cache posting list - switch to v2 format - flatten the inverted file to reduce the number of IO operations - use binary search to speed up wand - reorg files, divide the `inverted.rs` into `builder.rs`, `index.rs` and `wand.rs` --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>
inverted.rs
intobuilder.rs
,index.rs
andwand.rs