feat: support remapping for IVF_FLAT, IVF_PQ and IVF_SQ#2708
Conversation
prepare for supporting remap for new vector index format, HNSW remap not supported because simply mapping the row ids could break the connectivity of graph Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2708 +/- ##
==========================================
+ Coverage 78.80% 79.00% +0.19%
==========================================
Files 246 246
Lines 86637 86900 +263
Branches 86637 86900 +263
==========================================
+ Hits 68278 68655 +377
+ Misses 15529 15378 -151
- Partials 2830 2867 +37
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
@BubbleCal I've marked this as draft, since I'm assuming it is not ready for review. (There are no unit tests.) Mark it as ready for review when it is ready. |
…ctor-index Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
| lance_io::ReadBatchParams::RangeFull, | ||
| 4096, | ||
| 16, | ||
| projection, |
There was a problem hiding this comment.
we don't need the part_id in batch, just don't read it to save resources
| } | ||
|
|
||
| fn remap(&self, _: &HashMap<u64, Option<u64>>) -> Result<Self> { | ||
| Ok(self.clone()) |
There was a problem hiding this comment.
nit: let's add a warning log here?
There was a problem hiding this comment.
oh w8, we should remap sub index here no?
There was a problem hiding this comment.
for v3 we need to remap the subindex & vector storage. flat index doesn't contain anything so it simply returns itself here
There was a problem hiding this comment.
flat map is still { origin_vector: row_id }? if row id changes during compaction, we need to remap them ?
There was a problem hiding this comment.
remap on an vector index (v3) is:
- remap the sub index
- remap the storage
for IVF_FLAT, the sub index isFLATand storage isFlatStorage. FLAT sub index doesn't contain any data so no need to do anything here. the remapping happens onFlatStorage
| let batch = concat_batches(self.schema(), batches.iter())?; | ||
| Self::try_from_batch(batch, self.distance_type()) |
There was a problem hiding this comment.
I guess remap is already slow so it probably doesn't matter but it seems odd we would need to concat here.
There was a problem hiding this comment.
yeah it's because try_from_batch is not trivial, e.g. for PQ storage, it would transpose the pq codes
| let element_type = get_vector_element_type(dataset, column)?; | ||
| match element_type { | ||
| DataType::Float16 | DataType::Float32 | DataType::Float64 => { | ||
| IvfIndexBuilder::<FlatIndex, FlatQuantizer>::new( | ||
| dataset.clone(), | ||
| column.to_owned(), | ||
| dataset.indices_dir().child(uuid), | ||
| params.metric_type, | ||
| Box::new(shuffler), | ||
| Some(ivf_params.clone()), | ||
| Some(()), | ||
| (), | ||
| )? | ||
| .build() | ||
| .await?; | ||
| } | ||
| DataType::UInt8 => { | ||
| IvfIndexBuilder::<FlatIndex, FlatBinQuantizer>::new( | ||
| dataset.clone(), | ||
| column.to_owned(), | ||
| dataset.indices_dir().child(uuid), | ||
| params.metric_type, | ||
| Box::new(shuffler), | ||
| Some(ivf_params.clone()), | ||
| Some(()), | ||
| (), | ||
| )? | ||
| .build() | ||
| .await?; | ||
| } |
There was a problem hiding this comment.
I noticed there are many lines are doing the same thing: get the vector data type / value type and check it.
so just made the function get_vector_element_type to do this
| // async fn append(&self, batches: Vec<RecordBatch>) -> Result<()> { | ||
| // IvfIndexBuilder::new( | ||
| // dataset, | ||
| // column, | ||
| // index_dir, | ||
| // distance_type, | ||
| // shuffler, | ||
| // ivf_params, | ||
| // sub_index_params, | ||
| // quantizer_params, | ||
| // ) | ||
| // } | ||
|
|
There was a problem hiding this comment.
yeah these lines are commented and not used, so just removed them
| async fn write_batches( | ||
| path: Path, | ||
| batches: impl Iterator<Item = RecordBatch>, | ||
| schema: Schema, | ||
| ) -> Result<usize> { | ||
| let object_store = ObjectStore::local(); | ||
| let writer = object_store.create(&path).await?; | ||
| let mut writer = FileWriter::try_new(writer, schema, Default::default())?; | ||
| for batch in batches { | ||
| writer.write_batch(&batch).await?; | ||
| } | ||
| Ok(writer.finish().await? as usize) | ||
| } |
There was a problem hiding this comment.
Doesn't have to be part of this PR but it might be nice to have this as a static method on FileWriter.
| ) -> Result<()> { | ||
| let index_dir = dataset.indices_dir().child(new_uuid); | ||
| let element_type = get_vector_element_type(dataset, &column)?; | ||
| match index.index_type() { |
There was a problem hiding this comment.
Would it be possible to add a remap method to the VectorIndex trait instead of using a match statement here?
There was a problem hiding this comment.
yeah just tried, it can work!
| } | ||
| } | ||
|
|
||
| async fn test_remap_impl<T: ArrowPrimitiveType>( |
There was a problem hiding this comment.
Does this only test the case where rows are deleted or does it also test the case where fragments are combined and row ids are changed?
| .open_vector_index(q.column.as_str(), &index.uuid.to_string()) | ||
| .await?; | ||
| let mut q = q.clone(); | ||
| q.metric_type = idx.metric_type(); |
There was a problem hiding this comment.
this fixes a bug that with unindexed data, the flat search may compute the distances in a different distance type
westonpace
left a comment
There was a problem hiding this comment.
The cargo bump triggered a substrait update which is causing the MSRV failure. I'll make a PR to bump our MSRV (probably the easiest fix and 1.80 has been out for six months). No strong opinion on whether you wait for that PR or just merge and break CI.
not support IVF_HNSW_* index yet