feat: support remapping for IVF_FLAT, IVF_PQ and IVF_SQ by BubbleCal · Pull Request #2708 · lance-format/lance

BubbleCal · 2024-08-08T11:25:55Z

not support IVF_HNSW_* index yet

prepare for supporting remap for new vector index format, HNSW remap not supported because simply mapping the row ids could break the connectivity of graph Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

codecov-commenter · 2024-08-08T12:27:53Z

Codecov Report

Attention: Patch coverage is 77.77778% with 76 lines in your changes missing coverage. Please review.

Project coverage is 79.00%. Comparing base (2b29487) to head (c18b4dc).

Files with missing lines	Patch %	Lines
rust/lance/src/index/vector/builder.rs	76.29%	1 Missing and 31 partials ⚠️
rust/lance-index/src/vector/storage.rs	64.70%	7 Missing and 5 partials ⚠️
rust/lance/src/index/vector/ivf/v2.rs	89.15%	8 Missing and 1 partial ⚠️
rust/lance/src/index/vector/utils.rs	64.28%	4 Missing and 1 partial ⚠️
rust/lance-file/src/v2/writer.rs	69.23%	0 Missing and 4 partials ⚠️
rust/lance-index/src/vector.rs	0.00%	3 Missing ⚠️
rust/lance-index/src/vector/hnsw/builder.rs	0.00%	2 Missing ⚠️
rust/lance-index/src/vector/v3/shuffler.rs	90.90%	0 Missing and 2 partials ⚠️
rust/lance/src/dataset/scanner.rs	77.77%	0 Missing and 2 partials ⚠️
rust/lance/src/index/vector/ivf.rs	85.71%	1 Missing and 1 partial ⚠️
... and 3 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2708      +/-   ##
==========================================
+ Coverage   78.80%   79.00%   +0.19%     
==========================================
  Files         246      246              
  Lines       86637    86900     +263     
  Branches    86637    86900     +263     
==========================================
+ Hits        68278    68655     +377     
+ Misses      15529    15378     -151     
- Partials     2830     2867      +37

Flag	Coverage Δ
unittests	`79.00% <77.77%> (+0.19%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127 · 2024-08-13T18:17:06Z

@BubbleCal I've marked this as draft, since I'm assuming it is not ready for review. (There are no unit tests.) Mark it as ready for review when it is ready.

…ctor-index Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

…ctor-index

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

…ctor-index

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal · 2024-12-10T10:55:11Z

                lance_io::ReadBatchParams::RangeFull,
                4096,
                16,
+                projection,


we don't need the part_id in batch, just don't read it to save resources

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

chebbyChefNEQ · 2024-12-12T20:02:33Z

    }

+    fn remap(&self, _: &HashMap<u64, Option<u64>>) -> Result<Self> {
+        Ok(self.clone())


nit: let's add a warning log here?

oh w8, we should remap sub index here no?

for v3 we need to remap the subindex & vector storage. flat index doesn't contain anything so it simply returns itself here

flat map is still { origin_vector: row_id }? if row id changes during compaction, we need to remap them ?

remap on an vector index (v3) is:

remap the sub index

remap the storage
for IVF_FLAT, the sub index is FLAT and storage is FlatStorage. FLAT sub index doesn't contain any data so no need to do anything here. the remapping happens on FlatStorage

westonpace · 2024-12-18T14:43:56Z

+        let batch = concat_batches(self.schema(), batches.iter())?;
+        Self::try_from_batch(batch, self.distance_type())


I guess remap is already slow so it probably doesn't matter but it seems odd we would need to concat here.

yeah it's because try_from_batch is not trivial, e.g. for PQ storage, it would transpose the pq codes

westonpace · 2024-12-18T14:45:41Z

+        let element_type = get_vector_element_type(dataset, column)?;
+        match element_type {
+            DataType::Float16 | DataType::Float32 | DataType::Float64 => {
+                IvfIndexBuilder::<FlatIndex, FlatQuantizer>::new(
+                    dataset.clone(),
+                    column.to_owned(),
+                    dataset.indices_dir().child(uuid),
+                    params.metric_type,
+                    Box::new(shuffler),
+                    Some(ivf_params.clone()),
+                    Some(()),
+                    (),
+                )?
+                .build()
+                .await?;
+            }
+            DataType::UInt8 => {
+                IvfIndexBuilder::<FlatIndex, FlatBinQuantizer>::new(
+                    dataset.clone(),
+                    column.to_owned(),
+                    dataset.indices_dir().child(uuid),
+                    params.metric_type,
+                    Box::new(shuffler),
+                    Some(ivf_params.clone()),
+                    Some(()),
+                    (),
+                )?
+                .build()
+                .await?;
+            }


Why did this change?

I noticed there are many lines are doing the same thing: get the vector data type / value type and check it.
so just made the function get_vector_element_type to do this

westonpace · 2024-12-18T14:46:41Z

-    // async fn append(&self, batches: Vec<RecordBatch>) -> Result<()> {
-    //     IvfIndexBuilder::new(
-    //         dataset,
-    //         column,
-    //         index_dir,
-    //         distance_type,
-    //         shuffler,
-    //         ivf_params,
-    //         sub_index_params,
-    //         quantizer_params,
-    //     )
-    // }
-


yeah these lines are commented and not used, so just removed them

westonpace · 2024-12-18T14:52:08Z

+    async fn write_batches(
+        path: Path,
+        batches: impl Iterator<Item = RecordBatch>,
+        schema: Schema,
+    ) -> Result<usize> {
+        let object_store = ObjectStore::local();
+        let writer = object_store.create(&path).await?;
+        let mut writer = FileWriter::try_new(writer, schema, Default::default())?;
+        for batch in batches {
+            writer.write_batch(&batch).await?;
+        }
+        Ok(writer.finish().await? as usize)
+    }


Doesn't have to be part of this PR but it might be nice to have this as a static method on FileWriter.

westonpace · 2024-12-18T14:53:38Z

+) -> Result<()> {
+    let index_dir = dataset.indices_dir().child(new_uuid);
+    let element_type = get_vector_element_type(dataset, &column)?;
+    match index.index_type() {


Would it be possible to add a remap method to the VectorIndex trait instead of using a match statement here?

yeah just tried, it can work!

westonpace · 2024-12-18T14:54:30Z

+        }
+    }
+
+    async fn test_remap_impl<T: ArrowPrimitiveType>(


Does this only test the case where rows are deleted or does it also test the case where fragments are combined and row ids are changed?

…ctor-index

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal · 2024-12-20T07:31:10Z

+                .open_vector_index(q.column.as_str(), &index.uuid.to_string())
+                .await?;
+            let mut q = q.clone();
+            q.metric_type = idx.metric_type();


this fixes a bug that with unindexed data, the flat search may compute the distances in a different distance type

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace

The cargo bump triggered a substrait update which is causing the MSRV failure. I'll make a PR to bump our MSRV (probably the easiest fix and 1.80 has been out for six months). No strong opinion on whether you wait for that PR or just merge and break CI.

feat: support remapping vector storage and flat index

e24cbed

prepare for supporting remap for new vector index format, HNSW remap not supported because simply mapping the row ids could break the connectivity of graph Signed-off-by: BubbleCal <bubble-cal@outlook.com>

github-actions Bot added the enhancement New feature or request label Aug 8, 2024

BubbleCal added 2 commits August 8, 2024 19:34

fix

9d6e8f1

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

b5b6252

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

wjones127 marked this pull request as draft August 13, 2024 18:16

BubbleCal added 3 commits December 3, 2024 14:18

Merge branch 'main' of https://github.com/lancedb/lance into remap-ve…

4c9ea83

…ctor-index Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

0d34f8b

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

dfa1663

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal mentioned this pull request Dec 4, 2024

feat: support binary vector and hamming distance #3199

Closed

2 tasks

Merge branch 'main' of https://github.com/lancedb/lance into remap-ve…

36959a6

…ctor-index

github-actions Bot added the python label Dec 5, 2024

BubbleCal added 6 commits December 5, 2024 17:20

resolve conflicts

deb0b25

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

remap

3f4b06d

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

Merge branch 'main' of https://github.com/lancedb/lance into remap-ve…

c26696b

…ctor-index

update Cargo.lock

77a6cd5

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

add tests

78cdb17

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

17c973d

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal commented Dec 10, 2024

View reviewed changes

fix

7f5fae7

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal changed the title ~~feat: support remapping vector storage and flat index~~ feat: support remapping for IVF_FLAT, IVF_PQ and IVF_SQ Dec 10, 2024

BubbleCal marked this pull request as ready for review December 10, 2024 11:24

BubbleCal requested review from eddyxu, westonpace and wjones127 December 10, 2024 11:24

chebbyChefNEQ reviewed Dec 12, 2024

View reviewed changes

BubbleCal requested a review from chebbyChefNEQ December 13, 2024 04:38

BubbleCal mentioned this pull request Dec 18, 2024

feat: support IVF_FLAT, binary vectors and hamming distance lancedb/lancedb#1955

Merged

westonpace reviewed Dec 18, 2024

View reviewed changes

BubbleCal added 3 commits December 20, 2024 13:52

Merge branch 'main' of https://github.com/lancedb/lance into remap-ve…

c7cb809

…ctor-index

fix

8ed78c8

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix

1873581

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal commented Dec 20, 2024

View reviewed changes

BubbleCal added 2 commits December 20, 2024 15:38

fix

e3175e6

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fmt

c18b4dc

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace approved these changes Dec 20, 2024

View reviewed changes

BubbleCal merged commit 72ae355 into lance-format:main Dec 20, 2024

		let batch = concat_batches(self.schema(), batches.iter())?;
		Self::try_from_batch(batch, self.distance_type())

Conversation

BubbleCal commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wjones127 commented Aug 13, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

BubbleCal commented Aug 8, 2024 •

edited

Loading

codecov-commenter commented Aug 8, 2024 •

edited

Loading