feat: expose index cache size #1587

rok · 2023-11-11T06:04:20Z

This is to enable lancedb/lancedb#641.

rust/lance/src/index/cache.rs

wjones127

Correct me if I'm wrong @eddyxu, but I think you mean expose setting index cache size, right? This is just reading.

eddyxu · 2023-11-13T17:11:40Z

@wjones127 is right. This is for users from lancedb to be able to customize the size of the cache for index.

rok · 2023-11-13T17:50:26Z

Got it. Would we still want the read interface?

wjones127 · 2023-11-13T17:54:28Z

Would we still want the read interface?

I don't think it's critical, but it could be useful for testing.

rok · 2023-11-13T17:58:30Z

I'll leave it in for testing for now.

eddyxu · 2023-11-13T17:59:10Z

Lets reduce the API interface to put it behind dataset.stats?

rok · 2023-11-15T02:53:52Z

@eddyxu moved the index_cache_size under stats. I think this is now ready for review.

python/python/lance/dataset.py

python/python/tests/test_vector_index.py

rust/lance/src/dataset.rs

rok · 2023-11-16T02:17:49Z

rust/lance/src/index/cache.rs

+    pub(crate) fn get_size(&self) -> usize {
+        self.scalar_cache.sync();
+        self.vector_cache.sync();
+        self.scalar_cache.entry_count() as usize + self.vector_cache.entry_count() as usize


@wjones127 do you feel just summing this is ok?

TBH, I’m not sure I care about the entry count. As a user, I would much rather set the limit in terms of bytes and get the total bytes consumed in the statistics. So this is fine for now but long term I think we ought to consider switching to evicting based on in-memory size.

IIUC we can use weighted_size and with weigher to do that. (I hope moka is better at cache invaliation than naming).
I can give that a quick try.

Actually let's make that a separate issue to not stall this one.

I hope moka is better at cache invaliation than naming
🤣
Actually let's make that a separate issue to not stall this one.

Sounds good!

Opened #1613

I think summing is fine. We might just merge these into one field someday.

However, I think there is one potential problem. The user might set the index cache size to X and then, if they have both scalar and vector indices, they might see an entry count of 2 * X. Still, the best long term solution is probably to do bytes so let's stick with summing for now.

rok · 2023-11-16T03:00:07Z

@wjones127 any other comments here?

python/python/tests/test_vector_index.py

westonpace

Some questions but nothing blocking, looks good.

westonpace · 2023-11-17T16:26:08Z

python/python/lance/__init__.py

@@ -89,7 +89,13 @@ def dataset(
        Approximately, ``n = Total Rows / number of IVF partitions``.
        ``pq = number of PQ sub-vectors``.
    """
-    ds = LanceDataset(uri, version, block_size, commit_lock=commit_lock)
+    ds = LanceDataset(


Does the index cache have a TTL (I'm looking at the comment above which, to be fair, looks like it isn't part of this PR)? I don't think it does.

westonpace · 2023-11-17T16:27:38Z

python/python/lance/dataset.py

@@ -839,6 +839,7 @@ def create_index(
        ivf_centroids: Optional[Union[np.ndarray, pa.FixedSizeListArray]] = None,
        num_sub_vectors: Optional[int] = None,
        accelerator: Optional[Union[str, "torch.Device"]] = None,
+        index_cache_size: Optional[int] = None,


I wonder if it would be better to change create_index so that it modifies the dataset in-place instead of returning a new dataset. Though that would be a breaking change so perhaps that ship has sailed.

I was intending to do that at some point. I had done this on the Rust side earlier since it was a source of bugs: #1118

That does seem like the better design to me.

westonpace · 2023-11-17T16:32:35Z

rust/lance/src/index/cache.rs

+    pub(crate) fn get_size(&self) -> usize {
+        self.scalar_cache.sync();
+        self.vector_cache.sync();
+        self.scalar_cache.entry_count() as usize + self.vector_cache.entry_count() as usize


I think summing is fine. We might just merge these into one field someday.

However, I think there is one potential problem. The user might set the index cache size to X and then, if they have both scalar and vector indices, they might see an entry count of 2 * X. Still, the best long term solution is probably to do bytes so let's stick with summing for now.

This is to enable lancedb/lancedb#641.

This is to enable #641. Should be merged after lancedb/lance#1587 is released.

This is to enable lancedb#641. Should be merged after lancedb/lance#1587 is released.

This is to enable #641. Should be merged after lancedb/lance#1587 is released.

rok commented Nov 11, 2023

View reviewed changes

rust/lance/src/index/cache.rs Outdated Show resolved Hide resolved

rok force-pushed the expose_index_cache_size branch from d2499c0 to f94addd Compare November 11, 2023 20:09

rok requested review from eddyxu and wjones127 November 11, 2023 20:20

wjones127 reviewed Nov 11, 2023

View reviewed changes

rok force-pushed the expose_index_cache_size branch from b3ad2cc to 3f3c5a8 Compare November 15, 2023 02:51

rok requested a review from wjones127 November 15, 2023 11:28

wjones127 changed the title ~~perf: expose index cache size~~ feat: expose index cache size Nov 16, 2023

wjones127 requested changes Nov 16, 2023

View reviewed changes

python/python/lance/dataset.py Outdated Show resolved Hide resolved

python/python/tests/test_vector_index.py Outdated Show resolved Hide resolved

rust/lance/src/dataset.rs Outdated Show resolved Hide resolved

rok commented Nov 16, 2023

View reviewed changes

rok force-pushed the expose_index_cache_size branch from 4c81b7d to e598332 Compare November 16, 2023 02:18

rok requested a review from wjones127 November 16, 2023 02:59

wjones127 reviewed Nov 16, 2023

View reviewed changes

python/python/tests/test_vector_index.py Outdated Show resolved Hide resolved

rok force-pushed the expose_index_cache_size branch from 6758931 to bc53b9b Compare November 16, 2023 14:58

rok mentioned this pull request Nov 16, 2023

feat: index cache size should be settable and readable in bytes #1613

Open

rok force-pushed the expose_index_cache_size branch 2 times, most recently from b44691a to a41104f Compare November 17, 2023 00:23

rok requested a review from wjones127 November 17, 2023 00:34

rok added 5 commits November 17, 2023 02:26

perf: expose index cache size

9896fae

Better python test

0b80b9d

Change test

924d55c

lint and add index_cache_size to create_index

422d3bd

Move under .stats

b86634a

rok added 4 commits November 17, 2023 02:28

Fix test and lint

523438b

Review feefback

5053f6d

Simpler rng setup

8b43897

Minor change

be000ca

rok force-pushed the expose_index_cache_size branch from a41104f to c47de30 Compare November 17, 2023 01:29

Post commit fix

effed41

rok force-pushed the expose_index_cache_size branch from c47de30 to effed41 Compare November 17, 2023 01:41

docstring

edcf13d

rok mentioned this pull request Nov 17, 2023

feat(python): expose index cache size lancedb/lancedb#655

Merged

westonpace approved these changes Nov 17, 2023

View reviewed changes

wjones127 approved these changes Nov 17, 2023

View reviewed changes

rok merged commit 0083f2d into lancedb:main Nov 17, 2023
17 checks passed

eddyxu pushed a commit that referenced this pull request Nov 17, 2023

feat: expose index cache size (#1587)

a721776

This is to enable lancedb/lancedb#641.

eddyxu pushed a commit that referenced this pull request Nov 17, 2023

feat: expose index cache size (#1587)

4e52e2d

This is to enable lancedb/lancedb#641.

wjones127 pushed a commit to lancedb/lancedb that referenced this pull request Nov 18, 2023

feat(python): expose index cache size (#655)

d8e3e54

This is to enable #641. Should be merged after lancedb/lance#1587 is released.

rok mentioned this pull request Nov 19, 2023

feat: index cache size should be settable and readable in bytes #1638

Closed

raghavdixit99 pushed a commit to raghavdixit99/lancedb that referenced this pull request Apr 5, 2024

feat(python): expose index cache size (lancedb#655)

51850e3

This is to enable lancedb#641. Should be merged after lancedb/lance#1587 is released.

westonpace pushed a commit to lancedb/lancedb that referenced this pull request Apr 5, 2024

feat(python): expose index cache size (#655)

78ab906

This is to enable #641. Should be merged after lancedb/lance#1587 is released.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expose index cache size #1587

feat: expose index cache size #1587

rok commented Nov 11, 2023

wjones127 left a comment

eddyxu commented Nov 13, 2023

rok commented Nov 13, 2023

wjones127 commented Nov 13, 2023

rok commented Nov 13, 2023

eddyxu commented Nov 13, 2023

rok commented Nov 15, 2023

rok Nov 16, 2023

wjones127 Nov 16, 2023

rok Nov 16, 2023

rok Nov 16, 2023

wjones127 Nov 16, 2023

rok Nov 16, 2023

westonpace Nov 17, 2023

rok commented Nov 16, 2023

westonpace left a comment

westonpace Nov 17, 2023

westonpace Nov 17, 2023

wjones127 Nov 17, 2023

rok Nov 17, 2023

westonpace Nov 17, 2023

feat: expose index cache size #1587

feat: expose index cache size #1587

Conversation

rok commented Nov 11, 2023

wjones127 left a comment

Choose a reason for hiding this comment

eddyxu commented Nov 13, 2023

rok commented Nov 13, 2023

wjones127 commented Nov 13, 2023

rok commented Nov 13, 2023

eddyxu commented Nov 13, 2023

rok commented Nov 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rok commented Nov 16, 2023

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment