Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: expose index cache size #1587

Merged
merged 11 commits into from
Nov 17, 2023
Merged

feat: expose index cache size #1587

merged 11 commits into from
Nov 17, 2023

Conversation

rok
Copy link
Contributor

@rok rok commented Nov 11, 2023

This is to enable lancedb/lancedb#641.

rust/lance/src/index/cache.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong @eddyxu, but I think you mean expose setting index cache size, right? This is just reading.

@eddyxu
Copy link
Contributor

eddyxu commented Nov 13, 2023

@wjones127 is right. This is for users from lancedb to be able to customize the size of the cache for index.

@rok
Copy link
Contributor Author

rok commented Nov 13, 2023

Got it. Would we still want the read interface?

@wjones127
Copy link
Contributor

Would we still want the read interface?

I don't think it's critical, but it could be useful for testing.

@rok
Copy link
Contributor Author

rok commented Nov 13, 2023

I'll leave it in for testing for now.

@eddyxu
Copy link
Contributor

eddyxu commented Nov 13, 2023

Lets reduce the API interface to put it behind dataset.stats?

@rok
Copy link
Contributor Author

rok commented Nov 15, 2023

@eddyxu moved the index_cache_size under stats. I think this is now ready for review.

@rok rok requested a review from wjones127 November 15, 2023 11:28
@wjones127 wjones127 changed the title perf: expose index cache size feat: expose index cache size Nov 16, 2023
python/python/lance/dataset.py Outdated Show resolved Hide resolved
python/python/tests/test_vector_index.py Outdated Show resolved Hide resolved
rust/lance/src/dataset.rs Outdated Show resolved Hide resolved
pub(crate) fn get_size(&self) -> usize {
self.scalar_cache.sync();
self.vector_cache.sync();
self.scalar_cache.entry_count() as usize + self.vector_cache.entry_count() as usize
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjones127 do you feel just summing this is ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I’m not sure I care about the entry count. As a user, I would much rather set the limit in terms of bytes and get the total bytes consumed in the statistics. So this is fine for now but long term I think we ought to consider switching to evicting based on in-memory size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we can use weighted_size and with weigher to do that. (I hope moka is better at cache invaliation than naming).
I can give that a quick try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually let's make that a separate issue to not stall this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope moka is better at cache invaliation than naming
🤣
Actually let's make that a separate issue to not stall this one.

Sounds good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #1613

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think summing is fine. We might just merge these into one field someday.

However, I think there is one potential problem. The user might set the index cache size to X and then, if they have both scalar and vector indices, they might see an entry count of 2 * X. Still, the best long term solution is probably to do bytes so let's stick with summing for now.

@rok
Copy link
Contributor Author

rok commented Nov 16, 2023

@wjones127 any other comments here?

Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions but nothing blocking, looks good.

@@ -89,7 +89,13 @@ def dataset(
Approximately, ``n = Total Rows / number of IVF partitions``.
``pq = number of PQ sub-vectors``.
"""
ds = LanceDataset(uri, version, block_size, commit_lock=commit_lock)
ds = LanceDataset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the index cache have a TTL (I'm looking at the comment above which, to be fair, looks like it isn't part of this PR)? I don't think it does.

@@ -839,6 +839,7 @@ def create_index(
ivf_centroids: Optional[Union[np.ndarray, pa.FixedSizeListArray]] = None,
num_sub_vectors: Optional[int] = None,
accelerator: Optional[Union[str, "torch.Device"]] = None,
index_cache_size: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to change create_index so that it modifies the dataset in-place instead of returning a new dataset. Though that would be a breaking change so perhaps that ship has sailed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was intending to do that at some point. I had done this on the Rust side earlier since it was a source of bugs: #1118

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does seem like the better design to me.

pub(crate) fn get_size(&self) -> usize {
self.scalar_cache.sync();
self.vector_cache.sync();
self.scalar_cache.entry_count() as usize + self.vector_cache.entry_count() as usize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think summing is fine. We might just merge these into one field someday.

However, I think there is one potential problem. The user might set the index cache size to X and then, if they have both scalar and vector indices, they might see an entry count of 2 * X. Still, the best long term solution is probably to do bytes so let's stick with summing for now.

@rok rok merged commit 0083f2d into lancedb:main Nov 17, 2023
17 checks passed
eddyxu pushed a commit that referenced this pull request Nov 17, 2023
eddyxu pushed a commit that referenced this pull request Nov 17, 2023
wjones127 pushed a commit to lancedb/lancedb that referenced this pull request Nov 18, 2023
This is to enable #641.
Should be merged after lancedb/lance#1587 is
released.
raghavdixit99 pushed a commit to raghavdixit99/lancedb that referenced this pull request Apr 5, 2024
This is to enable lancedb#641.
Should be merged after lancedb/lance#1587 is
released.
westonpace pushed a commit to lancedb/lancedb that referenced this pull request Apr 5, 2024
This is to enable #641.
Should be merged after lancedb/lance#1587 is
released.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants