-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: expose index cache size #1587
Changes from all commits
9896fae
0b80b9d
924d55c
422d3bd
b86634a
523438b
5053f6d
8b43897
be000ca
effed41
edcf13d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -839,6 +839,7 @@ def create_index( | |
ivf_centroids: Optional[Union[np.ndarray, pa.FixedSizeListArray]] = None, | ||
num_sub_vectors: Optional[int] = None, | ||
accelerator: Optional[Union[str, "torch.Device"]] = None, | ||
index_cache_size: Optional[int] = None, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if it would be better to change There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was intending to do that at some point. I had done this on the Rust side earlier since it was a source of bugs: #1118 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That does seem like the better design to me. |
||
**kwargs, | ||
) -> LanceDataset: | ||
"""Create index on column. | ||
|
@@ -870,6 +871,8 @@ def create_index( | |
If set, use an accelerator to speed up the training process. | ||
Accepted accelerator: "cuda" (Nvidia GPU) and "mps" (Apple Silicon GPU). | ||
If not set, use the CPU. | ||
index_cache_size : int, optional | ||
The size of the index cache in number of entries. Default value is 256. | ||
kwargs : | ||
Parameters passed to the index building process. | ||
|
||
|
@@ -1015,7 +1018,7 @@ def create_index( | |
kwargs["ivf_centroids"] = ivf_centroids_batch | ||
|
||
self._ds.create_index(column, index_type, name, replace, kwargs) | ||
return LanceDataset(self.uri) | ||
return LanceDataset(self.uri, index_cache_size=index_cache_size) | ||
|
||
@staticmethod | ||
def _commit( | ||
|
@@ -1807,6 +1810,7 @@ def index_stats(self, index_name: str) -> Dict[str, Any]: | |
index_stats = json.loads(self._ds.index_statistics(index_name)) | ||
index_stats["num_indexed_rows"] = self._ds.count_indexed_rows(index_name) | ||
index_stats["num_unindexed_rows"] = self._ds.count_unindexed_rows(index_name) | ||
index_stats["index_cache_entry_count"] = self._ds.index_cache_entry_count() | ||
return index_stats | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -39,6 +39,12 @@ impl IndexCache { | |
self.vector_cache.entry_count() as usize | ||
} | ||
|
||
pub(crate) fn get_size(&self) -> usize { | ||
self.scalar_cache.sync(); | ||
self.vector_cache.sync(); | ||
self.scalar_cache.entry_count() as usize + self.vector_cache.entry_count() as usize | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wjones127 do you feel just summing this is ok? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TBH, I’m not sure I care about the entry count. As a user, I would much rather set the limit in terms of bytes and get the total bytes consumed in the statistics. So this is fine for now but long term I think we ought to consider switching to evicting based on in-memory size. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC we can use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually let's make that a separate issue to not stall this one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sounds good! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Opened #1613 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think summing is fine. We might just merge these into one field someday. However, I think there is one potential problem. The user might set the index cache size to X and then, if they have both scalar and vector indices, they might see an entry count of |
||
} | ||
|
||
/// Get an Index if present. Otherwise returns [None]. | ||
pub(crate) fn get_scalar(&self, key: &str) -> Option<Arc<dyn ScalarIndex>> { | ||
self.scalar_cache.get(key) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the index cache have a TTL (I'm looking at the comment above which, to be fair, looks like it isn't part of this PR)? I don't think it does.