Skip to content

Vector search should support appending new rows#593

Merged
changhiskhan merged 15 commits intomainfrom
changhiskhan/max-fragment-id
Apr 5, 2023
Merged

Vector search should support appending new rows#593
changhiskhan merged 15 commits intomainfrom
changhiskhan/max-fragment-id

Conversation

@changhiskhan
Copy link
Copy Markdown
Contributor

next PR will join flat search for new fragments

@changhiskhan changhiskhan requested a review from eddyxu February 17, 2023 06:56
Comment thread protos/format.proto Outdated
@changhiskhan changhiskhan force-pushed the changhiskhan/max-fragment-id branch from cec42e4 to 03c5543 Compare March 31, 2023 23:02


@pytest.mark.skipif(
(platform.system() == "Darwin") and (platform.machine() != "arm64"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the intel mac should fallback to the AVX version? not sure why need to skip on mac GHA

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we added this awhile back because the index related tests on intel mac runners would take forever on github

Comment thread rust/src/dataset.rs
Self::checkout_manifest(object_store, base_path, &manifest_file).await
}

pub async fn checkout_version(&self, version: u64) -> Result<Self> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checkout ? Also, pls add some doc to the public API

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkout is already a static method so this needs a new name. Will add docstring

Comment thread rust/src/dataset.rs Outdated
// If overwrite, invalidate index
manifest.index_section = None;
}
manifest.version = dataset.as_ref().map_or(1, |d| d.manifest.version + 1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the latest version + 1?
For example, if you check out version 2 of [1, 2, 3], and write again, it will overwrite the version 3, iiuc

Comment thread rust/src/dataset.rs
manifest.version = dataset.as_ref().map_or(1, |d| d.manifest.version + 1);
let indices = if matches!(params.mode, WriteMode::Append) {
if let Some(d) = dataset.as_ref() {
Some(d.load_indices().await?)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means inherent the index?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i'll add code comments as well

Comment thread rust/src/dataset/scanner.rs Outdated
// Check if we've created new versions since the index
let version = index.dataset_version;
if version != self.dataset.version().version {
// If we've added more rows, then we'll have new fragments
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you extract this piece to a separate function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread rust/src/dataset/scanner.rs Outdated
.manifest
.max_fragment_id()
.ok_or_else(|| Error::IO("No fragments in index version".to_string()))?;
let max_fragment_id_ds = self
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not allow to reclaim / recycle fragment IDs? Could you add a comment?

In case of a recycle Ids, it seems need to re-mapping all fragment ids.

Comment thread rust/src/format/manifest.rs Outdated
self.fragments.iter().map(|f| f.id).max()
}

pub fn fragments_since(&self, since: Manifest) -> Vec<Fragment> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since: &Manifest

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread rust/src/format/manifest.rs Outdated

pub fn fragments_since(&self, since: Manifest) -> Vec<Fragment> {
let mut fragments = vec![];
let mut fragment_map = HashMap::new();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check since version is ealier than self.version?

Feel it is more reliable that fragment has a "version" attach to it, so that this function just compares the version number.

To handle case like, deletion / insert and etc

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can add an assert check for the version.

Comment thread rust/src/io/exec/knn.rs Outdated
/// Returns an error if the preconditions are not met.
pub fn try_new(input: Arc<dyn ExecutionPlan>, query: Query) -> Result<Self> {
let schema = input.schema();
pub fn try_new(inputs: Vec<Arc<dyn ExecutionPlan>>, query: Query) -> Result<Self> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the case where KNNFlatExec has multiple children execute plan? Should it be its child to take care the multiple sources are?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'd need to add a concat node? feels like unnecessary complexity for now? what use case are you anticipating with multiple child nodes

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my question was that i dont see a case that KNNFlatExec has multiple inputs, so wanted to know why change the signature from input to inputs: Vec<_>.

Because the number of children is kind important for the execution plan contract. It feels wrong that KNNFlatExec needs to handle multiple children.

Comment thread protos/format.proto Outdated
// Index name. Must be unique within one dataset version.
string name = 3;

// The latest version of the dataset this index covers.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the version of the dataset this index was built on?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@changhiskhan changhiskhan changed the title Inherit index on Append Vector search should support appending new rows Apr 5, 2023
@changhiskhan changhiskhan force-pushed the changhiskhan/max-fragment-id branch from 88591f9 to d8a3b96 Compare April 5, 2023 05:34
@changhiskhan changhiskhan merged commit 25eb36f into main Apr 5, 2023
@changhiskhan changhiskhan deleted the changhiskhan/max-fragment-id branch April 5, 2023 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants