Skip to content

Missing column when both nearest and filter are applied#686

Merged
changhiskhan merged 3 commits intomainfrom
changhiskhan/nearest-filter
Mar 16, 2023
Merged

Missing column when both nearest and filter are applied#686
changhiskhan merged 3 commits intomainfrom
changhiskhan/nearest-filter

Conversation

@changhiskhan
Copy link
Copy Markdown
Contributor

@changhiskhan changhiskhan commented Mar 15, 2023

Addresses #685

  • reproduce issue
  • RCA => score column is missing due to incorrect projection when filters are present for ANN.
  • Fix

@eddyxu the fix is hacky due to the lack of formal contracts on the io exec nodes. That's why you see the weird ann_schema variable being passed to the local take node. Happy to walk through the logic with you over zoom if that's quicker.

@changhiskhan changhiskhan requested a review from eddyxu March 16, 2023 04:05
@changhiskhan changhiskhan marked this pull request as ready for review March 16, 2023 04:05
Comment thread rust/src/dataset/scanner.rs Outdated
let score_schema = ArrowSchema::new(vec![score]);

let merged = self
.projections
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the differnence of vector_schema w/ self.projections? is one a superset of the other?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> vector_schema is just the vector and score.
=> self.projections is the user supplied projection (e.g., dataset.to_table(columns=<>, ...), if None then all columns of the dataset is the projection).

Neither is the superset of the other.

}
}

fn scanner_output_schema(&self) -> Result<Arc<Schema>> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be Scanner::schema?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the API aspect, Scanner::schema should just return the schema of output. So the contract can be built with other system components.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scanner::schema is the ArrowSchema and should not be used in Lance reader internals because that's why the field id's are messed up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just make Scanner::schema call the other

Comment thread rust/src/dataset/scanner.rs Outdated
let score_schema = ArrowSchema::new(vec![column, score]);
let vector_search_columns = &Schema::try_from(&score_schema)?;
let merged = self.projections.merge(vector_search_columns);
let score_schema = ArrowSchema::new(vec![score]);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.vector_search_schema()?

filter_node,
self.dataset.clone(),
Arc::new(self.projections.clone()),
output_schema,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is leaking the abstraction of ann to take?

what is in output_schema nad ann_schema tho? can we just use one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_schema includes user supplied projections. ann_schema includes the vector/score columns.

Comment thread rust/src/io/exec/take.rs
input: SendableRecordBatchStream,
dataset: Arc<Dataset>,
schema: Arc<Schema>,
ann_schema: Option<Arc<Schema>>, // TODO add input/output schema contract to exec nodes and remove this
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File a ticket to remove this leaking abstraction?

Copy link
Copy Markdown
Member

@eddyxu eddyxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending a new ticket to track the issue of cleaning ann_schema.

@changhiskhan changhiskhan merged commit 9a3364c into main Mar 16, 2023
@changhiskhan changhiskhan deleted the changhiskhan/nearest-filter branch March 16, 2023 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants