Missing column when both nearest and filter are applied#686
Missing column when both nearest and filter are applied#686changhiskhan merged 3 commits intomainfrom
nearest and filter are applied#686Conversation
| let score_schema = ArrowSchema::new(vec![score]); | ||
|
|
||
| let merged = self | ||
| .projections |
There was a problem hiding this comment.
what is the differnence of vector_schema w/ self.projections? is one a superset of the other?
There was a problem hiding this comment.
=> vector_schema is just the vector and score.
=> self.projections is the user supplied projection (e.g., dataset.to_table(columns=<>, ...), if None then all columns of the dataset is the projection).
Neither is the superset of the other.
| } | ||
| } | ||
|
|
||
| fn scanner_output_schema(&self) -> Result<Arc<Schema>> { |
There was a problem hiding this comment.
Should this just be Scanner::schema?
There was a problem hiding this comment.
From the API aspect, Scanner::schema should just return the schema of output. So the contract can be built with other system components.
There was a problem hiding this comment.
Scanner::schema is the ArrowSchema and should not be used in Lance reader internals because that's why the field id's are messed up.
There was a problem hiding this comment.
I'll just make Scanner::schema call the other
| let score_schema = ArrowSchema::new(vec![column, score]); | ||
| let vector_search_columns = &Schema::try_from(&score_schema)?; | ||
| let merged = self.projections.merge(vector_search_columns); | ||
| let score_schema = ArrowSchema::new(vec![score]); |
There was a problem hiding this comment.
use self.vector_search_schema()?
| filter_node, | ||
| self.dataset.clone(), | ||
| Arc::new(self.projections.clone()), | ||
| output_schema, |
There was a problem hiding this comment.
this is leaking the abstraction of ann to take?
what is in output_schema nad ann_schema tho? can we just use one?
There was a problem hiding this comment.
output_schema includes user supplied projections. ann_schema includes the vector/score columns.
| input: SendableRecordBatchStream, | ||
| dataset: Arc<Dataset>, | ||
| schema: Arc<Schema>, | ||
| ann_schema: Option<Arc<Schema>>, // TODO add input/output schema contract to exec nodes and remove this |
There was a problem hiding this comment.
File a ticket to remove this leaking abstraction?
eddyxu
left a comment
There was a problem hiding this comment.
Pending a new ticket to track the issue of cleaning ann_schema.
Addresses #685
@eddyxu the fix is hacky due to the lack of formal contracts on the io exec nodes. That's why you see the weird
ann_schemavariable being passed to the local take node. Happy to walk through the logic with you over zoom if that's quicker.