Conversation
0158d72 to
db61c17
Compare
| } | ||
|
|
||
| impl DiskANNParams { | ||
| pub(crate) fn byte_length(&self) -> usize { |
There was a problem hiding this comment.
should this be num_sub_vectors * num_bits?
There was a problem hiding this comment.
seems "accidentally" correct just because we happen to use 1 byte pq code
There was a problem hiding this comment.
This should be depended on how PQ is refactored later, i.e,. if nbits=4, one byte can be two PQ code, it is bit difficult to put the logic like ceiling(nbits / 8) hard coded here.
| pub use persisted::*; | ||
|
|
||
| /// Graph build parameters. | ||
| pub trait GraphParams {} |
There was a problem hiding this comment.
what are we going to use this for?
| return Err(Error::Index("Invalid graph".to_string())); | ||
| } | ||
| let binary_size = graph.vertex(0).byte_length(); | ||
| let binary_size = graph.vertex(0).byte_length(graph_params); |
There was a problem hiding this comment.
Would it make more sense to have Vertex/Node factories that just takes a graph_params?
That way instead of passing it explicitly in a bunch of places, you give the GraphParams to like the GraphBuilder and then forget about it? And the GraphBuilder is responsible for making the nodes/vertices?
There was a problem hiding this comment.
Made a new VertexSerDe trait.
| self.nodes[id].neighbors = neighbors.into(); | ||
| } | ||
|
|
||
| pub fn add_edge(&mut self, from: usize, to: usize) { |
There was a problem hiding this comment.
nit pick on naming: why is set_neighbors but add_edge?
| Arc::new(dataset) | ||
| } | ||
|
|
||
| // #[tokio::test] |
| .row(idx) | ||
| .ok_or(Error::Index("Invalid row index".to_string()))?; | ||
|
|
||
| let dists = l2_distance(query, vector, vectors.num_columns())?; |
There was a problem hiding this comment.
TODO: support other distance metrics?
| Ok(()) | ||
| } | ||
|
|
||
| /// Write index file. |
There was a problem hiding this comment.
so right now nothing is actually writing the index to disk?
There was a problem hiding this comment.
Now it only write graphs to disk (as *.lance), the metadata (protobuf) will be written in the following PRs.
| .ok_or_else(|| Error::Index("Cannot find the medoid of an empty matrix".to_string()))?; | ||
|
|
||
| let dist_func = metric_type.func(); | ||
| // Find the closest vertex to the centroid. |
There was a problem hiding this comment.
i don't think this is the definition of the medoid is it? I thought medoid is the element in a cluster whose distance to all of the other elements is minimized.
There was a problem hiding this comment.
They seems to start to use "closest one to centroid" since the FreshDiskANN paper, maybe due to the time complexity of finding medoid.
I can change the function name tho.
| let mut candidates: BTreeMap<OrderedFloat<f32>, usize> = BTreeMap::new(); | ||
| let mut heap: BinaryHeap<VertexWithDistance> = BinaryHeap::new(); | ||
| let dist = distance_to(vectors, query, start)?; | ||
| heap.push(VertexWithDistance { |
There was a problem hiding this comment.
the heap is never more than one element?
There was a problem hiding this comment.
so why use the heap at all?
There was a problem hiding this comment.
Got lost of a few coommits. Refactored and fixed.
| } | ||
| } | ||
|
|
||
| Ok(( |
There was a problem hiding this comment.
so currently this is searching through all neighbors of the starting vertex and returning the K closest?
| pub fn func(&self) -> Arc<dyn Fn(&[f32], &[f32]) -> f32> { | ||
| match self { | ||
| Self::L2 => Arc::new(l2_distance), | ||
| Self::Cosine => todo!("cosine distance"), |
There was a problem hiding this comment.
when does this get used vs batch? does this mean we can't choose cosine distance for index?
There was a problem hiding this comment.
i am still refactoring the cosine to linalg module, similar to what does to l2 in #781 , after that, will have a cosine function returned here.
| impl VertexSerDe<PQVertex> for PQVertexSerDe { | ||
| #[inline] | ||
| fn size(&self) -> usize { | ||
| 8 /* row_id*/ + 8 /* Length of pq */ + self.num_sub_vectors * 1 /* only 8 bits now */ |
There was a problem hiding this comment.
only 8 bits now=> isn't it parameterized above as pq_num_bits?- what's "length of pq" (8) ?
There was a problem hiding this comment.
only 8 bits now => isn't it parameterized above as pq_num_bits?
we have not finalized how to materialize pq bits yet. i.e., 4bits and 12 bits, whether it takes ceiling or compress a few pq into the shared bytes.
what's "length of pq"
Length of the PQ code in the vector, i.e., sub_vectors.
| fn distance(&self, a: usize, b: usize) -> Result<f32> { | ||
| let vector_a = self.data.row(a).ok_or_else(|| { | ||
| Error::Index(format!( | ||
| "Attempt to access row {} in a matrix with {} rows", |
There was a problem hiding this comment.
Means the vector id is out of bound (of what are hold in the matrix view). Suggestion for rewording?
| pub fn add_edge(&mut self, from: usize, to: usize) { | ||
| self.nodes[from].neighbors.push(to as u32); | ||
| /// Set neighbors of a node. | ||
| pub fn set_neighbors(&mut self, id: usize, neighbors: impl Into<Vec<u32>>) { |
There was a problem hiding this comment.
should this be vertex as well?
There was a problem hiding this comment.
vertex? or Graph?
The graph interface will be shared with persisted graph as well, so the greedy_search will shared between index and graph builder.
No description provided.