Skip to content

[Rust] Build DiskANN index#763

Merged
eddyxu merged 12 commits intomainfrom
lei/diskann
Apr 18, 2023
Merged

[Rust] Build DiskANN index#763
eddyxu merged 12 commits intomainfrom
lei/diskann

Conversation

@eddyxu
Copy link
Copy Markdown
Member

@eddyxu eddyxu commented Apr 7, 2023

No description provided.

@eddyxu eddyxu force-pushed the lei/diskann branch 3 times, most recently from 0158d72 to db61c17 Compare April 15, 2023 17:28
@eddyxu eddyxu marked this pull request as ready for review April 15, 2023 18:46
@eddyxu eddyxu requested a review from changhiskhan April 15, 2023 18:46
Comment thread rust/src/index/vector/diskann.rs Outdated
}

impl DiskANNParams {
pub(crate) fn byte_length(&self) -> usize {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be num_sub_vectors * num_bits?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems "accidentally" correct just because we happen to use 1 byte pq code

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be depended on how PQ is refactored later, i.e,. if nbits=4, one byte can be two PQ code, it is bit difficult to put the logic like ceiling(nbits / 8) hard coded here.

Comment thread rust/src/index/vector/graph.rs Outdated
pub use persisted::*;

/// Graph build parameters.
pub trait GraphParams {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we going to use this for?

return Err(Error::Index("Invalid graph".to_string()));
}
let binary_size = graph.vertex(0).byte_length();
let binary_size = graph.vertex(0).byte_length(graph_params);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to have Vertex/Node factories that just takes a graph_params?
That way instead of passing it explicitly in a bunch of places, you give the GraphParams to like the GraphBuilder and then forget about it? And the GraphBuilder is responsible for making the nodes/vertices?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a new VertexSerDe trait.

Comment thread rust/src/index/vector/graph/builder.rs Outdated
self.nodes[id].neighbors = neighbors.into();
}

pub fn add_edge(&mut self, from: usize, to: usize) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit pick on naming: why is set_neighbors but add_edge?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Arc::new(dataset)
}

// #[tokio::test]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why commented out?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

.row(idx)
.ok_or(Error::Index("Invalid row index".to_string()))?;

let dists = l2_distance(query, vector, vectors.num_columns())?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: support other distance metrics?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Ok(())
}

/// Write index file.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so right now nothing is actually writing the index to disk?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it only write graphs to disk (as *.lance), the metadata (protobuf) will be written in the following PRs.

.ok_or_else(|| Error::Index("Cannot find the medoid of an empty matrix".to_string()))?;

let dist_func = metric_type.func();
// Find the closest vertex to the centroid.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this is the definition of the medoid is it? I thought medoid is the element in a cluster whose distance to all of the other elements is minimized.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They seems to start to use "closest one to centroid" since the FreshDiskANN paper, maybe due to the time complexity of finding medoid.

I can change the function name tho.

let mut candidates: BTreeMap<OrderedFloat<f32>, usize> = BTreeMap::new();
let mut heap: BinaryHeap<VertexWithDistance> = BinaryHeap::new();
let dist = distance_to(vectors, query, start)?;
heap.push(VertexWithDistance {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the heap is never more than one element?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why use the heap at all?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got lost of a few coommits. Refactored and fixed.

}
}

Ok((
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so currently this is searching through all neighbors of the starting vertex and returning the K closest?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

@eddyxu eddyxu changed the title [Rust] Write diskann index [Rust] Build DiskANN index Apr 17, 2023
Copy link
Copy Markdown
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of nits

Comment thread rust/src/index/vector.rs
pub fn func(&self) -> Arc<dyn Fn(&[f32], &[f32]) -> f32> {
match self {
Self::L2 => Arc::new(l2_distance),
Self::Cosine => todo!("cosine distance"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does this get used vs batch? does this mean we can't choose cosine distance for index?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am still refactoring the cosine to linalg module, similar to what does to l2 in #781 , after that, will have a cosine function returned here.

impl VertexSerDe<PQVertex> for PQVertexSerDe {
#[inline]
fn size(&self) -> usize {
8 /* row_id*/ + 8 /* Length of pq */ + self.num_sub_vectors * 1 /* only 8 bits now */
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. only 8 bits now => isn't it parameterized above as pq_num_bits?
  2. what's "length of pq" (8) ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only 8 bits now => isn't it parameterized above as pq_num_bits?

we have not finalized how to materialize pq bits yet. i.e., 4bits and 12 bits, whether it takes ceiling or compress a few pq into the shared bytes.

what's "length of pq"

Length of the PQ code in the vector, i.e., sub_vectors.

Comment thread rust/src/index/vector/graph/builder.rs Outdated
fn distance(&self, a: usize, b: usize) -> Result<f32> {
let vector_a = self.data.row(a).ok_or_else(|| {
Error::Index(format!(
"Attempt to access row {} in a matrix with {} rows",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wht does this mean?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Means the vector id is out of bound (of what are hold in the matrix view). Suggestion for rewording?

pub fn add_edge(&mut self, from: usize, to: usize) {
self.nodes[from].neighbors.push(to as u32);
/// Set neighbors of a node.
pub fn set_neighbors(&mut self, id: usize, neighbors: impl Into<Vec<u32>>) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be vertex as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vertex? or Graph?

The graph interface will be shared with persisted graph as well, so the greedy_search will shared between index and graph builder.

@eddyxu eddyxu merged commit 220dc8f into main Apr 18, 2023
@eddyxu eddyxu deleted the lei/diskann branch April 18, 2023 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants