[Rust] Persist graph using lance file format.#756
Conversation
| mod persisted; | ||
|
|
||
| /// Vertex (metadata). It does not include the actual data. | ||
| pub trait Vertex: Sized { |
There was a problem hiding this comment.
what do the bytes represent? Or are they just arbitrary bytes at the trait level?
There was a problem hiding this comment.
This is just serialized vertex, so each different graph implementation can share the same on-disk graph design.
| #[derive(Debug)] | ||
| pub(crate) struct Node<V: Vertex> { | ||
| pub(crate) vertex: V, | ||
| pub(crate) neighbors: Vec<u32>, |
There was a problem hiding this comment.
are the neighbors other Node's? other Vertex's? or the row id? if row id - should it be u64?
There was a problem hiding this comment.
This neighour id / vertex id will be the location in the graph index file. which will be different order from "row id" that points to the original vector in the dataset.
u32 can support up to 4B vectors per index. Feel that beyond that point, we need to apply IVF first? Main consideration is that it reduces the I/O and memory footprints by half.
| let schema = Schema::try_from(arrow_schema.as_ref())?; | ||
|
|
||
| let mut writer = FileWriter::try_new(object_store, path, &schema).await?; | ||
| for nodes in graph.nodes.as_slice().chunks(params.batch_size) { |
There was a problem hiding this comment.
also, are there any other useful graph metadata that can be written separately that's useful without reading the graph?
There was a problem hiding this comment.
Yes, so there will be other metadata that stored as protobuf index in the index directory as well.
This PR just serialize the graph itself.
likely, there will be some directory structure like:
/dataset/_index/1234-5678/index.pb. <- same metadata pb struct
/dataset/_index/1234-5679/graph.lance. <- graph file
In case of HNSW-ish layered graphs, the directory layout will look like
/dataset/_index/1234-5678/index.pb. <- same metadata pb struct
/dataset/_index/1234-5679/layer0.lance.
/dataset/_index/1234-5679/layer1.lance
...
/dataset/_index/1234-5679/layerN.lance.
changhiskhan
left a comment
There was a problem hiding this comment.
Please add some comments indicating what the vertex bytes and neighbors integers indicate. I feel they will easily be forgotten without some explicit instructions
Done |
We are going to use lance to persist a graph, because lance offers good random access already, which is good for graph access.
This PR writes a graph as a two column table:
It requires the vertex struct to be able to serialized to fixed size binary.
It will be the
GraphBuilder's responsibility to re-orgnize the graph vertex in a way that it preserves certain locally.