Skip to content

[Rust] Persist graph using lance file format.#756

Merged
eddyxu merged 9 commits intomainfrom
lei/graph_index
Apr 6, 2023
Merged

[Rust] Persist graph using lance file format.#756
eddyxu merged 9 commits intomainfrom
lei/graph_index

Conversation

@eddyxu
Copy link
Copy Markdown
Member

@eddyxu eddyxu commented Apr 6, 2023

We are going to use lance to persist a graph, because lance offers good random access already, which is good for graph access.

This PR writes a graph as a two column table:

struct<
   vertex: FixedSizeBinary(...),
   neighbours: list<u32>,
>

It requires the vertex struct to be able to serialized to fixed size binary.

It will be the GraphBuilder's responsibility to re-orgnize the graph vertex in a way that it preserves certain locally.

@eddyxu eddyxu self-assigned this Apr 6, 2023
mod persisted;

/// Vertex (metadata). It does not include the actual data.
pub trait Vertex: Sized {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do the bytes represent? Or are they just arbitrary bytes at the trait level?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just serialized vertex, so each different graph implementation can share the same on-disk graph design.

#[derive(Debug)]
pub(crate) struct Node<V: Vertex> {
pub(crate) vertex: V,
pub(crate) neighbors: Vec<u32>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the neighbors other Node's? other Vertex's? or the row id? if row id - should it be u64?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This neighour id / vertex id will be the location in the graph index file. which will be different order from "row id" that points to the original vector in the dataset.

u32 can support up to 4B vectors per index. Feel that beyond that point, we need to apply IVF first? Main consideration is that it reduces the I/O and memory footprints by half.

let schema = Schema::try_from(arrow_schema.as_ref())?;

let mut writer = FileWriter::try_new(object_store, path, &schema).await?;
for nodes in graph.nodes.as_slice().chunks(params.batch_size) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, are there any other useful graph metadata that can be written separately that's useful without reading the graph?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so there will be other metadata that stored as protobuf index in the index directory as well.

This PR just serialize the graph itself.

likely, there will be some directory structure like:

/dataset/_index/1234-5678/index.pb. <- same metadata pb struct 
/dataset/_index/1234-5679/graph.lance. <- graph file

In case of HNSW-ish layered graphs, the directory layout will look like

/dataset/_index/1234-5678/index.pb. <- same metadata pb struct 
/dataset/_index/1234-5679/layer0.lance.
/dataset/_index/1234-5679/layer1.lance
...
/dataset/_index/1234-5679/layerN.lance.

Copy link
Copy Markdown
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some comments indicating what the vertex bytes and neighbors integers indicate. I feel they will easily be forgotten without some explicit instructions

@eddyxu
Copy link
Copy Markdown
Member Author

eddyxu commented Apr 6, 2023

Please add some comments indicating what the vertex bytes and neighbors integers indicate. I feel they will easily be forgotten without some explicit instructions

Done

@eddyxu eddyxu merged commit 02ec8e7 into main Apr 6, 2023
@eddyxu eddyxu deleted the lei/graph_index branch April 6, 2023 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants