Strategy for Inserting 100 Million Documents into Vector Database #32754

MartinMashalov · 2024-05-02T01:57:06Z

MartinMashalov
May 2, 2024

Hi there!

I am trying to insert 100 million documents into the Mulvis vector database. I am using openai embeddings for the documents. Since I am new to Mulvis, I am wondering what the best strategy would be to insert all of these documents efficiently and also as quickly as possible. Would batch indexing be a good idea? Has anyone else encountered a similar problem and figured out how to overcome it? I would appreciate any guidance on this issue.

Thank you.

yhmo · 2024-05-02T08:42:32Z

yhmo
May 2, 2024
Collaborator

We recommend batch insert for Milvus.
The RPC transfer size limit is 64MB for each insert call. So, it is better to insert data batch by batch with each batch size between 20~40MB.
Each dimension is a float32 value. So, for 1536-dim embeddings, you can insert 3000 rows ~ 7000 rows for each batch. If there are some other metadata along with the embeddings, especially long-length strings, you might reduce the row count to 1000 ~ 2000 rows for each batch.

0 replies

MartinMashalov · 2024-05-02T11:48:18Z

MartinMashalov
May 2, 2024
Author

Thank you for your response! How can I do this batch insert (from JSON file or through some Python integration)? Any reference to documentation would be super helpful. Thank you so much! Martin

…

On May 2, 2024, at 04:42, groot ***@***.***> wrote: We recommend batch insert for Milvus. The RPC transfer size limit is 64MB for each insert call. So, it is better to insert data batch by batch with each batch size between 20~40MB. Each dimension is a float32 value. So, for 1536-dim embeddings, you can insert 3000 rows ~ 7000 rows for each batch. If there are some other metadata along with the embeddings, especially long-length strings, you might reduce the row count to 1000 ~ 2000 rows for each batch. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.

1 reply

yhmo May 2, 2024
Collaborator

The User Guide of online doc is a good start:
Create collection: https://milvus.io/docs/manage-collections.md
Insert/delete: https://milvus.io/docs/insert-update-delete.md
Index: https://milvus.io/docs/index-vector-fields.md

After read the doc, try some examples:
https://github.com/milvus-io/pymilvus/blob/master/examples/hello_milvus.py
https://github.com/milvus-io/pymilvus/blob/master/examples/example.py

Milvus insert() interface supports both row-based and column-based insertion:
Assume we have a collection with a primary key field and a vector field with name "id" and "embedding".
Row-based insert is like this:
collection.insert(data = {"id": 100, "embedding": [1.2, 1.3, 1.4, 1.5, 1.6.....]})

Column-based insert is like this:

collection.insert(data = [
   [1, 2, 3, 4, .... 100], # primary keys
   [[1.1, 1.2, ...], [2.1, 2.2, ...], [3.1, 3.2, ...], .... [100.1, 100.2, ....]],  # embedding field
]
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strategy for Inserting 100 Million Documents into Vector Database #32754

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Strategy for Inserting 100 Million Documents into Vector Database #32754

MartinMashalov May 2, 2024

Replies: 2 comments · 1 reply

yhmo May 2, 2024 Collaborator

MartinMashalov May 2, 2024 Author

yhmo May 2, 2024 Collaborator

MartinMashalov
May 2, 2024

Replies: 2 comments 1 reply

yhmo
May 2, 2024
Collaborator

MartinMashalov
May 2, 2024
Author

yhmo May 2, 2024
Collaborator