Skip to content

Latest commit

 

History

History
364 lines (274 loc) · 22 KB

MilvusCheatSheet.md

File metadata and controls

364 lines (274 loc) · 22 KB

Milvus Introduction, Best Practices, and Cheat Sheet Tutorial

Milvus Introduction

🐦 Milvus is an open-source (Apache License 2.0) vector database. It's a powerful tool to store, index, and manage unstructured data as embedding vectors generated by deep neural networks and other machine learning (ML) models. Unstructured data includes webpages, text files, pdfs, videos, images, or audio files.

Zilliz Cloud is a proprietary, managed service for Milvus.

  1. Successful AI applications requires good utilization of data. Embedding model is the state-of-the-art tool for understanding and retrieving unstructured data.

  2. Unstructured data is embedded into vectors, Milvus is purposed built database for vector data. 🤖 The AI jargon term for this is vector database.

  3. Vector retrieval is at the heart of many AI applications such as Retrieval Augmented Generation.

Integrations for Zilliz include AWS, GCP, and Azure clouds. Milvus philosophy is to be a low-level "shovel" in the AI stack. 🦙✨𑗗🤗 You should be able to choose independently for yourself which embedding, fusion, LLM, or generation models you want. 🦜⛓️ Milvus is also agnostic to the choice of RAG framework, such as LlamaIndex or LangChain.

Models and tools in the AI space are changing rapidly! ⬱ As a vector database company, we will have our opinions, but you should be free to choose the latest, best AI tools for your use case.

Quick Start

💡Zilliz Pipelines is a quick way to try out Milvus. It's also integrated into LlamaIndex. It has built-in:

  • Open-source embedding model bge-large-en-v1.5 and bge-large-zh-v1.5
  • Good out-of-box retrieval quality backed by our research on doc parsing and chunking strategy
  • AUTOINDEX, Zilliz proprietary
  • Metadata filtering capability

Architecture

Milvus uses a shared-storage architecture with 4 layers which are mutually independent for scaling or disaster recovery: 1)access layer, 2)coordinator service, 3)worker nodes, and 4)storage. Milvus also includes data sharding, logs-as-data persistence, and streaming data ingestion.

Documentation & Releases

  • Open-source Milvus documentation
  • Open-source Milvus Client documentation (no-schema wrapper around Milvus collection)
  • Commercial-source Zilliz documentation
  • Zilliz release notes
  • Zilliz serverless free tier
    • Max 1 cluster
    • Max 2 collections per cluster
    • Max 1 million vectors per collection
    • To upgrade to beta, you can contact Zilliz support for help, they will need your cluster's id
  • Zilliz enterprise tier for production use cases:
    • 99.9% uptime availability
    • Multiple availability zones
    • Enterprise-grade encryption in transit and at rest
    • SOC 2 Type 2 compliant
    • RBAC (Role-Based Access Control) at org and project levels
    • Resource monitors and alert notifications
    • Self-upgrade to beta, click the "try beta" button to upgrade a cluster
    • 24/7/365 Email and Discord support with response time SLAs (Urgent: 1 hour; High: 4 hours; Normal: 1 business day)
    • Zilliz bring-your-own-cloud is planned Q2 2024

Getting Started with Milvus Tutorial

Following are Best Practices for getting your data into Milvus so you can start developing AI applications.

  1. Start up Milvus server and connect.
  • 💡👉🏼The easiest way is to use Zilliz serverless free tier. No need to worry abut sufficient .wait() to connect, it's always there!

  • Milvus can run locally. Milvus flavors include lite, docker, or k8s.

  • Zilliz run Milvus in the cloud. Zilliz flavors include free tier (serverless) or paid (managed aws, google, azure).

  • See Example connection notebook at the bottom of this page.

2. Choose your chunking strategy based on the type of data. Unstructured data needs to be chunked, embedded (converted into vectors), and the vectors stored as tensors, which are vectors tied to specific compute hardware (CPU, GPU, TPU, etc). Tensors are the lingua franca of AI.

  • Good backgrounder on chunking strategies.

  • Most, general NLP tasks work best with chunk size 512 and 10-15% overlap.

  • Web page data performs best with chunking strategy that adds headers to chunks. Since headers are short, the added context per chunk is usually worth it.

  1. Use 1 embedding model per collection. The collection's vector space often comes from the next-to-last hidden layer of a deep neural network model. The weights (numbers) from this layer are used as a transformation function to map your input unstructured data to a vector of numbers (often 1024 dimensions). In order for vector similarity to work, all the data, including the questions need to be tokenized (mapped inputs to outputs) into the same vector space. That way concepts in that space can be searched. For this reason, it is best practice to use just 1 embedding model per collection.
  • 💡👉🏼Open source embedding models perform on par with commercial embedding models. OSS models have the benefits of high recall and free access to your own data. For example, to use the MTEB leaderboard > sort descending by column "Retrieval Average". Notice UAE-Large-V1 is ranked 4th best and takes only 1.34 MB memory. Compare this to OpenAI's ada-002 which is ranked 25th. (website accessed on Dec 30, 2023.)

  • Fine-tune your embedding model using your data and your task for potentially 10-15% improved retrieval. Only OSS embedding models can be tuned.

4. Create a collection. A collection is like a database table. Each collection has a name, index, schema, and consistency-level.

  • 💡👉🏼The easiest approach is to use Milvus Client no-schema. Milvus Client is a wrapper around the Milvus collection object which uses flexible json key:value format to allow collection creation without needing a schema up front. This is the least error-prone approach for getting started. See Example search notebook at the bottom of this page.
from pymilvus import MilvusClient

COLLECTION_NAME = "MilvusDocs"
EMBEDDING_LENGTH = 1024

INDEX_PARAMS = dict({
    'M': 16,               
    "efConstruction": M * 2 })
index_params = {
    "index_type": "HNSW", 
    "metric_type": "COSINE", 
    "params": INDEX_PARAMS
    }

# Use no-schema Milvus client uses flexible json key:value format.
mc = MilvusClient(
    uri=CLUSTER_ENDPOINT,
    # API key or a colon-separated cluster username and password
    token=TOKEN)

# Check if collection already exists, if so drop it.
has = utility.has_collection(COLLECTION_NAME)
if has:
    drop_result = utility.drop_collection(COLLECTION_NAME)

# Create the collection.
mc.create_collection(COLLECTION_NAME, 
                     EMBEDDING_LENGTH,
                     consistency_level="Eventually", 
                     auto_id=True,  
                     overwrite=True,
                     # skip setting params below, if using AUTOINDEX
                     params=index_params
                    )
print(mc.describe_collection(COLLECTION_NAME))
  • Metadata limit:  64 fields per row. These are the extra fields besides "pk" and "vector".

  • If you define your schema up front, check docs for schema types.

    • primary key (usually called "pk"), default type INT64 (Note: LangChain expects "pk" to be type string.)
    • embeddings *usually called "vector"), type list of numpy.ndarray of numpy.float32 numbers
    • Strings, type VARCHAR, Max Length 65535 characters.  Best practice: Use max length in schema. Actual data won't use that much space.
EMBEDDING_LENGTH = 1024
MAX_LENGTH = 65535
fields = [
  FieldSchema("pk", DataType.INT64, is_primary=True, auto_id=True), 
  FieldSchema("vector", DataType.FLOAT_VECTOR, dim=EMBEDDING_LENGTH),
  FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=MAX_LENGTH),
]
  • Partitions are meant to isolate entities in different physical paths to restrict search scope.

  • Milvus supports 2 types of partitions. Both types are equally as fast! So the choice is up to you!

    a) MANUAL - only use this if you can ensure approximately equal 20-100K rows per partition. Users specify which entity belongs to which partition. Partitions can be added or deleted at any time. Partition name needs to be included as a search parameter.

    b) AUTOMATIC - Milvus automatically distributes entities into different partitions. No need to specify partition name when searching, milvus will automatically translate your metadata filter expression to find data from paritions.

  • Partitioning Tips:

    • 💡👉🏼 Best practice is leave it to Milvus to automatically partition data and translate metadata filters into search mappings.
    • For now, RBAC is only at the collection or project level, so it is not possible to control visibility of partitions to different users.
    • For manual partitions, 20-100K rows per partition is recommended, otherwise search speed will be slower than with automatic partitions.
    • The max number of partitions in a collection is 4096.

5. Build an index (i.e. search algorithm used to find nearest-neighbors across tensors). Data is saved in data structures according to the particular search algorithm index - hashes, trees, or graphs.

  • Blog: Choosing the right index for your project.

  • 💡👉🏼With Milvus Client, you will need to define your own HNSW index. Otherwise search might be slow (Milvus Client uses IVF_Flat index by default).

  • HNSW best practice params. Start with M: 4~64, larger M for larger data and larger embedding lengths. Then ef = efConstruction = M * 2.

  • Pro tip: Use AUTOINDEX, except if you are using Milvus Client. AUTOINDEX defaults to HNSW in Milvus. In Zilliz, AUTOINDEX will choose the best index automatically based on your data and type of compute running on the cluster.

6. Choose the distance metric.

  • 💡👉🏼"COSINE" works best for most use cases.

  • Most search algorithms work best with normalized embeddings data. This means L2 metric is useless (since all vectors have same length). "IP" (inner product) and "COSINE" are equivalent when the vectors are normalized.

  • Only choose metric="L2" if you plan to keep your embeddings unnormalized.

  • For more speed, fine tune your search index parameters.

  • For more speed with big data, choose an index with vector compression, search 'Quantization-based index' on the index doc page.

7. Choose the consistency level.

  • 💡👉🏼For typical useage (e.g. tables updated every 30 minutes or longer), use "Eventually" for fastest performance.

  • The 4 available levels of consistency:

    • Strong - Real-time everyone sees the same thing.
    • Eventually - Soon everyone sees the same thing.
    • Session - Per session, data is up to date with all writes within session.
    • Bounded - Within a shorter amount of time than eventually, everyone sees the same thing.
  • You specify consistency in 2 places:

    • In collection.create_collection() - Set the default value.
    • In collection.search() - Possible to override the default value.

8. Insert data into the collection.

  • Milvus supports loading data from:

    • pandas dataframes, or
    • list of dictionaries
  • 💡👉🏼Milvus Client wrapper can only handle loading data from a list of dictionaries.

# Convert DataFrame to a list of dictionaries.
dict_list = []
for _, row in batch.iterrows():
    dictionary = row.to_dict()
    dict_list.append(dictionary)

print("Start inserting entities")
start_time = time.time()
insert_result = mc.insert(
    COLLECTION_NAME,
    data=dict_list,
    progress_bar=True)
end_time = time.time()
print(f"Milvus insert time for {batch.shape[0]} vectors: {end_time - start_time} seconds")
# After final entity is inserted, call flush to stop growing segments left in memory.
mc.flush(COLLECTION_NAME)

9. Search across all your data. Milvus search default is semantic search across tensors using approximate nearest neighbor distances in vector space. Or stochastic fuzzy search. The search algorithm used is based on the index you chose when you set up your collection.

# Embed the question using the same encoder.
query_embeddings = _utils.embed_query(encoder, [SAMPLE_QUESTION])

# Return top k results with HNSW index.
SEARCH_PARAMS = dict({
    "ef": INDEX_PARAMS['efConstruction']
    })

# Define output fields to return.
OUTPUT_FIELDS = ["h1", "h2", "source", "chunk"]

# Run semantic vector search using your query and the vector database.
start_time = time.time()
results = mc.search(
    COLLECTION_NAME,
    data=query_embeddings, 
    search_params=SEARCH_PARAMS,
    output_fields=OUTPUT_FIELDS, 
    # Milvus can utilize metadata in boolean expressions to filter search.
    # filter="pk >= 0",
    limit=3,  # Default top_k = 10
    consistency_level="Eventually"
    )
elapsed_time = time.time() - start_time
print(f"Milvus Client search time for {len(chunk_list)} vectors: {elapsed_time} seconds")

# Inspect search result.
print(f"type: {type(results[0])}, count: {len(results[0])}")
  • Similar in concept to SQL databases, in addition to vector search, scalar (metadata filtering) can be specified using boolean expressions.

    • "filter": "boolean_expression"
    "filter": "email == 'tom@zilliz.com' "
    • Surround any string literals with single '.
    • Chain together boolean_expressions using && (and) or || (or).
    • String match only works on anchored strings.
    "filter":"((DatePublished >= 2000) && (RatingValue > 6.8)) || (MovieName != 'Deepsea Challenge%')"
    • String match using "in" and "like" also supported with anchored strings.
      • "my_string in 'prefix%'"
      • "my_string like 'prefix%'"
    • Array metadata supported >= Milvus v2.3
      • A in ["str1", "str2"]
  • For manual control of semantic search, use range (specific vector distances) search.

  • When dealing with large datasets that do not fit in memory, Milvus offers DiskANN.

10. Update data using "upsert" operation. Either insert a new vector if it does not already exist or update data that already exists in the database.

11. "Query" operation does not use fuzzy search (semantic search).

  • Example you want to see if a certain productID already exists. res = collection.query(expr = "ProductID == 100") If the len(res) is 0, we can know no item's product id is 100.

Example Notebooks

  1. Getting started connecting to Milvus: https://github.com/milvus-io/bootcamp/blob/master/bootcamp/milvus_connect.ipynb
  2. Loading and searching IMDB Movie data with Milvus Client: https://github.com/milvus-io/bootcamp/blob/master/bootcamp/Retrieval/imdb_milvus_client.ipynb
  3. Building a RAG Chatbot on website data using open source LLMs (& also using OpenAI): https://github.com/milvus-io/bootcamp/blob/master/bootcamp/RAG/readthedocs_zilliz_langchain.ipynb
  4. Evaluating RAG using Ragas and OpenAI: https://github.com/milvus-io/bootcamp/blob/master/evaluation/evaluate_fiqa_customized_RAG.ipynb
  5. Building an OpenAI agent using LlamaIndex: https://github.com/milvus-io/bootcamp/blob/master/bootcamp/OpenAIAssistants/milvus_agent_llamaindex.ipynb

Learning Resources

Community & Help