<a href="https://colab.research.google.com/github/milvus-io/bootcamp/blob/master/integration/build_RAG_with_milvus_and_contextual_ai_parser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>   <a href="https://github.com/milvus-io/bootcamp/blob/master/integration/build_RAG_with_milvus_and_contextual_ai_parser.ipynb" target="_blank">
    <img src="https://img.shields.io/badge/View%20on%20GitHub-555555?style=flat&logo=github&logoColor=white" alt="GitHub Repository"/>


# Build RAG with Milvus and Contextual AI

**Versions used:**
- Milvus version `2.6.3`
- Contextual AI client `0.9.0`

[Contextual AI Parser](https://docs.contextual.ai/api-reference/parse/parse-file?utm_campaign=Parse-api-integration&utm_source=milvus&utm_medium=github&utm_content=notebook) is a cloud-based document parsing service that excels at extracting structured information from PDFs, DOC/DOCX, and PPT/PPTX files. It provides high-quality markdown extraction with document hierarchy preservation and advanced table extraction, making it ideal for RAG applications.

In this tutorial, we'll show you how to build a Retrieval-Augmented Generation (RAG) pipeline using Milvus and Contextual AI Parser. The pipeline integrates Contextual AI Parser for document parsing, Milvus for vector storage, and OpenAI for generating insightful, context-aware responses.


## Preparation

### Dependencies and Environment

To start, install the required dependencies by running the following command:


In [None]:
! pip install --upgrade "pymilvus[milvus_lite]" contextual-client openai requests rich


> If you are using Google Colab, to enable dependencies just installed, you may need to **restart the runtime** (click on the "Runtime" menu at the top of the screen, and select "Restart session" from the dropdown menu).


### Setting Up API Keys

We will use Contextual AI for document parsing and OpenAI as the LLM in this example. You should prepare the [CONTEXTUAL_API_KEY](https://docs.contextual.ai/user-guides/beginner-guide?utm_campaign=Parse-api-integration&utm_source=milvus&utm_medium=github&utm_content=notebook) and [OPENAI_API_KEY](https://platform.openai.com/docs/quickstart) as environment variables.

If you're running this notebook in Google Colab, you can add your API keys as secrets. The code below dynamically handles both Colab secrets and environment variables.


In [2]:
import os

# API key variable names
contextual_api_key_var = "CONTEXTUAL_API_KEY"
openai_api_key_var = "OPENAI_API_KEY"

# Fetch API keys
try:
    # If running in Colab, fetch API keys from Secrets
    import google.colab
    from google.colab import userdata
    contextual_api_key = userdata.get(contextual_api_key_var)
    openai_api_key = userdata.get(openai_api_key_var)

    if not contextual_api_key:
        raise ValueError(f"Secret '{contextual_api_key_var}' not found in Colab secrets.")
    if not openai_api_key:
        raise ValueError(f"Secret '{openai_api_key_var}' not found in Colab secrets.")
except ImportError:
    # If not running in Colab, fetch API keys from environment variables
    contextual_api_key = os.getenv(contextual_api_key_var)
    openai_api_key = os.getenv(openai_api_key_var)

    if not contextual_api_key:
        raise EnvironmentError(
            f"Environment variable '{contextual_api_key_var}' is not set. "
            "Please define it before running this script."
        )
    if not openai_api_key:
        raise EnvironmentError(
            f"Environment variable '{openai_api_key_var}' is not set. "
            "Please define it before running this script."
        )

os.environ["CONTEXTUAL_API_KEY"] = contextual_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key


### Prepare the LLM and Embedding Model

We initialize the OpenAI client for embeddings and Contextual AI client for GLM.


Define a function to generate text embeddings using OpenAI client. We use the [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings) model as an example.


In [3]:
from openai import OpenAI
from contextual import ContextualAI

openai_client = OpenAI()
contextual_client = ContextualAI(api_key=contextual_api_key)


In [4]:
def emb_text(text):
    return (
        openai_client.embeddings.create(input=text, model="text-embedding-3-small")
        .data[0]
        .embedding
    )


Generate a test embedding and print its dimension and first few elements.


In [5]:
test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])


1536
[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]


## Process Data Using Contextual AI Parser

Contextual AI Parser can parse various document formats into structured markdown with document hierarchy preservation. The parser handles complex documents with images, tables, and hierarchical structures, providing multiple output formats including:
- `markdown-document`: Single concatenated markdown output
- `markdown-per-page`: Page-by-page markdown output
- `blocks-per-page`: Structured JSON with document hierarchy

For a full list of supported input and output formats, please refer to [the official documentation](https://docs.contextual.ai/api-reference/parse/parse-file?utm_campaign=Parse-api-integration&utm_source=milvus&utm_medium=github&utm_content=notebook).

In this tutorial, we will parse two distinct document types: a research paper and a table-rich document. We'll use the `blocks-per-page` format to extract structured chunks suitable for downstream RAG tasks.


In [6]:
import requests
import asyncio
import nest_asyncio

# Documents to parse with Contextual AI
documents = [
    {
        "url": "https://arxiv.org/pdf/1706.03762",
        "title": "Attention Is All You Need",
        "type": "research_paper",
        "description": "Seminal transformer architecture paper that introduced self-attention mechanisms"
    },
    {
        "url": "https://raw.githubusercontent.com/ContextualAI/examples/refs/heads/main/03-standalone-api/04-parse/data/omnidocbench-text.pdf",
        "title": "OmniDocBench Dataset Documentation",
        "type": "table_rich_document",
        "description": "Dataset documentation with large tables demonstrating table extraction capabilities"
    }
]

job_data = []

# Submit parse jobs
for doc in documents:
    print(f"Submitting parse job for: {doc['title']}")

    file_content = requests.get(doc["url"]).content
    with open("temp_file.pdf", "wb") as f:
        f.write(file_content)

    with open("temp_file.pdf", "rb") as fp:
        response = contextual_client.parse.create(
            raw_file=fp,
            parse_mode="standard",
            enable_document_hierarchy=True,
            enable_split_tables=False,
            figure_caption_mode="concise"
        )

    job_data.append({"document": doc, "job_id": response.job_id})

print(f"Submitted {len(job_data)} parse jobs. Monitoring status...")

async def wait_for_jobs_async(job_data, max_attempts=20, interval=30.0):
    completed_jobs = set()
    for attempt in range(max_attempts):
        if len(completed_jobs) >= len(job_data):
            return completed_jobs

        for idx, job_info in enumerate(job_data, start=1):
            job_id = job_info["job_id"]
            if job_id in completed_jobs:
                continue

            status = await asyncio.to_thread(contextual_client.parse.job_status, job_id)
            doc_title = job_info["document"]["title"]
            print(f"Job {idx}/{len(job_data)} ({doc_title}): {status.status}")

            if status.status == "completed":
                completed_jobs.add(job_id)
            elif status.status == "failed":
                raise RuntimeError(f"Parse job failed for {doc_title}")

        if len(completed_jobs) < len(job_data):
            print("Waiting for remaining jobs to complete...")
            await asyncio.sleep(interval)

    raise TimeoutError("Timed out waiting for parse jobs to complete.")

nest_asyncio.apply()
completed_jobs = asyncio.run(wait_for_jobs_async(job_data))
print("All parse jobs completed!\n")

# Retrieve results and extract text chunks
texts = []
for job_info in job_data:
    job_id = job_info["job_id"]
    doc_title = job_info["document"]["title"]

    results = contextual_client.parse.job_results(job_id, output_types=["blocks-per-page"])
    block_count = 0
    for page in results.pages:
        for block in page.blocks:
            if getattr(block, "markdown", None):
                texts.append(block.markdown)
                block_count += 1

    print(f"Extracted {block_count} text blocks from {doc_title}")

print(f"\nTotal chunks extracted: {len(texts)}")

Submitting parse job for: Attention Is All You Need
Submitting parse job for: OmniDocBench Dataset Documentation
Submitted 2 parse jobs. Monitoring status...
Job 1/2 (Attention Is All You Need): processing
Job 2/2 (OmniDocBench Dataset Documentation): processing
Waiting for remaining jobs to complete...
Job 1/2 (Attention Is All You Need): processing
Job 2/2 (OmniDocBench Dataset Documentation): completed
Waiting for remaining jobs to complete...
Job 1/2 (Attention Is All You Need): processing
Waiting for remaining jobs to complete...
Job 1/2 (Attention Is All You Need): completed
All parse jobs completed!

Extracted 134 text blocks from Attention Is All You Need
Extracted 3 text blocks from OmniDocBench Dataset Documentation

Total chunks extracted: 137


## Load Data into Milvus

### Create the collection


In [7]:
from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./milvus_demo.db")
collection_name = "my_rag_collection"


> As for the argument of `MilvusClient`:
> - Setting the `uri` as a local file, e.g.`./milvus.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.
> - If you have large scale of data, you can set up a more performant Milvus server on [docker or kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server uri, e.g.`http://localhost:19530`, as your `uri`.
> - If you want to use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud.


Check if the collection already exists and drop it if it does.


In [8]:
if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)


Create a new collection with specified parameters.

If we don't specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values.


In [9]:
milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",  # Inner product distance
    # Strong consistency waits for all loads to complete, adding latency with large datasets
    # consistency_level="Strong",  # Supported values are (`"Strong"`, `"Session"`, `"Bounded"`, `"Eventually"`). See https://milvus.io/docs/consistency.md#Consistency-Level for more details.
)


### Insert data

In [10]:
from tqdm import tqdm

data = []

for i, chunk in enumerate(tqdm(texts, desc="Processing chunks")):
    embedding = emb_text(chunk)
    data.append({"id": i, "vector": embedding, "text": chunk})

milvus_client.insert(collection_name=collection_name, data=data)


Processing chunks: 100%|██████████| 137/137 [01:00<00:00,  2.27it/s]


{'insert_count': 137, 'ids': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136], 'cost': 0}

## Build RAG

### Retrieve data for a query

Let's specify a query question about the parsed documents.


In [11]:
question = "What is the transformer architecture and how does self-attention work?"


Search for the question in the collection and retrieve the semantic top-3 matches.


In [12]:
search_res = milvus_client.search(
    collection_name=collection_name,
    data=[emb_text(question)],
    limit=3,
    search_params={"metric_type": "IP", "params": {}},
    output_fields=["text"],
)


Let's take a look at the search results of the query


In [13]:
import json

retrieved_lines_with_distances = [
    (res["entity"]["text"], res["distance"]) for res in search_res[0]
]
print(json.dumps(retrieved_lines_with_distances, indent=4))


[
    [
        "To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence- aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as [17, 18] and [9].",
        0.7215487360954285
    ],
    [
        "The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.",
        0.7109684944152832
    ],
    [
        "In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.",
        0.6668375730514526
    ]
]


### Use LLM to get a RAG response

Convert the retrieved documents into a string format.


In [14]:
context = "\n".join(
    [line_with_distance[0] for line_with_distance in retrieved_lines_with_distances]
)


Define system and user prompts for the Language Model. This prompt is assembled with the retrieved documents from Milvus.


In [15]:
SYSTEM_PROMPT = """
Human: You are an AI assistant. You are able to find answers to the questions from the contextual passage snippets provided.
"""
USER_PROMPT = f"""
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""


Use OpenAI ChatGPT to generate a response based on the prompts.


In [17]:
response = openai_client.chat.completions.create(
    model="gpt-5",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": USER_PROMPT},
    ],
)
print(response.choices[0].message.content)


- Transformer architecture: It is a sequence transduction (encoder–decoder) model that relies entirely on attention. Both the encoder and decoder are built from stacked self-attention layers followed by point-wise, fully connected (feed-forward) layers. Unlike earlier encoder–decoder models, it does not use sequence-aligned RNNs or convolutions.

- How self-attention works (as described here): The model computes representations of the input and output using attention alone; the recurrent layers are replaced with multi-headed self-attention.
