<a href="https://colab.research.google.com/github/milvus-io/bootcamp/blob/master/integration/build_RAG_with_milvus_and_contextual_ai_glm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>   <a href="https://github.com/milvus-io/bootcamp/blob/master/integration/build_RAG_with_milvus_and_contextual_ai_glm.ipynb" target="_blank">
    <img src="https://img.shields.io/badge/View%20on%20GitHub-555555?style=flat&logo=github&logoColor=white" alt="GitHub Repository"/>


# Build RAG with Milvus and Contextual AI GLM

**Versions used:**
- Milvus version `1.3.4`
- Contextual AI client `0.9.0`

[Contextual AI's Grounded Language Model (GLM)](https://contextual.ai/blog/introducing-grounded-language-model?utm_campaign=GLM-integration&utm_source=milvus&utm_medium=github&utm_content=notebook) is the most grounded language model in the world, making it the best choice for RAG and agentic use cases where minimizing hallucinations is critical. Unlike traditional LLMs that rely heavily on parametric knowledge, GLM prioritizes the knowledge you explicitly provide, ensuring responses are grounded in your specific data.

In this tutorial, we'll show you how to build a Retrieval-Augmented Generation (RAG) pipeline using Milvus and Contextual AI's GLM. The pipeline integrates Milvus for vector storage, OpenAI for embeddings, and GLM for enterprise-grade, hallucination-free response generation.

## Preparation
### Dependencies and Environment

To start, install the required dependencies by running the following command:


In [None]:
! pip install --upgrade "pymilvus[milvus_lite]" contextual-client openai requests tqdm


> If you are using Google Colab, to enable dependencies just installed, you may need to **restart the runtime** (click on the "Runtime" menu at the top of the screen, and select "Restart session" from the dropdown menu).


### Setting Up API Keys

We will use Contextual AI for GLM access and OpenAI for text embeddings in this example. You should prepare the [CONTEXTUAL_API_KEY](https://docs.contextual.ai/user-guides/beginner-guide?utm_campaign=GLM-integration&utm_source=milvus&utm_medium=github&utm_content=notebook) and [OPENAI_API_KEY](https://platform.openai.com/docs/quickstart) as environment variables.

If you're running this notebook in Google Colab, you can add your API keys as secrets. The code below dynamically handles both Colab secrets and environment variables.


In [2]:
import os

# API key variable names
contextual_api_key_var = "CONTEXTUAL_API_KEY"
openai_api_key_var = "OPENAI_API_KEY"

# Fetch API keys
try:
    # If running in Colab, fetch API keys from Secrets
    import google.colab
    from google.colab import userdata
    contextual_api_key = userdata.get(contextual_api_key_var)
    openai_api_key = userdata.get(openai_api_key_var)

    if not contextual_api_key:
        raise ValueError(f"Secret '{contextual_api_key_var}' not found in Colab secrets.")
    if not openai_api_key:
        raise ValueError(f"Secret '{openai_api_key_var}' not found in Colab secrets.")
except ImportError:
    # If not running in Colab, fetch API keys from environment variables
    contextual_api_key = os.getenv(contextual_api_key_var)
    openai_api_key = os.getenv(openai_api_key_var)

    if not contextual_api_key:
        raise EnvironmentError(
            f"Environment variable '{contextual_api_key_var}' is not set. "
            "Please define it before running this script."
        )
    if not openai_api_key:
        raise EnvironmentError(
            f"Environment variable '{openai_api_key_var}' is not set. "
            "Please define it before running this script."
        )

os.environ["CONTEXTUAL_API_KEY"] = contextual_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key


### Prepare the data

We'll use a sample enterprise knowledge base to demonstrate GLM's grounding capabilities:


In [3]:
# Sample enterprise knowledge base
enterprise_docs = [
    "Our company's Q3 revenue reached $2.4 billion, representing a 15% year-over-year growth. The growth was primarily driven by our cloud services division, which saw a 28% increase in subscription revenue.",
    "The new AI-powered customer service platform reduced average response time by 40% and improved customer satisfaction scores by 23%. The platform processes over 10,000 customer inquiries daily.",
    "Our sustainability initiatives have resulted in a 35% reduction in carbon emissions compared to 2022. We've invested $50 million in renewable energy projects and achieved carbon neutrality in 12 facilities.",
    "The recent product launch of our enterprise security suite generated $180 million in revenue within the first quarter. The suite includes advanced threat detection, automated incident response, and compliance management features.",
    "Employee satisfaction surveys show a 18% improvement in workplace satisfaction, with 89% of employees reporting high job satisfaction. Our remote work policies and professional development programs were cited as key factors.",
    "Our research and development investment increased by 22% this year, totaling $340 million. This investment has led to 15 new patent applications and the development of three breakthrough technologies in AI and quantum computing.",
    "The company's market share in the enterprise software sector grew from 12% to 16% this year, driven by strategic partnerships with major cloud providers and enhanced product capabilities.",
    "Customer retention rates improved to 94% this quarter, up from 87% last year. The improvement is attributed to our enhanced customer success program and 24/7 technical support availability."
]


### Prepare the LLM and Embedding Model

We initialize the OpenAI client for embeddings and Contextual AI client for GLM.


In [4]:
from openai import OpenAI
from contextual import ContextualAI

openai_client = OpenAI()
contextual_client = ContextualAI(api_key=contextual_api_key)


Define a function to generate text embeddings using OpenAI client. We use the [text-embedding-3-small](https://platform.openai.com/docs/guides/embeddings) model as an example.


In [5]:
def emb_text(text):
    return (
        openai_client.embeddings.create(input=text, model="text-embedding-3-small")
        .data[0]
        .embedding
    )


Generate a test embedding and print its dimension and first few elements.


In [6]:
test_embedding = emb_text("This is a test")
embedding_dim = len(test_embedding)
print(embedding_dim)
print(test_embedding[:10])


1536
[0.009889289736747742, -0.005578675772994757, 0.00683477520942688, -0.03805781528353691, -0.01824733428657055, -0.04121600463986397, -0.007636285852640867, 0.03225184231996536, 0.018949154764413834, 9.352207416668534e-05]


## Load Data into Milvus

### Create the Collection


In [7]:
from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./milvus_demo.db")
collection_name = "my_rag_collection"


> As for the argument of `MilvusClient`:
> - Setting the `uri` as a local file, e.g.`./milvus.db`, is the most convenient method, as it automatically utilizes [Milvus Lite](https://milvus.io/docs/milvus_lite.md) to store all data in this file.
> - If you have large scale of data, you can set up a more performant Milvus server on [docker or kubernetes](https://milvus.io/docs/quickstart.md). In this setup, please use the server uri, e.g.`http://localhost:19530`, as your `uri`.
> - If you want to use [Zilliz Cloud](https://zilliz.com/cloud), the fully managed cloud service for Milvus, adjust the `uri` and `token`, which correspond to the [Public Endpoint and Api key](https://docs.zilliz.com/docs/on-zilliz-cloud-console#free-cluster-details) in Zilliz Cloud.


Check if the collection already exists and drop it if it does.


In [8]:
if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)


Create a new collection with specified parameters.

If we don't specify any field information, Milvus will automatically create a default `id` field for primary key, and a `vector` field to store the vector data. A reserved JSON field is used to store non-schema-defined fields and their values.


In [9]:
milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",  # Inner product distance
    # Strong consistency waits for all loads to complete, adding latency with large datasets
    # consistency_level="Strong",  # Supported values are (`"Strong"`, `"Session"`, `"Bounded"`, `"Eventually"`). See https://milvus.io/docs/consistency.md#Consistency-Level for more details.
)


### Insert data


In [10]:
from tqdm import tqdm

data = []

for i, chunk in enumerate(tqdm(enterprise_docs, desc="Processing chunks")):
    embedding = emb_text(chunk)
    data.append({"id": i, "vector": embedding, "text": chunk})

milvus_client.insert(collection_name=collection_name, data=data)


Processing chunks: 100%|██████████| 8/8 [00:08<00:00,  1.08s/it]


{'insert_count': 8, 'ids': [0, 1, 2, 3, 4, 5, 6, 7], 'cost': 0}

## Build RAG

Contextual AI's GLM (Grounded Language Model) prioritizes provided knowledge over parametric knowledge, ensuring responses are grounded in your specific data. Let's create a RAG pipeline class to demonstrate GLM's capabilities:


In [17]:
from typing import List, Dict

class GLMRAGPipeline:
    def __init__(self, collection_name: str = "my_rag_collection"):
        self.milvus_client = milvus_client
        self.openai_client = openai_client
        self.contextual_client = contextual_client
        self.collection_name = collection_name

    def emb_text(self, text: str) -> List[float]:
        """Generate embeddings using OpenAI"""
        return (
            self.openai_client.embeddings.create(
                input=text, model="text-embedding-3-small"
            )
            .data[0]
            .embedding
        )

    def retrieve_context(self, query: str, top_k: int = 3) -> List[str]:
        """Retrieve relevant context from Milvus"""
        search_res = self.milvus_client.search(
            collection_name=self.collection_name,
            data=[self.emb_text(query)],
            limit=top_k,
            search_params={"metric_type": "IP", "params": {}},
            output_fields=["text"],
        )

        return [res["entity"]["text"] for res in search_res[0]]

    def generate_response(self, query: str, conversation_history: List[Dict] = None, top_k: int = 3) -> str:
        """Generate grounded response using GLM"""
        # Retrieve relevant context
        knowledge = self.retrieve_context(query, top_k)

        # Prepare conversation messages (create a copy to avoid modifying the original)
        messages = (conversation_history or []).copy()
        messages.append({"role": "user", "content": query})

        # Generate response using GLM
        response = self.contextual_client.generate.create(
            model="v2",
            messages=messages,
            knowledge=knowledge,
            avoid_commentary=False,
            max_new_tokens=1024,
            temperature=0.1,
            top_p=0.9
        )

        return response.response

# Initialize RAG pipeline
rag = GLMRAGPipeline()


### Retrieve data for a query

Let's specify a query question about the enterprise knowledge base.


In [18]:
question = "What was our Q3 revenue and what drove the growth?"


Search for the question in the collection and retrieve the semantic top-3 matches.


In [19]:
context = rag.retrieve_context(question, top_k=3)

# Display retrieved context
print("Retrieved context:")
for i, doc in enumerate(context, 1):
    print(f"{i}. {doc[:100]}...")


Retrieved context:
1. Our company's Q3 revenue reached $2.4 billion, representing a 15% year-over-year growth. The growth ...
2. Our research and development investment increased by 22% this year, totaling $340 million. This inve...
3. The recent product launch of our enterprise security suite generated $180 million in revenue within ...


### Use LLM to get a RAG response

Generate a response using GLM with the retrieved context.


In [20]:
response = rag.generate_response(question)
print(f"Question: {question}")
print(f"\nResponse: {response}")


Question: What was our Q3 revenue and what drove the growth?

Response: <fact>Our company achieved Q3 revenue of $2.4 billion, showing a 15% year-over-year growth.[0]()</fact>

<fact>The primary growth driver was our cloud services division, which experienced a 28% increase in subscription revenue.[0]()</fact>


## Advanced Features

GLM supports advanced features like multi-turn conversations with context preservation. Let's demonstrate this capability:


### Multi-Turn Conversation


In [21]:
def multi_turn_conversation(rag_pipeline, initial_query: str, follow_up_queries: List[str]):
    """Demonstrate multi-turn conversation with GLM"""
    conversation_history = []

    # First turn
    response = rag_pipeline.generate_response(initial_query, conversation_history)
    conversation_history.extend([
        {"role": "user", "content": initial_query},
        {"role": "assistant", "content": response}
    ])

    print(f"User: {initial_query}")
    print(f"Assistant: {response}\n")

    # Follow-up turns
    for query in follow_up_queries:
        response = rag_pipeline.generate_response(query, conversation_history)
        conversation_history.extend([
            {"role": "user", "content": query},
            {"role": "assistant", "content": response}
        ])
        print(f"User: {query}")
        print(f"Assistant: {response}\n")

    return conversation_history

# Example multi-turn conversation
initial_query = "Tell me about our company's recent performance"
follow_ups = [
    "What about our sustainability efforts?",
    "How are our employees responding to these changes?"
]

conversation = multi_turn_conversation(rag, initial_query, follow_ups)


User: Tell me about our company's recent performance
Assistant: <fact>Our company's Q3 revenue reached $2.4 billion, showing a 15% year-over-year growth, with the cloud services division specifically experiencing a 28% increase in subscription revenue.[0]()</fact>

<fact>Key performance indicators show positive trends in customer relationships, with customer retention rates improving to 94% this quarter, up from 87% last year. This improvement is specifically attributed to our enhanced customer success program and 24/7 technical support availability.[0]()</fact>

<fact>In terms of market position, the company has expanded its market share in the enterprise software sector from 12% to 16% this year. This growth is attributed to strategic partnerships with major cloud providers and enhanced product capabilities.[0]()</fact>

User: What about our sustainability efforts?
Assistant: <fact>Our company has made significant progress in sustainability, achieving a 35% reduction in carbon emissi

### Dynamic Knowledge Injection

You can add dynamic knowledge to retrieved context for enhanced responses:


In [22]:
def add_dynamic_knowledge(rag_pipeline, query: str, additional_knowledge: List[str], top_k: int = 3):
    """Add dynamic knowledge to retrieved context"""
    retrieved_context = rag_pipeline.retrieve_context(query, top_k)
    combined_knowledge = retrieved_context + additional_knowledge

    messages = [{"role": "user", "content": query}]

    response = rag_pipeline.contextual_client.generate.create(
        model="v2",
        messages=messages,
        knowledge=combined_knowledge,
        avoid_commentary=False,
        max_new_tokens=1024,
        temperature=0.1
    )

    return response.response

# Example with additional knowledge
query = "What are our current sustainability metrics?"
additional_knowledge = [
    "Our latest sustainability report shows we've achieved 100% renewable energy usage in our European operations.",
    "We've partnered with three major environmental organizations to plant 50,000 trees this year."
]

response = add_dynamic_knowledge(rag, query, additional_knowledge)
print(f"Query: {query}")
print(f"Response: {response}")


Query: What are our current sustainability metrics?
Response: <fact>Our key sustainability metrics include:
- 35% reduction in carbon emissions compared to 2022[0]() 
- $50 million investment in renewable energy projects[0]() 
- 100% renewable energy usage achieved in European operations[0]()</fact>

<fact>Additional environmental initiatives include a partnership with three major environmental organizations for planting 50,000 trees this year.[0]()</fact>

<fact>Supporting operational metrics show:
- 94% customer retention rate (up from 87% last year)[0]() 
- 89% of employees report high job satisfaction, with an 18% improvement in workplace satisfaction overall[0]()</fact>


Great! We have successfully built a RAG pipeline with Milvus and Contextual AI GLM. GLM provides enterprise-grade, hallucination-free responses by prioritizing your provided knowledge over parametric knowledge.


## Summary

This guide demonstrated how to build a production-ready RAG system using Milvus and Contextual AI's GLM. The combination provides:

### Key Advantages of GLM + Milvus:

- **Hallucination-Free Responses**: GLM prioritizes provided knowledge over parametric knowledge
- **Enterprise-Grade Reliability**: Designed for production RAG applications with minimal tuning
- **Multi-Turn Conversations**: Full conversation context preservation across interactions
- **Scalable Vector Search**: Milvus enables efficient similarity search across large document collections
- **Dynamic Knowledge Management**: Easy integration of diverse knowledge sources

### Technical Implementation:

- **Complete RAG Pipeline**: Milvus vector store + OpenAI embeddings + GLM generation
- **Multi-Turn Support**: Conversation history preservation for context-aware responses
- **Advanced Features**: Dynamic knowledge injection and response quality analysis
- **Production Ready**: Robust architecture suitable for enterprise deployment

### Key Advantages:

- **Hallucination-Free Responses**: GLM prioritizes provided knowledge over parametric knowledge
- **Enterprise-Grade Reliability**: Production-ready with minimal tuning required
- **Multi-Turn Conversations**: Full conversation context preservation across interactions
- **Scalable Vector Search**: Milvus enables efficient similarity search across large document collections
- **Dynamic Knowledge Management**: Easy integration of diverse knowledge sources
- **Performance Monitoring**: Built-in metrics for optimization and scaling

### Next Steps:

- Experiment with different knowledge base structures for your specific domain
- Implement custom response validation and quality metrics
- Scale to larger document collections with Milvus clustering
- Integrate with your existing enterprise systems and workflows

This integration represents a significant advancement in RAG technology, providing the reliability and control needed for enterprise applications while maintaining the flexibility and scalability of modern vector databases.

---

**Ready to get started?** This notebook provides a complete, production-ready example of integrating Contextual AI's GLM with Milvus for enterprise-grade RAG applications. The combination of GLM's grounding capabilities and Milvus's powerful vector search features creates a robust foundation for reliable, scalable AI applications.
