# **Retrieval-Augmented Generation (RAG) with Azure AI Foundry**

## Overview
This notebook demonstrates how to implement Retrieval-Augmented Generation (RAG) using Azure AI Foundry services. You'll learn how to enhance AI model responses with relevant information retrieved from your own data sources, improving accuracy and context-awareness of the responses.

## What is RAG?
Retrieval-Augmented Generation combines the strengths of two approaches:

1. **Retrieval**: Finding relevant information from a knowledge base or vector store
2. **Generation**: Using this retrieved context to generate accurate, informed responses

This hybrid approach helps solve several limitations of standalone generative models:
- Provides access to specialized knowledge not in the model's training data
- Reduces hallucinations by grounding responses in factual information
- Allows for real-time access to updated information
- Makes citation and attribution of sources possible

## 1. Setting Up the Environment

First, we'll load environment variables and set up OpenTelemetry for tracing our requests.

In [1]:
import os
import dotenv
dotenv.load_dotenv(".env", override=True)

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

### Initializing Azure AI Foundry Clients

Now we'll set up connections to essential services:
- AI Project Client to manage project resources
- Chat Completions client for query understanding and response generation
- Embeddings client for vector representation
- Search client for document retrieval

In [2]:
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import ConnectionType
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient

# create a project client using environment variables loaded from the .env file
project = AIProjectClient.from_connection_string(
    conn_str=os.environ["PROJECT_CONNECTION_STRING"], credential=DefaultAzureCredential()
)

# create a vector embeddings client that will be used to generate vector embeddings
chat = project.inference.get_chat_completions_client()
embeddings = project.inference.get_embeddings_client()

# use the project client to get the default search connection
search_connection = project.connections.get_default(
    connection_type=ConnectionType.AZURE_AI_SEARCH, include_credentials=True
)

# Create a search index client using the search connection
# This client will be used to create and delete search indexes
search_client = SearchClient(
    index_name=os.environ["SEARCH_INDEX_NAME"],
    endpoint=search_connection.endpoint_url,
    credential=AzureKeyCredential(key=search_connection.key),
)

## 2. Building the Retrieval Component

The retrieval component is responsible for finding relevant documents that match the user's query. There are two main steps:

1. Understanding user intent to create an optimized search query
2. Using that optimized query to retrieve relevant documents

### Intent System Message Setup

This system message instructs the model to analyze user queries and reformat them into optimized search terms.

In [3]:
from azure.ai.inference.models import UserMessage, SystemMessage

# Define your INTENT_SYSTEM_PROMPT correctly with escaped braces
INTENT_SYSTEM_PROMPT = """
# Semantic Intent Clarification System

Your task is to rephrase the user's query into a concise, standalone phrase or sentence that clearly captures the semantic intent.  
The output will be used for semantic similarity retrieval, so it should fully represent the user's information need.

Guidelines:
- Do not use conversational phrases ("I want," "Can you tell me," etc.).
- Ensure clarity, completeness, and semantic accuracy.
- Include related relevant concepts naturally to enhance semantic richness, if applicable.

Example:
- User Query: "How do attention mechanisms work in transformer models?"
- Intent: "Function and role of self-attention mechanisms in transformer architectures for NLP tasks"

Conversation History:
{conversation_history}

Respond strictly with the following JSON format:
{{"intent": "your clarified semantic query here"}}
"""

# Then fix your get_intent_system_message function to properly escape the curly braces in the output
def get_intent_system_message(conversation_history):
    return SystemMessage(INTENT_SYSTEM_PROMPT.format(conversation_history=conversation_history)) 

### Document Retrieval Function

This function handles the full document retrieval process:
1. Analyze user intent to create optimized search query
2. Generate embeddings for the optimized query
3. Perform a hybrid search (both text-based and vector-based)
4. Return the most relevant documents

In [4]:
from azure.search.documents.models import VectorizedQuery
import json

@tracer.start_as_current_span(name="get_documents")
def get_documents(messages: list, top: int=3) -> dict:
    intent_query_response = chat.complete(
        model=os.environ["chatModel"],
        messages=[get_intent_system_message(messages)]
    )

    enhanced_search_query = json.loads(intent_query_response.choices[0].message.content)["intent"]
    
    embedding = embeddings.embed(
        model=os.environ["embeddingModel"],
        input=enhanced_search_query
    ).data[0].embedding
    
    vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=50, fields="text_vector")

    search_results = search_client.search(
        search_text=enhanced_search_query,
        vector_queries=[vector_query],
        select=["id", "content", "title", "url"],
        top=top,
    )

    documents = [
        {
            "id": result["id"],
            "content": result["content"],
            "title": result["title"],
            "url": result["url"],
        }
        for result in search_results
    ]

    return documents

## 3. Building the Generation Component

The generation component uses the retrieved documents as context to create accurate, well-informed responses to user queries.

### RAG System Message Setup

This system message instructs the model on how to use retrieved context to generate responses. It emphasizes:
- Staying grounded in the provided context
- Proper citation and attribution
- Clear formatting with Markdown for better readability

In [5]:
SYSTEM_PROMPT = """You are a helpful AI assistant that provides accurate information based on the retrieved context.

### Retrieved Context:
{retrieved_context}

### Instructions:
1. Answer questions based on the retrieved context above.
2. If the context doesn't contain the information needed, acknowledge the limitation.
3. Do not make up information that is not supported by the context.
4. Keep responses concise and focused on the user's question.
5. Format your answers using Markdown when appropriate (so we can render them using from IPython.display import display, Markdown).
    5a. pay extra attention on formatting code blocks, lists, tables, and mathematical equations.
6. When quoting directly from the context, use quotation marks.
7. In terms of citation style please follow the Institute for Electrical and Electronics Engineers (IEEE):
    7a. In-text citations should be in square brackets, e.g. [1], [2], etc.
    7b. The reference list should be numbered in the order in which they appear in the text (make them clickable).
    7c. Use the following format for references: [1] Author(s), "Title," Journal, vol. X, no. Y, pp. Z-Z, Year. [Online]. Available: URL

Remember: Only use information from the retrieved context to answer questions, and remember to cite sources properly.
"""

def get_completion_system_message(retrieved_context):
    return SystemMessage(SYSTEM_PROMPT.format(retrieved_context=retrieved_context))

### Complete RAG Pipeline Function

This function integrates both retrieval and generation components:
1. Retrieve relevant documents based on user query
2. Format the system message with retrieved context
3. Generate a response that incorporates the retrieved information
4. Return the formatted response to the user

In [6]:
@tracer.start_as_current_span(name="chat_with_attentionIsAllYouNeed")
def chat_with_documents(messages: list) -> dict:
    documents = get_documents(messages)

    # Create the system message
    system_message = get_completion_system_message(documents)

    # Format messages properly for the API
    formatted_messages = [system_message]

    # Add user messages
    for message in messages:
        formatted_messages.append(UserMessage(message["content"]))

    response = chat.complete(
        model=os.environ["chatModel"],
        messages=formatted_messages,
        max_tokens=1000
    )

    # Return a chat protocol compliant response
    return response.choices[0].message

## 4. Telemetry Integration

Telemetry helps us monitor and analyze RAG system performance. This function sets up:
- Azure Monitor integration
- OpenTelemetry instrumentation
- Tracing capabilities for inference operations

In [7]:
from azure.ai.inference.tracing import AIInferenceInstrumentor
from azure.monitor.opentelemetry import configure_azure_monitor
from azure.core.settings import settings

def enable_telemetry(project):
    AIInferenceInstrumentor().instrument()
    settings.tracing_implementation = "opentelemetry"
    application_insights_connection_string = project.telemetry.get_connection_string()
    configure_azure_monitor(connection_string=application_insights_connection_string)

## 5. Testing the RAG Pipeline

Now let's test our complete RAG pipeline with a sample question about attention mechanisms in neural networks.
This will demonstrate how the system:
1. Processes the question
2. Retrieves relevant documents
3. Generates a well-informed response with citations

In [12]:
# from config import enable_telemetry
enable_telemetry(project)

user_message = "On a high level, how are llms different from diffusion models?"
response = chat_with_documents(messages=[{"role": "user", "content": user_message}])

### Displaying the Response

We'll use IPython's Markdown display capability to render the response with proper formatting, including:
- Headers and sections
- Code blocks (if present)
- Citations and references
- Mathematical equations (if present)

In [13]:
from IPython.display import display, Markdown
display(Markdown(response.content))

Large Language Models (LLMs) and diffusion models differ in their fundamental approach to processing and generating data:

### 1. **Modeling Approach**
- **LLMs**: These are predominantly **autoregressive models**, which predict the next token given the context of preceding tokens. They generate outputs sequentially at the token level, relying heavily on architectures like transformer-based, decoder-only language models [1].
- **Diffusion Models**: These are based on **probabilistic diffusion processes**. They iteratively refine random noise over multiple steps to generate coherent outputs. Diffusion models operate in continuous embedding spaces and are an alternative probabilistic paradigm for text generation [2][3].

### 2. **Focus**
- **LLMs**: Heavily token-based and often operate at the word or subword level. Their training is generally English-centric, with a strong dependency on tokens and language-based patterns [1].
- **Diffusion Models**: They aim for reasoning and text generation in abstract embedding spaces. This enables them to model higher-level semantic relationships and hierarchical reasoning that LLMs typically do not capture [2].

### 3. **Applications**
While both can be used for tasks like natural language processing, conversational AI, or code generation, diffusion models may offer advantages in tasks requiring complex reasoning or hierarchical abstraction due to their focus on higher-level embeddings [2].

### 4. **Concerns**
Both models share societal challenges, such as environmental impact due to large-scale training, potential biases in training data, and misuse for generating harmful content [3].

### References
[1] "Large Language Models," [Online]. Available: https://dataericssonlearningpath.blob.core.windows.net/demo-data/papers/2412.08821v2.pdf  
[2] "Deep Probabilistic Paradigms in LLMs," [Online]. Available: https://dataericssonlearningpath.blob.core.windows.net/demo-data/papers/2412.08821v2.pdf  
[3] "Diffusion-based Alternatives," [Online]. Available: https://dataericssonlearningpath.blob.core.windows.net/demo-data/papers/2502.09992v2.pdf  