# **Retrieval-Augmented Generation (RAG) with Azure AI Foundry**

## Overview
This notebook demonstrates how to implement Retrieval-Augmented Generation (RAG) using Azure AI Foundry services. You'll learn how to enhance AI model responses with relevant information retrieved from your own data sources, improving accuracy and context-awareness of the responses.

## What is RAG?
Retrieval-Augmented Generation combines the strengths of two approaches:

1. **Retrieval**: Finding relevant information from a knowledge base or vector store
2. **Generation**: Using this retrieved context to generate accurate, informed responses

This hybrid approach helps solve several limitations of standalone generative models:
- Provides access to specialized knowledge not in the model's training data
- Reduces hallucinations by grounding responses in factual information
- Allows for real-time access to updated information
- Makes citation and attribution of sources possible

## Key Components in This Notebook

### **1. Intent-Based Query Processing**
- Converting user queries into optimized search queries
- Enhancing search relevance through intent understanding

### **2. Vector Search**
- Using embeddings to find semantically similar content
- Combining traditional keyword search with vector similarity

### **3. Context-Aware Completion**
- Providing retrieved documents as context to the language model
- Formatting responses with proper citations

### **4. Telemetry Integration**
- Tracking query performance with Azure Monitor
- Using OpenTelemetry for observability

## Learning Objectives
- Set up a complete RAG pipeline using Azure AI Foundry
- Implement intelligent query formulation
- Configure vector search for semantic retrieval
- Provide grounded, well-cited responses from an AI model
- Monitor and analyze RAG system performance

## Prerequisites
- An Azure account with access to Azure AI Foundry
- An active Azure AI Search index (created in the 01-create-index.ipynb notebook)
- Appropriate environment variables in a `.env` file:
  - `PROJECT_CONNECTION_STRING`: Connection string for your Azure AI Foundry project
  - `SEARCH_INDEX_NAME`: Name of your Azure AI Search index
  - `chatModel`: The model to use for chat completions (e.g., GPT-4)
  - `embeddingModel`: The model to use for embeddings (e.g., text-embedding-ada-002)
  - `APPLICATIONINSIGHTS_CONNECTION_STRING`: Connection string for telemetry

Follow along step by step to build your own RAG solution with Azure AI Foundry!

**Imports**

In [1]:
import os
import dotenv
dotenv.load_dotenv(".env")

from opentelemetry import trace
tracer = trace.get_tracer(__name__)

In [2]:
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import ConnectionType
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient

# create a project client using environment variables loaded from the .env file
project = AIProjectClient.from_connection_string(
    conn_str=os.environ["PROJECT_CONNECTION_STRING"], credential=DefaultAzureCredential()
)

# create a vector embeddings client that will be used to generate vector embeddings
chat = project.inference.get_chat_completions_client()
embeddings = project.inference.get_embeddings_client()

# use the project client to get the default search connection
search_connection = project.connections.get_default(
    connection_type=ConnectionType.AZURE_AI_SEARCH, include_credentials=True
)

# Create a search index client using the search connection
# This client will be used to create and delete search indexes
search_client = SearchClient(
    index_name=os.environ["SEARCH_INDEX_NAME"],
    endpoint=search_connection.endpoint_url,
    credential=AzureKeyCredential(key=search_connection.key),
)

### **RETRIEVAL**

**INTENT SYSTEM MESSAGE**

In [3]:
from azure.ai.inference.models import UserMessage, SystemMessage

# Define your INTENT_SYSTEM_PROMPT correctly with escaped braces
INTENT_SYSTEM_PROMPT = """
    # Intent Mapping System

    Your task is to understand the user's query and map it to a search intent.
    
    For example, if a user asks about "attention mechanisms in transformers", 
    create a search query like "attention mechanism transformer architecture neural networks".
    
    Avoid phrases like "I want" or "tell me about". Just provide keywords.
    
    The user's conversation history is:
    {conversation_history}
    
    Return only the search query, nothing else. Use the format: 
    {{"intent": "your search query here"}}
"""

# Then fix your get_intent_system_message function to properly escape the curly braces in the output
def get_intent_system_message(conversation_history):
    return SystemMessage(INTENT_SYSTEM_PROMPT.format(conversation_history=conversation_history)) 

**RETRIEVE DOCUMENTS**

In [10]:
from azure.search.documents.models import VectorizedQuery
import json

@tracer.start_as_current_span(name="get_documents")
def get_documents(messages: list, top: int=3) -> dict:
    intent_query_response = chat.complete(
        model=os.environ["chatModel"],
        messages=[get_intent_system_message(messages)]
    )

    enhanced_search_query = json.loads(intent_query_response.choices[0].message.content)["intent"]
    
    embedding = embeddings.embed(model=os.environ["embeddingModel"], input=enhanced_search_query)
    search_vector = embedding.data[0].embedding
    vector_query = VectorizedQuery(vector=search_vector, k_nearest_neighbors=50, fields="text_vector")

    search_results = search_client.search(
        search_text=enhanced_search_query,
        vector_queries=[vector_query],
        select=["id", "content", "title", "url"],
        top=top,
    )

    documents = [
        {
            "id": result["id"],
            "content": result["content"],
            "title": result["title"],
            "url": result["url"],
        }
        for result in search_results
    ]

    return documents

### **Completion**

**RAG SYSTEM MESSAGE**

In [11]:
SYSTEM_PROMPT = """You are a helpful AI assistant that provides accurate information based on the retrieved context.

### Retrieved Context:
{retrieved_context}

### Instructions:
1. Answer questions based on the retrieved context above.
2. If the context doesn't contain the information needed, acknowledge the limitation.
3. Do not make up information that is not supported by the context.
4. Keep responses concise and focused on the user's question.
5. Format your answers using Markdown when appropriate (so we can render them using from IPython.display import display, Markdown).
    5a. pay extra attention on formatting code blocks, lists, tables, and mathematical equations.
6. When quoting directly from the context, use quotation marks.
7. In terms of citation style please follow the Institute for Electrical and Electronics Engineers (IEEE):
    7a. In-text citations should be in square brackets, e.g. [1], [2], etc.
    7b. The reference list should be numbered in the order in which they appear in the text (make them clickable).
    7c. Use the following format for references: [1] Author(s), "Title," Journal, vol. X, no. Y, pp. Z-Z, Year. [Online]. Available: URL

Remember: Only use information from the retrieved context to answer questions.
"""

def get_completion_system_message(retrieved_context):
    return SystemMessage(SYSTEM_PROMPT.format(retrieved_context=retrieved_context))

In [12]:
@tracer.start_as_current_span(name="chat_with_attentionIsAllYouNeed")
def chat_with_documents(messages: list) -> dict:
    documents = get_documents(messages)

    # Create the system message
    system_message = get_completion_system_message(documents)

    # Format messages properly for the API
    formatted_messages = [system_message]

    # Add user messages
    for message in messages:
        print(f"message: {message}")
        formatted_messages.append(UserMessage(message["content"]))

    response = chat.complete(
        model=os.environ["chatModel"],
        messages=formatted_messages,
        max_tokens=1000
    )

    # Return a chat protocol compliant response
    return response.choices[0].message

In [13]:
from azure.ai.inference.tracing import AIInferenceInstrumentor
from azure.monitor.opentelemetry import configure_azure_monitor
from azure.core.settings import settings

def enable_telemetry(project):
    AIInferenceInstrumentor().instrument()
    settings.tracing_implementation = "opentelemetry"
    application_insights_connection_string = project.telemetry.get_connection_string()
    configure_azure_monitor(connection_string=application_insights_connection_string)

In [14]:
# from config import enable_telemetry
enable_telemetry(project)

user_message = "how does attention relate to feed forward networks?"
response = chat_with_documents(messages=[{"role": "user", "content": user_message}])

message: {'role': 'user', 'content': 'how does attention relate to feed forward networks?'}


In [15]:
from IPython.display import display, Markdown
display(Markdown(response.content))

In the Transformer model architecture, attention mechanisms are integral to both the encoder and decoder, and work alongside feed-forward networks. Here is how they relate:

1. **Structure**: The Transformer uses stacked self-attention and point-wise, fully connected layers (feed-forward networks) for both the encoder and decoder. This can be seen in the diagram from the paper, which alternates between multi-head attention layers and feed-forward layers [1].

2. **Complementary Functions**:
    - **Self-Attention**: Enables the model to consider and weigh the relevance of different positions in the input sequence for generating the output, without regard to their distance.
    - **Feed-Forward Networks**: Applied independently to each position, adding non-linearity and enhancing the model's ability to learn complex patterns.

The Transformer architecture can thus be viewed as an interplay between attention mechanisms capturing dependencies and feed-forward networks processing these representations locally.

### Reference:
[1] "The Transformer - model architecture," papers/1706.03762v7.pdf. [Online]. Available: https://dataericssonlearningpath.blob.core.windows.net/demo-data/papers/1706.03762v7.pdf