# LlamaStack RAG Capabilities Demo

This notebook demonstrates the Retrieval-Augmented Generation (RAG) capabilities of Llama Stack using the built-in Agent API with the `file_search` tool.

## How RAG Works in Llama Stack

Llama Stack provides native RAG support through its **Agent API** with the `file_search` tool:

1. **Vector Store Creation**: Documents are uploaded and chunked into smaller pieces
2. **Embedding Generation**: Each chunk is converted to a vector embedding using an embedding model
3. **Vector Search**: When a query is made, it's embedded and compared against stored chunks using:
   - **Semantic search**: Vector similarity (cosine/dot product)
   - **Keyword search**: BM25 algorithm
   - **Hybrid search**: Combination of both (configurable weights)
4. **Retrieval**: Top matching chunks are retrieved based on relevance scores
5. **Generation**: The LLM uses retrieved context to generate answers, citing sources

### Architecture

```
User Query → Embedding → Vector Search → Top K Chunks → LLM + Context → Answer
                              ↓
                        Vector Store
                     (Embeddings + Metadata)
```

## Prerequisites

Before running this notebook:

1. Install required Python packages:
   ```bash
   pip install llama-stack-client python-dotenv requests
   ```

2. Start your Llama Stack server (if not already running)

3. Configure your environment variables in `.env` file

## Configuration

Set up the connection to your Llama Stack server and configure the inference model.

In [None]:
import os
from dotenv import load_dotenv
from llama_stack_client import LlamaStackClient, Agent, AgentEventLogger
import requests
from io import BytesIO
import time

# Load environment variables
load_dotenv()

# Get configuration from environment
LLAMA_STACK_BASE_URL = os.getenv("LLAMA_STACK_BASE_URL", "http://localhost:8321")
INFERENCE_MODEL = os.getenv("INFERENCE_MODEL", "vllm/qwen3-14b-gaudi")

print(f"Llama Stack URL: {LLAMA_STACK_BASE_URL}")
print(f"Inference Model: {INFERENCE_MODEL}")

## Initialize Llama Stack Client

In [None]:
# Initialize client
client = LlamaStackClient(base_url=LLAMA_STACK_BASE_URL)

print("✓ Llama Stack client initialized")

## 1. Create Vector Store

First, we'll create a vector store with hybrid search capabilities (combining semantic and keyword search). We'll download an HR benefits document and ingest it into the vector store.

In [None]:
# Create vector store with embedding model configuration and hybrid search
vector_store_name = "hr-benefits-hybrid"

vs = client.vector_stores.create(
    name=vector_store_name,
    extra_body={
        "embedding_model": "sentence-transformers/nomic-ai/nomic-embed-text-v1.5",
        "embedding_dimension": 768,
        "search_mode": "hybrid",  # Enable hybrid search (keyword + semantic)
        "bm25_weight": 0.5,  # Weight for keyword search (BM25)
        "semantic_weight": 0.5,  # Weight for semantic search
    }
)

print(f"✓ Vector store created: {vs.id}")
print(f"  Name: {vs.name}")
print(f"  Search Mode: hybrid (BM25 + semantic)")

### Download and Ingest Document

We'll download the FantaCo HR Benefits document and upload it to the vector store.

In [None]:
# Download clean text file
url = "https://raw.githubusercontent.com/burrsutter/fantaco-redhat-one-2026/refs/heads/main/basic-rag-llama-stack/source_docs/FantaCoFabulousHRBenefits_clean.txt"
print(f"Downloading text file from {url}...")
response = requests.get(url)
text_content = response.text

print(f"✓ Downloaded {len(text_content)} characters of text")

# Save the text to source_docs folder for inspection
source_docs_path = os.path.join("source_docs", "FantaCoFabulousHRBenefits_clean.txt")
os.makedirs(os.path.dirname(source_docs_path), exist_ok=True)
with open(source_docs_path, 'w', encoding='utf-8') as f:
    f.write(text_content)
print(f"✓ Saved text to: {source_docs_path}")

In [None]:
# Upload as text file
text_buffer = BytesIO(text_content.encode('utf-8'))
text_buffer.name = "hr-benefits-clean.txt"

uploaded_file = client.files.create(
    file=text_buffer,
    purpose="assistants"
)

print(f"✓ File uploaded: {uploaded_file.id}")

In [None]:
# Attach file to vector store with custom chunking strategy
client.vector_stores.files.create(
    vector_store_id=vs.id,
    file_id=uploaded_file.id,
    chunking_strategy={
        "type": "static",
        "static": {
            "max_chunk_size_tokens": 100,
            "chunk_overlap_tokens": 10
        }
    }
)

print(f"✓ File {uploaded_file.id} added to vector store")
print("  Chunking: 100 tokens per chunk, 10 token overlap")

In [None]:
# Check file status
time.sleep(2)
files = client.vector_stores.files.list(vector_store_id=vs.id)
for f in files:
    print(f"File status: {f.status}")
    if f.status == "completed":
        print("✓ File processing completed successfully")
    elif f.status == "failed":
        print("✗ File processing failed")

## 2. List Available Vector Stores

Let's see all the vector stores we have available.

In [None]:
# List all vector stores
vector_stores = client.vector_stores.list()

print("Available Vector Stores:")
print("-" * 80)

for vs_item in vector_stores:
    print(f"ID: {vs_item.id}")
    print(f"Name: {vs_item.name}")
    print(f"Created: {vs_item.created_at}")
    
    # List files in this vector store
    files = client.vector_stores.files.list(vector_store_id=vs_item.id)
    file_count = len(list(files))
    print(f"Files: {file_count}")
    print("-" * 80)

if not list(vector_stores):
    print("No vector stores found.")

## 3. Test RAG Queries

Now let's test the RAG system with various queries about HR benefits.

### Helper Functions

First, we'll define some helper functions to make it easier to query and display results.

In [None]:
def get_latest_vector_store(name_pattern="hr-benefits-hybrid"):
    """Get the most recent vector store matching the name pattern."""
    vector_stores = list(client.vector_stores.list())
    matching_stores = [vs for vs in vector_stores if name_pattern in vs.name]
    if matching_stores:
        return max(matching_stores, key=lambda vs: vs.created_at)
    return None

def query_rag_agent(query, vector_store_id, model=INFERENCE_MODEL, stream=True):
    """Query the RAG agent with a question."""
    agent = Agent(
        client,
        model=model,
        instructions="You MUST use the file_search tool to answer ALL questions by searching the provided documents.",
        tools=[
            {
                "type": "file_search",
                "vector_store_ids": [vector_store_id],
            }
        ],
    )
    
    session_id = agent.create_session(f"query-{hash(query)}")
    response = agent.create_turn(
        messages=[{"role": "user", "content": query}],
        session_id=session_id,
        stream=stream,
    )
    
    return response

def print_agent_response(response):
    """Print the agent's streaming response."""
    for log in AgentEventLogger().log(response):
        print(log, end="")
    print()  # Add newline at the end

### Get Vector Store for Queries

In [None]:
# Get the vector store (use the most recent one with matching name)
vector_store = get_latest_vector_store("hr-benefits-hybrid")

if not vector_store:
    print("Error: Vector store 'hr-benefits-hybrid' not found. Please run the create vector store section first.")
else:
    print(f"Using vector store: {vector_store.id}")
    print(f"Using model: {INFERENCE_MODEL}")
    print("-" * 80)

### Query 1: General Retirement Benefits

Ask a general question about retirement benefits.

In [None]:
query1 = "What do I receive when I retire?"

print(f"Query: {query1}\n")
print("Agent Response:")
print("-" * 80)

response1 = query_rag_agent(query1, vector_store.id)
print_agent_response(response1)

### Query 2: Specific Question About Gold Watch

Test retrieval with a more specific query.

In [None]:
query2 = "When do I get my gold watch?"

print(f"Query: {query2}\n")
print("Agent Response:")
print("-" * 80)

response2 = query_rag_agent(query2, vector_store.id)
print_agent_response(response2)

### Query 3: Multiple Unique Terms

Test retrieval quality with queries containing unique terms from the document.

In [None]:
queries_unique = [
    "Tell me about the chocolate statue and personal bard",
    "What do I get instead of a gold watch when I retire",
    "Tell me about the 401k and astrological alignment",
]

for query in queries_unique:
    print(f"\nQuery: {query}")
    print("-" * 80)
    
    response = query_rag_agent(query, vector_store.id)
    print_agent_response(response)
    print()

## 4. Debug Vector Search

Let's inspect the raw vector search results to see what chunks are being retrieved.

This helps us understand:
- What content is being found for different queries
- Relevance scores for retrieved chunks
- Whether the hybrid search is working effectively

In [None]:
# Test different queries and see what gets retrieved
debug_queries = [
    "gold watch retirement",
    "chocolate statue",
    "401k astrological alignment",
    "personal bard",
    "retirement benefits",
]

for query in debug_queries:
    print(f"\nQuery: '{query}'")
    print("-" * 80)
    
    try:
        # Direct vector store search
        results = client.vector_stores.search(
            vector_store_id=vector_store.id,
            query=query
        )
        
        print(f"Search results type: {type(results)}")
        
        # Try to access data if available
        if hasattr(results, 'data'):
            if results.data:
                print(f"Found {len(results.data)} results")
                for i, result in enumerate(results.data[:3], 1):  # Show top 3
                    print(f"\n  Result {i}:")
                    if hasattr(result, 'score'):
                        print(f"    Score: {result.score}")
                    if hasattr(result, 'content'):
                        content = result.content[:200] if len(result.content) > 200 else result.content
                        print(f"    Content: {content}...")
                    else:
                        print(f"    Data: {result}")
            else:
                print("  No results returned")
        else:
            print(f"  Response attributes: {dir(results)}")
    
    except Exception as e:
        print(f"  Error: {e}")
    
    print()

## 5. Cleanup: Delete Vector Stores

When you're done testing, you can clean up by deleting vector stores.

**Warning**: This will permanently delete the vector store and all its associated data.

In [None]:
def list_vector_stores_detailed():
    """List all vector stores with details."""
    vector_stores = list(client.vector_stores.list())
    
    if not vector_stores:
        print("No vector stores found.")
        return []
    
    print(f"Found {len(vector_stores)} vector store(s):")
    print("-" * 80)
    for i, vs in enumerate(vector_stores, 1):
        print(f"{i}. Name: {vs.name}")
        print(f"   ID: {vs.id}")
        print(f"   Created: {vs.created_at}")
        print()
    
    return vector_stores

def delete_vector_store_by_name(name_pattern):
    """Delete vector stores matching a name pattern."""
    vector_stores = list(client.vector_stores.list())
    stores_to_delete = [vs for vs in vector_stores if name_pattern in vs.name]
    
    if not stores_to_delete:
        print(f"No vector stores match pattern '{name_pattern}'")
        return
    
    print(f"\nDeleting {len(stores_to_delete)} vector store(s)...")
    print("-" * 80)
    
    for vs in stores_to_delete:
        try:
            client.vector_stores.delete(vector_store_id=vs.id)
            print(f"✓ Deleted: {vs.name} ({vs.id})")
        except Exception as e:
            print(f"✗ Failed to delete {vs.name}: {e}")
    
    print("\nDone!")

In [None]:
# List all vector stores
list_vector_stores_detailed()

In [None]:
# UNCOMMENT TO DELETE: Delete vector stores matching "hr-benefits"
# delete_vector_store_by_name("hr-benefits")

## Summary

In this notebook, we demonstrated:

1. **Vector Store Creation** - Created a vector store with hybrid search (semantic + keyword)
2. **Document Ingestion** - Downloaded and uploaded an HR benefits document with custom chunking
3. **RAG Queries** - Asked various questions and got contextual answers from the documents
4. **Debug Tools** - Inspected raw vector search results to understand retrieval quality
5. **Cleanup** - Learned how to delete vector stores when done

### Key Takeaways

- Llama Stack's built-in RAG support makes it easy to build question-answering systems
- Hybrid search combines the best of keyword (BM25) and semantic search
- The Agent API handles all the complexity of tool calling and context management
- Proper chunking and embedding configuration are crucial for good retrieval quality

### Next Steps

- Experiment with different chunking strategies
- Try different embedding models
- Adjust hybrid search weights (bm25_weight vs semantic_weight)
- Integrate with LangGraph for more complex agentic workflows

## References

- [Llama Stack Documentation](https://llama-stack.readthedocs.io/)
- [Llama Stack Client Python SDK](https://github.com/meta-llama/llama-stack-client-python)
- [Project README](./README.md)