## LlamaStack + CrewAI Integration Tutorial

This notebook guides you through integrating **LlamaStack** with **CrewAI** to build a complete Retrieval-Augmented Generation (RAG) system.

### Overview

- **LlamaStack**: Provides the infrastructure for running LLMs and vector store.
- **CrewAI**: Offers a framework for orchestrating agents and tasks.
- **Integration**: Leverages LlamaStack's OpenAI-compatible API with CrewAI.

### What You Will Learn

1.  How to set up and start the LlamaStack server using the Together AI provider.
2.  How to create and manage vector stores within LlamaStack.
3.  How to build RAG tool with CrewAI by utilizing the LlamaStack server.
4.  How to query the RAG tool for effective information retrieval and generation.

### Prerequisites

A Together AI API key is required to run the examples in this notebook.

---

### 1. Installation and Setup
#### Install Required Dependencies

Begin by installing all necessary packages for CrewAI integration. Ensure your `TOGETHER_API_KEY` is set as an environment variable.

In [1]:
!pip install uv
!uv tool install crewai
import os
import getpass

try:
    from google.colab import userdata
    os.environ['TOGETHER_API_KEY'] = userdata.get('TOGETHER_API_KEY')
except ImportError:
    print("Not in Google Colab environment")

for key in ['TOGETHER_API_KEY']:
    try:
        api_key = os.environ[key]
        if not api_key:
            raise ValueError(f"{key} environment variable is empty")
    except KeyError:
        api_key = getpass.getpass(f"{key} environment variable is not set. Please enter your API key: ")
        os.environ[key] = api_key

`[36mcrewai[39m` is already installed
Not in Google Colab environment


TOGETHER_API_KEY environment variable is not set. Please enter your API key:  ········


### 2. LlamaStack Server Setup

#### Build and Start LlamaStack Server

This section sets up the LlamaStack server with:
- **Together AI** as the inference provider
- **FAISS** as the vector database
- **Sentence Transformers** for embeddings

The server runs on `localhost:8321` and provides OpenAI-compatible endpoints.

In [2]:
import os
import subprocess
import time

# Remove UV_SYSTEM_PYTHON to ensure uv creates a proper virtual environment
# instead of trying to use system Python globally, which could cause permission issues
# and package conflicts with the system's Python installation
if "UV_SYSTEM_PYTHON" in os.environ:
    del os.environ["UV_SYSTEM_PYTHON"]

def run_llama_stack_server_background():
    """Build and run LlamaStack server in one step using --run flag"""
    log_file = open("llama_stack_server.log", "w")
    process = subprocess.Popen(
        "uv run --with llama-stack llama stack build --distro starter --image-type venv --run",
        shell=True,
        stdout=log_file,
        stderr=log_file,
        text=True,
    )

    print(f"Building and starting Llama Stack server with PID: {process.pid}")
    return process


def wait_for_server_to_start():
    import requests
    from requests.exceptions import ConnectionError

    url = "http://0.0.0.0:8321/v1/health"
    max_retries = 30
    retry_interval = 2

    print("Waiting for server to start", end="")
    for _ in range(max_retries):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                print("\nServer is ready!")
                return True
        except ConnectionError:
            print(".", end="", flush=True)
            time.sleep(retry_interval)

    print("\nServer failed to start after", max_retries * retry_interval, "seconds")
    return False


def kill_llama_stack_server():
    # Kill any existing llama stack server processes using pkill command
    os.system("pkill -f llama_stack.core.server.server")

In [3]:
server_process = run_llama_stack_server_background()
assert wait_for_server_to_start()

Building and starting Llama Stack server with PID: 52433
Waiting for server to start........
Server is ready!


### 3. Initialize LlamaStack Client

Create a client connection to the LlamaStack server with API key for Together provider.



In [4]:
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(
    base_url="http://0.0.0.0:8321",
    provider_data={"together_api_key": os.environ["TOGETHER_API_KEY"]},
)

#### Explore Available Models 

Check what models are available through your LlamaStack instance.

In [5]:
print("Available models:")
for m in client.models.list():
    print(f"- {m.identifier}")

print("----")

INFO:httpx:HTTP Request: GET http://0.0.0.0:8321/v1/models "HTTP/1.1 200 OK"


Available models:
- bedrock/meta.llama3-1-8b-instruct-v1:0
- bedrock/meta.llama3-1-70b-instruct-v1:0
- bedrock/meta.llama3-1-405b-instruct-v1:0
- sentence-transformers/all-MiniLM-L6-v2
- together/Alibaba-NLP/gte-modernbert-base
- together/arcee-ai/AFM-4.5B
- together/arcee-ai/coder-large
- together/arcee-ai/maestro-reasoning
- together/arcee-ai/virtuoso-large
- together/arcee_ai/arcee-spotlight
- together/arize-ai/qwen-2-1.5b-instruct
- together/BAAI/bge-base-en-v1.5
- together/BAAI/bge-large-en-v1.5
- together/black-forest-labs/FLUX.1-dev
- together/black-forest-labs/FLUX.1-dev-lora
- together/black-forest-labs/FLUX.1-kontext-dev
- together/black-forest-labs/FLUX.1-kontext-max
- together/black-forest-labs/FLUX.1-kontext-pro
- together/black-forest-labs/FLUX.1-krea-dev
- together/black-forest-labs/FLUX.1-pro
- together/black-forest-labs/FLUX.1-schnell
- together/black-forest-labs/FLUX.1-schnell-Free
- together/black-forest-labs/FLUX.1.1-pro
- together/cartesia/sonic
- together/cartesia

### 4. Vector Store Setup

#### Create a Vector Store with File Upload

Create a vector store using the OpenAI-compatible vector stores API:

- **Vector Store**: OpenAI-compatible vector store for document storage
- **File Upload**: Automatic chunking and embedding of uploaded files
- **Embedding Model**: Sentence Transformers model for text embeddings
- **Dimensions**: 384-dimensional embeddings

In [6]:
from io import BytesIO

docs = [
    ("Acme ships globally in 3-5 business days.", {"title": "Shipping Policy"}),
    ("Returns are accepted within 30 days of purchase.", {"title": "Returns Policy"}),
    ("Support is available 24/7 via chat and email.", {"title": "Support"}),
]

file_ids = []
for content, metadata in docs:
  with BytesIO(content.encode()) as file_buffer:
      file_buffer.name = f"{metadata['title'].replace(' ', '_').lower()}.txt"
      create_file_response = client.files.create(file=file_buffer, purpose="assistants")
      print(create_file_response)
      file_ids.append(create_file_response.id)

# Create vector store with files
vector_store = client.vector_stores.create(
  name="acme_docs",
  file_ids=file_ids,
  embedding_model="sentence-transformers/all-MiniLM-L6-v2",
  embedding_dimension=384,
  provider_id="faiss"
)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/files "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/files "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/files "HTTP/1.1 200 OK"


File(id='file-489db9aae0424745960e3408ff0f477f', bytes=41, created_at=1757540912, expires_at=1789076912, filename='shipping_policy.txt', object='file', purpose='assistants')
File(id='file-b2f38b0e164347f5a2b6bbe211e33ff3', bytes=48, created_at=1757540912, expires_at=1789076912, filename='returns_policy.txt', object='file', purpose='assistants')
File(id='file-6f6f157d165a4078b4abef66a095ccd6', bytes=45, created_at=1757540912, expires_at=1789076912, filename='support.txt', object='file', purpose='assistants')


INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/vector_stores "HTTP/1.1 200 OK"


#### Test Vector Search

Query the vector store to verify it's working correctly. This performs semantic search to find relevant documents based on the query.

In [7]:
search_response = client.vector_stores.search(
  vector_store_id=vector_store.id,
  query="How long does shipping take?",
  max_num_results=2
)
for result in search_response.data:
  content = result.content[0].text
  print(content)

INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/vector_stores/vs_dab05212-db05-402c-91ef-57e41797406b/search "HTTP/1.1 200 OK"


Acme ships globally in 3-5 business days.
Returns are accepted within 30 days of purchase.


### 5. CrewAI Integration

#### Configure CrewAI with LlamaStack

Set up CrewAI to use LlamaStack's OpenAI-compatible API:

- **Base URL**: Points to LlamaStack's OpenAI endpoint
- **Headers**: Include Together AI API key for model access
- **Model**: Use Meta Llama 3.3 70B model via Together AI

In [8]:
import os
from crewai.llm import LLM

# Point LLM class to Llamastack Server

llamastack_llm = LLM(
    model="openai/together/meta-llama/Llama-3.3-70B-Instruct-Turbo", # it's an openai-api compatible model
    base_url="http://localhost:8321/v1/openai/v1",
    api_key = os.getenv("OPENAI_API_KEY", "dummy"),
)

INFO:httpx:HTTP Request: GET https://raw.githubusercontent.com/BerriAI/litellm/main/model_prices_and_context_window.json "HTTP/1.1 200 OK"


#### Test LLM Connection

Verify that CrewAI LLM can successfully communicate with the LlamaStack server.

In [9]:
# Test llm with simple message
messages = [
    {"role": "system", "content": "You are a friendly assistant."},
    {"role": "user", "content": "Write a two-sentence poem about llama."},
]
llamastack_llm.call(messages)

[92m14:49:56 - LiteLLM:INFO[0m: utils.py:3258 - 
LiteLLM completion() model= together/meta-llama/Llama-3.3-70B-Instruct-Turbo; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= together/meta-llama/Llama-3.3-70B-Instruct-Turbo; provider = openai
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"
[92m14:50:01 - LiteLLM:INFO[0m: utils.py:1260 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


"In the Andes' gentle breeze, a llama's soft eyes gaze with peaceful ease, its fur a warm and fuzzy tease. With steps both gentle and serene, the llama roams, a symbol of calm, its beauty pure and supreme."

#### Create CrewAI Custom Tool

Define a custom CrewAI tool, `LlamaStackRAGTool`, to encapsulate the logic for querying the LlamaStack vector store. This tool will be used by the CrewAI agent to perform retrieval during the RAG process.

-   **Input Schema**: Defines the expected input parameters for the tool, such as the user query, the vector store ID, and optional parameters like `top_k`.
-   **Tool Logic**: Implements the `_run` method, which takes the user query and vector store ID, calls the LlamaStack client's `vector_stores.search` method, and formats the retrieved documents into a human-readable string for the LLM to use as context.

In [16]:
from crewai.tools import BaseTool
from typing import Any, List, Optional, Type
from pydantic import BaseModel, Field

# ---------- 1. Input schema ----------
class VectorStoreRAGToolInput(BaseModel):
    """Input schema for LlamaStackVectorStoreRAGTool."""
    query: str = Field(..., description="The user query for RAG search")
    vector_store_id: str = Field(...,
        description="ID of the vector store to search inside the Llama-Stack server",
    )
    top_k: Optional[int] = Field(
        default=5,
        description="How many documents to return",
    )
    score_threshold: Optional[float] = Field(
        default=None,
        description="Optional similarity score cut-off (0-1).",
    )

# ---------- 2. The tool ----------
class LlamaStackVectorStoreRAGTool(BaseTool):
    name: str = "Llama Stack Vector Store RAG tool"
    description: str = (
        "This tool calls a Llama-Stack endpoint for retrieval-augmented generation using a vector store. "
        "It takes a natural-language query and returns the most relevant documents."
    )
    args_schema: Type[BaseModel] = VectorStoreRAGToolInput
    client: Any
    vector_store_id: str = ""
    top_k: int = 5

    def _run(self, **kwargs: Any) -> str:
        # 1. Resolve parameters (use instance defaults when not supplied)
        query: str = kwargs.get("query")                    # Required – schema enforces presence
        vector_store_id: str = kwargs.get("vector_store_id", self.vector_store_id)
        top_k: int = kwargs.get("top_k", self.top_k)
        if vector_store_id == "":
            print('vector_store_id is empty, please specify which vector_store to search')
            return "No documents found."
        # 2. Issue request to Llama-Stack
        response = self.client.vector_stores.search(
            vector_store_id=vector_store_id,
            query=query,
            max_num_results=top_k,
        )

        # 3. Massage results into a single human-readable string
        if not response or not response.data:
            return "No documents found."

        docs: List[str] = []
        for result in response.data:
            content = result.content[0].text if result.content else "No content"
            filename = result.filename if result.filename else {}
            docs.append(f"filename: {filename}, content: {content}")
        return "\n".join(docs)


### 6. Building the RAG tool

#### Create a Complete RAG Pipeline

Construct a CrewAI pipeline that orchestrates the RAG process. This pipeline includes:

1.  **Agent Definition**: Defining a CrewAI agent with a specific role (`RAG assistant`), goal, backstory, and the LlamaStack LLM and the custom RAG tool.
2.  **Task Definition**: Defining a CrewAI task for the agent to perform. The task description includes placeholders for the user query and vector store ID, which will be provided during execution. The task's expected output is an answer to the question based on the retrieved context.
3.  **Crew Definition**: Creating a CrewAI `Crew` object with the defined task and agent. This crew represents the complete RAG pipeline.

**CrewAI workflow**:
`User Query → CrewAI Task → Agent invokes LlamaStackRAGTool → LlamaStack Vector Search → Retrieved Context → Agent uses Context + Question → LLM Generation → Final Response`

In [17]:
from crewai import Agent, Crew, Task, Process

# ---- 3. Define the agent -----------------------------------------
agent = Agent(
    role="RAG assistant",
    goal="Answer user's question with provided context",
    backstory="You are an experienced search assistant specializing in finding relevant information from documentation and vector_db to answer user questions accurately.",
    allow_delegation=False,
    llm=llamastack_llm,
    tools=[LlamaStackVectorStoreRAGTool(client=client)])
# ---- 4. Wrap everything in a Crew task ---------------------------
task = Task(
    description="Answer the following questions: {query}, using the RAG_tool to search the provided vector_store_id {vector_store_id} if needed",
    expected_output="An answer to the question with provided context",
    agent=agent,
)
crew = Crew(tasks=[task], verbose=True)


### 7. Testing the RAG System

#### Example 1: Shipping Query

In [18]:
query = "How long does shipping take?"
response = crew.kickoff(inputs={"query": query,"vector_store_id": vector_store.id})
print("❓", query)
print("💡", response)

Output()

[92m14:55:09 - LiteLLM:INFO[0m: utils.py:3258 - 
LiteLLM completion() model= together/meta-llama/Llama-3.3-70B-Instruct-Turbo; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= together/meta-llama/Llama-3.3-70B-Instruct-Turbo; provider = openai
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"
[92m14:55:11 - LiteLLM:INFO[0m: utils.py:1260 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/vector_stores/vs_dab05212-db05-402c-91ef-57e41797406b/search "HTTP/1.1 200 OK"
[92m14:55:11 - LiteLLM:INFO[0m: utils.py:3258 - 
LiteLLM completion() model= together/meta-llama/Llama-3.3-70B-Instruct-Turbo; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= together/meta-llama/Llama-3.3-70B-Instruct-Turbo; provider = openai
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"
[92m14:55:12 - LiteLLM:INFO[0m: utils.py:1260 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


❓ How long does shipping take?
💡 Acme ships globally in 3-5 business days.


#### Example 2: Returns Policy Query

In [19]:
query = "Can I return a product after 40 days?"
response = crew.kickoff(inputs={"query": query,"vector_store_id": vector_store.id})
print("❓", query)
print("💡", response)

Output()

[92m14:55:19 - LiteLLM:INFO[0m: utils.py:3258 - 
LiteLLM completion() model= together/meta-llama/Llama-3.3-70B-Instruct-Turbo; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= together/meta-llama/Llama-3.3-70B-Instruct-Turbo; provider = openai
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"
[92m14:55:21 - LiteLLM:INFO[0m: utils.py:1260 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/openai/v1/vector_stores/vs_dab05212-db05-402c-91ef-57e41797406b/search "HTTP/1.1 200 OK"
[92m14:55:22 - LiteLLM:INFO[0m: utils.py:3258 - 
LiteLLM completion() model= together/meta-llama/Llama-3.3-70B-Instruct-Turbo; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= together/meta-llama/Llama-3.3-70B-Instruct-Turbo; provider = openai
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"
[92m14:55:22 - LiteLLM:INFO[0m: utils.py:1260 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler


❓ Can I return a product after 40 days?
💡 Returns are accepted within 30 days of purchase.


---

We have successfully built a RAG system that combines:

-   **LlamaStack** for infrastructure (LLM serving + vector store)
-   **CrewAI** for orchestration (agents, tasks, and tools)
-   **Together AI** for high-quality language models

### Key Benefits

1.  **Unified Infrastructure**: A single server for LLMs and vector stores simplifies deployment and management.
2.  **OpenAI Compatibility**: Enables easy integration with existing libraries and frameworks that support the OpenAI API standard, such as CrewAI.
3.  **Multi-Provider Support**: Offers the flexibility to switch between different LLM and embedding providers without altering the core application logic.
4.  **Production Ready**: LlamaStack includes features designed for production environments, such as built-in safety shields and monitoring capabilities.


##### 🔧 Cleanup

Remember to stop the LlamaStack server process when you are finished to free up resources. You can use the `kill_llama_stack_server()` helper function defined earlier in the notebook.