# Docling as MCP tool with Llama Stack

---


### Technologies We'll Use

Building on our previous labs, we'll add:

1. **[Docling](https://docling-project.github.io/docling/):** An open-source toolkit used to parse and convert documents.
2. **[MCP](https://modelcontextprotocol.io)**: The model context protocol for creating a tool.
3. **[Llama Stack](https://llama-stack.readthedocs.io/)**: Framework for building generative AI applications.
4. Agentic RAG: Use reasing and tools for an enhanced RAG flow


---

## Prerequisites

Before we begin, ensure you have:
- Completed Labs 1 (or equivalent Docling knowledge)
- Python >=3.10 installed
- Ollama installed
- Podman installed (or Docker)

---

## Installation and Setup


In [None]:
! pip install \
    llama-stack-client==0.2.0 \
    pydantic \
    pydantic_settings \
    rich


Now let's import the essential modules:

In [None]:
import uuid
import logging

from llama_stack_client import LlamaStackClient
from pydantic import NonNegativeFloat
from pydantic_settings import BaseSettings, SettingsConfigDict

# pretty print of the results returned from the model/agent
from rich.console import Console
from rich.pretty import pprint

### Logging

To see detailed information about the document processing and chunking operations, we'll configure INFO log level.

NOTE: It is okay to skip running this cell if you prefer less verbose output.

In [None]:
console = Console()

logger = logging.getLogger(__name__)
if not logger.hasHandlers():  
    logger.setLevel(logging.INFO)
    stream_handler = logging.StreamHandler()
    stream_handler.setLevel(logging.INFO)
    formatter = logging.Formatter('%(message)s')
    stream_handler.setFormatter(formatter)
    logger.addHandler(stream_handler)

### Setting up

In the following blocks we setup the environment needed for connecting to llama stack and use it as agentic framework.

In [None]:
class Settings(BaseSettings):
    base_url: str

    vdb_provider: str
    vdb_embedding: str
    vdb_embedding_dimension: int
    vdb_embedding_window: int

    inference_model_id: str
    max_tokens: int
    temperature: NonNegativeFloat
    top_p: float
    stream: bool

    model_config = SettingsConfigDict(env_file=".env", env_file_encoding="utf-8")

In [None]:
settings = Settings(
    base_url="http://localhost:8321",
    inference_model_id="meta-llama/Llama-3.2-3B-Instruct",
    max_tokens=4096,
    temperature=0.0,
    top_p=0.95,
    stream=True,
    vdb_provider="faiss",
    vdb_embedding="all-MiniLM-L6-v2",
    vdb_embedding_dimension=384,
    vdb_embedding_window=256,
)
print(settings)

In [None]:
if settings.temperature > 0.0:
    strategy = {
        "type": "top_p",
        "temperature": settings.temperature,
        "top_p": settings.top_p,
    }
else:
    strategy = {"type": "greedy"}

# sampling_params will later be used to pass the parameters to Llama Stack Agents/Inference APIs
sampling_params = {
    "strategy": strategy,
    "max_tokens": settings.max_tokens,
}

print(sampling_params)

---

## Launch Llama Stack

Within this lab we will interact with a Llama Stack backend. We have chosen the Ollama distribution which allows to easily get started on a local environment.

### Fetch the models

In a terminal window use the following command for fetching the models required for running.

```bash
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"

# ollama names this model differently, and we must use the ollama name when loading the model
export OLLAMA_INFERENCE_MODEL="llama3.2:3b-instruct-fp16"
ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m
  ```


### Start the Llama Stack container

In a new terminal window use the following command to run the Llama Stack server.

```bash
# make a working directory which will be used by the container
mkdir -p ~/.llama

# launch llama stack
export LLAMA_STACK_PORT=8321
podman run \
  -it \
  --pull always \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  llamastack/distribution-ollama \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env OLLAMA_URL=http://host.containers.internal:11434
```


Next we can use the `LlamaStackClient` within this notebook validate the connection.

In [None]:
client = LlamaStackClient(base_url=settings.base_url)
print(f"Connected to Llama Stack server @ {client.base_url}")

---

## Launch the MCP tool

MCP allows to connect custom tools (like Docling) within an agentic framework. In this lab we will use an MCP tool which allows to
1. Convert documents using Docling
2. Ingest them into a Llama Stack vector DB instance.

_You can inspect how a tool is created by looking at the file [Docling_Lab4_tool.py](./Docling_Lab4_tool.py)_

We already packaged the tool into a working container image which is ready for you to try out.

**Launch the Docling Llama Stack MCP tool** by running the following command in a new terminal window.

```bash
podman run \
  -it \
  --pull always \
  -p 8000:8000 \
  quay.io/docling-project/lab-demo-docling-llamstack-mcp \
  --env DOCLING_MCP_LLAMA_STACK_URL=http://host.containers.internal:8321
```

---

### Validate tools available in our llama-stack instance

When an instance of llama-stack is redeployed your tools need to re-registered. Also if a tool is already registered with a llama-stack instance, if you try to register one with the same `toolgroup_id`, llama-stack will throw you an error.

For this reason it is recommended to include some code to validate your tools and toolgroups. This is where the `mcp_url` comes into play. The following code will check that the `mcp::docling-llamastack` tool is registered, or it will be registered directly from the mcp url.

If you are running the MCP server from source, the default value for this is: `http://localhost:8000/sse`.

If you are running the MCP server from a container, the default value for this is: `http://host.containers.internal:8000/sse`.

Make sure to pass the corresponding MCP URL for the server you are trying to register/validate tools for.

In [None]:
docling_mcp_url = "http://host.containers.internal:8000/sse"

registered_tools = client.tools.list()
registered_toolgroups = [t.toolgroup_id for t in registered_tools]

if "mcp::docling-llamastack" not in registered_toolgroups:
    client.toolgroups.register(
        toolgroup_id="mcp::docling-llamastack",
        provider_id="model-context-protocol",
        mcp_endpoint={"uri":docling_mcp_url},
    )

registered_tools = client.tools.list()
registered_toolgroups = [t.toolgroup_id for t in registered_tools]
logger.info(f"Your Llama Stack server is already registered with the following tool groups @ {set(registered_toolgroups)} \n")

## Ingest + RAG-aware agent

- Initialize the collection in the vectordb
- Initialize the agent the required tools:
    - Docling Ingest will be responsible to take care of instructions like "Ingest the document https://arxiv.org/pdf/2503.11576".
    - RAG/Knowledge search will respond to user queries by running RAG on the documents ingested in the vectordb.

In [None]:
from llama_stack_client import Agent, AgentEventLogger
from llama_stack_client.lib.agents.event_logger import EventLogger


In [None]:
# define the name of the vectordb collection to use
vector_db_id = f"test_vector_db_{uuid.uuid4()}"

# define and register the document collection to be used
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=settings.vdb_embedding,
    embedding_dimension=settings.vdb_embedding_dimension,
    provider_id=settings.vdb_provider,
)


agent = Agent(
    client,
    model=settings.inference_model_id,
    instructions="You are a helpful assistant.",
    sampling_params=sampling_params,
    tools=[
        dict(
            name="mcp::docling-llamastack",
            args={
                "vector_db_id": vector_db_id,
            },
        ),
        dict(
            name="builtin::rag/knowledge_search",
            args={
                "vector_db_ids": [vector_db_id],  # list of IDs of document collections to consider during retrieval
            },
        )
    ],

)

In [None]:
v=client.vector_dbs.retrieve(vector_db_id)
client.vector_io.query(vector_db_id=vector_db_id, query="docling")

## Executing ingest and RAG queries

- For each prompt, initialize a new agent session, execute a turn during which a retrieval call may be requested, and output the reply received from the agent.

In [None]:
queries = [
    "Ingest the document https://arxiv.org/pdf/2503.11576",
    "Lookup the documents to answer the question: How does the system compare to humans when analyzing the layout?",
]

for prompt in queries:
    console.print(f"\n[cyan]User> {prompt}[/cyan]")
    
    # create a new turn with a new session ID for each prompt
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=agent.create_session(f"rag-session_{uuid.uuid4()}"),
        stream=settings.stream,
    )
    
    # print the response, including tool calls output
    if settings.stream:
        for log in EventLogger().log(response):
            log.print()
    else:
        pprint(response.steps)

## What happened?

The code above executed a chat interaction with an agent.

With the first message, we instruct the agent to ingest the document. The model, performing its reasoning, plans decides to call the Docling tool for converting the document.

With the second message, we ask the agent to search the ingested content.
Note how the model decides on its own which one is a good query for search the relevant chunks in the vector database. Compared to the previous labs, the retrieval is not done with the exact use query. This is interpreted and tuned.

---

# ReAct Agent

In the following section we use the reasoning agent `ReActAgent`. In this scenario, the model orchestrator the tools execution is reasoning on the sequence of tools to be executed in order to perform the task.

This allows to have a single user query which triggers multiple independent steps, e.g.

1. Ingest the documents
2. Run a search query on the documents


In [None]:
from llama_stack_client.lib.agents.react.agent import ReActAgent
from llama_stack_client.lib.agents.react.tool_parser import ReActOutput


In [None]:
vector_db_id = f"test_vector_db_{uuid.uuid4()}"

# define and register the document collection to be used
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=settings.vdb_embedding,
    embedding_dimension=settings.vdb_embedding_dimension,
    provider_id=settings.vdb_provider,
)



In [None]:

agent = ReActAgent(
            client=client,
            model=settings.inference_model_id,
            tools=[
                dict(
                    name="mcp::docling-llamastack",
                    args={
                        "vector_db_id": vector_db_id,
                    },
                ),
                dict(
                    name="builtin::rag/knowledge_search",
                    args={
                        "vector_db_ids": [vector_db_id],  # list of IDs of document collections to consider during retrieval
                    },
                )
            ],
            response_format={
                "type": "json_schema",
                "json_schema": ReActOutput.model_json_schema(),
            },
            sampling_params=sampling_params,
        )
user_prompts = [
    "I would like to summarize the statements of the authors of https://arxiv.org/pdf/2503.11576 on how does SmolDocling compare to humans when analyzing the layout."
]

for prompt in user_prompts:
    print("\n"+"="*50)
    console.print(f"[cyan]Processing user query: {prompt}[/cyan]")
    print("="*50)
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=agent.create_session(f"rag-session_{uuid.uuid4()}"),
        stream=settings.stream
    )
    if settings.stream:
        for log in EventLogger().log(response):
            log.print()
    else:
        pprint(response.steps) # print the steps of an agent's response in a formatted way. 

## What happened?

Compared to the previous agent, here the model used advanced reasining for creating a plan of actions needed to perform the operation.

---

# Summary and Next Steps

### What You've Accomplished

Congratulations! You've successfully used Docling in an agentic framework. Here's what you've learned:

- **Lab 1**: Document structure preservation enables everything else
- **Lab 2**: Intelligent chunking optimizes retrieval quality
- **Lab 3**: Visual grounding transforms RAG into transparent AI
- **Lab 4**: Run Docling as MCP tool with Llama Stack


## Next Steps: Where to Go from Here

### Immediate actions

1. **Experiment with your documents**
   - Try documents with complex layouts
   - Test with technical diagrams and charts
   - Process multi-page reports with mixed content

2. **Connect more agents**
   - Try connecting more tools
   - Search the documents to ingest via metadata
   - Search the web for relevant documents
   - Extract information from the documents

3. **More ways to interact with tools**
   - Use the Llama Stack playground UI for chatting with the agents
   - Use other frameworks and ecosystems like Claude Desktop, BeeAI, etc

---

## Resources for Continued Learning

### Official Documentation
- **[Docling Documentation](https://github.com/docling-project/docling)**: Latest features and updates

### Community Resources
- Join the Docling community on GitHub
- Share your implementations
- Contribute improvements back to the project

### Related Topics to Explore
- Document Layout Analysis
- Multimodal Embeddings
- Visual Question Answering
- Explainable AI Systems

---

## Final Thoughts

You've completed an incredible journey from basic document conversion to building a sophisticated, transparent AI system. The combination of Docling's document understanding with AI frameworks like Langchain, Llama Stack and MCP allows to build powerful applications.
