# RAGdoll Key Modules Demo

Hands-on walkthrough for ingestion, chunking, embeddings, storage layers, and orchestration. Every example relies on the sample assets under `tests/test_data`, so the notebook can run offline.

> **Note:** Cells 7 and 9 call OpenAI's GPT endpoints via `get_llm_caller`. Export `OPENAI_API_KEY` (or add it to `.env`) before running them.


## What you'll see

1. **Ingestion** ? `DocumentLoaderService` from [`docs/ingestion.md`](../docs/ingestion.md).
2. **Chunking** ? `ragdoll.chunkers` helpers from [`docs/chunking.md`](../docs/chunking.md).
3. **Embeddings** ? provider factory from [`docs/embeddings.md`](../docs/embeddings.md) using the fake backend for speed.
4. **Vector stores** ? `vector_store_from_config` customization from [`docs/vector_stores.md`](../docs/vector_stores.md).
5. **Graph stores** ? `get_graph_store` JSON persistence from [`docs/graph_stores.md`](../docs/graph_stores.md).
6. **Graph retrievers** ? `GraphPersistenceService` simple/Neo4j backends from [`docs/graph_stores.md`](../docs/graph_stores.md).
7. **LLMs** ? `get_llm_caller`/`call_llm_sync` bridge described in [`docs/llm_integration.md`](../docs/llm_integration.md) hitting your real OpenAI model.
8. **Pipeline** ? `IngestionPipeline` snapshot from [`docs/architecture.md`](../docs/architecture.md).
9. **Ragdoll** ? orchestrator entry point tying everything together.

Each cell builds on the previous ones so you can treat this as a scratchpad for experimenting with new loaders or configuration overrides.


In [None]:
from pathlib import Path
from pprint import pprint
import shutil
import time

from langchain_core.documents import Document

from ragdoll import Ragdoll
from ragdoll.app_config import bootstrap_app
from ragdoll.ingestion import DocumentLoaderService
from ragdoll.chunkers import get_text_splitter, split_documents
from ragdoll.embeddings import get_embedding_model
from ragdoll.vector_stores import vector_store_from_config
from ragdoll.config.base_config import VectorStoreConfig
from ragdoll.graph_stores import get_graph_store
from ragdoll.entity_extraction.models import Graph, GraphNode, GraphEdge
from ragdoll.entity_extraction.graph_persistence import GraphPersistenceService
from ragdoll.llms import get_llm_caller
from ragdoll.llms.callers import call_llm_sync
from ragdoll.pipeline import IngestionPipeline, IngestionOptions

DATA_DIR = Path('../tests/test_data').resolve()
STATE_DIR = Path('demo_state').resolve()
STATE_DIR.mkdir(exist_ok=True)

#SAMPLE_TXT = DATA_DIR / 'test_txt.txt'
SAMPLE_TXT = DATA_DIR / '*'

app_config = bootstrap_app(
    overrides={
        'monitor': {'enabled': False, 'collect_metrics': False},
    }
)


def normalize_documents(raw_docs):
    docs = []
    for entry in raw_docs:
        if isinstance(entry, Document):
            docs.append(entry)
        elif isinstance(entry, dict):
            docs.append(
                Document(
                    page_content=str(entry.get('page_content', '')),
                    metadata=entry.get('metadata', {}) or {},
                )
            )
        else:
            docs.append(Document(page_content=str(entry), metadata={}))
    return docs


def reset_subdir(name: str) -> Path:
    path = STATE_DIR / name
    if path.exists():
        for attempt in range(5):
            try:
                shutil.rmtree(path)
                break
            except PermissionError:
                time.sleep(0.5)
        else:
            timestamped = STATE_DIR / f"{name}_{int(time.time())}"
            timestamped.mkdir(parents=True, exist_ok=True)
            print(f"Warning: {path} was locked, using {timestamped} instead.")
            return timestamped
    path.mkdir(parents=True, exist_ok=True)
    return path

## 1. Load sample data
`DocumentLoaderService` fans out across the loader registry defined in `ragdoll/config/default_config.yaml`. We point it at the lightweight TXT fixture so the demo does not need optional dependencies.


In [None]:
loader = DocumentLoaderService(
    app_config=app_config,
    use_cache=False,
    collect_metrics=False,
)

raw_documents = loader.ingest_documents([str(SAMPLE_TXT)])
documents = normalize_documents(raw_documents)

print(f"Loaded {len(documents)} document(s) from {SAMPLE_TXT.name}")
print('Metadata sample:')
pprint(documents[0].metadata)
print('Preview:')
print(documents[0].page_content[:400])


Loaded 777 document(s) from *
Metadata sample:
{'author': 'Nathan Sasto',
 'content_type': 'document_full',
 'conversion_success': True,
 'created': '2025-04-08 13:25:00+00:00',
 'file_name': 'test_docx.docx',
 'file_size': 328097,
 'last_modified_by': 'Nathan Sasto',
 'modified': '2025-05-28 19:59:00+00:00',
 'revision': '3',
 'source': 'C:\\dev\\RAGdoll\\tests\\test_data\\test_docx.docx',
 'success': True}
Preview:
Lorem ipsum

# Large Language Models

![](data:image/x-emf;base64...)

Large language models (LLMs) have transformed natural language processing by leveraging massive datasets and computational power to achieve remarkable performance in tasks like text generation, summarization, and sentiment analysis. These models rely on deep learning architectures, particularly transformers, which allow them to


## 2. Chunk documents
`ragdoll.chunkers.get_text_splitter` mirrors the strategies in [`docs/chunking.md`](../docs/chunking.md). Reusing the splitter instance keeps experiments consistent when you tweak chunk sizes/overlap.


In [None]:
splitter = get_text_splitter(
    splitter_type='recursive',
    chunk_size=250,
    chunk_overlap=40,
    app_config=app_config,
)
chunks = split_documents(documents, text_splitter=splitter)

print(f"Created {len(chunks)} chunk(s)")
for idx, chunk in enumerate(chunks[:3], start=1):
    preview = chunk.page_content[:180].replace('', ' ')
    print(f"Chunk {idx} metadata: {chunk.metadata}")
    print(preview)
    print('---')


Created 8985 chunk(s)
Chunk 1 metadata: {'source': 'C:\\dev\\RAGdoll\\tests\\test_data\\test_docx.docx', 'success': True, 'conversion_success': True, 'file_name': 'test_docx.docx', 'file_size': 328097, 'author': 'Nathan Sasto', 'created': '2025-04-08 13:25:00+00:00', 'modified': '2025-05-28 19:59:00+00:00', 'last_modified_by': 'Nathan Sasto', 'revision': '3', 'content_type': 'document_full'}
 L o r e m   i p s u m 
 
 #   L a r g e   L a n g u a g e   M o d e l s 
 
 ! [ ] ( d a t a : i m a g e / x - e m f ; b a s e 6 4 . . . ) 
---
Chunk 2 metadata: {'source': 'C:\\dev\\RAGdoll\\tests\\test_data\\test_docx.docx', 'success': True, 'conversion_success': True, 'file_name': 'test_docx.docx', 'file_size': 328097, 'author': 'Nathan Sasto', 'created': '2025-04-08 13:25:00+00:00', 'modified': '2025-05-28 19:59:00+00:00', 'last_modified_by': 'Nathan Sasto', 'revision': '3', 'content_type': 'document_full'}
 L a r g e   l a n g u a g e   m o d e l s   ( L L M s )   h a v e   t r a n s f o r m e

## 3. Create embeddings
`ragdoll.embeddings.get_embedding_model` instantiates providers dynamically. Passing `provider="fake"` gives deterministic vectors without hitting OpenAI/HuggingFace, but the rest of the flow matches production usage.


In [None]:
embedding_inputs = [chunk.page_content for chunk in chunks[:3]]
if not embedding_inputs:
    embedding_inputs = [documents[0].page_content]

fake_embeddings = get_embedding_model(provider='fake', size=256)
vectors = fake_embeddings.embed_documents(embedding_inputs)

print(f"Generated {len(vectors)} embedding vector(s) with dimension {len(vectors[0])}")
print('First vector slice:', vectors[0][:8])


Generated 3 embedding vector(s) with dimension 256
First vector slice: [np.float64(-1.1767431451548127), np.float64(2.5051321687172563), np.float64(0.47355094119428875), np.float64(-0.5549040791814097), np.float64(1.1380425553720102), np.float64(-0.8713958213782483), np.float64(-2.5380340171633797), np.float64(0.7849101963595468)]


## 4. Build a vector store
`vector_store_from_config` consumes a `VectorStoreConfig`, so you can swap FAISS/Chroma/etc. on demand. This cell provisions a Chroma collection under `demo_state` and runs a quick similarity query.


In [None]:
core_store_dir = reset_subdir('chroma_core_demo')
vector_config = VectorStoreConfig(
    enabled=True,
    store_type='chroma',
    params={
        'collection_name': 'ragdoll_core_demo',
        'persist_directory': str(core_store_dir),
    },
)

demo_vector_store = vector_store_from_config(
    vector_config,
    embedding=fake_embeddings,
)

demo_vector_store.add_documents(chunks)
question = 'What content lives in the txt sample?'
results = demo_vector_store.similarity_search(question, k=2)
for idx, doc in enumerate(results, start=1):
    snippet = doc.page_content[:160].replace('', ' ')
    print(f"Result {idx} (source={doc.metadata.get('source')}) -> {snippet}")


PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\dev\\RAGdoll\\examples\\demo_state\\chroma_core_demo\\chroma.sqlite3'

## 5. Persist a tiny graph
`get_graph_store` supports JSON, NetworkX, and Neo4j backends. To keep things simple we build a two-node graph (document ? chunk) and write it to JSON for inspection.


In [None]:
doc_node = GraphNode(
    name='Sample Text File',
    type='Document',
    metadata={'path': str(SAMPLE_TXT)},
)
chunk_node = GraphNode(
    name='Chunk 0',
    type='Chunk',
    metadata={'chunk_index': 0, 'preview': chunks[0].page_content[:80]},
)
graph = Graph(
    nodes=[doc_node, chunk_node],
    edges=[
        GraphEdge(
            source=doc_node.id,
            target=chunk_node.id,
            type='CONTAINS',
            metadata={'similarity': 1.0},
        )
    ],
)

graph_path = STATE_DIR / 'graph_demo.json'
graph_store = get_graph_store(
    store_type='json',
    graph=graph,
    graph_config={'output_file': str(graph_path)},
)
print(f"Persisted graph via {type(graph_store).__name__} -> {graph_path}")
if graph_path.exists():
    print(graph_path.read_text()[:400])
else:
    print('Graph file was not created yet.')


Persisted graph via GraphStoreWrapper -> C:\dev\RAGdoll\examples\demo_state\graph_demo.json
{
  "nodes": [
    {
      "id": "443a09d5-3977-4480-a055-5a6d712154dc",
      "type": "Document",
      "name": "Sample Text File",
      "metadata": {
        "path": "C:\\dev\\RAGdoll\\tests\\test_data\\*"
      }
    },
    {
      "id": "9c274e4a-cf3f-443e-8d80-2bbf353ed186",
      "type": "Chunk",
      "name": "Chunk 0",
      "metadata": {
        "chunk_index": 0,
        "preview": "Lore


## 6. Query the graph with a LangChain retriever
`GraphPersistenceService` can materialize a LangChain-compatible retriever from the last saved graph. In a full ingestion run you would enable `entity_extraction.graph_retriever.enabled`, but here we reuse the toy graph above to show how the **simple** backend answers questions.


In [None]:
graph_persistence = GraphPersistenceService(
    output_format='custom_graph_object',
    retriever_backend='simple',
    retriever_config={'top_k': 3, 'include_edges': True},
)

_ = graph_persistence.save(graph)
graph_retriever = graph_persistence.create_retriever()

retriever_question = 'Which graph nodes reference the document and its chunk?'
retriever_hits = graph_retriever.get_relevant_documents(retriever_question)

print(f"Graph retriever returned {len(retriever_hits)} document(s)")
for doc in retriever_hits:
    print(f"- {doc.page_content} => {doc.metadata}")


## 7. Wire up an LLM caller
`get_llm_caller` now instantiates the OpenAI chat model defined in `ragdoll/config/default_config.yaml` (defaults to `gpt-4o-mini`). Make sure `OPENAI_API_KEY` is available before running the next cell.


In [None]:
import os

if not os.getenv('OPENAI_API_KEY'):
    raise EnvironmentError('Set OPENAI_API_KEY before calling the real OpenAI demo cell.')

openai_llm_caller = get_llm_caller(app_config=app_config)
prompt = 'Pretend you read the txt sample and summarize it in one sentence.'
llm_reply = call_llm_sync(openai_llm_caller, prompt)
print('OpenAI response:', llm_reply)


OpenAI response: Sure! Please provide the text sample you'd like me to summarize.


## 8. Run the ingestion pipeline (async)
`IngestionPipeline` stitches together the loader, chunker, embeddings, vector store, and optional graph/entity stages. We disable entity extraction to keep the run lightweight and await the coroutine directly inside the notebook.


In [None]:
pipeline_store_dir = reset_subdir('chroma_pipeline_demo')
pipeline_vector_config = VectorStoreConfig(
    enabled=True,
    store_type='chroma',
    params={
        'collection_name': 'ragdoll_pipeline_demo',
        'persist_directory': str(pipeline_store_dir),
    },
)
pipeline_vector_store = vector_store_from_config(
    pipeline_vector_config,
    embedding=fake_embeddings,
)

pipeline = IngestionPipeline(
    app_config=app_config,
    content_extraction_service=DocumentLoaderService(
        app_config=app_config,
        use_cache=False,
        collect_metrics=False,
    ),
    embedding_model=fake_embeddings,
    vector_store=pipeline_vector_store,
    options=IngestionOptions(
        batch_size=2,
        extract_entities=False,
        skip_graph_store=True,
        chunking_options={'chunk_size': 300, 'chunk_overlap': 60, 'splitter_type': 'recursive'},
    ),
)

pipeline_stats = await pipeline.ingest([str(SAMPLE_TXT)])
pipeline_stats


## 9. Use the Ragdoll orchestrator
Finally, plug the fake embeddings/vector store plus the real OpenAI LLM caller into `ragdoll.Ragdoll` so you can see how ingestion and `query` behave from the package's public API.


In [None]:
import os

if not os.getenv('OPENAI_API_KEY'):
    raise EnvironmentError('Set OPENAI_API_KEY before running the orchestrator demo.')

rag_store_dir = reset_subdir('chroma_ragdoll_demo')
rag_vector_config = VectorStoreConfig(
    enabled=True,
    store_type='chroma',
    params={
        'collection_name': 'ragdoll_orchestrator_demo',
        'persist_directory': str(rag_store_dir),
    },
)
rag_vector_store = vector_store_from_config(
    rag_vector_config,
    embedding=fake_embeddings,
)

rag_llm_caller = get_llm_caller(app_config=app_config)

rag = Ragdoll(
    app_config=app_config,
    ingestion_service=DocumentLoaderService(
        app_config=app_config,
        use_cache=False,
        collect_metrics=False,
    ),
    embedding_model=fake_embeddings,
    vector_store=rag_vector_store,
    llm_caller=rag_llm_caller,
)

question = 'What does the sample document discuss?'
ingested = rag.ingest_data([str(SAMPLE_TXT)])
print(f"Ragdoll indexed {len(ingested)} LangChain document(s).")
rag_response = rag.query(question)
print('LLM answer:', rag_response['answer'])
for idx, doc in enumerate(rag_response['documents'], start=1):
    snippet = doc.page_content[:140].replace('
', ' ')
    print(f"Doc {idx} (source={doc.metadata.get('source')}): {snippet}")

graph_docs = (
    graph_retriever.get_relevant_documents(question)
    if 'graph_retriever' in globals()
    else []
)
if graph_docs:
    print('
Graph retriever context:')
    for doc in graph_docs:
        print(f"- {doc.metadata.get('node_type')} {doc.metadata.get('node_id')}: {doc.page_content}")

    vector_context = '

'.join(
        f"Doc {idx}: {doc.page_content[:200]}" for idx, doc in enumerate(rag_response['documents'], start=1)
    ) or 'No vector hits available.'
    graph_context = '
'.join(
        f"Node {doc.metadata.get('node_id')}: {doc.page_content} (neighbors={doc.metadata.get('connected_to')})"
        for doc in graph_docs
    )
    hybrid_prompt = (
        "You answer using both vector chunks and graph nodes.
"
        f"Vector context:
{vector_context}

"
        f"Graph context:
{graph_context}

"
        f"Question: {question}
Answer:"
    )
    hybrid_answer = call_llm_sync(rag_llm_caller, hybrid_prompt)
    print('
Hybrid graph + vector answer:', hybrid_answer)
else:
    print('
Graph retriever not initialized yet; run Section 6 to build it before rerunning this cell.')


---
Feel free to duplicate this notebook and swap inputs (PDFs, DOCX, loaders, vector stores, etc.) to explore other combinations covered throughout `docs/`.
