# Split documents into chunks for RAG

Break PDFs and documents into searchable chunks for retrieval-augmented generation (RAG) pipelines.


## Problem

You have PDF documents or text files that you want to use for retrieval-augmented generation (RAG). Before you can search them, you need to:

1. Split documents into smaller chunks
2. Generate embeddings for each chunk
3. Store everything in a searchable index

| Document | Size | Chunks needed |
|----------|------|---------------|
| annual_report.pdf | 50 pages | ~100 chunks |
| user_manual.pdf | 20 pages | ~40 chunks |
| research_paper.pdf | 10 pages | ~20 chunks |


## Solution

**What's in this recipe:**
- Split PDFs into paragraphs or sentences
- Control chunk size with token limits
- Add embeddings for semantic search

You create a view with a DocumentSplitter iterator that automatically breaks documents into chunks. Then you add an embedding index for semantic search.


### Setup


In [1]:
%pip install -qU pixeltable sentence-transformers


Note: you may need to restart the kernel to use updated packages.


In [2]:
import pixeltable as pxt
from pixeltable.iterators import DocumentSplitter
from pixeltable.functions.huggingface import sentence_transformer


### Load documents


In [3]:
# Create a fresh directory
pxt.drop_dir('rag_demo', force=True)
pxt.create_dir('rag_demo')


Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory 'rag_demo'.


<pixeltable.catalog.dir.Dir at 0x17f33f1d0>

In [4]:
# Create table for documents
docs = pxt.create_table('rag_demo.documents', {'document': pxt.Document})


Created table 'documents'.


In [5]:
# Insert a sample PDF
docs.insert([
    {'document': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/Argus-Market-Digest-June-2024.pdf'}
])


Error: Failed to download https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/Argus-Market-Digest-June-2024.pdf: HTTP Error 404: Not Found

### Split into chunks

Create a view that splits each document into paragraphs with a token limit:


In [None]:
# Create a view that splits documents into chunks
chunks = pxt.create_view(
    'rag_demo.chunks',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.document,
        separators='paragraph',  # Split by paragraph
        limit=300  # Max 300 tokens per chunk
    )
)


In [None]:
# View the chunks
chunks.select(chunks.text).head(5)


### Add semantic search

Create an embedding index on the chunks for similarity search:


In [None]:
# Add embedding index for semantic search
chunks.add_embedding_index(
    column='text',
    string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2')
)


### Search your documents

Use similarity search to find relevant chunks:


In [None]:
# Search for relevant chunks
query = "market trends"
sim = chunks.text.similarity(query)

results = (
    chunks
    .order_by(sim, asc=False)
    .select(chunks.text, score=sim)
    .limit(3)
)
results.collect()


## Explanation

**Separator options:**

| Separator | Description |
|-----------|-------------|
| `paragraph` | Split on paragraph breaks |
| `sentence` | Split on sentence boundaries |
| `heading` | Split on document headings |
| `page` | Split on page breaks |
| `token_limit` | Split at token count only |

You can combine separators: `separators='paragraph,token_limit'`

**Chunk sizing:**

- `limit`: Maximum tokens per chunk (default: 500)
- `overlap`: Tokens to overlap between chunks (default: 0)

**New documents are processed automatically:**

When you insert new documents, chunks and embeddings are generated without extra code.


## See also

- [Iterators documentation](https://docs.pixeltable.com/datastore/iterators)
- [RAG demo notebook](https://docs.pixeltable.com/notebooks/use-cases/rag-demo)
