# Split documents into chunks for RAG

Break PDFs and documents into searchable chunks for retrieval-augmented generation (RAG) pipelines.

## Problem

You have PDF documents or text files that you want to use for retrieval-augmented generation (RAG). Before you can search them, you need to:

1. Split documents into smaller chunks
1. Generate embeddings for each chunk
1. Store everything in a searchable index

| Document | Size | Chunks needed |
|----------|------|---------------|
| annual_report.pdf | 50 pages | ~100 chunks |
| user_manual.pdf | 20 pages | ~40 chunks |
| research_paper.pdf | 10 pages | ~20 chunks |

## Solution

**What's in this recipe:**

- Split PDFs into sentences with token limits
- Control chunk size with token limits
- Add embeddings for semantic search

You create a view with a `document_splitter` iterator that automatically breaks documents into chunks. Then you add an embedding index for semantic search.

### Setup

In [None]:
%pip install -qU pixeltable sentence-transformers spacy tiktoken
!python -m spacy download en_core_web_sm -q

In [None]:
import pixeltable as pxt
from pixeltable.functions.document import document_splitter
from pixeltable.functions.huggingface import sentence_transformer

### Load documents

In [12]:
# Create a fresh directory
pxt.drop_dir('rag_demo', force=True)
pxt.create_dir('rag_demo')

Created directory 'rag_demo'.


<pixeltable.catalog.dir.Dir at 0x3d8e31710>

In [13]:
# Create table for documents
docs = pxt.create_table('rag_demo.documents', {'document': pxt.Document})

Created table 'documents'.


In [14]:
# Insert a sample PDF
docs.insert([
    {'document': 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Argus-Market-Digest-June-2024.pdf'}
])

Inserting rows into `documents`: 1 rows [00:00, 775.86 rows/s]
Inserted 1 row with 0 errors.


1 row inserted, 2 values computed.

### Split into chunks

Create a view that splits each document into sentences with a token limit:

In [None]:
# Create a view that splits documents into chunks
chunks = pxt.create_view(
    'rag_demo.chunks',
    docs,
    iterator=document_splitter(
        docs.document,
        separators='sentence,token_limit',  # Split by sentence with token limit
        limit=300  # Max 300 tokens per chunk
    )
)

Inserting rows into `chunks`: 217 rows [00:00, 42111.88 rows/s]


In [16]:
# View the chunks
chunks.select(chunks.text).head(5)

text
MARKET DIGEST
- 1 -
"FRIDAY, JUNE 21, 2024"
"JUNE 20, DJIA: 39,134.76 UP 299.90"
Independent Equity Research Since 1934 ARGUS


### Add semantic search

Create an embedding index on the chunks for similarity search:

In [17]:
# Add embedding index for semantic search
chunks.add_embedding_index(
    column='text',
    string_embed=sentence_transformer.using(model_id='all-MiniLM-L6-v2')
)

### Search your documents

Use similarity search to find relevant chunks:

In [18]:
# Search for relevant chunks
query = "market trends"
sim = chunks.text.similarity(string=query)

results = (
    chunks
    .order_by(sim, asc=False)
    .select(chunks.text, score=sim)
    .limit(3)
)
results.collect()

text,score
MARKET REVIEW:,0.558
"This is the Market Digest for Friday, June 21, 2024, with analysis of the financial markets and comments on Accenture plc.",0.489
MARKET DIGEST,0.479


## Explanation

**Separator options:**

| Separator | Description |
|-----------|-------------|
| `sentence` | Split on sentence boundaries |
| `heading` | Split on document headings |
| `page` | Split on page breaks |
| `token_limit` | Split at token count only |

You can combine separators: `separators='sentence,token_limit'`

**Chunk sizing:**

- `limit`: Maximum tokens per chunk (default: 500)
- `overlap`: Tokens to overlap between chunks (default: 0)

**New documents are processed automatically:**

When you insert new documents, chunks and embeddings are generated without extra code.

## See also

- [Iterators documentation](https://docs.pixeltable.com/platform/iterators)
- [RAG demo notebook](https://docs.pixeltable.com/howto/use-cases/rag-demo)