# Extract text from PowerPoint, Word, and Excel files

Transform office documents into searchable, analyzable text data.

**What's in this recipe:**
- Extract text from PPTX, DOCX, and XLSX files
- Split documents by headings, paragraphs, or custom limits
- Preserve document structure and metadata for analysis


## Problem

You have office documents—presentations, reports, spreadsheets—that contain valuable text data. You need to extract this text to analyze content, search across documents, or feed into AI models.

Manual extraction means opening each file, copying text, and losing structural information like headings and page boundaries. You need an automated way to process hundreds or thousands of office files while preserving their organization.


## Solution

You extract text from office documents using Pixeltable's document type with Microsoft's MarkItDown library. This converts PowerPoint, Word, and Excel files to structured text automatically.

You can iterate on document processing before adding transformations to your table. Use `.select()` with `.collect()` to preview results on sample documents—nothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Once you're satisfied, use `DocumentSplitter` to split documents by headings, paragraphs, or token limits.

### Setup


In [None]:
# For testing with local changes, run from repo root: uv pip install -e .
%pip install -qU pixeltable markitdown[pptx,docx,xlsx]

In [None]:
import pixeltable as pxt
from pixeltable.iterators.document import DocumentSplitter


In [None]:
# Sample file from Pixeltable repo (Excel spreadsheet)
# Replace with your own PPTX, DOCX, or XLSX files
sample_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/rag-demo/Q-A-Rag.xlsx'


### Load office documents


In [None]:
# Create a fresh directory (drop existing if present)
pxt.drop_dir('office_docs', force=True)
pxt.create_dir('office_docs')


In [None]:
# Create table for office documents
docs = pxt.create_table('office_docs.documents', {'doc': pxt.Document})


In [None]:
# Insert office files
docs.insert([{'doc': sample_url}])

### Extract full document text


#### Test extraction with a query

Use `.select()` with `.head(1)` to preview text extraction from one document.


In [None]:
# Create a view to extract full document text
chunks = pxt.create_view(
    'office_docs.full_text',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.doc,
        separators='',  # No splitting - extract full document
    )
)


In [None]:
# Preview extracted text from first document
chunks.select(chunks.doc, chunks.text).head(1)


### Split documents by headings


#### Test heading split with a query

Use `.select()` with `.collect()` to preview how documents split by headings.


In [None]:
# Create view that splits by headings
chunks = pxt.create_view(
    'office_docs.by_heading',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.doc,
        separators='heading',
        metadata='heading',  # Preserve heading structure
    )
)


In [None]:
# View chunks with their headings
# Each section under a heading becomes a separate chunk
chunks.select(chunks.doc, chunks.heading, chunks.text).collect()


### Split by token limit for AI models


#### Test token limit with a query

Use `.select()` with `.head(3)` to preview how documents split into token-sized chunks.


In [None]:
# Create view with token-based splitting
chunks = pxt.create_view(
    'office_docs.by_tokens',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.doc,
        separators='heading,token_limit',  # Split by heading first, then by tokens
        limit=512,  # Maximum tokens per chunk
        overlap=50,  # Overlap between chunks to preserve context
        metadata='heading',
    )
)


In [None]:
# Preview first few chunks with token limits
# Each chunk is 512 tokens or less
chunks.select(chunks.doc, chunks.heading, chunks.text).head(3)


### Search across all documents


In [None]:
# Find chunks containing specific keywords
# This searches across all office documents
chunks.where(chunks.text.contains('test')).select(chunks.doc, chunks.text).head(3)


## Explanation

**Supported formats:**
- PowerPoint: `.pptx`, `.ppt`
- Word: `.docx`, `.doc`
- Excel: `.xlsx`, `.xls`

**Separator options:**
- `heading` - Split by document headings (preserves structure)
- `paragraph` - Split by paragraphs
- `sentence` - Split by sentences
- `token_limit` - Split by token count (requires `limit` parameter)
- `char_limit` - Split by character count (requires `limit` parameter)
- Multiple separators work together: `'heading,token_limit'` splits by heading first, then ensures no chunk exceeds token limit

**Metadata fields:**
- `heading` - Hierarchical heading structure (e.g., `{'h1': 'Introduction', 'h2': 'Overview'}`)
- `title` - Document title
- `sourceline` - Source line number (HTML and Markdown documents)

**Token overlap:**
The `overlap` parameter ensures chunks share context at boundaries. This prevents sentences from being split mid-thought when feeding chunks to AI models.


## See also

- [Get fast feedback on transformations](./dev-iterative-workflow.ipynb)
- [Pixeltable Document API](https://docs.pixeltable.com/api/pixeltable/#pixeltable.Document)
