# Extract text from PowerPoint, Word, and Excel files

Transform office documents into searchable, analyzable text data.

**What's in this recipe:**
- Extract text from PPTX, DOCX, and XLSX files
- Split documents by headings, paragraphs, or custom limits
- Preserve document structure and metadata for analysis


## Problem

You have office documents‚Äîpresentations, reports, spreadsheets‚Äîthat contain valuable text data. You need to extract this text to analyze content, search across documents, or feed into AI models.

Manual extraction means opening each file, copying text, and losing structural information like headings and page boundaries. You need an automated way to process hundreds or thousands of office files while preserving their organization.


## Solution

You extract text from office documents using Pixeltable's document type with Microsoft's MarkItDown library. This converts PowerPoint, Word, and Excel files to structured text automatically.

You use `DocumentSplitter` to split documents by headings, paragraphs, or token limits. Each split creates a view where each row represents a chunk of the document with its metadata.

### Setup


In [None]:
%pip install -qU pixeltable 'markitdown[pptx,docx,xlsx]'

In [None]:
import pixeltable as pxt
from pixeltable.iterators.document import DocumentSplitter

### Load office documents


In [None]:
# Create a fresh directory (drop existing if present)
pxt.drop_dir('office_docs', force=True)
pxt.create_dir('office_docs')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory 'office_docs'.


<pixeltable.catalog.dir.Dir at 0x146c24c10>

In [None]:
# Create table for office documents
docs = pxt.create_table('office_docs.documents', {'doc': pxt.Document})

Created table 'documents'.


In [5]:
# Sample PowerPoint from Pixeltable repo
# Replace with your own PPTX, DOCX, or XLSX files
sample_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/calpy.pptx'

docs.insert([{'doc': sample_url}])

Inserting rows into `documents`: 1 rows [00:00, 57.40 rows/s]
Inserted 1 row with 0 errors.


1 row inserted, 2 values computed.

### Extract full document text

You create a view with `DocumentSplitter` to extract text. Setting `separators=''` extracts the full document without splitting.


In [6]:
# Create a view to extract full document text
full_text = pxt.create_view(
    'office_docs.full_text',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.doc,
        separators='',  # No splitting - extract full document
    )
)

  iterator=DocumentSplitter.create(


Inserting rows into `full_text`: 1 rows [00:00, 196.50 rows/s]


In [7]:
# Preview extracted text
full_text.select(full_text.doc, full_text.text).head(1)

doc,text
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx,"November 6 2025 Open-Source Data Infrastructure for Multimodal AI Marcel Kornacker Notes: About me Co-founder & CTO, Pixeltable UC Berkeley: PhD in Database Systems (advisor: Joe Hellerstein) Google (2003-2010): Tech lead for F1 database, worked on scalable data infrastructure Cloudera: Co-creator of Apache Parquet Created Apache Impala (first database to use LLVM for runtime code generation) ‚Äπ#‚Ä∫ Notes: The problem with AI development today ‚Äπ#‚Ä∫ Notes: ""I want to make a searchable collection ...... tic Propagation ================================================================================ ‚Äπ#‚Ä∫ Notes: Your one stop shop for developing AI-based data products Complete - capture all the data you need, doesn't limit what you do with the data Store of record - don't need separate place [ ] - express any transformation or other application logic ‚Üí Complete - real production is multi user ‚Üí Complete - real AI use cases require captures all the data types ‚Üí Complete - augment it ‚Äπ#‚Ä∫ Notes:"


### Split documents by headings

You split documents by headings to preserve their logical structure. Each section under a heading becomes a separate chunk.


In [8]:
# Create view that splits by headings
by_heading = pxt.create_view(
    'office_docs.by_heading',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.doc,
        separators='heading',
        metadata='heading',  # Preserve heading structure
    )
)

  iterator=DocumentSplitter.create(


Inserting rows into `by_heading`: 87 rows [00:00, 10359.54 rows/s]


In [9]:
# View chunks with their headings
by_heading.select(by_heading.heading, by_heading.text).head(5)

heading,text
{},November 6 2025
"{""h1"": ""Open-Source Data Infrastructure for Multimodal AI""}",Open-Source Data Infrastructure for Multimodal AI Marcel Kornacker
"{""h1"": ""Open-Source Data Infrastructure for Multimodal AI"", ""h3"": ""Notes:""}",Notes:
"{""h1"": ""About me""}","About me Co-founder & CTO, Pixeltable UC Berkeley: PhD in Database Systems (advisor: Joe Hellerstein) Google (2003-2010): Tech lead for F1 database, worked on scalable data infrastructure Cloudera: Co-creator of Apache Parquet Created Apache Impala (first database to use LLVM for runtime code generation) ‚Äπ#‚Ä∫"
"{""h1"": ""About me"", ""h3"": ""Notes:""}",Notes:


### Split by token limit for AI models

You split documents by token count when feeding chunks to AI models. The `overlap` parameter ensures chunks share context at boundaries.


In [10]:
# Create view with token-based splitting
by_tokens = pxt.create_view(
    'office_docs.by_tokens',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.doc,
        separators='heading,token_limit',  # Split by heading first, then by tokens
        limit=512,  # Maximum tokens per chunk
        overlap=50,  # Overlap between chunks to preserve context
        metadata='heading',
    )
)

  iterator=DocumentSplitter.create(


Inserting rows into `by_tokens`: 2369 rows [00:00, 9212.05 rows/s]


In [11]:
# Preview chunks with token limits
by_tokens.select(by_tokens.doc, by_tokens.heading, by_tokens.text).head(3)

doc,heading,text
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx,{},November 6 2025
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx,{},6 2025
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx,{},6 2025


### Search across documents

You search across all document chunks using standard Pixeltable queries.


In [12]:
# Find chunks containing specific keywords
by_tokens.where(by_tokens.text.contains('Python')).select(by_tokens.doc, by_tokens.text).head(3)

doc,text
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx,"Storage üóÑÔ∏è Orchestration ‚öôÔ∏è What you get: Videos in S3, embeddings in Pinecone, metadata in Postgres‚Ä¶ Data loaded into memory, exported to file formats File formats that don't support media data Manual tracking of what lives where What you miss: Transactions: Models fail halfway ‚Üí data stays inconsistent Concurrency: Multiple users ‚Üí can't work on same data simultaneously Persistence: Work happens in memory ‚Üí doesn't map to traditional database schemas OLTP capabilities: Built for batch ‚Üí ca ...... g tools together Cron jobs and Python scripts for every step Manually handling rate limits, retries, chasing API errors Wild goose chase when requirements change What you miss: Dependency tracking: Transforms happen in scripts ‚Üí hard to trace what depends on what Low latency/high throughput: Hard to parallelize external API calls ‚Üí poor performance Failure handling: Something fails somewhere ‚Üí rerun Operational integrity: Inconsistent models for indexing and querying ‚Üí contaminated index ‚Äπ#‚Ä∫"
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx,"Storage üóÑÔ∏è Orchestration ‚öôÔ∏è What you get: Videos in S3, embeddings in Pinecone, metadata in Postgres‚Ä¶ Data loaded into memory, exported to file formats File formats that don't support media data Manual tracking of what lives where What you miss: Transactions: Models fail halfway ‚Üí data stays inconsistent Concurrency: Multiple users ‚Üí can't work on same data simultaneously Persistence: Work happens in memory ‚Üí doesn't map to traditional database schemas OLTP capabilities: Built for batch ‚Üí ca ...... g tools together Cron jobs and Python scripts for every step Manually handling rate limits, retries, chasing API errors Wild goose chase when requirements change What you miss: Dependency tracking: Transforms happen in scripts ‚Üí hard to trace what depends on what Low latency/high throughput: Hard to parallelize external API calls ‚Üí poor performance Failure handling: Something fails somewhere ‚Üí rerun Operational integrity: Inconsistent models for indexing and querying ‚Üí contaminated index ‚Äπ#‚Ä∫"
/Users/pjlb/.pixeltable/file_cache/8140cdce326a47cd98fe484d6fb1fabe_0_a6ed56e20a649393988cdc8d8ccc90207e77d323369e6cd389edc9d755c92b95.pptx,"Storage üóÑÔ∏è Orchestration ‚öôÔ∏è What you get: Videos in S3, embeddings in Pinecone, metadata in Postgres‚Ä¶ Data loaded into memory, exported to file formats File formats that don't support media data Manual tracking of what lives where What you miss: Transactions: Models fail halfway ‚Üí data stays inconsistent Concurrency: Multiple users ‚Üí can't work on same data simultaneously Persistence: Work happens in memory ‚Üí doesn't map to traditional database schemas OLTP capabilities: Built for batch ‚Üí ca ...... g tools together Cron jobs and Python scripts for every step Manually handling rate limits, retries, chasing API errors Wild goose chase when requirements change What you miss: Dependency tracking: Transforms happen in scripts ‚Üí hard to trace what depends on what Low latency/high throughput: Hard to parallelize external API calls ‚Üí poor performance Failure handling: Something fails somewhere ‚Üí rerun Operational integrity: Inconsistent models for indexing and querying ‚Üí contaminated index ‚Äπ#‚Ä∫"


## Explanation

**Supported formats:**
- PowerPoint: `.pptx`, `.ppt`
- Word: `.docx`, `.doc`
- Excel: `.xlsx`, `.xls`

**Separator options:**
- `heading` - Split by document headings (preserves structure)
- `paragraph` - Split by paragraphs
- `sentence` - Split by sentences
- `token_limit` - Split by token count (requires `limit` parameter)
- `char_limit` - Split by character count (requires `limit` parameter)
- Multiple separators work together: `'heading,token_limit'` splits by heading first, then ensures no chunk exceeds token limit

**Metadata fields:**
- `heading` - Hierarchical heading structure (e.g., `{'h1': 'Introduction', 'h2': 'Overview'}`)
- `title` - Document title
- `sourceline` - Source line number (HTML and Markdown documents)

**Token overlap:**
The `overlap` parameter ensures chunks share context at boundaries. This prevents sentences from being split mid-thought when feeding chunks to AI models.


## See also

- [Get fast feedback on transformations](https://docs.pixeltable.com/howto/cookbooks/core/dev-iterative-workflow)
- [Pixeltable Document API](https://docs.pixeltable.com/sdk/latest/document)
