# Extract text from PowerPoint, Word, and Excel files

Transform office documents into searchable, analyzable text data.

**What's in this recipe:**
- Extract text from PPTX, DOCX, and XLSX files
- Split documents by headings, paragraphs, or custom limits
- Preserve document structure and metadata for analysis


## Problem

You have office documents‚Äîpresentations, reports, spreadsheets‚Äîthat contain valuable text data. You need to extract this text to analyze content, search across documents, or feed into AI models.

Manual extraction means opening each file, copying text, and losing structural information like headings and page boundaries. You need an automated way to process hundreds or thousands of office files while preserving their organization.


## Solution

You extract text from office documents using Pixeltable's document type with Microsoft's MarkItDown library. This converts PowerPoint, Word, and Excel files to structured text automatically.

You can iterate on document processing before adding transformations to your table. Use `.select()` with `.collect()` to preview results on sample documents‚Äînothing is stored in your table. If you want to collect only the first few rows, use `.head(n)` instead of `.collect()`. Once you're satisfied, use `DocumentSplitter` to split documents by headings, paragraphs, or token limits.

### Setup


In [None]:
# For testing with local changes, run from repo root: uv pip install -e .
%pip install -qU pixeltable markitdown[pptx,docx,xlsx]

zsh:1: no matches found: markitdown[pptx,docx,xlsx]
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pixeltable as pxt
from pixeltable.iterators.document import DocumentSplitter


In [None]:
# Download a sample PowerPoint file for demonstration
import urllib.request
import tempfile
import os

# Sample test file from Pixeltable repo
url = 'https://github.com/pixeltable/pixeltable/raw/main/tests/data/documents/test_presentation.pptx'
sample_pptx = os.path.join(tempfile.gettempdir(), 'test_presentation.pptx')
urllib.request.urlretrieve(url, sample_pptx)
print(f"Downloaded sample presentation: {sample_pptx}")


Created sample presentation: /var/folders/s4/0zdx499s6sv3_0jll6ccdbh00000gn/T/sample_presentation.pptx


### Load office documents


In [4]:
# Create a fresh directory (drop existing if present)
pxt.drop_dir('office_docs', force=True)
pxt.create_dir('office_docs')


Creating a Pixeltable instance at: /Users/pjlb/.pixeltable
Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata
Created directory 'office_docs'.


<pixeltable.catalog.dir.Dir at 0x169f2f490>

In [5]:
# Create table for office documents
docs = pxt.create_table('office_docs.documents', {'doc': pxt.Document})


Created table 'documents'.


In [None]:
# Insert office files
# Using the sample presentation downloaded above
# Replace with your own PPTX, DOCX, or XLSX files
docs.insert([
    {'doc': sample_pptx},
])

Error: Unrecognized document format: /var/folders/s4/0zdx499s6sv3_0jll6ccdbh00000gn/T/sample_presentation.pptx

### Extract full document text


#### Test extraction with a query

Use `.select()` with `.head(1)` to preview text extraction from one document.


In [None]:
# Create a view to extract full document text
chunks = pxt.create_view(
    'office_docs.full_text',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.doc,
        separators='',  # No splitting - extract full document
    )
)


Inserting rows into `full_text`: 1 rows [00:00, 287.69 rows/s]


In [None]:
# Preview extracted text from first document
chunks.select(chunks.doc, chunks.text).head(1)


doc,text
/Users/pierre/pixeltable/docs/resources/calpy.pptx,"November 6 2025 Open-Source Data Infrastructure for Multimodal AI Marcel Kornacker Notes: About me Co-founder & CTO, Pixeltable UC Berkeley: PhD in Database Systems (advisor: Joe Hellerstein) Google (2003-2010): Tech lead for F1 database, worked on scalable data infrastructure Cloudera: Co-creator of Apache Parquet Created Apache Impala (first database to use LLVM for runtime code generation) ‚Äπ#‚Ä∫ Notes: The problem with AI development today ‚Äπ#‚Ä∫ Notes: ""I want to make a searchable collection ...... tic Propagation ================================================================================ ‚Äπ#‚Ä∫ Notes: Your one stop shop for developing AI-based data products Complete - capture all the data you need, doesn't limit what you do with the data Store of record - don't need separate place [ ] - express any transformation or other application logic ‚Üí Complete - real production is multi user ‚Üí Complete - real AI use cases require captures all the data types ‚Üí Complete - augment it ‚Äπ#‚Ä∫ Notes:"


### Split documents by headings


#### Test heading split with a query

Use `.select()` with `.collect()` to preview how documents split by headings.


In [None]:
# Create view that splits by headings
chunks = pxt.create_view(
    'office_docs.by_heading',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.doc,
        separators='heading',
        metadata='heading',  # Preserve heading structure
    )
)


Inserting rows into `by_heading`: 87 rows [00:00, 13241.81 rows/s]


In [None]:
# View chunks with their headings
# Each section under a heading becomes a separate chunk
chunks.select(chunks.doc, chunks.heading, chunks.text).collect()


doc,heading,text
/Users/pierre/pixeltable/docs/resources/calpy.pptx,{},November 6 2025
/Users/pierre/pixeltable/docs/resources/calpy.pptx,"{""h1"": ""Open-Source Data Infrastructure for Multimodal AI""}",Open-Source Data Infrastructure for Multimodal AI Marcel Kornacker
/Users/pierre/pixeltable/docs/resources/calpy.pptx,"{""h1"": ""Open-Source Data Infrastructure for Multimodal AI"", ""h3"": ""Notes:""}",Notes:
/Users/pierre/pixeltable/docs/resources/calpy.pptx,"{""h1"": ""About me""}","About me Co-founder & CTO, Pixeltable UC Berkeley: PhD in Database Systems (advisor: Joe Hellerstein) Google (2003-2010): Tech lead for F1 database, worked on scalable data infrastructure Cloudera: Co-creator of Apache Parquet Created Apache Impala (first database to use LLVM for runtime code generation) ‚Äπ#‚Ä∫"
/Users/pierre/pixeltable/docs/resources/calpy.pptx,"{""h1"": ""About me"", ""h3"": ""Notes:""}",Notes:
/Users/pierre/pixeltable/docs/resources/calpy.pptx,"{""h1"": ""The problem with AI development today""}",The problem with AI development today ‚Äπ#‚Ä∫
/Users/pierre/pixeltable/docs/resources/calpy.pptx,"{""h1"": ""The problem with AI development today"", ""h3"": ""Notes:""}",Notes:
/Users/pierre/pixeltable/docs/resources/calpy.pptx,"{""h1"": ""\u201cI want to make a searchable collection of videos\u201d""}","""I want to make a searchable collection of videos"" Example stack: Upload videos to S3 for storage Extract frames with FFmpeg/OpenCV Send frames to OpenAI Vision API, handle retries Parse responses, validate JSON Generate embeddings from responses Store embeddings in Pinecone/LanceDB (for search) + results in PostgreSQL (for queries) JOIN data across S3, Pinecone/LanceDB, and PostgreSQL with foreign keys and correlation IDs Handle failures across all of this... somehow? ‚Üí 1000+ lines of glue code and you are still trying to figure out after that how to version, get observability, lineage, scalability, parallelization... ‚Äπ#‚Ä∫"
/Users/pierre/pixeltable/docs/resources/calpy.pptx,"{""h1"": ""\u201cI want to make a searchable collection of videos\u201d"", ""h3"": ""Notes:""}",Notes:
/Users/pierre/pixeltable/docs/resources/calpy.pptx,"{""h1"": ""AI development today is""}",AI development today is mostly plumbing & pipelines ~ 80-90% of effort ‚Äπ#‚Ä∫


### Split by token limit for AI models


#### Test token limit with a query

Use `.select()` with `.head(3)` to preview how documents split into token-sized chunks.


In [None]:
# Create view with token-based splitting
chunks = pxt.create_view(
    'office_docs.by_tokens',
    docs,
    iterator=DocumentSplitter.create(
        document=docs.doc,
        separators='heading,token_limit',  # Split by heading first, then by tokens
        limit=512,  # Maximum tokens per chunk
        overlap=50,  # Overlap between chunks to preserve context
        metadata='heading',
    )
)


Inserting rows into `by_tokens`: 2369 rows [00:00, 26816.76 rows/s]


In [None]:
# Preview first few chunks with token limits
# Each chunk is 512 tokens or less
chunks.select(chunks.doc, chunks.heading, chunks.text).head(3)


doc,heading,text
/Users/pierre/pixeltable/docs/resources/calpy.pptx,{},November 6 2025
/Users/pierre/pixeltable/docs/resources/calpy.pptx,{},6 2025
/Users/pierre/pixeltable/docs/resources/calpy.pptx,{},6 2025


### Search across all documents


In [None]:
# Find chunks containing specific keywords
# This searches across all office documents
chunks.where(chunks.text.contains('test')).select(chunks.doc, chunks.text).head(3)


doc,text
/Users/pierre/pixeltable/docs/resources/calpy.pptx,"About me Co-founder & CTO, Pixeltable UC Berkeley: PhD in Database Systems (advisor: Joe Hellerstein) Google (2003-2010): Tech lead for F1 database, worked on scalable data infrastructure Cloudera: Co-creator of Apache Parquet Created Apache Impala (first database to use LLVM for runtime code generation) ‚Äπ#‚Ä∫"
/Users/pierre/pixeltable/docs/resources/calpy.pptx,"Google (2003-2010): Tech lead for F1 database, worked on scalable data infrastructure Cloudera: Co-creator of Apache Parquet Created Apache Impala (first database to use LLVM for runtime code generation) ‚Äπ#‚Ä∫"
/Users/pierre/pixeltable/docs/resources/calpy.pptx,"(2003-2010): Tech lead for F1 database, worked on scalable data infrastructure Cloudera: Co-creator of Apache Parquet Created Apache Impala (first database to use LLVM for runtime code generation) ‚Äπ#‚Ä∫"


## Explanation

**Supported formats:**
- PowerPoint: `.pptx`, `.ppt`
- Word: `.docx`, `.doc`
- Excel: `.xlsx`, `.xls`

**Separator options:**
- `heading` - Split by document headings (preserves structure)
- `paragraph` - Split by paragraphs
- `sentence` - Split by sentences
- `token_limit` - Split by token count (requires `limit` parameter)
- `char_limit` - Split by character count (requires `limit` parameter)
- Multiple separators work together: `'heading,token_limit'` splits by heading first, then ensures no chunk exceeds token limit

**Metadata fields:**
- `heading` - Hierarchical heading structure (e.g., `{'h1': 'Introduction', 'h2': 'Overview'}`)
- `title` - Document title
- `sourceline` - Source line number (HTML and Markdown documents)

**Token overlap:**
The `overlap` parameter ensures chunks share context at boundaries. This prevents sentences from being split mid-thought when feeding chunks to AI models.


## See also

- [Get fast feedback on transformations](./dev-iterative-workflow.ipynb)
- [Pixeltable Document API](https://docs.pixeltable.com/api/pixeltable/#pixeltable.Document)
