A PostgreSQL extension for asynchronous text chunking and vector embedding generation.
pgEdge Vectorizer automatically chunks text content and generates vector embeddings using background workers. It supports multiple embedding providers (OpenAI, Voyage AI, and Ollama) and provides a simple SQL interface for enabling vectorization on any table.
- Automatic Chunking: Intelligently splits text into chunks with configurable strategies
- Async Processing: Background workers process embeddings without blocking your application
- Multiple Providers: Support for OpenAI, Voyage AI, and Ollama (local embeddings)
- Configurable: Extensive GUC parameters for fine-tuning behavior
- Batching: Efficient batch processing of embeddings
- Retry Logic: Automatic retry with exponential backoff for failed embeddings
- Monitoring: Built-in views and functions for monitoring queue status
- PostgreSQL 14 or later
- pgvector extension
- libcurl development files
- API key (for OpenAI or Voyage AI providers; not needed for Ollama)
Ubuntu/Debian:
sudo apt-get install postgresql-server-dev-all libcurl4-openssl-devmacOS (Homebrew):
brew install postgresql curlFollow the pgvector installation instructions.
# Clone the repository
cd pgedge-vectorizer
# Build
make
# Install (may require sudo)
sudo make installAdd to postgresql.conf:
shared_preload_libraries = 'pgedge_vectorizer'
# Provider configuration
pgedge_vectorizer.provider = 'openai'
pgedge_vectorizer.api_key_file = '/path/to/your/api_key_file'
pgedge_vectorizer.model = 'text-embedding-3-small'
# Worker configuration (optional)
pgedge_vectorizer.num_workers = 2
pgedge_vectorizer.batch_size = 10
# Chunking configuration (optional)
pgedge_vectorizer.default_chunk_size = 400
pgedge_vectorizer.default_chunk_overlap = 50Create a file containing only your API key:
echo "your-openai-api-key-here" > ~/.pgedge-vectorizer-llm-api-key
chmod 600 ~/.pgedge-vectorizer-llm-api-keysudo systemctl restart postgresql
# or
pg_ctl restartCREATE EXTENSION vector;
CREATE EXTENSION pgedge_vectorizer;-- Create a documents table
CREATE TABLE articles (
id BIGSERIAL PRIMARY KEY,
title TEXT,
content TEXT,
url TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Enable vectorization on the content column
SELECT pgedge_vectorizer.enable_vectorization(
source_table := 'articles',
source_column := 'content',
chunk_strategy := 'token_based',
chunk_size := 400,
chunk_overlap := 50
);
-- Insert a document - it will be automatically chunked and vectorized
INSERT INTO articles (title, content, url)
VALUES (
'Introduction to PostgreSQL',
'PostgreSQL is a powerful, open source object-relational database system...',
'https://example.com/postgres-intro'
);
-- Check queue status
SELECT * FROM pgedge_vectorizer.queue_status;
-- Wait for background workers to process...
-- Query for similar content
SELECT
a.title,
c.content,
c.embedding <=> '[0.1, 0.2, ...]'::vector AS distance
FROM articles a
JOIN articles_chunks c ON a.id = c.source_id
ORDER BY distance
LIMIT 5;All configuration parameters can be set in postgresql.conf or via ALTER SYSTEM.
| Parameter | Type | Default | Description |
|---|---|---|---|
pgedge_vectorizer.provider |
string | 'openai' |
Embedding provider (openai, voyage, ollama) |
pgedge_vectorizer.api_key_file |
string | '~/.pgedge-vectorizer-llm-api-key' |
Path to API key file |
pgedge_vectorizer.api_url |
string | 'https://api.openai.com/v1' |
API endpoint URL |
pgedge_vectorizer.model |
string | 'text-embedding-3-small' |
Embedding model name |
| Parameter | Type | Default | Description |
|---|---|---|---|
pgedge_vectorizer.num_workers |
integer | 2 |
Number of background workers (requires restart) |
pgedge_vectorizer.batch_size |
integer | 10 |
Batch size for embedding generation |
pgedge_vectorizer.max_retries |
integer | 3 |
Maximum retry attempts for failed embeddings |
pgedge_vectorizer.worker_poll_interval |
integer | 1000 |
Worker polling interval in milliseconds |
| Parameter | Type | Default | Description |
|---|---|---|---|
pgedge_vectorizer.auto_chunk |
boolean | true |
Enable automatic chunking |
pgedge_vectorizer.default_chunk_strategy |
string | 'token_based' |
Default chunking strategy |
pgedge_vectorizer.default_chunk_size |
integer | 400 |
Default chunk size in tokens |
pgedge_vectorizer.default_chunk_overlap |
integer | 50 |
Default chunk overlap in tokens |
| Parameter | Type | Default | Description |
|---|---|---|---|
pgedge_vectorizer.auto_cleanup_hours |
integer | 24 |
Automatically delete completed queue items older than this many hours. Set to 0 to disable automatic cleanup. |
Enable automatic vectorization for a table column.
SELECT pgedge_vectorizer.enable_vectorization(
source_table REGCLASS,
source_column NAME,
chunk_strategy TEXT DEFAULT NULL,
chunk_size INT DEFAULT NULL,
chunk_overlap INT DEFAULT NULL,
embedding_dimension INT DEFAULT 1536,
chunk_table_name TEXT DEFAULT NULL
);Disable vectorization for a table.
SELECT pgedge_vectorizer.disable_vectorization(
source_table REGCLASS,
drop_chunk_table BOOLEAN DEFAULT FALSE
);Manually chunk text.
SELECT pgedge_vectorizer.chunk_text(
content TEXT,
strategy TEXT DEFAULT NULL,
chunk_size INT DEFAULT NULL,
overlap INT DEFAULT NULL
);Retry failed queue items.
SELECT pgedge_vectorizer.retry_failed(
max_age_hours INT DEFAULT 24
);Remove old completed items from the queue.
SELECT pgedge_vectorizer.clear_completed(
older_than_hours INT DEFAULT 24
);Note: Workers automatically clean up completed items based on pgedge_vectorizer.auto_cleanup_hours (default 24 hours). Manual cleanup is only needed if you want to clean up more frequently or if automatic cleanup is disabled (set to 0).
Show all configuration settings.
SELECT * FROM pgedge_vectorizer.show_config();Summary of queue items by table and status.
SELECT * FROM pgedge_vectorizer.queue_status;Failed queue items with error details.
SELECT * FROM pgedge_vectorizer.failed_items;Count of pending items.
SELECT * FROM pgedge_vectorizer.pending_count;- Triggers: Automatically detect changes to configured columns
- Chunking Engine: Splits text into optimal-sized chunks
- Queue Table: Stores pending embedding tasks
- Background Workers: Process queue items asynchronously
- Provider Interface: Abstraction layer for different embedding APIs
- Chunk Tables: Store chunks and their embeddings
INSERT/UPDATE → Trigger → Chunk Text → Insert Chunks → Queue Items
↓
Background Worker
↓
Generate Embeddings
↓
Update Chunk Tables
-- Overall status
SELECT * FROM pgedge_vectorizer.queue_status;
-- Pending items
SELECT * FROM pgedge_vectorizer.pending_count;
-- Failed items
SELECT * FROM pgedge_vectorizer.failed_items;SELECT * FROM pgedge_vectorizer.show_config();SELECT id, chunk_table, status, attempts, error_message, created_at
FROM pgedge_vectorizer.queue
WHERE status = 'failed'
ORDER BY created_at DESC
LIMIT 10;Check PostgreSQL logs:
tail -f /var/log/postgresql/postgresql-*.logVerify shared_preload_libraries:
SHOW shared_preload_libraries;- Check API key file exists and is readable
- Verify provider configuration
- Check queue for errors:
SELECT * FROM pgedge_vectorizer.failed_items;- Increase number of workers:
ALTER SYSTEM SET pgedge_vectorizer.num_workers = 4;
-- Restart required- Increase batch size:
ALTER SYSTEM SET pgedge_vectorizer.batch_size = 20;
SELECT pg_reload_conf();- Batch Size: Larger batches (10-50) are more efficient for API calls
- Worker Count: Match to your API rate limits and server capacity
- Chunk Size: 200-500 tokens is optimal for most use cases
- Overlap: 10-20% overlap provides good context without too much duplication
make clean
make
make installThe extension includes a comprehensive test suite with 9 test files covering all functionality:
make installcheckTests cover:
- Extension installation and configuration
- Text chunking with various strategies
- Queue management and monitoring views
- Vectorization enable/disable
- Multi-column vectorization
- Maintenance functions (reprocess, recreate)
- Edge cases (empty, NULL, whitespace handling)
- Worker configuration
All tests must pass on PostgreSQL 14-18 before merging changes.
Enable debug logging:
SET client_min_messages = DEBUG1;- Support for Voyage AI embedding models
- Support for Ollama (local models)
- Semantic chunking strategy
- Markdown-aware chunking
- Sentence-based chunking
- Integration with tiktoken for accurate token counting
- Cost tracking and quotas
- Multi-database support
- Custom embedding dimensions
- Webhook notifications
PostgreSQL License (see LICENSE.md)
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Quick checklist:
- All tests must pass (
make installcheck) - Code follows PostgreSQL conventions
- New features include tests
- Documentation is updated
For issues and questions:
- GitHub: https://github.com/pgEdge/pgedge-vectorizer
- Issues: https://github.com/pgEdge/pgedge-vectorizer/issues
- Documentation: https://github.com/pgEdge/pgedge-vectorizer/blob/main/docs/index.md
Developed by pgEdge, Inc.
Portions copyright (c) 2025, pgEdge, Inc.