High-performance document conversion engine for AI/LLM embeddings
Transmutation is a pure Rust document conversion engine designed to transform various file formats into optimized text and image outputs suitable for LLM processing and vector embeddings. Built as a core component of the HiveLLM Vectorizer ecosystem, Transmutation is a high-performance alternative to Docling, offering superior speed, lower memory usage, and zero runtime dependencies.
- Pure Rust implementation - No Python dependencies, maximum performance
- Convert documents to LLM-friendly formats (Markdown, Images, JSON)
- Optimize output for embedding generation (text and multimodal)
- Maintain maximum quality with minimum size
- Competitor to Docling - 98x faster, more efficient, and easier to deploy
- Seamless integration with HiveLLM Vectorizer
Transmutation vs Docling (Fast Mode - Pure Rust):
Metric | Paper 1 (15 pages) | Paper 2 (25 pages) | Average |
---|---|---|---|
Similarity | 76.36% | 84.44% | 80.40% |
Speed | 108x faster | 88x faster | 98x faster |
Time (Docling) | 31.36s | 40.56s | ~35s |
Time (Transmutation) | 0.29s | 0.46s | ~0.37s |
- β 80% similarity - Acceptable for most use cases
- β 98x faster - Near-instant conversion
- β Pure Rust - No Python/ML dependencies
- β Low memory - 50 MB footprint
- π― Goal: 95% similarity (Precision Mode with C++ FFI - in development)
See BENCHMARK_COMPARISON.md for detailed results.
Input Format | Output Options | Status | Modes |
---|---|---|---|
Image per page, Markdown (per page/full), JSON | β Production | Fast, Precision, FFI | |
DOCX | Image per page, Markdown (per page/full), JSON | β Production | Pure Rust + LibreOffice |
XLSX | Markdown tables, CSV, JSON | β Production | Pure Rust (148 pg/s) |
PPTX | Image per slide, Markdown per slide | β Production | Pure Rust (1639 pg/s) |
HTML | Markdown, JSON | β Production | Pure Rust (2110 pg/s) |
XML | Markdown, JSON | β Production | Pure Rust (2353 pg/s) |
TXT | Markdown, JSON | β Production | Pure Rust (2805 pg/s) |
CSV/TSV | Markdown tables, JSON | β Production | Pure Rust (2647 pg/s) |
RTF | Markdown, JSON | Pure Rust (simplified parser) | |
ODT | Markdown, JSON | Pure Rust (ZIP + XML) | |
MD | Markdown (normalized), JSON | π Planned | - |
Input Format | Output Options | OCR Engine | Status |
---|---|---|---|
JPG/JPEG | Markdown (OCR), JSON | Tesseract | β Production |
PNG | Markdown (OCR), JSON | Tesseract | β Production |
TIFF/TIF | Markdown (OCR), JSON | Tesseract | β Production |
BMP | Markdown (OCR), JSON | Tesseract | β Production |
GIF | Markdown (OCR), JSON | Tesseract | β Production |
WEBP | Markdown (OCR), JSON | Tesseract | β Production |
Input Format | Output Options | Engine | Status |
---|---|---|---|
MP3 | Markdown (transcription), JSON | Whisper | β Production |
WAV | Markdown (transcription), JSON | Whisper | β Production |
M4A | Markdown (transcription), JSON | Whisper | β Production |
FLAC | Markdown (transcription), JSON | Whisper | β Production |
OGG | Markdown (transcription), JSON | Whisper | β Production |
MP4 | Markdown (transcription), JSON | FFmpeg + Whisper | β Production |
AVI | Markdown (transcription), JSON | FFmpeg + Whisper | β Production |
MKV | Markdown (transcription), JSON | FFmpeg + Whisper | β Production |
MOV | Markdown (transcription), JSON | FFmpeg + Whisper | β Production |
WEBM | Markdown (transcription), JSON | FFmpeg + Whisper | β Production |
Input Format | Output Options | Status | Performance |
---|---|---|---|
ZIP | File listing, statistics, Markdown index, JSON | β Production | Pure Rust (1864 pg/s) |
TAR/GZ | Extract and process contents | π Planned | - |
7Z | Extract and process contents | π Planned | - |
transmutation/
βββ src/
β βββ lib.rs # Main library entry
β βββ converters/
β β βββ mod.rs # Converter registry
β β βββ pdf.rs # PDF conversion (pure Rust)
β β βββ docx.rs # DOCX conversion
β β βββ pptx.rs # PPTX conversion
β β βββ xlsx.rs # XLSX conversion
β β βββ html.rs # HTML conversion
β β βββ xml.rs # XML conversion
β β βββ image.rs # Image OCR (Tesseract)
β β βββ audio.rs # Audio transcription (pure Rust ASR)
β β βββ video.rs # Video processing (FFmpeg)
β β βββ archive.rs # Archive extraction
β βββ output/
β β βββ mod.rs # Output format handlers
β β βββ markdown.rs # Markdown generation
β β βββ image.rs # Image generation/optimization
β β βββ json.rs # JSON serialization
β β βββ csv.rs # CSV generation
β βββ engines/
β β βββ mod.rs # Engine abstractions
β β βββ pdf_parser.rs # Pure Rust PDF parsing
β β βββ tesseract.rs # Tesseract OCR wrapper
β β βββ audio_asr.rs # Pure Rust audio transcription
β β βββ ffmpeg.rs # FFmpeg wrapper
β βββ optimization/
β β βββ mod.rs # Optimization strategies
β β βββ text.rs # Text compression/cleanup
β β βββ image.rs # Image compression
β β βββ quality.rs # Quality metrics
β βββ integration/
β β βββ mod.rs # Integration layer
β β βββ vectorizer.rs # Vectorizer integration
β β βββ langchain.rs # LangChain integration
β β βββ llamaindex.rs # LlamaIndex integration
β β βββ haystack.rs # Haystack integration
β βββ utils/
β β βββ mod.rs # Utilities
β β βββ file_detect.rs # File type detection
β β βββ metadata.rs # Metadata extraction
β β βββ cache.rs # Conversion cache
β βββ error.rs # Error types
βββ src/bin/
β βββ transmutation.rs # CLI application (included in main crate)
βββ bindings/
β βββ python/ # Python bindings (PyO3) - Future
β βββ node/ # Node.js bindings (Neon) - Future
β βββ wasm/ # WebAssembly bindings - Future
βββ examples/
β βββ basic_conversion.rs
β βββ batch_processing.rs
β βββ vectorizer_integration.rs
β βββ custom_pipeline.rs
βββ benches/
β βββ conversion_benchmarks.rs
βββ tests/
β βββ integration/
β βββ fixtures/
βββ Cargo.toml
βββ README.md
βββ LICENSE
βββ ROADMAP.md
βββ ARCHITECTURE.md
βββ CONTRIBUTING.md
Windows MSI Installer:
# Download from releases or build:
.\build-msi.ps1
msiexec /i target\wix\transmutation-0.1.1-x86_64.msi
See docs/MSI_BUILD.md
for details.
Cargo:
# Add to Cargo.toml
[dependencies]
transmutation = "0.1"
# Core features (always enabled, no flags needed):
# - PDF, HTML, XML, ZIP, TXT, CSV, TSV, RTF, ODT
# With Office formats (default)
[dependencies.transmutation]
version = "0.1"
features = ["office"] # DOCX, XLSX, PPTX
# With optional features (requires external tools)
features = ["office", "pdf-to-image", "tesseract", "audio"]
Transmutation is mostly pure Rust, with core features requiring ZERO dependencies:
Feature | Requires | Status |
---|---|---|
Core (PDF, HTML, XML, ZIP, TXT, CSV, TSV, RTF, ODT) | β None | Always enabled |
office (DOCX, XLSX, PPTX - Text) |
β None | Pure Rust (default) |
pdf-to-image |
Optional | |
office + images |
Optional | |
image-ocr |
Optional | |
audio |
Optional | |
video |
Optional | |
archives-extended (TAR, GZ, 7Z) |
Optional |
During compilation, build.rs
will automatically detect missing dependencies and provide installation instructions:
cargo build --features "pdf-to-image"
# If pdftoppm is missing, you'll see:
β οΈ Optional External Dependencies Missing
β pdftoppm (poppler-utils): PDF β Image conversion
Install: sudo apt-get install poppler-utils
π Quick install (all dependencies):
./install/install-deps-linux.sh
Installation scripts are provided for all platforms:
- Linux:
./install/install-deps-linux.sh
- macOS:
./install/install-deps-macos.sh
- Windows:
.\install\install-deps-windows.ps1
(or.bat
)
See install/README.md
for detailed instructions.
use transmutation::{Converter, OutputFormat, ConversionOptions};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize converter
let converter = Converter::new()?;
// Convert PDF to Markdown
let result = converter
.convert("document.pdf")
.to(OutputFormat::Markdown)
.with_options(ConversionOptions {
split_pages: true,
optimize_for_llm: true,
..Default::default()
})
.execute()
.await?;
// Save output
result.save("output/document.md").await?;
println!("Converted {} pages", result.page_count());
Ok(())
}
use transmutation::{Converter, BatchProcessor, OutputFormat};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let converter = Converter::new()?;
let batch = BatchProcessor::new(converter);
// Process multiple files
let results = batch
.add_files(&["doc1.pdf", "doc2.docx", "doc3.pptx"])
.to(OutputFormat::Markdown)
.parallel(4)
.execute()
.await?;
for (file, result) in results {
println!("{}: {} -> {}", file, result.input_size(), result.output_size());
}
Ok(())
}
use transmutation::{Converter, OutputFormat};
use vectorizer::VectorizerClient;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let converter = Converter::new()?;
let vectorizer = VectorizerClient::new("http://localhost:15002").await?;
// Convert and embed in one pipeline
let result = converter
.convert("document.pdf")
.to(OutputFormat::EmbeddingReady)
.pipe_to(&vectorizer)
.execute()
.await?;
println!("Embedded {} chunks", result.chunk_count());
Ok(())
}
pub struct ConversionOptions {
// Output control
pub split_pages: bool, // Split output by pages
pub optimize_for_llm: bool, // Optimize for LLM processing
pub max_chunk_size: usize, // Maximum chunk size (tokens)
// Quality settings
pub image_quality: ImageQuality, // High, Medium, Low
pub dpi: u32, // DPI for image output (default: 150)
pub ocr_language: String, // OCR language (default: "eng")
// Processing options
pub preserve_layout: bool, // Preserve document layout
pub extract_tables: bool, // Extract tables separately
pub extract_images: bool, // Extract embedded images
pub include_metadata: bool, // Include document metadata
// Optimization
pub compression_level: u8, // 0-9 for output compression
pub remove_headers_footers: bool,
pub remove_watermarks: bool,
pub normalize_whitespace: bool,
}
Feature | Transmutation | Docling |
---|---|---|
Language | 100% Rust | Python |
Performance | β 250x faster | Baseline |
Memory Usage | β ~20MB | ~2-3GB |
Dependencies | β Zero runtime deps | Python + ML models |
Deployment | β Single binary (~5MB) | Python env + models (~2GB) |
Startup Time | β <100ms | ~5-10s |
Platform Support | β Windows/Mac/Linux | Requires Python |
- LangChain: Document loaders and text splitters
- LlamaIndex: Document readers and node parsers
- Haystack: Document converters and preprocessors
- DSPy: Optimized document processing
Test Document: Attention Is All You Need (arXiv:1706.03762v7.pdf)
Size: 2.22 MB, 15 pages
Metric | Transmutation | Docling | Improvement |
---|---|---|---|
Conversion Time | 0.21s | 52.68s | β 250x faster |
Processing Speed | 71 pages/sec | 0.28 pages/sec | β 254x faster |
Memory Usage | ~20MB | ~2-3GB | β 100-150x less |
Startup Time | <0.1s | ~6s | β 60x faster |
Output Quality (Fast) | 71.8% similarity | 100% (reference) | |
Output Quality (Precision) | 77.3% similarity | 100% (reference) |
Operation | Input Size | Time | Throughput |
---|---|---|---|
PDF β Markdown | 2.2MB (15 pages) | 0.21s | 71 pages/s β |
PDF β Markdown | 10MB (100 pages) | ~1.4s | 71 pages/s |
Batch (1,000 PDFs) | 2.2GB (15,000 pages) | ~4 min | 3,750 pages/min |
- Base: ~20MB (pure Rust, no Python runtime) β
- Per conversion: Minimal (streaming processing)
- No ML models required (unlike Docling's 2-3GB)
Fast Mode (default) - 71.8% similarity:
- β 250x faster than Docling
- β Pure Rust with basic text heuristics
- β Works on any PDF without training
- β Zero runtime dependencies
Precision Mode (--precision
) - 77.3% similarity:
- β 250x faster than Docling (same speed as fast mode)
- β Enhanced text processing with space correction
- β +5.5% better than fast mode
- β No hardcoded rules, all generic heuristics
Why not 95%+ similarity?
Docling uses:
docling-parse
(C++ library) - Extracts text with precise coordinates, fonts, and layout info- LayoutModel (ML) - Deep learning to detect block types (headings, paragraphs, tables) visually
- ReadingOrderModel (ML) - ML-based reading order determination
Transmutation provides three modes:
1. Fast Mode (default):
- Pure Rust text extraction (
pdf-extract
) - Generic heuristics (no ML)
- 71.8% similarity, 250x faster
2. Precision Mode (--precision
):
- Enhanced text processing
- Generic heuristics + space correction
- 77.3% similarity, 250x faster
Future: C++ FFI Mode - Direct integration with docling-parse (no Python):
- Will use C++ library via FFI for 95%+ similarity
- No Python dependency, pure Rust + C++ shared library
- In development
Mode | Similarity | Speed | Memory | Dependencies |
---|---|---|---|---|
Fast | 71.8% | 250x | 50 MB | None (pure Rust) |
Precision | 77.3% | 250x | 50 MB | None (pure Rust) |
FFI (future) | 95%+ | ~50x | 100 MB | C++ shared lib only |
See ROADMAP.md for detailed development plan.
- β Project structure and architecture
- β Core converter interfaces
- β PDF conversion (pure Rust - pdf-extract)
- β Advanced Markdown output with intelligent paragraph joining
- β 98x faster than Docling benchmark achieved (97 papers tested)
- β Windows MSI installer with dependency management
- β Custom icons and professional branding
- β Multi-platform installation scripts (5 variants)
- β Build-time dependency detection
- β Comprehensive documentation
- β DOCX conversion (Markdown + Images - Pure Rust)
- β XLSX conversion (Markdown/CSV/JSON - Pure Rust, 148 pg/s)
- β PPTX conversion (Markdown/Images - Pure Rust, 1639 pg/s)
- β HTML/XML conversion (Pure Rust, 2110-2353 pg/s)
- β Text formats (TXT, CSV, TSV, RTF, ODT - Pure Rust)
- β 11 formats total (8 production, 2 beta)
- β Core formats always enabled (no feature flags)
- β Simplified API and user experience
- β Faster compilation
- β Archive handling (ZIP, TAR, TAR.GZ - 1864 pg/s)
- β Batch processing (Concurrent with Tokio - 4,627 pg/s)
- β Image OCR (Tesseract - 6 formats, 88x faster than Docling)
- π Performance optimizations
- π Quality improvements (RTF, ODT)
- π Memory optimizations
- π v1.0.0 Release
See CONTRIBUTING.md for guidelines.
MIT License - see LICENSE for details.
See CHANGELOG.md for detailed version history and release notes.
Current Version: 0.1.1 (October 13, 2025)
- GitHub: https://github.com/hivellm/transmutation
- Documentation: https://docs.hivellm.org/transmutation
- Changelog: CHANGELOG.md
- Docling Project: https://github.com/docling-project
- HiveLLM Vectorizer: https://github.com/hivellm/vectorizer
Built with β€οΈ by the HiveLLM Team
Pure Rust implementation - No Python, no ML model dependencies
Powered by:
- lopdf - Pure Rust PDF parsing
- docx-rs - Pure Rust DOCX parsing
- Tesseract - OCR engine (optional)
- FFmpeg - Multimedia processing (optional)
Inspired by Docling, but built to be faster, lighter, and easier to deploy.
Status: β v0.1.1 - Production Ready with Professional Distribution Tools
Latest Updates (v0.1.1):
- πͺ Windows MSI Installer with dependency management
- π¨ Custom icons and branding
- π¦ Multi-platform installation scripts
- π§ Automated build and distribution tools