Skip to content

Transmutation is a Rust-based document conversion module designed to transform various file formats into optimized text and image outputs suitable for LLM processing and vector embeddings. Built as a core component of the HiveLLM Vectorizer ecosystem, it leverages [Docling](https://github.com/docling-project) for advanced document understanding.

License

Notifications You must be signed in to change notification settings

hivellm/transmutation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

34 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Transmutation

High-performance document conversion engine for AI/LLM embeddings

Transmutation is a pure Rust document conversion engine designed to transform various file formats into optimized text and image outputs suitable for LLM processing and vector embeddings. Built as a core component of the HiveLLM Vectorizer ecosystem, Transmutation is a high-performance alternative to Docling, offering superior speed, lower memory usage, and zero runtime dependencies.

🎯 Project Goals

  • Pure Rust implementation - No Python dependencies, maximum performance
  • Convert documents to LLM-friendly formats (Markdown, Images, JSON)
  • Optimize output for embedding generation (text and multimodal)
  • Maintain maximum quality with minimum size
  • Competitor to Docling - 98x faster, more efficient, and easier to deploy
  • Seamless integration with HiveLLM Vectorizer

πŸ“Š Benchmark Results

Transmutation vs Docling (Fast Mode - Pure Rust):

Metric Paper 1 (15 pages) Paper 2 (25 pages) Average
Similarity 76.36% 84.44% 80.40%
Speed 108x faster 88x faster 98x faster
Time (Docling) 31.36s 40.56s ~35s
Time (Transmutation) 0.29s 0.46s ~0.37s
  • βœ… 80% similarity - Acceptable for most use cases
  • βœ… 98x faster - Near-instant conversion
  • βœ… Pure Rust - No Python/ML dependencies
  • βœ… Low memory - 50 MB footprint
  • 🎯 Goal: 95% similarity (Precision Mode with C++ FFI - in development)

See BENCHMARK_COMPARISON.md for detailed results.

πŸ“‹ Supported Formats

Document Formats

Input Format Output Options Status Modes
PDF Image per page, Markdown (per page/full), JSON βœ… Production Fast, Precision, FFI
DOCX Image per page, Markdown (per page/full), JSON βœ… Production Pure Rust + LibreOffice
XLSX Markdown tables, CSV, JSON βœ… Production Pure Rust (148 pg/s)
PPTX Image per slide, Markdown per slide βœ… Production Pure Rust (1639 pg/s)
HTML Markdown, JSON βœ… Production Pure Rust (2110 pg/s)
XML Markdown, JSON βœ… Production Pure Rust (2353 pg/s)
TXT Markdown, JSON βœ… Production Pure Rust (2805 pg/s)
CSV/TSV Markdown tables, JSON βœ… Production Pure Rust (2647 pg/s)
RTF Markdown, JSON ⚠️ Beta Pure Rust (simplified parser)
ODT Markdown, JSON ⚠️ Beta Pure Rust (ZIP + XML)
MD Markdown (normalized), JSON πŸ”„ Planned -

Image Formats (OCR)

Input Format Output Options OCR Engine Status
JPG/JPEG Markdown (OCR), JSON Tesseract βœ… Production
PNG Markdown (OCR), JSON Tesseract βœ… Production
TIFF/TIF Markdown (OCR), JSON Tesseract βœ… Production
BMP Markdown (OCR), JSON Tesseract βœ… Production
GIF Markdown (OCR), JSON Tesseract βœ… Production
WEBP Markdown (OCR), JSON Tesseract βœ… Production

Audio/Video Formats

Input Format Output Options Engine Status
MP3 Markdown (transcription), JSON Whisper βœ… Production
WAV Markdown (transcription), JSON Whisper βœ… Production
M4A Markdown (transcription), JSON Whisper βœ… Production
FLAC Markdown (transcription), JSON Whisper βœ… Production
OGG Markdown (transcription), JSON Whisper βœ… Production
MP4 Markdown (transcription), JSON FFmpeg + Whisper βœ… Production
AVI Markdown (transcription), JSON FFmpeg + Whisper βœ… Production
MKV Markdown (transcription), JSON FFmpeg + Whisper βœ… Production
MOV Markdown (transcription), JSON FFmpeg + Whisper βœ… Production
WEBM Markdown (transcription), JSON FFmpeg + Whisper βœ… Production

Archive Formats

Input Format Output Options Status Performance
ZIP File listing, statistics, Markdown index, JSON βœ… Production Pure Rust (1864 pg/s)
TAR/GZ Extract and process contents πŸ”„ Planned -
7Z Extract and process contents πŸ”„ Planned -

πŸ—οΈ Architecture

transmutation/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ lib.rs                  # Main library entry
β”‚   β”œβ”€β”€ converters/
β”‚   β”‚   β”œβ”€β”€ mod.rs              # Converter registry
β”‚   β”‚   β”œβ”€β”€ pdf.rs              # PDF conversion (pure Rust)
β”‚   β”‚   β”œβ”€β”€ docx.rs             # DOCX conversion
β”‚   β”‚   β”œβ”€β”€ pptx.rs             # PPTX conversion
β”‚   β”‚   β”œβ”€β”€ xlsx.rs             # XLSX conversion
β”‚   β”‚   β”œβ”€β”€ html.rs             # HTML conversion
β”‚   β”‚   β”œβ”€β”€ xml.rs              # XML conversion
β”‚   β”‚   β”œβ”€β”€ image.rs            # Image OCR (Tesseract)
β”‚   β”‚   β”œβ”€β”€ audio.rs            # Audio transcription (pure Rust ASR)
β”‚   β”‚   β”œβ”€β”€ video.rs            # Video processing (FFmpeg)
β”‚   β”‚   └── archive.rs          # Archive extraction
β”‚   β”œβ”€β”€ output/
β”‚   β”‚   β”œβ”€β”€ mod.rs              # Output format handlers
β”‚   β”‚   β”œβ”€β”€ markdown.rs         # Markdown generation
β”‚   β”‚   β”œβ”€β”€ image.rs            # Image generation/optimization
β”‚   β”‚   β”œβ”€β”€ json.rs             # JSON serialization
β”‚   β”‚   └── csv.rs              # CSV generation
β”‚   β”œβ”€β”€ engines/
β”‚   β”‚   β”œβ”€β”€ mod.rs              # Engine abstractions
β”‚   β”‚   β”œβ”€β”€ pdf_parser.rs       # Pure Rust PDF parsing
β”‚   β”‚   β”œβ”€β”€ tesseract.rs        # Tesseract OCR wrapper
β”‚   β”‚   β”œβ”€β”€ audio_asr.rs        # Pure Rust audio transcription
β”‚   β”‚   └── ffmpeg.rs           # FFmpeg wrapper
β”‚   β”œβ”€β”€ optimization/
β”‚   β”‚   β”œβ”€β”€ mod.rs              # Optimization strategies
β”‚   β”‚   β”œβ”€β”€ text.rs             # Text compression/cleanup
β”‚   β”‚   β”œβ”€β”€ image.rs            # Image compression
β”‚   β”‚   └── quality.rs          # Quality metrics
β”‚   β”œβ”€β”€ integration/
β”‚   β”‚   β”œβ”€β”€ mod.rs              # Integration layer
β”‚   β”‚   β”œβ”€β”€ vectorizer.rs       # Vectorizer integration
β”‚   β”‚   β”œβ”€β”€ langchain.rs        # LangChain integration
β”‚   β”‚   β”œβ”€β”€ llamaindex.rs       # LlamaIndex integration
β”‚   β”‚   └── haystack.rs         # Haystack integration
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ mod.rs              # Utilities
β”‚   β”‚   β”œβ”€β”€ file_detect.rs      # File type detection
β”‚   β”‚   β”œβ”€β”€ metadata.rs         # Metadata extraction
β”‚   β”‚   └── cache.rs            # Conversion cache
β”‚   └── error.rs                # Error types
β”œβ”€β”€ src/bin/
β”‚   └── transmutation.rs        # CLI application (included in main crate)
β”œβ”€β”€ bindings/
β”‚   β”œβ”€β”€ python/                 # Python bindings (PyO3) - Future
β”‚   β”œβ”€β”€ node/                   # Node.js bindings (Neon) - Future
β”‚   └── wasm/                   # WebAssembly bindings - Future
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ basic_conversion.rs
β”‚   β”œβ”€β”€ batch_processing.rs
β”‚   β”œβ”€β”€ vectorizer_integration.rs
β”‚   └── custom_pipeline.rs
β”œβ”€β”€ benches/
β”‚   └── conversion_benchmarks.rs
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ integration/
β”‚   └── fixtures/
β”œβ”€β”€ Cargo.toml
β”œβ”€β”€ README.md
β”œβ”€β”€ LICENSE
β”œβ”€β”€ ROADMAP.md
β”œβ”€β”€ ARCHITECTURE.md
└── CONTRIBUTING.md

πŸš€ Quick Start

Installation

Windows MSI Installer:

# Download from releases or build:
.\build-msi.ps1
msiexec /i target\wix\transmutation-0.1.1-x86_64.msi

See docs/MSI_BUILD.md for details.

Cargo:

# Add to Cargo.toml
[dependencies]
transmutation = "0.1"

# Core features (always enabled, no flags needed):
# - PDF, HTML, XML, ZIP, TXT, CSV, TSV, RTF, ODT

# With Office formats (default)
[dependencies.transmutation]
version = "0.1"
features = ["office"]  # DOCX, XLSX, PPTX

# With optional features (requires external tools)
features = ["office", "pdf-to-image", "tesseract", "audio"]

External Dependencies

Transmutation is mostly pure Rust, with core features requiring ZERO dependencies:

Feature Requires Status
Core (PDF, HTML, XML, ZIP, TXT, CSV, TSV, RTF, ODT) βœ… None Always enabled
office (DOCX, XLSX, PPTX - Text) βœ… None Pure Rust (default)
pdf-to-image ⚠️ poppler-utils Optional
office + images ⚠️ LibreOffice Optional
image-ocr ⚠️ Tesseract OCR Optional
audio ⚠️ Whisper CLI Optional
video ⚠️ FFmpeg + Whisper Optional
archives-extended (TAR, GZ, 7Z) ⚠️ tar, flate2 crates Optional

During compilation, build.rs will automatically detect missing dependencies and provide installation instructions:

cargo build --features "pdf-to-image"

# If pdftoppm is missing, you'll see:
⚠️  Optional External Dependencies Missing

  ❌ pdftoppm (poppler-utils): PDF β†’ Image conversion
     Install: sudo apt-get install poppler-utils

πŸ“– Quick install (all dependencies):
   ./install/install-deps-linux.sh

Installation scripts are provided for all platforms:

  • Linux: ./install/install-deps-linux.sh
  • macOS: ./install/install-deps-macos.sh
  • Windows: .\install\install-deps-windows.ps1 (or .bat)

See install/README.md for detailed instructions.

Basic Usage

use transmutation::{Converter, OutputFormat, ConversionOptions};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize converter
    let converter = Converter::new()?;
    
    // Convert PDF to Markdown
    let result = converter
        .convert("document.pdf")
        .to(OutputFormat::Markdown)
        .with_options(ConversionOptions {
            split_pages: true,
            optimize_for_llm: true,
            ..Default::default()
        })
        .execute()
        .await?;
    
    // Save output
    result.save("output/document.md").await?;
    
    println!("Converted {} pages", result.page_count());
    Ok(())
}

Batch Processing

use transmutation::{Converter, BatchProcessor, OutputFormat};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let converter = Converter::new()?;
    let batch = BatchProcessor::new(converter);
    
    // Process multiple files
    let results = batch
        .add_files(&["doc1.pdf", "doc2.docx", "doc3.pptx"])
        .to(OutputFormat::Markdown)
        .parallel(4)
        .execute()
        .await?;
    
    for (file, result) in results {
        println!("{}: {} -> {}", file, result.input_size(), result.output_size());
    }
    
    Ok(())
}

Vectorizer Integration

use transmutation::{Converter, OutputFormat};
use vectorizer::VectorizerClient;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let converter = Converter::new()?;
    let vectorizer = VectorizerClient::new("http://localhost:15002").await?;
    
    // Convert and embed in one pipeline
    let result = converter
        .convert("document.pdf")
        .to(OutputFormat::EmbeddingReady)
        .pipe_to(&vectorizer)
        .execute()
        .await?;
    
    println!("Embedded {} chunks", result.chunk_count());
    Ok(())
}

πŸ”§ Configuration

Conversion Options

pub struct ConversionOptions {
    // Output control
    pub split_pages: bool,           // Split output by pages
    pub optimize_for_llm: bool,      // Optimize for LLM processing
    pub max_chunk_size: usize,       // Maximum chunk size (tokens)
    
    // Quality settings
    pub image_quality: ImageQuality, // High, Medium, Low
    pub dpi: u32,                    // DPI for image output (default: 150)
    pub ocr_language: String,        // OCR language (default: "eng")
    
    // Processing options
    pub preserve_layout: bool,       // Preserve document layout
    pub extract_tables: bool,        // Extract tables separately
    pub extract_images: bool,        // Extract embedded images
    pub include_metadata: bool,      // Include document metadata
    
    // Optimization
    pub compression_level: u8,       // 0-9 for output compression
    pub remove_headers_footers: bool,
    pub remove_watermarks: bool,
    pub normalize_whitespace: bool,
}

πŸ†š Why Transmutation vs Docling?

Feature Transmutation Docling
Language 100% Rust Python
Performance βœ… 250x faster Baseline
Memory Usage βœ… ~20MB ~2-3GB
Dependencies βœ… Zero runtime deps Python + ML models
Deployment βœ… Single binary (~5MB) Python env + models (~2GB)
Startup Time βœ… <100ms ~5-10s
Platform Support βœ… Windows/Mac/Linux Requires Python

LLM Framework Integrations

  • LangChain: Document loaders and text splitters
  • LlamaIndex: Document readers and node parsers
  • Haystack: Document converters and preprocessors
  • DSPy: Optimized document processing

πŸ“Š Performance

Real-World Benchmarks βœ…

Test Document: Attention Is All You Need (arXiv:1706.03762v7.pdf)
Size: 2.22 MB, 15 pages

Metric Transmutation Docling Improvement
Conversion Time 0.21s 52.68s βœ… 250x faster
Processing Speed 71 pages/sec 0.28 pages/sec βœ… 254x faster
Memory Usage ~20MB ~2-3GB βœ… 100-150x less
Startup Time <0.1s ~6s βœ… 60x faster
Output Quality (Fast) 71.8% similarity 100% (reference) ⚠️ Trade-off
Output Quality (Precision) 77.3% similarity 100% (reference) ⚠️ +5.5% better

Projected Performance

Operation Input Size Time Throughput
PDF β†’ Markdown 2.2MB (15 pages) 0.21s 71 pages/s βœ…
PDF β†’ Markdown 10MB (100 pages) ~1.4s 71 pages/s
Batch (1,000 PDFs) 2.2GB (15,000 pages) ~4 min 3,750 pages/min

Memory Footprint

  • Base: ~20MB (pure Rust, no Python runtime) βœ…
  • Per conversion: Minimal (streaming processing)
  • No ML models required (unlike Docling's 2-3GB)

Precision vs Performance Trade-off

Fast Mode (default) - 71.8% similarity:

  • βœ… 250x faster than Docling
  • βœ… Pure Rust with basic text heuristics
  • βœ… Works on any PDF without training
  • βœ… Zero runtime dependencies

Precision Mode (--precision) - 77.3% similarity:

  • βœ… 250x faster than Docling (same speed as fast mode)
  • βœ… Enhanced text processing with space correction
  • βœ… +5.5% better than fast mode
  • βœ… No hardcoded rules, all generic heuristics

Why not 95%+ similarity?

Docling uses:

  1. docling-parse (C++ library) - Extracts text with precise coordinates, fonts, and layout info
  2. LayoutModel (ML) - Deep learning to detect block types (headings, paragraphs, tables) visually
  3. ReadingOrderModel (ML) - ML-based reading order determination

Transmutation provides three modes:

1. Fast Mode (default):

  • Pure Rust text extraction (pdf-extract)
  • Generic heuristics (no ML)
  • 71.8% similarity, 250x faster

2. Precision Mode (--precision):

  • Enhanced text processing
  • Generic heuristics + space correction
  • 77.3% similarity, 250x faster

Future: C++ FFI Mode - Direct integration with docling-parse (no Python):

  • Will use C++ library via FFI for 95%+ similarity
  • No Python dependency, pure Rust + C++ shared library
  • In development
Mode Similarity Speed Memory Dependencies
Fast 71.8% 250x 50 MB None (pure Rust)
Precision 77.3% 250x 50 MB None (pure Rust)
FFI (future) 95%+ ~50x 100 MB C++ shared lib only

πŸ›£οΈ Roadmap

See ROADMAP.md for detailed development plan.

Phase 1: Foundation (Q1 2025) βœ… COMPLETE

  • βœ… Project structure and architecture
  • βœ… Core converter interfaces
  • βœ… PDF conversion (pure Rust - pdf-extract)
  • βœ… Advanced Markdown output with intelligent paragraph joining
  • βœ… 98x faster than Docling benchmark achieved (97 papers tested)

Phase 1.5: Distribution & Tooling (Oct 2025) βœ… COMPLETE

  • βœ… Windows MSI installer with dependency management
  • βœ… Custom icons and professional branding
  • βœ… Multi-platform installation scripts (5 variants)
  • βœ… Build-time dependency detection
  • βœ… Comprehensive documentation

Phase 2: Core Formats (Q2 2025) βœ… 100% COMPLETE

  • βœ… DOCX conversion (Markdown + Images - Pure Rust)
  • βœ… XLSX conversion (Markdown/CSV/JSON - Pure Rust, 148 pg/s)
  • βœ… PPTX conversion (Markdown/Images - Pure Rust, 1639 pg/s)
  • βœ… HTML/XML conversion (Pure Rust, 2110-2353 pg/s)
  • βœ… Text formats (TXT, CSV, TSV, RTF, ODT - Pure Rust)
  • βœ… 11 formats total (8 production, 2 beta)

Phase 2.5: Core Features Architecture βœ… COMPLETE

  • βœ… Core formats always enabled (no feature flags)
  • βœ… Simplified API and user experience
  • βœ… Faster compilation

Phase 3: Advanced Features (Q3 2025) βœ… COMPLETE

  • βœ… Archive handling (ZIP, TAR, TAR.GZ - 1864 pg/s)
  • βœ… Batch processing (Concurrent with Tokio - 4,627 pg/s)
  • βœ… Image OCR (Tesseract - 6 formats, 88x faster than Docling)

Phase 4: Advanced Optimizations

  • πŸ“ Performance optimizations
  • πŸ“ Quality improvements (RTF, ODT)
  • πŸ“ Memory optimizations
  • πŸ“ v1.0.0 Release

🀝 Contributing

See CONTRIBUTING.md for guidelines.

πŸ“ License

MIT License - see LICENSE for details.

πŸ“ Changelog

See CHANGELOG.md for detailed version history and release notes.

Current Version: 0.1.1 (October 13, 2025)

πŸ”— Links

πŸ† Credits

Built with ❀️ by the HiveLLM Team

Pure Rust implementation - No Python, no ML model dependencies

Powered by:

  • lopdf - Pure Rust PDF parsing
  • docx-rs - Pure Rust DOCX parsing
  • Tesseract - OCR engine (optional)
  • FFmpeg - Multimedia processing (optional)

Inspired by Docling, but built to be faster, lighter, and easier to deploy.


Status: βœ… v0.1.1 - Production Ready with Professional Distribution Tools

Latest Updates (v0.1.1):

  • πŸͺŸ Windows MSI Installer with dependency management
  • 🎨 Custom icons and branding
  • πŸ“¦ Multi-platform installation scripts
  • πŸ”§ Automated build and distribution tools

About

Transmutation is a Rust-based document conversion module designed to transform various file formats into optimized text and image outputs suitable for LLM processing and vector embeddings. Built as a core component of the HiveLLM Vectorizer ecosystem, it leverages [Docling](https://github.com/docling-project) for advanced document understanding.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published