Bun-native vectorless, reasoning-based RAG for document understanding. A TypeScript port of PageIndex optimized for the Bun runtime.
- Vectorless RAG: Uses LLM reasoning to build hierarchical document indices without vector databases
- PDF Support: Extract structure and content from PDF documents
- OCR Mode: Process scanned PDFs using GLM-OCR vision model (not in original PageIndex!)
- Markdown Support: Convert markdown documents to tree structures
- LLM Agnostic: Works with OpenAI, LM Studio, Ollama, or any OpenAI-compatible API
- Bun Native: Optimized for Bun runtime with minimal dependencies
- CLI & API: Use as a library or command-line tool
bun add bun-pageindexOCR mode requires Poppler to be installed on your system:
# macOS
brew install poppler
# Ubuntu/Debian
sudo apt-get install poppler-utils
# Windows
# Download from https://github.com/oschwartz10612/poppler-windows/releasesimport { PageIndex, indexPdf, mdToTree } from "bun-pageindex";
// Process a PDF with OpenAI
const result = await indexPdf("document.pdf", {
apiKey: process.env.OPENAI_API_KEY,
model: "gpt-4o",
});
console.log(result.structure);
// Or use the PageIndex class for more control
const pageIndex = new PageIndex({
model: "gpt-4o",
addNodeSummary: true,
addDocDescription: true,
});
const pdfResult = await pageIndex.fromPdf("document.pdf");
// Process markdown
const mdResult = await mdToTree("document.md", {
addNodeSummary: true,
thinning: true,
thinningThreshold: 5000,
});import { PageIndex } from "bun-pageindex";
const pageIndex = new PageIndex({
model: "local-model", // Your LM Studio model name
}).useLMStudio();
const result = await pageIndex.fromPdf("document.pdf");import { PageIndex } from "bun-pageindex";
const pageIndex = new PageIndex({
model: "llama3",
}).useOllama();
const result = await pageIndex.fromPdf("document.pdf");OCR mode converts PDF pages to images and uses a vision model (like GLM-OCR) to extract text, then processes with a reasoning model.
import { PageIndex, indexPdfWithOcr, indexPdfWithLMStudioOcr } from "bun-pageindex";
// Using OpenAI
const result = await indexPdfWithOcr("scanned-document.pdf", {
apiKey: process.env.OPENAI_API_KEY,
reasoningModel: "gpt-4o",
ocrModel: "gpt-4o", // OpenAI vision model
});
// Using LM Studio with local models
const result = await indexPdfWithLMStudioOcr(
"scanned-document.pdf",
"qwen/qwen3-vl-30b", // Reasoning model
"mlx-community/GLM-OCR-bf16" // OCR vision model
);
// Or use the PageIndex class directly
const pageIndex = new PageIndex({
model: "qwen/qwen3-vl-30b",
extractionMode: "ocr",
ocrModel: "mlx-community/GLM-OCR-bf16",
imageDpi: 150,
}).useLMStudio();
const result = await pageIndex.fromPdf("scanned-document.pdf");# Process a PDF
bun-pageindex --pdf document.pdf
# Process with LM Studio
bun-pageindex --pdf document.pdf --lmstudio --model llama3
# Process scanned PDF with OCR
bun-pageindex --pdf scanned.pdf --ocr --lmstudio --model qwen/qwen3-vl-30b
# Process markdown with options
bun-pageindex --md README.md --add-doc-description --thinning
# See all options
bun-pageindex --helpconst pageIndex = new PageIndex(options);Options:
model: LLM model to use (default: "gpt-4o-2024-11-20")apiKey: OpenAI API key (default: from OPENAI_API_KEY env var)baseUrl: Custom API base URL (for LM Studio, Ollama, etc.)tocCheckPageNum: Pages to check for TOC (default: 20)maxPageNumEachNode: Max pages per node (default: 10)maxTokenNumEachNode: Max tokens per node (default: 20000)addNodeId: Add node IDs (default: true)addNodeSummary: Generate summaries (default: true)addDocDescription: Add document description (default: false)addNodeText: Include raw text (default: false)
OCR Options:
extractionMode: "text" (default) or "ocr" for scanned PDFsocrModel: Vision model for OCR (default: "mlx-community/GLM-OCR-bf16")ocrPromptType: "text", "formula", or "table" (default: "text")imageDpi: DPI for PDF to image conversion (default: 150)imageFormat: "png" or "jpeg" (default: "png")ocrConcurrency: Concurrent OCR requests (default: 3)
Methods:
fromPdf(input): Process a PDF file or bufferuseLMStudio(): Configure for LM StudiouseOllama(): Configure for OllamauseOcrMode(ocrModel?): Enable OCR modesetBaseUrl(url): Set custom API base URL
const result = await mdToTree(path, options);Additional Options:
thinning: Apply tree thinning (default: false)thinningThreshold: Min tokens for thinning (default: 5000)summaryTokenThreshold: Token threshold for summaries (default: 200)
interface PageIndexResult {
docName: string;
docDescription?: string;
structure: TreeNode[];
}
interface TreeNode {
title: string;
nodeId?: string;
startIndex?: number;
endIndex?: number;
summary?: string;
prefixSummary?: string;
text?: string;
lineNum?: number;
nodes?: TreeNode[];
}Run benchmarks comparing Bun vs Python implementations:
# Requires LM Studio running on localhost:1234
bun run benchmark# Install dependencies
bun install
# Run tests
bun test
# Build
bun run buildPageIndex uses LLM reasoning to:
- Detect Table of Contents: Scans initial pages for TOC
- Extract Structure: Parses TOC or generates structure from content
- Map Page Numbers: Associates logical page numbers with physical pages
- Build Tree: Creates hierarchical tree structure
- Generate Summaries: Creates summaries for each node (optional)
This approach provides human-like document understanding without the limitations of vector-based retrieval.
For scanned PDFs, OCR mode adds an additional step:
- Convert PDF to Images: Uses Poppler to render each page as an image
- OCR Extraction: Uses a vision model (GLM-OCR) to extract text from images
- Standard Processing: Continues with the same reasoning-based indexing
This enables processing of scanned documents that the original Python PageIndex cannot handle.
This is a Bun/TypeScript port of PageIndex by VectifyAI.
MIT
Antonio Oliveira antonio@oakoliver.com (oakoliver.com)