bun-pageindex

Bun-native vectorless, reasoning-based RAG for document understanding. A TypeScript port of PageIndex optimized for the Bun runtime.

Features

Vectorless RAG: Uses LLM reasoning to build hierarchical document indices without vector databases
PDF Support: Extract structure and content from PDF documents
OCR Mode: Process scanned PDFs using GLM-OCR vision model (not in original PageIndex!)
Markdown Support: Convert markdown documents to tree structures
LLM Agnostic: Works with OpenAI, LM Studio, Ollama, or any OpenAI-compatible API
Bun Native: Optimized for Bun runtime with minimal dependencies
CLI & API: Use as a library or command-line tool

Installation

bun add bun-pageindex

For OCR Mode (Scanned PDFs)

OCR mode requires Poppler to be installed on your system:

# macOS
brew install poppler

# Ubuntu/Debian
sudo apt-get install poppler-utils

# Windows
# Download from https://github.com/oschwartz10612/poppler-windows/releases

Quick Start

As a Library

import { PageIndex, indexPdf, mdToTree } from "bun-pageindex";

// Process a PDF with OpenAI
const result = await indexPdf("document.pdf", {
  apiKey: process.env.OPENAI_API_KEY,
  model: "gpt-4o",
});

console.log(result.structure);

// Or use the PageIndex class for more control
const pageIndex = new PageIndex({
  model: "gpt-4o",
  addNodeSummary: true,
  addDocDescription: true,
});

const pdfResult = await pageIndex.fromPdf("document.pdf");

// Process markdown
const mdResult = await mdToTree("document.md", {
  addNodeSummary: true,
  thinning: true,
  thinningThreshold: 5000,
});

Using LM Studio (Local LLMs)

import { PageIndex } from "bun-pageindex";

const pageIndex = new PageIndex({
  model: "local-model", // Your LM Studio model name
}).useLMStudio();

const result = await pageIndex.fromPdf("document.pdf");

Using Ollama

import { PageIndex } from "bun-pageindex";

const pageIndex = new PageIndex({
  model: "llama3",
}).useOllama();

const result = await pageIndex.fromPdf("document.pdf");

OCR Mode for Scanned PDFs

OCR mode converts PDF pages to images and uses a vision model (like GLM-OCR) to extract text, then processes with a reasoning model.

import { PageIndex, indexPdfWithOcr, indexPdfWithLMStudioOcr } from "bun-pageindex";

// Using OpenAI
const result = await indexPdfWithOcr("scanned-document.pdf", {
  apiKey: process.env.OPENAI_API_KEY,
  reasoningModel: "gpt-4o",
  ocrModel: "gpt-4o", // OpenAI vision model
});

// Using LM Studio with local models
const result = await indexPdfWithLMStudioOcr(
  "scanned-document.pdf",
  "qwen/qwen3-vl-30b",           // Reasoning model
  "mlx-community/GLM-OCR-bf16"   // OCR vision model
);

// Or use the PageIndex class directly
const pageIndex = new PageIndex({
  model: "qwen/qwen3-vl-30b",
  extractionMode: "ocr",
  ocrModel: "mlx-community/GLM-OCR-bf16",
  imageDpi: 150,
}).useLMStudio();

const result = await pageIndex.fromPdf("scanned-document.pdf");

CLI Usage

# Process a PDF
bun-pageindex --pdf document.pdf

# Process with LM Studio
bun-pageindex --pdf document.pdf --lmstudio --model llama3

# Process scanned PDF with OCR
bun-pageindex --pdf scanned.pdf --ocr --lmstudio --model qwen/qwen3-vl-30b

# Process markdown with options
bun-pageindex --md README.md --add-doc-description --thinning

# See all options
bun-pageindex --help

API Reference

PageIndex Class

const pageIndex = new PageIndex(options);

Options:

model: LLM model to use (default: "gpt-4o-2024-11-20")
apiKey: OpenAI API key (default: from OPENAI_API_KEY env var)
baseUrl: Custom API base URL (for LM Studio, Ollama, etc.)
tocCheckPageNum: Pages to check for TOC (default: 20)
maxPageNumEachNode: Max pages per node (default: 10)
maxTokenNumEachNode: Max tokens per node (default: 20000)
addNodeId: Add node IDs (default: true)
addNodeSummary: Generate summaries (default: true)
addDocDescription: Add document description (default: false)
addNodeText: Include raw text (default: false)

OCR Options:

extractionMode: "text" (default) or "ocr" for scanned PDFs
ocrModel: Vision model for OCR (default: "mlx-community/GLM-OCR-bf16")
ocrPromptType: "text", "formula", or "table" (default: "text")
imageDpi: DPI for PDF to image conversion (default: 150)
imageFormat: "png" or "jpeg" (default: "png")
ocrConcurrency: Concurrent OCR requests (default: 3)

Methods:

fromPdf(input): Process a PDF file or buffer
useLMStudio(): Configure for LM Studio
useOllama(): Configure for Ollama
useOcrMode(ocrModel?): Enable OCR mode
setBaseUrl(url): Set custom API base URL

mdToTree Function

const result = await mdToTree(path, options);

Additional Options:

thinning: Apply tree thinning (default: false)
thinningThreshold: Min tokens for thinning (default: 5000)
summaryTokenThreshold: Token threshold for summaries (default: 200)

Result Structure

interface PageIndexResult {
  docName: string;
  docDescription?: string;
  structure: TreeNode[];
}

interface TreeNode {
  title: string;
  nodeId?: string;
  startIndex?: number;
  endIndex?: number;
  summary?: string;
  prefixSummary?: string;
  text?: string;
  lineNum?: number;
  nodes?: TreeNode[];
}

Benchmarks

Run benchmarks comparing Bun vs Python implementations:

# Requires LM Studio running on localhost:1234
bun run benchmark

Development

# Install dependencies
bun install

# Run tests
bun test

# Build
bun run build

How It Works

PageIndex uses LLM reasoning to:

Detect Table of Contents: Scans initial pages for TOC
Extract Structure: Parses TOC or generates structure from content
Map Page Numbers: Associates logical page numbers with physical pages
Build Tree: Creates hierarchical tree structure
Generate Summaries: Creates summaries for each node (optional)

This approach provides human-like document understanding without the limitations of vector-based retrieval.

OCR Mode (New in bun-pageindex)

For scanned PDFs, OCR mode adds an additional step:

Convert PDF to Images: Uses Poppler to render each page as an image
OCR Extraction: Uses a vision model (GLM-OCR) to extract text from images
Standard Processing: Continues with the same reasoning-based indexing

This enables processing of scanned documents that the original Python PageIndex cannot handle.

Credits

This is a Bun/TypeScript port of PageIndex by VectifyAI.

License

MIT

Author

Antonio Oliveira antonio@oakoliver.com (oakoliver.com)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
index.ts		index.ts
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bun-pageindex

Features

Installation

For OCR Mode (Scanned PDFs)

Quick Start

As a Library

Using LM Studio (Local LLMs)

Using Ollama

OCR Mode for Scanned PDFs

CLI Usage

API Reference

PageIndex Class

mdToTree Function

Result Structure

Benchmarks

Development

How It Works

OCR Mode (New in bun-pageindex)

Credits

License

Author

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

bun-pageindex

Features

Installation

For OCR Mode (Scanned PDFs)

Quick Start

As a Library

Using LM Studio (Local LLMs)

Using Ollama

OCR Mode for Scanned PDFs

CLI Usage

API Reference

PageIndex Class

mdToTree Function

Result Structure

Benchmarks

Development

How It Works

OCR Mode (New in bun-pageindex)

Credits

License

Author

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages