Skip to content

neurondb/PageIndex

PageIndex

CI pkg.go.dev License: MIT Go Release

Go library and command-line tool that turns PDF and Markdown files into a hierarchical document tree (outline / TOC semantics). Structure is derived from layout and headings; optional LLM calls refine the TOC, attach summaries, and power tree-scoped search. No vector index and no embedding step—useful when you want an interpretable skeleton of the document rather than chunk-and-embed RAG. The project name reflects the same structural-indexing idea explored in other open work; this repository is maintained on its own schedule with a distinct Go API.1

Module github.com/neurondb/pageindex · see go.mod
Source github.com/neurondb/PageIndex
CLI pageindex → JSON tree on disk (cmd/pageindex)
Stable API BuildFromPDF, BuildFromMarkdown, Document, Node, With*, TreeSearch* (details)

Contents


Capabilities

Area Behavior
PDF Text via go-fitz (MuPDF). TOC detection, verification/repair, tree assembly.
Markdown Heading hierarchy; optional thinning for very large trees (token-aware).
LLM OpenAI and Anthropic clients with retries; model chosen per call / config.
Tokens tiktoken-go for budgets and thinning thresholds.
Search TreeSearch, TreeSearchWithPreference run over an already-built Document.
Output Document / JSON serialization; CLI writes *_structure.json under --output-dir.

Note

Semantic versioning applies to the symbols listed in docs/api.md and pkg/pageindex/doc.go. Prefer the package entrypoints over lower-level builders unless you need them.


Requirements

Prerequisite Detail
Go 1.26.1+ (toolchain in go.mod).
LLM Set OPENAI_API_KEY and/or ANTHROPIC_API_KEY when using models; provider is inferred from the model id.
MuPDF at runtime Default builds link MuPDF for PDF I/O.

Important

Builds with -tags=nocgo (e.g. some shared-library / c-shared workflows) expect libmupdf.so on the loader path (LD_LIBRARY_PATH or system dirs). Without it, PDF paths fail at runtime even if the binary links.


Install

As a dependency

go get github.com/neurondb/pageindex@latest
import "github.com/neurondb/pageindex/pkg/pageindex"

CLI (install from module)

go install github.com/neurondb/pageindex/cmd/pageindex@latest

From a git clone

git clone https://github.com/neurondb/PageIndex.git
cd PageIndex
make build   # ./bin/pageindex
Version string

The library exposes pageindex.Version. The binary prints it with -version.

./bin/pageindex -version

Release artifacts (when published): Releases.


Usage

CLI — one of --pdf_path or --md_path; optional flags mirror common With* options (see cmd/pageindex/main.go).

export OPENAI_API_KEY=""
./bin/pageindex \
  --pdf_path ./report.pdf \
  --model gpt-4o-2024-11-20 \
  --toc-check-pages 20 \
  --if-add-node-summary yes \
  --output-dir ./results

Writes results/<stem>_structure.json.

Library

doc, err := pageindex.BuildFromPDF("report.pdf",
	pageindex.WithModel("gpt-4o-2024-11-20"),
	pageindex.WithAddNodeSummary(true),
)
if err != nil {
	// handle: I/O, PDF, TOC, LLM, config
}
_ = doc.Structure // root nodes

Further patterns: docs/examples.md, examples/.


Configuration

Layer Mechanism
Library Functional options (pageindex.With…) override defaults loaded from env (PAGEINDEX_*) and optional YAML; see docs/configuration.md.
CLI Flags apply per invocation; same semantics as the high-level options.

Repository layout

Path Role
pkg/pageindex Public API and types
internal/pdf, internal/markdown, internal/toc Parsers and TOC logic
internal/llm Provider clients
internal/config Defaults and config load
cmd/pageindex CLI entrypoint
docs/ Long-form guides

Data flow

flowchart LR
  subgraph in["Input"]
    P[PDF file]
    M[Markdown file]
  end

  subgraph api["pkg/pageindex"]
    E[BuildFromPDF / BuildFromMarkdown]
  end

  subgraph core["Processing"]
    CFG[config]
    PDF[pdf + toc]
    MD[markdown]
    TOK[tokenizer]
    LLM[llm]
  end

  subgraph out["Output"]
    D[Document tree]
    J[JSON / files]
  end

  P --> E
  M --> E
  E --> CFG
  E --> PDF
  E --> MD
  E --> TOK
  E --> LLM
  E --> D
  D --> J
Loading

Narrative breakdown: docs/architecture.md.


Documentation

Document Purpose
Getting started Environment, first build, first run
Configuration Env vars, YAML, option reference
API reference Exported symbols and stability
Architecture Package boundaries and pipelines
Examples Recipes
Contributing PR workflow

Meta: CHANGELOG · Support · Security · Code of conduct


Development

make build          # ./bin/pageindex
make test           # go test ./...
make test-coverage  # coverage.out + HTML report
make fmt && make vet && make test

Optional: make install-git-hooks registers scripts/git-hooks/commit-msg (strips accidental Made-with: trailer lines).


Contributing

Read CONTRIBUTING.md and docs/contributing.md. Security-sensitive reports belong in SECURITY.md, not public issues.


License

Distributed under the MIT License.


Copyright © 2024–2026 PageIndex contributors.

Footnotes

  1. Example lineage: VectifyAI/PageIndex. Not affiliated; implementation and versioning here are independent.

About

Go library & CLI: vectorless hierarchical document indexing from PDF and Markdown, optional LLM summaries and tree search (NeuronDB).

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages