Go library and command-line tool that turns PDF and Markdown files into a hierarchical document tree (outline / TOC semantics). Structure is derived from layout and headings; optional LLM calls refine the TOC, attach summaries, and power tree-scoped search. No vector index and no embedding step—useful when you want an interpretable skeleton of the document rather than chunk-and-embed RAG. The project name reflects the same structural-indexing idea explored in other open work; this repository is maintained on its own schedule with a distinct Go API.1
| Module | github.com/neurondb/pageindex · see go.mod |
| Source | github.com/neurondb/PageIndex |
| CLI | pageindex → JSON tree on disk (cmd/pageindex) |
| Stable API | BuildFromPDF, BuildFromMarkdown, Document, Node, With*, TreeSearch* (details) |
- Capabilities
- Requirements
- Install
- Usage
- Configuration
- Repository layout
- Data flow
- Documentation
- Development
- Contributing
- License
| Area | Behavior |
|---|---|
| Text via go-fitz (MuPDF). TOC detection, verification/repair, tree assembly. | |
| Markdown | Heading hierarchy; optional thinning for very large trees (token-aware). |
| LLM | OpenAI and Anthropic clients with retries; model chosen per call / config. |
| Tokens | tiktoken-go for budgets and thinning thresholds. |
| Search | TreeSearch, TreeSearchWithPreference run over an already-built Document. |
| Output | Document / JSON serialization; CLI writes *_structure.json under --output-dir. |
Note
Semantic versioning applies to the symbols listed in docs/api.md and pkg/pageindex/doc.go. Prefer the package entrypoints over lower-level builders unless you need them.
| Prerequisite | Detail |
|---|---|
| Go | 1.26.1+ (toolchain in go.mod). |
| LLM | Set OPENAI_API_KEY and/or ANTHROPIC_API_KEY when using models; provider is inferred from the model id. |
| MuPDF at runtime | Default builds link MuPDF for PDF I/O. |
Important
Builds with -tags=nocgo (e.g. some shared-library / c-shared workflows) expect libmupdf.so on the loader path (LD_LIBRARY_PATH or system dirs). Without it, PDF paths fail at runtime even if the binary links.
As a dependency
go get github.com/neurondb/pageindex@latestimport "github.com/neurondb/pageindex/pkg/pageindex"CLI (install from module)
go install github.com/neurondb/pageindex/cmd/pageindex@latestFrom a git clone
git clone https://github.com/neurondb/PageIndex.git
cd PageIndex
make build # ./bin/pageindexVersion string
The library exposes pageindex.Version. The binary prints it with -version.
./bin/pageindex -versionRelease artifacts (when published): Releases.
CLI — one of --pdf_path or --md_path; optional flags mirror common With* options (see cmd/pageindex/main.go).
export OPENAI_API_KEY="…"
./bin/pageindex \
--pdf_path ./report.pdf \
--model gpt-4o-2024-11-20 \
--toc-check-pages 20 \
--if-add-node-summary yes \
--output-dir ./resultsWrites results/<stem>_structure.json.
Library
doc, err := pageindex.BuildFromPDF("report.pdf",
pageindex.WithModel("gpt-4o-2024-11-20"),
pageindex.WithAddNodeSummary(true),
)
if err != nil {
// handle: I/O, PDF, TOC, LLM, config
}
_ = doc.Structure // root nodesFurther patterns: docs/examples.md, examples/.
| Layer | Mechanism |
|---|---|
| Library | Functional options (pageindex.With…) override defaults loaded from env (PAGEINDEX_*) and optional YAML; see docs/configuration.md. |
| CLI | Flags apply per invocation; same semantics as the high-level options. |
| Path | Role |
|---|---|
pkg/pageindex |
Public API and types |
internal/pdf, internal/markdown, internal/toc |
Parsers and TOC logic |
internal/llm |
Provider clients |
internal/config |
Defaults and config load |
cmd/pageindex |
CLI entrypoint |
docs/ |
Long-form guides |
flowchart LR
subgraph in["Input"]
P[PDF file]
M[Markdown file]
end
subgraph api["pkg/pageindex"]
E[BuildFromPDF / BuildFromMarkdown]
end
subgraph core["Processing"]
CFG[config]
PDF[pdf + toc]
MD[markdown]
TOK[tokenizer]
LLM[llm]
end
subgraph out["Output"]
D[Document tree]
J[JSON / files]
end
P --> E
M --> E
E --> CFG
E --> PDF
E --> MD
E --> TOK
E --> LLM
E --> D
D --> J
Narrative breakdown: docs/architecture.md.
| Document | Purpose |
|---|---|
| Getting started | Environment, first build, first run |
| Configuration | Env vars, YAML, option reference |
| API reference | Exported symbols and stability |
| Architecture | Package boundaries and pipelines |
| Examples | Recipes |
| Contributing | PR workflow |
Meta: CHANGELOG · Support · Security · Code of conduct
make build # ./bin/pageindex
make test # go test ./...
make test-coverage # coverage.out + HTML report
make fmt && make vet && make testOptional: make install-git-hooks registers scripts/git-hooks/commit-msg (strips accidental Made-with: trailer lines).
Read CONTRIBUTING.md and docs/contributing.md. Security-sensitive reports belong in SECURITY.md, not public issues.
Distributed under the MIT License.
Copyright © 2024–2026 PageIndex contributors.
Footnotes
-
Example lineage: VectifyAI/PageIndex. Not affiliated; implementation and versioning here are independent. ↩