feat(doc/page): add pdf_doc_summary + pdf_pages_summary + summary() by billdenney · Pull Request #36 · humanpred/rpdfium

billdenney · 2026-05-21T14:34:41Z

First post-v0.1.0 feature PR. Three related additions for the
"give me everything at once" workflow.

Summary

pdf_doc_summary(doc) — single-row tibble aggregating the
most-asked-for facts about a PDF: path, page count, Info dict
(with parsed POSIXct dates), structural feature flags (tagged,
encrypted, xref-valid), counts for every per-feature list
(bookmarks, attachments, signatures, form fields, JavaScript,
named destinations), and the file-ID tuple.
pdf_pages_summary(doc) — per-page sibling. One row per
page with page_num, width, height (PDF user-space points,
pre-rotation), rotation (0/90/180/270), and label. Uses the
fast by-index PDFium readers; no per-page pdf_page_load().
summary.pdfium_doc() — S3 method dispatching to
pdf_doc_summary. Matches R idiom of print() for a quick
"what is this" string and summary() for the deep dive.
vignettes/comparison.Rmd migration table points
pdftools::pdf_info / pdftools::pdf_pagesize users at the new
helpers.

Test plan

24 new tests cover column shape, types, both input forms,
error cases, multi-page docs, and the S3 dispatch.
Full suite 2113/2113 pass.
R coverage 100% (2814/2814 lines).
0 lints; pkgdown reference check passes.

Notes

file_id_hex_or_na is a small internal helper hoisted out of
pdf_doc_summary so its two branches can be unit-tested without
needing a fixture with an /ID trailer entry (none of the
shipped fixtures has one).
summary.pdfium_doc lives in R/doc.R (next to its dependency
pdf_doc_summary) rather than R/classes.R, so lintr's
per-file object-usage analysis can see it.
DESCRIPTION version bumped to 0.1.0.9000 to mark the
post-CRAN-submission development cycle.

🤖 Generated with Claude Code

Returns a single-row tibble that aggregates the most-asked-for facts about a PDF document: file path, page count, Info-dictionary metadata, structural feature flags (forms, attachments, bookmarks, signatures, JavaScript, tagged-PDF), counts for each feature group, encryption state, xref validity, and the file-ID tuple. Designed to replace the eight-or-so individual reader calls users typically chain together when triaging an unfamiliar PDF. 27 columns aggregated from existing readers: * `pdf_doc_info()` — page count, file version, Info-dict text + dates (both raw PDF strings and POSIXct parses) * `pdf_doc_is_tagged()`, `pdf_doc_security()`, `pdf_doc_xref_valid()` — structural / encryption flags * `pdf_doc_bookmarks()`, `pdf_attachments()`, `pdf_signatures()`, `pdf_form_fields()`, `pdf_doc_javascript()`, `pdf_doc_named_dests()` — `length()` over each list * `pdf_page_labels()` — boolean "has labelled pages?" * `pdf_doc_file_id()` — hex-encoded as character (NA when absent) Accepts both a `pdfium_doc` and a character path, mirroring the two-input-form convention `pdf_doc_info()` already uses. The path form opens + closes internally. The `file_id` columns required a small helper (`file_id_hex_or_na`, internal) because `pdf_doc_file_id()` returns a `raw(0)` for the common case of PDFs without an `/ID` trailer entry — letting that go into a tibble column recycles the whole tibble to zero rows. The helper is hoisted to module scope so both branches can be unit- tested without a fixture that has `/ID` set (none of the shipped fixtures do). 12 new tests in `test-doc-summary.R` cover column shape, types, counts, path / raw-bytes / doc input forms, error cases, and both branches of the file-ID helper. Full suite 2081/2081 pass; R coverage 100% (2782/2782 lines); 0 lints. DESCRIPTION version bumped to 0.1.0.9000 to mark the start of the post-CRAN-submission development cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Returns a tibble with one row per page covering the cheap by-index metadata: width, height (PDF user-space points, pre-rotation), rotation in degrees (0/90/180/270), and the page label (or NA when absent). All four columns use the fast by-index PDFium readers (FPDF_GetPageSizeByIndexF + FPDF_GetPageRotation + FPDF_GetPageLabel), so the function does not load any page objects and scales linearly on long documents. Designed as the per-page sibling of pdf_doc_summary() — the same "give me everything cheap in one call" shape, parallel to pdftools::pdf_pagesize() but with rotation + label columns added. Accepts both a pdfium_doc and a character path; the path form opens + closes internally. Surfaces empty-string labels as NA for a cleaner "no label here" signal (PDFium can return "" for pages omitted from a partial /PageLabels array). 11 new tests in test-pages-summary.R cover column shape + types, the page_num sequence, dimensions sanity, agreement with pdf_page_size() + pdf_page_rotation() on a per-page basis, both input forms (path + doc), the multi-page case, password forwarding, closed-doc rejection, bad-input rejection, the empty-pages-summary helper, and the label-empty-to-NA contract. Defensive guards (missing-/PageLabels and zero-page-doc) marked # nocov — both unreachable from the shipped fixture set. Full suite 2111/2111 pass; R coverage 100% (2813/2813); 0 lints; pkgdown reference check passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Updates the "Switching from pdftools" table in vignettes/comparison.Rmd to point users at the post-v0.1.0 summary helpers: * `pdftools::pdf_info(path)` -> mention pdf_doc_summary as the richer alternative for one-call triage. * `pdftools::pdf_pagesize(path)` -> point at pdf_pages_summary rather than pdf_page_size, since pdf_pagesize is vectorised over pages and pdf_page_size is per-page. pdf_pages_summary matches the vectorised shape and adds rotation + label columns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Calling summary() on a pdfium_doc now dispatches to pdf_doc_summary(), matching the standard R idiom of print() for a quick "what is this" one-line string and summary() for the deep-dive tibble. The method lives in R/doc.R rather than R/classes.R so lintr's per-file object-usage check can see the pdf_doc_summary call in the same file. Two new tests confirm dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v0.1.0 hasn't shipped to CRAN yet — the version string was aspirational. Switch back to the conventional pre-release development version (0.0.9000) until devtools::release() actually runs. NEWS.md: collapse the "(development version)" block I had bolted on top into the existing planned-0.1.0 section. The new pdf_doc_summary / pdf_pages_summary / summary.pdfium_doc entries join the v0.1.0 surface they were always going to ship with. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Calling summary() on a pdfium_page now returns a single-row tibble combining the cheap by-index columns (page_num, width, height, rotation, label — same shape pdf_pages_summary returns per row) with the per-page counts that the loaded page makes available: annotation_count, obj_count, text_run_count, link_count. Two-tier shape: the doc-wide pdf_pages_summary stays cheap (no page loads, no per-row counts); the page-level summary trades one already-loaded page for richer information. Users picking which one to call don't have to think about the cost — the loaded-page overload just exists, exposed through the standard R idiom. 5 new tests in test-pages-summary.R covering shape, columns, agreement with the underlying readers, error on closed page, and the real-label path against outline.pdf (the only shipped fixture with non-empty /PageLabels entries). Full suite 2124/2124 pass; R coverage 100% (2835/2835 lines); 0 lints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Convenience wrapper around pdf_doc_open(source = ...) that fetches the bytes of a URL (http://, https://, ftp://, or file://) via base R's url() + readBin() and loads through PDFium's in-memory path. No temporary file is created — the bytes live in R memory for the document's lifetime. The returned pdfium_doc's $path field is the URL string itself, so print() and pdf_doc_summary() surface the source even though no local path exists. Closes the most common user-facing convenience gap: today, users fetching a PDF from a URL have to chain download.file() + tempfile() + pdf_doc_open() themselves. One call is shorter, doesn't leave temp files on disk, and handles cleanup via existing pdfium_doc finalizers. 8 new tests in test-doc-open-url.R cover the file:// happy path, the URL-stored-as-path contract, password/readwrite forwarding, input-shape rejection (non-URL strings, bad types), connection errors (file:// to non-existent path, http(s) to unreachable hosts — suppressWarnings so the unreachable-host warning doesn't pollute test output), and a pdf_doc_summary() round-trip. Full suite 2140/2140 pass; R coverage 100% (2849/2849); 0 lints; pkgdown reference check passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI flagged R CMD check WARNING on every platform: pdf_doc_open_url.Rd: print.pdfium_doc Please provide package anchors for all Rd \link{} targets not in the package itself and the base packages. The pdf_doc_open_url docstring had `[print()][print.pdfium_doc]` — markdown for "render `print()` as link to topic `print.pdfium_doc`". But `print.pdfium_doc` is an internal S3 method without its own Rd page, so the link can't resolve. Two changes: 1. Replace the bracketed cross-reference with plain `print()` inline code so the function name still renders as code but doesn't generate a broken link. Mirrors the same fix pattern used on PR #32's `[is_open()]` issue. 2. New pre-commit hook `rd-xref-check` (entry: tools/check-rd-xrefs.R) that runs the same internal R function `R CMD check` uses for its cross-reference step (tools:::.check_Rd_xrefs). Catches this class of WARNING on the developer machine before push. The script needs nothing more than the source tree (no install, no compile, no C++ build), so it's cheap enough to run on every commit — listed under `repo: local` next to the existing `pkgdown-reference-check` hook, with the same diagnostic-then- exit-1 pattern. Trigger files: any change to R/*.R, man/*.Rd, or DESCRIPTION. Manual verification: injecting the original broken link reproduces the CI failure shape: [check-rd-xrefs] pdf_doc_open_url.Rd: unresolved \link{} target 'print.pdfium_doc' Restoring the fix exits 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@export

pkgdown's `build_reference_index()` errored on PR #36: In _pkgdown.yml, 2 topics missing from index: "summary.pdfium_doc" and "summary.pdfium_page" Both S3 methods have their own @export and so their own Rd page (summary.pdfium_doc.Rd, summary.pdfium_page.Rd). pkgdown enforces that every non-internal Rd topic appears in the reference index. Two changes: 1. Add both to _pkgdown.yml next to their `pdf_*_summary()` companions. 2. Rewrite tools/check-pkgdown-reference.R to enumerate man/*.Rd files directly rather than reconstruct the topic set from NAMESPACE's `export()` + `S3method()` entries. The old design only flagged missing topics for NAMESPACE `export()` entries — `S3method()`-only entries (the path that produces summary.*.Rd) slipped through. The new design: * filters out Rd files marked `\keyword{internal}` (matches the only existing internal Rd, pdfium-package.Rd) * computes the topic set as { Rd basename } ∪ { every \alias{} entry inside }, so @rdname-collapsed methods (e.g. each pdfium_*_code paired with its _name in one Rd) still count as valid YAML entries * flags topics missing from YAML and YAML entries missing from man/, same diagnostic shape as before Manual verification: removing the summary.* entries from _pkgdown.yml reproduces the CI failure shape: [check-pkgdown-reference] Documented but not in _pkgdown.yml reference index: summary.pdfium_doc, summary.pdfium_page Restoring exits 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Scans a directory for PDF files and returns a tibble with one row per file in the pdf_doc_summary() column shape. The natural replacement for the standard "loop over a folder of PDFs and find the ones with forms / attachments / encryption" triage workflow. * Recursive descent via `recursive = TRUE`. * Case-insensitive `pattern = "\\.pdf$"` by default — picks up both `.pdf` and `.PDF`. * Optional shared `password` applied to every file. * `errors` argument selects how broken / non-PDF files are handled: * "warn" (default) — surface a warning per failure and skip * "skip" — silently skip * "stop" — abort on the first failure Internal pdf_doc_summary_empty() helper hoisted to module scope so its zero-row template can be tested without exercising the no-files-in-directory branch through the full file scan. 14 new tests in test-dir-summary.R cover row-per-PDF count, column shape parity with pdf_doc_summary, path preservation, recursive descent, the empty-directory case, the empty-tibble helper itself, case-insensitive .PDF matching, all three errors modes, the zero-rows-when-everything-fails case, password forwarding, input-shape rejection, and custom patterns. Full suite 2166/2166 pass; R coverage 100% (2907/2907 lines); 0 lints; pkgdown reference check passes (the hardened hook from the previous commit caught my omission of pdf_dir_summary from _pkgdown.yml during this commit's development — verifying the hook does what it should). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

billdenney and others added 4 commits May 21, 2026 14:20

billdenney changed the title ~~feat(doc/page): add pdf_doc_summary() + pdf_pages_summary() helpers~~ feat(doc/page): add pdf_doc_summary + pdf_pages_summary + summary() May 21, 2026

billdenney and others added 6 commits May 21, 2026 14:37

billdenney merged commit 0b1b5df into main May 21, 2026
13 checks passed

billdenney deleted the claude/pdf-doc-summary branch May 21, 2026 18:33

billdenney mentioned this pull request May 21, 2026

feat(authoring): ship image embedding + custom-font loading (closes v0.1.0 wrapping gaps) #38

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(doc/page): add pdf_doc_summary + pdf_pages_summary + summary()#36

feat(doc/page): add pdf_doc_summary + pdf_pages_summary + summary()#36
billdenney merged 10 commits into
mainfrom
claude/pdf-doc-summary

billdenney commented May 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

billdenney commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

billdenney commented May 21, 2026 •

edited

Loading