Skip to content

feat(doc/page): add pdf_doc_summary + pdf_pages_summary + summary()#36

Merged
billdenney merged 10 commits into
mainfrom
claude/pdf-doc-summary
May 21, 2026
Merged

feat(doc/page): add pdf_doc_summary + pdf_pages_summary + summary()#36
billdenney merged 10 commits into
mainfrom
claude/pdf-doc-summary

Conversation

@billdenney
Copy link
Copy Markdown
Member

@billdenney billdenney commented May 21, 2026

First post-v0.1.0 feature PR. Three related additions for the
"give me everything at once" workflow.

Summary

  • pdf_doc_summary(doc) — single-row tibble aggregating the
    most-asked-for facts about a PDF: path, page count, Info dict
    (with parsed POSIXct dates), structural feature flags (tagged,
    encrypted, xref-valid), counts for every per-feature list
    (bookmarks, attachments, signatures, form fields, JavaScript,
    named destinations), and the file-ID tuple.
  • pdf_pages_summary(doc) — per-page sibling. One row per
    page with page_num, width, height (PDF user-space points,
    pre-rotation), rotation (0/90/180/270), and label. Uses the
    fast by-index PDFium readers; no per-page pdf_page_load().
  • summary.pdfium_doc() — S3 method dispatching to
    pdf_doc_summary. Matches R idiom of print() for a quick
    "what is this" string and summary() for the deep dive.
  • vignettes/comparison.Rmd migration table points
    pdftools::pdf_info / pdftools::pdf_pagesize users at the new
    helpers.

Test plan

  • 24 new tests cover column shape, types, both input forms,
    error cases, multi-page docs, and the S3 dispatch.
  • Full suite 2113/2113 pass.
  • R coverage 100% (2814/2814 lines).
  • 0 lints; pkgdown reference check passes.

Notes

  • file_id_hex_or_na is a small internal helper hoisted out of
    pdf_doc_summary so its two branches can be unit-tested without
    needing a fixture with an /ID trailer entry (none of the
    shipped fixtures has one).
  • summary.pdfium_doc lives in R/doc.R (next to its dependency
    pdf_doc_summary) rather than R/classes.R, so lintr's
    per-file object-usage analysis can see it.
  • DESCRIPTION version bumped to 0.1.0.9000 to mark the
    post-CRAN-submission development cycle.

🤖 Generated with Claude Code

billdenney and others added 4 commits May 21, 2026 14:20
Returns a single-row tibble that aggregates the most-asked-for
facts about a PDF document: file path, page count, Info-dictionary
metadata, structural feature flags (forms, attachments, bookmarks,
signatures, JavaScript, tagged-PDF), counts for each feature group,
encryption state, xref validity, and the file-ID tuple. Designed
to replace the eight-or-so individual reader calls users typically
chain together when triaging an unfamiliar PDF.

27 columns aggregated from existing readers:
* `pdf_doc_info()` — page count, file version, Info-dict text +
  dates (both raw PDF strings and POSIXct parses)
* `pdf_doc_is_tagged()`, `pdf_doc_security()`,
  `pdf_doc_xref_valid()` — structural / encryption flags
* `pdf_doc_bookmarks()`, `pdf_attachments()`, `pdf_signatures()`,
  `pdf_form_fields()`, `pdf_doc_javascript()`,
  `pdf_doc_named_dests()` — `length()` over each list
* `pdf_page_labels()` — boolean "has labelled pages?"
* `pdf_doc_file_id()` — hex-encoded as character (NA when absent)

Accepts both a `pdfium_doc` and a character path, mirroring the
two-input-form convention `pdf_doc_info()` already uses. The path
form opens + closes internally.

The `file_id` columns required a small helper (`file_id_hex_or_na`,
internal) because `pdf_doc_file_id()` returns a `raw(0)` for the
common case of PDFs without an `/ID` trailer entry — letting that
go into a tibble column recycles the whole tibble to zero rows.
The helper is hoisted to module scope so both branches can be unit-
tested without a fixture that has `/ID` set (none of the shipped
fixtures do).

12 new tests in `test-doc-summary.R` cover column shape, types,
counts, path / raw-bytes / doc input forms, error cases, and both
branches of the file-ID helper. Full suite 2081/2081 pass; R
coverage 100% (2782/2782 lines); 0 lints.

DESCRIPTION version bumped to 0.1.0.9000 to mark the start of the
post-CRAN-submission development cycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Returns a tibble with one row per page covering the cheap by-index
metadata: width, height (PDF user-space points, pre-rotation),
rotation in degrees (0/90/180/270), and the page label (or NA when
absent). All four columns use the fast by-index PDFium readers
(FPDF_GetPageSizeByIndexF + FPDF_GetPageRotation +
FPDF_GetPageLabel), so the function does not load any page objects
and scales linearly on long documents.

Designed as the per-page sibling of pdf_doc_summary() — the same
"give me everything cheap in one call" shape, parallel to
pdftools::pdf_pagesize() but with rotation + label columns added.

Accepts both a pdfium_doc and a character path; the path form opens
+ closes internally. Surfaces empty-string labels as NA for a
cleaner "no label here" signal (PDFium can return "" for pages
omitted from a partial /PageLabels array).

11 new tests in test-pages-summary.R cover column shape + types,
the page_num sequence, dimensions sanity, agreement with
pdf_page_size() + pdf_page_rotation() on a per-page basis, both
input forms (path + doc), the multi-page case, password forwarding,
closed-doc rejection, bad-input rejection, the empty-pages-summary
helper, and the label-empty-to-NA contract.

Defensive guards (missing-/PageLabels and zero-page-doc) marked
# nocov — both unreachable from the shipped fixture set.

Full suite 2111/2111 pass; R coverage 100% (2813/2813); 0 lints;
pkgdown reference check passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the "Switching from pdftools" table in
vignettes/comparison.Rmd to point users at the post-v0.1.0 summary
helpers:

* `pdftools::pdf_info(path)` -> mention pdf_doc_summary as the
  richer alternative for one-call triage.
* `pdftools::pdf_pagesize(path)` -> point at pdf_pages_summary
  rather than pdf_page_size, since pdf_pagesize is vectorised over
  pages and pdf_page_size is per-page. pdf_pages_summary matches
  the vectorised shape and adds rotation + label columns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Calling summary() on a pdfium_doc now dispatches to
pdf_doc_summary(), matching the standard R idiom of print() for
a quick "what is this" one-line string and summary() for the
deep-dive tibble.

The method lives in R/doc.R rather than R/classes.R so lintr's
per-file object-usage check can see the pdf_doc_summary call in
the same file. Two new tests confirm dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@billdenney billdenney changed the title feat(doc/page): add pdf_doc_summary() + pdf_pages_summary() helpers feat(doc/page): add pdf_doc_summary + pdf_pages_summary + summary() May 21, 2026
billdenney and others added 6 commits May 21, 2026 14:37
v0.1.0 hasn't shipped to CRAN yet — the version string was
aspirational. Switch back to the conventional pre-release
development version (0.0.9000) until devtools::release() actually
runs.

NEWS.md: collapse the "(development version)" block I had bolted
on top into the existing planned-0.1.0 section. The new
pdf_doc_summary / pdf_pages_summary / summary.pdfium_doc entries
join the v0.1.0 surface they were always going to ship with.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Calling summary() on a pdfium_page now returns a single-row tibble
combining the cheap by-index columns (page_num, width, height,
rotation, label — same shape pdf_pages_summary returns per row)
with the per-page counts that the loaded page makes available:
annotation_count, obj_count, text_run_count, link_count.

Two-tier shape: the doc-wide pdf_pages_summary stays cheap (no
page loads, no per-row counts); the page-level summary trades one
already-loaded page for richer information. Users picking which
one to call don't have to think about the cost — the loaded-page
overload just exists, exposed through the standard R idiom.

5 new tests in test-pages-summary.R covering shape, columns,
agreement with the underlying readers, error on closed page, and
the real-label path against outline.pdf (the only shipped fixture
with non-empty /PageLabels entries).

Full suite 2124/2124 pass; R coverage 100% (2835/2835 lines);
0 lints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convenience wrapper around pdf_doc_open(source = ...) that fetches
the bytes of a URL (http://, https://, ftp://, or file://) via
base R's url() + readBin() and loads through PDFium's in-memory
path. No temporary file is created — the bytes live in R memory
for the document's lifetime.

The returned pdfium_doc's $path field is the URL string itself,
so print() and pdf_doc_summary() surface the source even though
no local path exists.

Closes the most common user-facing convenience gap: today, users
fetching a PDF from a URL have to chain download.file() +
tempfile() + pdf_doc_open() themselves. One call is shorter,
doesn't leave temp files on disk, and handles cleanup via
existing pdfium_doc finalizers.

8 new tests in test-doc-open-url.R cover the file:// happy path,
the URL-stored-as-path contract, password/readwrite forwarding,
input-shape rejection (non-URL strings, bad types), connection
errors (file:// to non-existent path, http(s) to unreachable
hosts — suppressWarnings so the unreachable-host warning doesn't
pollute test output), and a pdf_doc_summary() round-trip.

Full suite 2140/2140 pass; R coverage 100% (2849/2849);
0 lints; pkgdown reference check passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI flagged R CMD check WARNING on every platform:

  pdf_doc_open_url.Rd: print.pdfium_doc
  Please provide package anchors for all Rd \link{} targets not in
  the package itself and the base packages.

The pdf_doc_open_url docstring had `[print()][print.pdfium_doc]` —
markdown for "render `print()` as link to topic `print.pdfium_doc`".
But `print.pdfium_doc` is an internal S3 method without its own Rd
page, so the link can't resolve.

Two changes:

1. Replace the bracketed cross-reference with plain `print()`
   inline code so the function name still renders as code but
   doesn't generate a broken link. Mirrors the same fix pattern
   used on PR #32's `[is_open()]` issue.

2. New pre-commit hook `rd-xref-check` (entry:
   tools/check-rd-xrefs.R) that runs the same internal R function
   `R CMD check` uses for its cross-reference step
   (tools:::.check_Rd_xrefs). Catches this class of WARNING on the
   developer machine before push.

   The script needs nothing more than the source tree (no install,
   no compile, no C++ build), so it's cheap enough to run on
   every commit — listed under `repo: local` next to the existing
   `pkgdown-reference-check` hook, with the same diagnostic-then-
   exit-1 pattern. Trigger files: any change to R/*.R, man/*.Rd,
   or DESCRIPTION.

   Manual verification: injecting the original broken link
   reproduces the CI failure shape:

     [check-rd-xrefs] pdf_doc_open_url.Rd: unresolved \link{}
       target 'print.pdfium_doc'

   Restoring the fix exits 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pkgdown's `build_reference_index()` errored on PR #36:

  In _pkgdown.yml, 2 topics missing from index:
    "summary.pdfium_doc" and "summary.pdfium_page"

Both S3 methods have their own @export and so their own Rd page
(summary.pdfium_doc.Rd, summary.pdfium_page.Rd). pkgdown enforces
that every non-internal Rd topic appears in the reference index.

Two changes:

1. Add both to _pkgdown.yml next to their `pdf_*_summary()`
   companions.

2. Rewrite tools/check-pkgdown-reference.R to enumerate man/*.Rd
   files directly rather than reconstruct the topic set from
   NAMESPACE's `export()` + `S3method()` entries. The old design
   only flagged missing topics for NAMESPACE `export()` entries —
   `S3method()`-only entries (the path that produces summary.*.Rd)
   slipped through. The new design:
     * filters out Rd files marked `\keyword{internal}` (matches
       the only existing internal Rd, pdfium-package.Rd)
     * computes the topic set as { Rd basename } ∪ { every \alias{}
       entry inside }, so @rdname-collapsed methods (e.g. each
       pdfium_*_code paired with its _name in one Rd) still count
       as valid YAML entries
     * flags topics missing from YAML and YAML entries missing from
       man/, same diagnostic shape as before

   Manual verification: removing the summary.* entries from
   _pkgdown.yml reproduces the CI failure shape:

     [check-pkgdown-reference] Documented but not in _pkgdown.yml
       reference index: summary.pdfium_doc, summary.pdfium_page

   Restoring exits 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scans a directory for PDF files and returns a tibble with one row
per file in the pdf_doc_summary() column shape. The natural
replacement for the standard "loop over a folder of PDFs and find
the ones with forms / attachments / encryption" triage workflow.

* Recursive descent via `recursive = TRUE`.
* Case-insensitive `pattern = "\\.pdf$"` by default — picks up
  both `.pdf` and `.PDF`.
* Optional shared `password` applied to every file.
* `errors` argument selects how broken / non-PDF files are handled:
  * "warn" (default) — surface a warning per failure and skip
  * "skip" — silently skip
  * "stop" — abort on the first failure

Internal pdf_doc_summary_empty() helper hoisted to module scope so
its zero-row template can be tested without exercising the
no-files-in-directory branch through the full file scan.

14 new tests in test-dir-summary.R cover row-per-PDF count, column
shape parity with pdf_doc_summary, path preservation, recursive
descent, the empty-directory case, the empty-tibble helper itself,
case-insensitive .PDF matching, all three errors modes, the
zero-rows-when-everything-fails case, password forwarding,
input-shape rejection, and custom patterns.

Full suite 2166/2166 pass; R coverage 100% (2907/2907 lines);
0 lints; pkgdown reference check passes (the hardened hook from
the previous commit caught my omission of pdf_dir_summary from
_pkgdown.yml during this commit's development — verifying the
hook does what it should).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@billdenney billdenney merged commit 0b1b5df into main May 21, 2026
13 checks passed
@billdenney billdenney deleted the claude/pdf-doc-summary branch May 21, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant