feat(doc/page): add pdf_doc_summary + pdf_pages_summary + summary()#36
Merged
Conversation
Returns a single-row tibble that aggregates the most-asked-for facts about a PDF document: file path, page count, Info-dictionary metadata, structural feature flags (forms, attachments, bookmarks, signatures, JavaScript, tagged-PDF), counts for each feature group, encryption state, xref validity, and the file-ID tuple. Designed to replace the eight-or-so individual reader calls users typically chain together when triaging an unfamiliar PDF. 27 columns aggregated from existing readers: * `pdf_doc_info()` — page count, file version, Info-dict text + dates (both raw PDF strings and POSIXct parses) * `pdf_doc_is_tagged()`, `pdf_doc_security()`, `pdf_doc_xref_valid()` — structural / encryption flags * `pdf_doc_bookmarks()`, `pdf_attachments()`, `pdf_signatures()`, `pdf_form_fields()`, `pdf_doc_javascript()`, `pdf_doc_named_dests()` — `length()` over each list * `pdf_page_labels()` — boolean "has labelled pages?" * `pdf_doc_file_id()` — hex-encoded as character (NA when absent) Accepts both a `pdfium_doc` and a character path, mirroring the two-input-form convention `pdf_doc_info()` already uses. The path form opens + closes internally. The `file_id` columns required a small helper (`file_id_hex_or_na`, internal) because `pdf_doc_file_id()` returns a `raw(0)` for the common case of PDFs without an `/ID` trailer entry — letting that go into a tibble column recycles the whole tibble to zero rows. The helper is hoisted to module scope so both branches can be unit- tested without a fixture that has `/ID` set (none of the shipped fixtures do). 12 new tests in `test-doc-summary.R` cover column shape, types, counts, path / raw-bytes / doc input forms, error cases, and both branches of the file-ID helper. Full suite 2081/2081 pass; R coverage 100% (2782/2782 lines); 0 lints. DESCRIPTION version bumped to 0.1.0.9000 to mark the start of the post-CRAN-submission development cycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Returns a tibble with one row per page covering the cheap by-index metadata: width, height (PDF user-space points, pre-rotation), rotation in degrees (0/90/180/270), and the page label (or NA when absent). All four columns use the fast by-index PDFium readers (FPDF_GetPageSizeByIndexF + FPDF_GetPageRotation + FPDF_GetPageLabel), so the function does not load any page objects and scales linearly on long documents. Designed as the per-page sibling of pdf_doc_summary() — the same "give me everything cheap in one call" shape, parallel to pdftools::pdf_pagesize() but with rotation + label columns added. Accepts both a pdfium_doc and a character path; the path form opens + closes internally. Surfaces empty-string labels as NA for a cleaner "no label here" signal (PDFium can return "" for pages omitted from a partial /PageLabels array). 11 new tests in test-pages-summary.R cover column shape + types, the page_num sequence, dimensions sanity, agreement with pdf_page_size() + pdf_page_rotation() on a per-page basis, both input forms (path + doc), the multi-page case, password forwarding, closed-doc rejection, bad-input rejection, the empty-pages-summary helper, and the label-empty-to-NA contract. Defensive guards (missing-/PageLabels and zero-page-doc) marked # nocov — both unreachable from the shipped fixture set. Full suite 2111/2111 pass; R coverage 100% (2813/2813); 0 lints; pkgdown reference check passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the "Switching from pdftools" table in vignettes/comparison.Rmd to point users at the post-v0.1.0 summary helpers: * `pdftools::pdf_info(path)` -> mention pdf_doc_summary as the richer alternative for one-call triage. * `pdftools::pdf_pagesize(path)` -> point at pdf_pages_summary rather than pdf_page_size, since pdf_pagesize is vectorised over pages and pdf_page_size is per-page. pdf_pages_summary matches the vectorised shape and adds rotation + label columns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Calling summary() on a pdfium_doc now dispatches to pdf_doc_summary(), matching the standard R idiom of print() for a quick "what is this" one-line string and summary() for the deep-dive tibble. The method lives in R/doc.R rather than R/classes.R so lintr's per-file object-usage check can see the pdf_doc_summary call in the same file. Two new tests confirm dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v0.1.0 hasn't shipped to CRAN yet — the version string was aspirational. Switch back to the conventional pre-release development version (0.0.9000) until devtools::release() actually runs. NEWS.md: collapse the "(development version)" block I had bolted on top into the existing planned-0.1.0 section. The new pdf_doc_summary / pdf_pages_summary / summary.pdfium_doc entries join the v0.1.0 surface they were always going to ship with. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Calling summary() on a pdfium_page now returns a single-row tibble combining the cheap by-index columns (page_num, width, height, rotation, label — same shape pdf_pages_summary returns per row) with the per-page counts that the loaded page makes available: annotation_count, obj_count, text_run_count, link_count. Two-tier shape: the doc-wide pdf_pages_summary stays cheap (no page loads, no per-row counts); the page-level summary trades one already-loaded page for richer information. Users picking which one to call don't have to think about the cost — the loaded-page overload just exists, exposed through the standard R idiom. 5 new tests in test-pages-summary.R covering shape, columns, agreement with the underlying readers, error on closed page, and the real-label path against outline.pdf (the only shipped fixture with non-empty /PageLabels entries). Full suite 2124/2124 pass; R coverage 100% (2835/2835 lines); 0 lints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Convenience wrapper around pdf_doc_open(source = ...) that fetches the bytes of a URL (http://, https://, ftp://, or file://) via base R's url() + readBin() and loads through PDFium's in-memory path. No temporary file is created — the bytes live in R memory for the document's lifetime. The returned pdfium_doc's $path field is the URL string itself, so print() and pdf_doc_summary() surface the source even though no local path exists. Closes the most common user-facing convenience gap: today, users fetching a PDF from a URL have to chain download.file() + tempfile() + pdf_doc_open() themselves. One call is shorter, doesn't leave temp files on disk, and handles cleanup via existing pdfium_doc finalizers. 8 new tests in test-doc-open-url.R cover the file:// happy path, the URL-stored-as-path contract, password/readwrite forwarding, input-shape rejection (non-URL strings, bad types), connection errors (file:// to non-existent path, http(s) to unreachable hosts — suppressWarnings so the unreachable-host warning doesn't pollute test output), and a pdf_doc_summary() round-trip. Full suite 2140/2140 pass; R coverage 100% (2849/2849); 0 lints; pkgdown reference check passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI flagged R CMD check WARNING on every platform:
pdf_doc_open_url.Rd: print.pdfium_doc
Please provide package anchors for all Rd \link{} targets not in
the package itself and the base packages.
The pdf_doc_open_url docstring had `[print()][print.pdfium_doc]` —
markdown for "render `print()` as link to topic `print.pdfium_doc`".
But `print.pdfium_doc` is an internal S3 method without its own Rd
page, so the link can't resolve.
Two changes:
1. Replace the bracketed cross-reference with plain `print()`
inline code so the function name still renders as code but
doesn't generate a broken link. Mirrors the same fix pattern
used on PR #32's `[is_open()]` issue.
2. New pre-commit hook `rd-xref-check` (entry:
tools/check-rd-xrefs.R) that runs the same internal R function
`R CMD check` uses for its cross-reference step
(tools:::.check_Rd_xrefs). Catches this class of WARNING on the
developer machine before push.
The script needs nothing more than the source tree (no install,
no compile, no C++ build), so it's cheap enough to run on
every commit — listed under `repo: local` next to the existing
`pkgdown-reference-check` hook, with the same diagnostic-then-
exit-1 pattern. Trigger files: any change to R/*.R, man/*.Rd,
or DESCRIPTION.
Manual verification: injecting the original broken link
reproduces the CI failure shape:
[check-rd-xrefs] pdf_doc_open_url.Rd: unresolved \link{}
target 'print.pdfium_doc'
Restoring the fix exits 0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pkgdown's `build_reference_index()` errored on PR #36: In _pkgdown.yml, 2 topics missing from index: "summary.pdfium_doc" and "summary.pdfium_page" Both S3 methods have their own @export and so their own Rd page (summary.pdfium_doc.Rd, summary.pdfium_page.Rd). pkgdown enforces that every non-internal Rd topic appears in the reference index. Two changes: 1. Add both to _pkgdown.yml next to their `pdf_*_summary()` companions. 2. Rewrite tools/check-pkgdown-reference.R to enumerate man/*.Rd files directly rather than reconstruct the topic set from NAMESPACE's `export()` + `S3method()` entries. The old design only flagged missing topics for NAMESPACE `export()` entries — `S3method()`-only entries (the path that produces summary.*.Rd) slipped through. The new design: * filters out Rd files marked `\keyword{internal}` (matches the only existing internal Rd, pdfium-package.Rd) * computes the topic set as { Rd basename } ∪ { every \alias{} entry inside }, so @rdname-collapsed methods (e.g. each pdfium_*_code paired with its _name in one Rd) still count as valid YAML entries * flags topics missing from YAML and YAML entries missing from man/, same diagnostic shape as before Manual verification: removing the summary.* entries from _pkgdown.yml reproduces the CI failure shape: [check-pkgdown-reference] Documented but not in _pkgdown.yml reference index: summary.pdfium_doc, summary.pdfium_page Restoring exits 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scans a directory for PDF files and returns a tibble with one row per file in the pdf_doc_summary() column shape. The natural replacement for the standard "loop over a folder of PDFs and find the ones with forms / attachments / encryption" triage workflow. * Recursive descent via `recursive = TRUE`. * Case-insensitive `pattern = "\\.pdf$"` by default — picks up both `.pdf` and `.PDF`. * Optional shared `password` applied to every file. * `errors` argument selects how broken / non-PDF files are handled: * "warn" (default) — surface a warning per failure and skip * "skip" — silently skip * "stop" — abort on the first failure Internal pdf_doc_summary_empty() helper hoisted to module scope so its zero-row template can be tested without exercising the no-files-in-directory branch through the full file scan. 14 new tests in test-dir-summary.R cover row-per-PDF count, column shape parity with pdf_doc_summary, path preservation, recursive descent, the empty-directory case, the empty-tibble helper itself, case-insensitive .PDF matching, all three errors modes, the zero-rows-when-everything-fails case, password forwarding, input-shape rejection, and custom patterns. Full suite 2166/2166 pass; R coverage 100% (2907/2907 lines); 0 lints; pkgdown reference check passes (the hardened hook from the previous commit caught my omission of pdf_dir_summary from _pkgdown.yml during this commit's development — verifying the hook does what it should). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First post-v0.1.0 feature PR. Three related additions for the
"give me everything at once" workflow.
Summary
pdf_doc_summary(doc)— single-row tibble aggregating themost-asked-for facts about a PDF: path, page count, Info dict
(with parsed POSIXct dates), structural feature flags (tagged,
encrypted, xref-valid), counts for every per-feature list
(bookmarks, attachments, signatures, form fields, JavaScript,
named destinations), and the file-ID tuple.
pdf_pages_summary(doc)— per-page sibling. One row perpage with
page_num,width,height(PDF user-space points,pre-rotation),
rotation(0/90/180/270), andlabel. Uses thefast by-index PDFium readers; no per-page
pdf_page_load().summary.pdfium_doc()— S3 method dispatching topdf_doc_summary. Matches R idiom ofprint()for a quick"what is this" string and
summary()for the deep dive.vignettes/comparison.Rmdmigration table pointspdftools::pdf_info/pdftools::pdf_pagesizeusers at the newhelpers.
Test plan
error cases, multi-page docs, and the S3 dispatch.
Notes
file_id_hex_or_nais a small internal helper hoisted out ofpdf_doc_summaryso its two branches can be unit-tested withoutneeding a fixture with an
/IDtrailer entry (none of theshipped fixtures has one).
summary.pdfium_doclives inR/doc.R(next to its dependencypdf_doc_summary) rather thanR/classes.R, so lintr'sper-file object-usage analysis can see it.
DESCRIPTIONversion bumped to0.1.0.9000to mark thepost-CRAN-submission development cycle.
🤖 Generated with Claude Code