PDFCLOUD-5523 Direct extract text to Pydantic models by datalogics-kam · Pull Request #9 · pdfrest/pdfrest-python

datalogics-kam · 2026-01-09T18:39:03Z

Please review/merge #23 first

Summary

This PR makes JSON text extraction a first-class SDK experience by adding typed
extract_pdf_text APIs (sync + async), rich response models, and full test/live
coverage. The goal is to let consumers work directly with structured text, word
coordinates, and style metadata without manual JSON parsing.

Intent and Outcomes

Why

Today, extracted-text workflows are harder than they need to be when callers want
structured word/page data. They must rely on lower-level handling or file-based
flows.

What this enables

Callers can now request extracted text as a typed Python object directly.
Word-level metadata (coordinates/style) is easier to consume safely.
The feature is validated across sync/async unit tests and live integration tests.
A runnable example demonstrates real-world usage and output rendering.

Key Changes

Client API

Added extract_pdf_text(...) -> ExtractedTextDocument to:
- PdfRestClient
- AsyncPdfRestClient
Supports options for:
- pages
- full_text (off, by_page, document)
- preserve_line_breaks
- word_style
- word_coordinates
- request customization (extra_query, extra_headers, extra_body, timeout)

Models

Introduced public extracted-text model hierarchy (exported via pdfrest.models)
for structured parsing and safer downstream usage, including:
- document-level payload
- full-text document/page representations
- word, coordinates, font, and color structures

Testing

Added dedicated unit coverage for extract_pdf_text:
- sync + async success cases
- both pages omitted/provided paths
- request customization propagation
- invalid option/path validation
- server error translation
Added dedicated live tests under tests/live/:
- sync + async success matrices across supported option combinations
- negative invalid-pages cases that reach the server
Added model-focused tests for extracted-text schema behavior and round-trip
serialization.

Examples & Developer Experience

Added examples/extract_text/extract_pdf_text_example.py showing practical usage
with word coordinates/style rendered in a Rich table.
Updated examples README with the new extraction example.
Added rich as a dev dependency for example output formatting.

Repo Hygiene

.gitignore updated to ignore /coverage.json.

Behavior Notes

No breaking changes to existing endpoints.
This extends the API surface with a more ergonomic typed pathway for extracted
text JSON results.

Validation

Unit tests added for client + model behavior.
Live tests added for sync/async parity and server-side error cases.
Lint/type checks were used during iteration (ruff, basedpyright) and
targeted test modules were run while fixing coverage branches.

CI / Workflow Impact

No workflow file changes in this PR.
Existing CI behavior remains:
- pre-commit workflow
- Test and Publish matrix on Python 3.10–3.14
This PR is expected to run within the existing matrix and checks.

- Introduced new models for enhanced text extraction support: - `ExtractedTextDocument`: Represents structured JSON output from text extraction. - `ExtractedTextWord`, `ExtractedTextWordStyle`, `ExtractedTextWordColor`, and related classes: Support word-level extraction with font, color, coordinates, and style data. - `ExtractedTextFullText` and `ExtractedTextFullTextPages`: Enable handling of extracted text in both document-level and page-level formats. - Updated `__init__.py` to include the new models in `__all__`. - Improved property methods for `ExtractedTextFullText` to provide convenient access to text representations (`document_text`, `pages`). Assisted-by: Codex

…ization - Introduced tests to validate `ExtractedTextDocument` in different modes: - Document mode with full text payload. - Page mode with structured words and page-level text. - Mode without words or full text. - Word payload variations including minimal, styled, and coordinate data. - Ensured validation of `model_validate` and `model_dump` methods. - Verified handling of optional fields like `words`, `style`, and `coordinates`. Assisted-by: Codex

- Introduced `extract_pdf_text` to extract text content from PDFs with options like `full_text`, `preserve_line_breaks`, `word_style`, and `word_coordinates`. - Added both sync and async implementations for the `extract_pdf_text` method. - Validated payloads using `ExtractTextPayload` and model validation. - Returns structured `ExtractedTextDocument` with extracted text and metadata. tests: Add unit and live tests for extract_pdf_text - Added comprehensive unit tests to validate different payload combinations. - Introduced live tests to ensure end-to-end functionality of API integration. - Verified error handling for invalid pages, payloads, and server responses. Assisted-by: Codex

- Added `extract_pdf_text_example.py` to demonstrate text extraction, including word-level coordinates and style metadata from `examples/resources/report.pdf`. - Rendered extracted data as a Rich table for structured visualization. - Updated `README.md` to document the available examples. Assisted-by: Codex

- Changed `name` and `size` in the font style model from `Optional` to required fields. - Simplified downstream usage by ensuring these fields are always defined. examples/extract_text: Simplify font formatting logic - Updated `_format_font` to remove conditional checks for `None` values, as `name` and `size` are now guaranteed to be present in the font style model. Assisted-by: Codex

- Introduced `PAGES_OPTION_SETS` to validate extraction with and without page ranges. - Updated `test_async_extract_pdf_text_success` to include `pages` parameter, covering scenarios with specific page ranges. - Enhanced payload handling to conditionally include `pages` when provided. Assisted-by: Codex

- Excluded `coverage.json` from version control to avoid tracking local coverage report artifacts. Assisted-by: Codex

…df_text - Replaced redundant string comparisons for `word_style` and `word_coordinates` options with direct truthy value checks. - Ensured consistent logic for validating `response.words`. Assisted-by: Codex

- Introduced `test_async_extract_pdf_text_request_customization` to validate async behavior with customized query parameters, headers, body data, and timeout configurations. - Ensured payload consistency and timeout accuracy within the test. - Mocked API transport to emulate request handling and validate expected responses. Assisted-by: Codex

- Introduced tests for async `extract_pdf_text` to validate various error scenarios: - `test_async_extract_pdf_text_multi_file_guard`: Ensures validation for single file restriction. - `test_async_extract_pdf_text_invalid_pages`: Verifies validation for improper page ranges. - `test_async_extract_pdf_text_server_error`: Tests handling of server error responses. - `test_async_extract_pdf_text_invalid_option_values`: Checks validation for invalid option values. - Mocked API transport to simulate responses and validate exception handling. Assisted-by: Codex

datalogics-kam mentioned this pull request Jan 9, 2026

PDFCLOUD-5464 Add additional pdfRest tools #6

Merged

datalogics-kam force-pushed the pdfcloud-5523-extract-text branch from 5502a52 to f0d943f Compare January 12, 2026 17:42

datalogics-cgreen mentioned this pull request Jan 16, 2026

PDFCLOUD-5547 Add PDF encryption and restriction client methods #11

Merged

datalogics-kam force-pushed the pdfcloud-5523-extract-text branch 2 times, most recently from a89851f to de3eb43 Compare February 6, 2026 17:38

datalogics-kam marked this pull request as ready for review February 6, 2026 19:56

datalogics-kam requested a review from datalogics-cgreen February 6, 2026 20:06

datalogics-kam assigned datalogics-cgreen Feb 6, 2026

datalogics-kam force-pushed the pdfcloud-5523-extract-text branch from b8880f6 to 78f526e Compare February 10, 2026 00:41

datalogics-kam added 11 commits February 10, 2026 10:23

uv: Add the rich package to the dev group

d7bc027

gitignore: Add coverage.json to ignore list

46c765d

- Excluded `coverage.json` from version control to avoid tracking local coverage report artifacts. Assisted-by: Codex

tests/live: Simplify word-related conditionals in test_live_extract_p…

9fd96c3

…df_text - Replaced redundant string comparisons for `word_style` and `word_coordinates` options with direct truthy value checks. - Ensured consistent logic for validating `response.words`. Assisted-by: Codex

datalogics-kam force-pushed the pdfcloud-5523-extract-text branch from 78f526e to d64fb28 Compare February 10, 2026 16:23

datalogics-cgreen approved these changes Feb 11, 2026

View reviewed changes

datalogics-cgreen merged commit 3d1c459 into pdfrest:main Feb 11, 2026
14 checks passed

datalogics-kam deleted the pdfcloud-5523-extract-text branch February 11, 2026 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFCLOUD-5523 Direct extract text to Pydantic models#9

PDFCLOUD-5523 Direct extract text to Pydantic models#9
datalogics-cgreen merged 11 commits intopdfrest:mainfrom
datalogics-kam:pdfcloud-5523-extract-text

datalogics-kam commented Jan 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

datalogics-kam commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Intent and Outcomes

Why

What this enables

Key Changes

Client API

Models

Testing

Examples & Developer Experience

Repo Hygiene

Behavior Notes

Validation

CI / Workflow Impact

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

datalogics-kam commented Jan 9, 2026 •

edited

Loading