PDFCLOUD-5523 Direct extract text to Pydantic models#9
Merged
datalogics-cgreen merged 11 commits intopdfrest:mainfrom Feb 11, 2026
Merged
Conversation
5502a52 to
f0d943f
Compare
a89851f to
de3eb43
Compare
b8880f6 to
78f526e
Compare
- Introduced new models for enhanced text extraction support:
- `ExtractedTextDocument`: Represents structured JSON output from text
extraction.
- `ExtractedTextWord`, `ExtractedTextWordStyle`,
`ExtractedTextWordColor`, and related classes: Support word-level
extraction with font, color, coordinates, and style data.
- `ExtractedTextFullText` and `ExtractedTextFullTextPages`: Enable
handling of extracted text in both document-level and page-level
formats.
- Updated `__init__.py` to include the new models in `__all__`.
- Improved property methods for `ExtractedTextFullText` to provide
convenient access to text representations (`document_text`, `pages`).
Assisted-by: Codex
…ization - Introduced tests to validate `ExtractedTextDocument` in different modes: - Document mode with full text payload. - Page mode with structured words and page-level text. - Mode without words or full text. - Word payload variations including minimal, styled, and coordinate data. - Ensured validation of `model_validate` and `model_dump` methods. - Verified handling of optional fields like `words`, `style`, and `coordinates`. Assisted-by: Codex
- Introduced `extract_pdf_text` to extract text content from PDFs with options like `full_text`, `preserve_line_breaks`, `word_style`, and `word_coordinates`. - Added both sync and async implementations for the `extract_pdf_text` method. - Validated payloads using `ExtractTextPayload` and model validation. - Returns structured `ExtractedTextDocument` with extracted text and metadata. tests: Add unit and live tests for extract_pdf_text - Added comprehensive unit tests to validate different payload combinations. - Introduced live tests to ensure end-to-end functionality of API integration. - Verified error handling for invalid pages, payloads, and server responses. Assisted-by: Codex
- Added `extract_pdf_text_example.py` to demonstrate text extraction, including word-level coordinates and style metadata from `examples/resources/report.pdf`. - Rendered extracted data as a Rich table for structured visualization. - Updated `README.md` to document the available examples. Assisted-by: Codex
- Changed `name` and `size` in the font style model from `Optional` to required fields. - Simplified downstream usage by ensuring these fields are always defined. examples/extract_text: Simplify font formatting logic - Updated `_format_font` to remove conditional checks for `None` values, as `name` and `size` are now guaranteed to be present in the font style model. Assisted-by: Codex
- Introduced `PAGES_OPTION_SETS` to validate extraction with and without page ranges. - Updated `test_async_extract_pdf_text_success` to include `pages` parameter, covering scenarios with specific page ranges. - Enhanced payload handling to conditionally include `pages` when provided. Assisted-by: Codex
- Excluded `coverage.json` from version control to avoid tracking local coverage report artifacts. Assisted-by: Codex
…df_text - Replaced redundant string comparisons for `word_style` and `word_coordinates` options with direct truthy value checks. - Ensured consistent logic for validating `response.words`. Assisted-by: Codex
- Introduced `test_async_extract_pdf_text_request_customization` to validate async behavior with customized query parameters, headers, body data, and timeout configurations. - Ensured payload consistency and timeout accuracy within the test. - Mocked API transport to emulate request handling and validate expected responses. Assisted-by: Codex
- Introduced tests for async `extract_pdf_text` to validate various error
scenarios:
- `test_async_extract_pdf_text_multi_file_guard`: Ensures validation for
single file restriction.
- `test_async_extract_pdf_text_invalid_pages`: Verifies validation for
improper page ranges.
- `test_async_extract_pdf_text_server_error`: Tests handling of server error
responses.
- `test_async_extract_pdf_text_invalid_option_values`: Checks validation for
invalid option values.
- Mocked API transport to simulate responses and validate exception handling.
Assisted-by: Codex
78f526e to
d64fb28
Compare
datalogics-cgreen
approved these changes
Feb 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PDFCLOUD-5523
Please review/merge #23 first
Summary
This PR makes JSON text extraction a first-class SDK experience by adding typed
extract_pdf_textAPIs (sync + async), rich response models, and full test/livecoverage. The goal is to let consumers work directly with structured text, word
coordinates, and style metadata without manual JSON parsing.
Intent and Outcomes
Why
Today, extracted-text workflows are harder than they need to be when callers want
structured word/page data. They must rely on lower-level handling or file-based
flows.
What this enables
Key Changes
Client API
extract_pdf_text(...) -> ExtractedTextDocumentto:PdfRestClientAsyncPdfRestClientpagesfull_text(off,by_page,document)preserve_line_breaksword_styleword_coordinatesextra_query,extra_headers,extra_body,timeout)Models
pdfrest.models)for structured parsing and safer downstream usage, including:
Testing
extract_pdf_text:pagesomitted/provided pathstests/live/:serialization.
Examples & Developer Experience
examples/extract_text/extract_pdf_text_example.pyshowing practical usagewith word coordinates/style rendered in a Rich table.
richas a dev dependency for example output formatting.Repo Hygiene
.gitignoreupdated to ignore/coverage.json.Behavior Notes
text JSON results.
Validation
ruff,basedpyright) andtargeted test modules were run while fixing coverage branches.
CI / Workflow Impact
pre-commitworkflowTest and Publishmatrix on Python 3.10–3.14