Skip to content

PDFCLOUD-5523 Direct extract text to Pydantic models#9

Merged
datalogics-cgreen merged 11 commits intopdfrest:mainfrom
datalogics-kam:pdfcloud-5523-extract-text
Feb 11, 2026
Merged

PDFCLOUD-5523 Direct extract text to Pydantic models#9
datalogics-cgreen merged 11 commits intopdfrest:mainfrom
datalogics-kam:pdfcloud-5523-extract-text

Conversation

@datalogics-kam
Copy link
Copy Markdown
Contributor

@datalogics-kam datalogics-kam commented Jan 9, 2026

PDFCLOUD-5523

Please review/merge #23 first

Summary

This PR makes JSON text extraction a first-class SDK experience by adding typed
extract_pdf_text APIs (sync + async), rich response models, and full test/live
coverage. The goal is to let consumers work directly with structured text, word
coordinates, and style metadata without manual JSON parsing.

Intent and Outcomes

Why

Today, extracted-text workflows are harder than they need to be when callers want
structured word/page data. They must rely on lower-level handling or file-based
flows.

What this enables

  • Callers can now request extracted text as a typed Python object directly.
  • Word-level metadata (coordinates/style) is easier to consume safely.
  • The feature is validated across sync/async unit tests and live integration tests.
  • A runnable example demonstrates real-world usage and output rendering.

Key Changes

Client API

  • Added extract_pdf_text(...) -> ExtractedTextDocument to:
    • PdfRestClient
    • AsyncPdfRestClient
  • Supports options for:
    • pages
    • full_text (off, by_page, document)
    • preserve_line_breaks
    • word_style
    • word_coordinates
    • request customization (extra_query, extra_headers, extra_body, timeout)

Models

  • Introduced public extracted-text model hierarchy (exported via pdfrest.models)
    for structured parsing and safer downstream usage, including:
    • document-level payload
    • full-text document/page representations
    • word, coordinates, font, and color structures

Testing

  • Added dedicated unit coverage for extract_pdf_text:
    • sync + async success cases
    • both pages omitted/provided paths
    • request customization propagation
    • invalid option/path validation
    • server error translation
  • Added dedicated live tests under tests/live/:
    • sync + async success matrices across supported option combinations
    • negative invalid-pages cases that reach the server
  • Added model-focused tests for extracted-text schema behavior and round-trip
    serialization.

Examples & Developer Experience

  • Added examples/extract_text/extract_pdf_text_example.py showing practical usage
    with word coordinates/style rendered in a Rich table.
  • Updated examples README with the new extraction example.
  • Added rich as a dev dependency for example output formatting.

Repo Hygiene

  • .gitignore updated to ignore /coverage.json.

Behavior Notes

  • No breaking changes to existing endpoints.
  • This extends the API surface with a more ergonomic typed pathway for extracted
    text JSON results.

Validation

  • Unit tests added for client + model behavior.
  • Live tests added for sync/async parity and server-side error cases.
  • Lint/type checks were used during iteration (ruff, basedpyright) and
    targeted test modules were run while fixing coverage branches.

CI / Workflow Impact

  • No workflow file changes in this PR.
  • Existing CI behavior remains:
    • pre-commit workflow
    • Test and Publish matrix on Python 3.10–3.14
  • This PR is expected to run within the existing matrix and checks.
image

- Introduced new models for enhanced text extraction support:
  - `ExtractedTextDocument`: Represents structured JSON output from text
    extraction.
  - `ExtractedTextWord`, `ExtractedTextWordStyle`,
    `ExtractedTextWordColor`, and related classes: Support word-level
    extraction with font, color, coordinates, and style data.
  - `ExtractedTextFullText` and `ExtractedTextFullTextPages`: Enable
    handling of extracted text in both document-level and page-level
    formats.
- Updated `__init__.py` to include the new models in `__all__`.
- Improved property methods for `ExtractedTextFullText` to provide
  convenient access to text representations (`document_text`, `pages`).

Assisted-by: Codex
…ization

- Introduced tests to validate `ExtractedTextDocument` in different modes:
  - Document mode with full text payload.
  - Page mode with structured words and page-level text.
  - Mode without words or full text.
  - Word payload variations including minimal, styled, and coordinate data.
- Ensured validation of `model_validate` and `model_dump` methods.
- Verified handling of optional fields like `words`, `style`, and `coordinates`.

Assisted-by: Codex
- Introduced `extract_pdf_text` to extract text content from PDFs with options
  like `full_text`, `preserve_line_breaks`, `word_style`, and `word_coordinates`.
- Added both sync and async implementations for the `extract_pdf_text` method.
- Validated payloads using `ExtractTextPayload` and model validation.
- Returns structured `ExtractedTextDocument` with extracted text and metadata.

tests: Add unit and live tests for extract_pdf_text

- Added comprehensive unit tests to validate different payload combinations.
- Introduced live tests to ensure end-to-end functionality of API integration.
- Verified error handling for invalid pages, payloads, and server responses.

Assisted-by: Codex
- Added `extract_pdf_text_example.py` to demonstrate text extraction, including
  word-level coordinates and style metadata from `examples/resources/report.pdf`.
- Rendered extracted data as a Rich table for structured visualization.
- Updated `README.md` to document the available examples.

Assisted-by: Codex
- Changed `name` and `size` in the font style model from `Optional` to
  required fields.
- Simplified downstream usage by ensuring these fields are always
  defined.

examples/extract_text: Simplify font formatting logic

- Updated `_format_font` to remove conditional checks for `None` values,
  as `name` and `size` are now guaranteed to be present in the font
  style model.

Assisted-by: Codex
- Introduced `PAGES_OPTION_SETS` to validate extraction with and without
  page ranges.
- Updated `test_async_extract_pdf_text_success` to include `pages`
  parameter, covering scenarios with specific page ranges.
- Enhanced payload handling to conditionally include `pages` when
  provided.

Assisted-by: Codex
- Excluded `coverage.json` from version control to avoid tracking
  local coverage report artifacts.

Assisted-by: Codex
…df_text

- Replaced redundant string comparisons for `word_style` and
  `word_coordinates` options with direct truthy value checks.
- Ensured consistent logic for validating `response.words`.

Assisted-by: Codex
- Introduced `test_async_extract_pdf_text_request_customization` to validate
  async behavior with customized query parameters, headers, body data, and
  timeout configurations.
- Ensured payload consistency and timeout accuracy within the test.
- Mocked API transport to emulate request handling and validate expected
  responses.

Assisted-by: Codex
- Introduced tests for async `extract_pdf_text` to validate various error
  scenarios:
  - `test_async_extract_pdf_text_multi_file_guard`: Ensures validation for
    single file restriction.
  - `test_async_extract_pdf_text_invalid_pages`: Verifies validation for
    improper page ranges.
  - `test_async_extract_pdf_text_server_error`: Tests handling of server error
    responses.
  - `test_async_extract_pdf_text_invalid_option_values`: Checks validation for
    invalid option values.
- Mocked API transport to simulate responses and validate exception handling.

Assisted-by: Codex
@datalogics-kam datalogics-kam force-pushed the pdfcloud-5523-extract-text branch from 78f526e to d64fb28 Compare February 10, 2026 16:23
@datalogics-cgreen datalogics-cgreen merged commit 3d1c459 into pdfrest:main Feb 11, 2026
14 checks passed
@datalogics-kam datalogics-kam deleted the pdfcloud-5523-extract-text branch February 11, 2026 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants