Skip to content

v0.1.0: Words + ExtractText (Phase 1.3.B)#1

Merged
hallelx2 merged 1 commit into
mainfrom
feat/words-and-text-extract
May 26, 2026
Merged

v0.1.0: Words + ExtractText (Phase 1.3.B)#1
hallelx2 merged 1 commit into
mainfrom
feat/words-and-text-extract

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

@hallelx2 hallelx2 commented May 26, 2026

Summary

Port pdfplumber's word + text extraction algorithms into Go. Three new methods on Page:

  • Words(WordOpts) ([]Word, error)
  • ExtractText(TextOpts) (string, error)
  • ExtractTextSimple(xTolerance, yTolerance float64) (string, error)

Plus supporting machinery: BBox helpers (geometry.go), 1-D clustering primitives (clustering.go), and the WordExtractor algorithm (text.go).

API additions

type Page interface {
    // v0.0.1 surface unchanged…
    Number() int
    Width() float64
    Height() float64
    Chars() ([]Char, error)
    Lines() ([]Line, error)
    Rects() ([]Rect, error)
    Curves() ([]Curve, error)
    Objects() (Objects, error)

    // New in v0.1.0.
    Words(opts WordOpts) ([]Word, error)
    ExtractText(opts TextOpts) (string, error)
    ExtractTextSimple(xTolerance, yTolerance float64) (string, error)
}

type Word struct {
    Text      string
    X0,Y0,X1,Y1 float64
    Upright   bool
    Direction string // "ltr" | "rtl" | "ttb" | "btt"
    FontName  string
    FontSize  float64
    Chars     []Char // populated when WordOpts.KeepChars=true
}

type WordOpts struct {
    XTolerance, YTolerance float64 // default 3, 3
    KeepBlankChars         bool
    UseTextFlow            bool
    HorizontalLTR          bool   // default true
    VerticalTTB            bool   // default true
    ExtraAttrs             []string
    SplitAtPunctuation     bool
    Expand                 bool   // ligature expansion, default true
    KeepChars              bool
}

type TextOpts struct {
    XTolerance, YTolerance       float64
    Layout                       bool
    LayoutWidthChars             int
    LayoutHeightChars            int
    XDensity, YDensity           float64 // default 7.25, 13
    UseTextFlow                  bool
    HorizontalLTR                bool
    VerticalTTB                  bool
    ExtraAttrs                   []string
    Expand                       bool
}

// Defaults match pdfplumber.
func DefaultWordOpts() WordOpts
func DefaultTextOpts() TextOpts

Parity with pdfplumber

Matches exactly:

  • Word text content (byte-equal).
  • Word count and ordering (line cluster → sort within line by direction).
  • Word direction handling (ltr/rtl/ttb/btt mapping from upright + flags).
  • Ligature expansion table (fi→fi, fl→fl, etc.).
  • Default tolerances (XTolerance=3, YTolerance=3) and switches
    (KeepBlankChars, UseTextFlow, SplitAtPunctuation, Expand).
  • dedupeChars semantics (text + position + extra_attrs equality).

Intentionally differs:

  • Word bbox positions drift by up to ~10 PDF points on standard-14 fonts because the AFM metric tables aren't yet bundled. We use a default-width fallback whereas pdfplumber pulls precise widths from pdfminer.six's AFM bundle. Word text + count + order match exactly; the drift is purely geometric. Golden tests assert text parity exactly and position parity within a 15-point envelope. AFM bundling is a v0.2.x goal.
  • Layout=true produces a structurally similar fixed-width grid but is not byte-equal to pdfplumber's extract_text(layout=True) (pdfplumber's layout output has its own version-to-version drift).

Not yet ported (documented as future work):

  • extract_text_lines (regex-based line extraction).
  • TextMap.search (regex over assembled page text with char-back-references).
  • extra_attrs beyond fontname and size.

Tests

  • 30+ unit tests across geometry_test.go, clustering_test.go, text_test.go.
  • 3 golden-file parity tests against pdfplumber output: hello.pdf, rules.pdf, simple1.pdf (in testdata/golden/). Regenerable via python scripts/gen_golden.py.
  • All v0.0.1 tests continue to pass.
  • go test ./... runs in ~4 seconds.

Test plan

  • go build ./...
  • go vet ./...
  • go test -count=1 ./... clean (final two-line output: ok github.com/hallelx2/pdftable 0.803s / ok github.com/hallelx2/pdftable/internal/pdf 2.975s).
  • Golden tests pass (go test -run TestGolden ./...).
  • No breaking changes — v0.0.1 callers compile against the extended Page interface as-is.

Summary by Sourcery

Add word-level and text extraction APIs to pages and align their behaviour with pdfplumber, including clustering and geometry helpers plus golden parity tests.

New Features:

  • Introduce Page.Words, Page.ExtractText, and Page.ExtractTextSimple APIs for word and text extraction.
  • Add Word, WordOpts, and TextOpts types with defaults mirroring pdfplumber’s behaviour.
  • Document text extraction workflows, pdfplumber parity details, and usage examples in the README.

Enhancements:

  • Implement word and text extraction pipeline, including clustering and geometry helpers for layout analysis.
  • Add extensive unit tests and golden-file parity tests to validate behaviour against pdfplumber output.
  • Update project roadmap and changelog for the v0.1.0 release, noting known limitations and future AFM metric work.

Documentation:

  • Update README with v0.1.0 status, installation snippet, API reference for new text APIs, and parity notes against pdfplumber.

Tests:

  • Add unit tests for text extraction, clustering, and geometry helpers, plus golden-file tests comparing output to pdfplumber.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added text extraction APIs: Words() extracts positioned text runs, ExtractText() provides layout-preserving text output, and ExtractTextSimple() offers a streamlined alternative.
    • Added BBox geometry type with spatial operations for bounding box calculations and containment checks.
  • Tests

    • Added comprehensive golden-file tests validating extraction parity with pdfplumber outputs.

Review Change Stack

Port pdfplumber's WordExtractor and extract_text into Go. Three new
methods on the Page interface:

- Page.Words(WordOpts)             → []Word, error
- Page.ExtractText(TextOpts)       → string, error
- Page.ExtractTextSimple(xt, yt)   → string, error

Each Word carries its bbox, font name/size, upright flag, and
direction (ltr/rtl/ttb/btt), with an optional Chars slice when
KeepChars=true.

Supporting infrastructure:

- geometry.go — BBox value type with Union/Intersect/Contains/Snap
  and MergeBBoxes helpers.
- clustering.go — 1-D agglomerative clustering primitives
  (clusterFloat1D, clusterObjects[T], groupObjectsByAttr[T,K],
  dedupeChars). Ports of pdfplumber/utils/clustering.py.
- text.go — Word + WordExtractor algorithm, dense and layout-
  preserving ExtractText paths, ligature expansion table.

The Page interface is additive: v0.0.1 callers that only use
Chars/Lines/Rects/Curves continue to compile and work unchanged.

Tests:

- geometry_test.go, clustering_test.go, text_test.go — table-
  driven unit tests for each primitive and each public entry
  point.
- golden_test.go — parity tests against pdfplumber output on three
  fixture PDFs (hello, rules, simple1). Expected outputs in
  testdata/golden/*.expected.json, regenerable via
  scripts/gen_golden.py.

Parity notes:

- Word text, count, order, and direction match pdfplumber exactly.
- Word bbox positions drift by up to ~10 PDF points on standard-14
  fonts because the AFM metrics aren't yet bundled (planned for
  v0.2.x). The golden test tolerance is 15 points to absorb this.
Copilot AI review requested due to automatic review settings May 26, 2026 22:52
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 26, 2026

Reviewer's Guide

Implements pdfplumber-parity word and text extraction in Go by extending the Page API with Words/ExtractText/ExtractTextSimple, adding clustering and geometry primitives to support word grouping and layout-preserving extraction, and wiring up golden tests and documentation for v0.1.0.

Sequence diagram for Page.ExtractText workflow

sequenceDiagram
    actor Client
    participant Page
    participant page
    participant clustering as clusterObjects
    participant text as extractTextFromChars
    participant layout as extractTextWithLayout

    Client->>Page: ExtractText(opts TextOpts)
    Page->>page: ExtractText(opts TextOpts)
    page->>page: applyTextOptDefaults(opts)
    page->>page: Chars() []Char

    alt [opts.Layout]
        page->>layout: extractTextWithLayout(chars, Width(), Height(), opts)
        layout->>page: text string
    else [!opts.Layout]
        page->>text: extractTextFromChars(chars, opts)
        text->>clustering: clusterObjects(words, keyFn, opts.YTolerance, false)
        clustering-->>text: lines of words
        text-->>page: text string
    end

    page-->>Client: text string
Loading

File-Level Changes

Change Details Files
Extend the public API and docs to expose word and text extraction on Page with configurable options and defaults aligned to pdfplumber.
  • Add Words, ExtractText, and ExtractTextSimple methods to the Page interface with detailed behavior docs and non-breaking semantics for existing callers
  • Introduce Word, WordOpts, and TextOpts types plus DefaultWordOpts and DefaultTextOpts constructors wired to pdfplumber-matching defaults
  • Update README with new API surface, usage examples, parity/limitations section, architecture diagram, and roadmap for v0.1.x and later
  • Bump install instructions and status text in README from v0.0.1 to v0.1.0 and document new text-extraction capabilities
  • Record v0.1.0 additions and limitations in CHANGELOG, including AFM-metrics TODO and unported text features
README.md
CHANGELOG.md
page.go
pdftable.go
Implement pdfplumber-like word extraction and text extraction (dense and layout-preserving) as methods on the concrete page implementation.
  • Add text.go with a Go port of pdfplumber’s WordExtractor and extract_text*, including Word construction, ligature expansion, punctuation splitting, direction handling, and conversion between TextOpts and WordOpts
  • Implement page.Words to fetch chars, apply defaults, and run the core word-extraction pipeline over Char slices
  • Implement page.ExtractText to choose between dense and layout-preserving paths, using page dimensions for layout grid sizing when needed
  • Implement page.ExtractTextSimple as a light-weight, heuristic-based text extractor that clusters chars into lines and joins with gap-based spacing
  • Handle coordinate-system differences (pdfplumber image space vs PDF user space) in clustering and layout to maintain expected reading order
text.go
page.go
Add reusable geometry and clustering primitives that underpin word grouping, layout reconstruction, and de-duplication of chars.
  • Introduce BBox type with helpers (NewBBox, Width/Height/Area, Union, Intersect, Contains, ContainsPoint, Snap, MergeBBoxes, BBoxOfChar, BBoxOfChars) to normalize and operate on bounding boxes
  • Port pdfplumber’s 1-D clustering utilities into clusterFloat1D and makeClusterDict, plus generic clusterObjects and groupObjectsByAttr for tolerance-based and exact-key grouping
  • Implement dedupeChars to collapse near-duplicate Char instances based on text, position, and optional extra attributes such as fontname and size
  • Expose a small float64Bits helper to keep float-key encoding centralized for clustering/deduplication logic
geometry.go
clustering.go
Add comprehensive unit and golden tests plus helper tooling to validate parity with pdfplumber and ensure geometry/clustering correctness.
  • Add text_test.go covering word grouping basics, multi-line handling, ligature expansion, punctuation splitting, KeepChars behavior, dense and layout extraction, and integration tests using a synthetic Hello, world! PDF
  • Add clustering_test.go validating 1-D clustering, clustering dictionaries, grouping semantics with and without preserveOrder, and dedupeChars behavior under different extra_attrs configurations
  • Add geometry_test.go validating BBox normalization, dimensions, union/intersection semantics (including edge-touching cases), containment, snapping, and aggregation helpers
  • Add golden_test.go (in the external pdftable_test package) that compares Words and ExtractText output to pdfplumber-generated JSON golden files with strict text/direction parity and tolerant bbox checks
  • Introduce scripts/gen_golden.py to regenerate expected JSON from pdfplumber for each fixture PDF in testdata/golden, and add initial golden JSON fixtures for hello.pdf, rules.pdf, and simple1.pdf
text_test.go
clustering_test.go
geometry_test.go
golden_test.go
scripts/gen_golden.py
testdata/golden/hello.expected.json
testdata/golden/rules.expected.json
testdata/golden/simple1.expected.json

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

📝 Walkthrough

Walkthrough

This PR introduces v0.1.0 text and word extraction capabilities. It adds geometry primitives (BBox type with spatial operations), clustering algorithms for character grouping, a complete text extraction pipeline supporting dense and layout-preserving modes, comprehensive test coverage, golden parity fixtures against pdfplumber, and updated documentation.

Changes

Word and Text Extraction Pipeline

Layer / File(s) Summary
Documentation and release notes
CHANGELOG.md, README.md, pdftable.go
CHANGELOG.md documents v0.1.0 release with text extraction APIs and known limitations. README expands API surface, adds quickstart examples showing Words and ExtractText, new "Text extraction" usage section, updated pdfplumber comparison, and architecture diagram. Package comments reflect text extraction inclusion.
Bounding box and spatial geometry
geometry.go, geometry_test.go
BBox type with float coordinates, normalization constructor, convenience methods (width/height/area), spatial operations (union, intersection, containment checks), snapping, and slice helpers for merging and deriving bboxes from characters. Comprehensive test coverage validates geometry across all operations.
Clustering and character deduplication
clustering.go, clustering_test.go
Tolerance-based float clustering, generic object clustering with optional order preservation, consecutive grouping by exact attribute equality, and character deduplication by position with configurable font/size attributes. Tests verify clustering semantics, deduplication with text/font sensitivity, and edge cases.
Page interface extensions
page.go
Adds three new methods to Page interface: Words(opts WordOpts) for positioned text runs, ExtractText(opts TextOpts) for full-page text with optional layout, and ExtractTextSimple(xTolerance, yTolerance) as baseline extraction.
Text extraction algorithms and types
text.go
Word type with position/direction/font metadata and Chars list. WordOpts and TextOpts structs with default constructors. Core algorithms: char deduplication/grouping, line/word clustering using direction-aware tolerance, word assembly with optional ligature expansion and punctuation splitting, dense text joining (spaces/newlines), and layout-preserving mode using fixed character grids with XDensity/YDensity. Page-level entry points with error propagation.
Text extraction tests
text_test.go
Hand-crafted unit tests for word grouping (multi-line, blank char handling, ligature expansion, punctuation splitting, char retention) and dense/layout text modes. Integration tests against inlined hello-world PDF fixture for Words, ExtractText, and ExtractTextSimple. Direction inference and default option validation. Helper functions for deterministic PDF construction and numeric comparison.
Golden fixture generation and parity testing
scripts/gen_golden.py, testdata/golden/*.expected.json, golden_test.go
Python script to generate golden fixtures from PDFs using pdfplumber with coordinate conversion (image-space → PDF user-space). Three test documents (hello, rules, simple1) with expected extraction outputs including text, word coordinates, direction, and upright flags. Golden test suite enumerates expected fixtures, validates page dimensions and word extraction with ±15pt position tolerance and exact text matching, plus validates dense text output by whitespace-splitting word sequences.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Through PDFs we hop and bound,
Words and boxes all around!
Clustering chars in perfect rows,
Golden fixtures verify our flows—
Text extraction complete at last! 📖

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 52.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main feature addition: v0.1.0 release with Words and ExtractText APIs implementing Phase 1.3.B text extraction.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/words-and-text-extract

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 5 issues, and left some high level feedback:

  • The option defaulting logic for tolerances (e.g., applyWordOptDefaults, applyTextOptDefaults, and textOptsToWordOpts) treats 0 as "use default", which makes it impossible for callers to intentionally request a true zero-tolerance behaviour (even though lower-level clustering supports a special tolerance == 0 semantics); consider either using a separate sentinel (e.g., negative) or a *float64/nil-means-default pattern so 0 can be expressed explicitly.
  • The key-building helpers in extractWordsFromChars/dedupeChars (keyOf functions) allocate fresh []byte and strings on every call and re-run even inside tight loops (e.g., the sort.SliceStable comparator in dedupeChars recomputes keys repeatedly); if profiling shows this as hot, consider caching keys per element or precomputing them once to reduce per-char allocations.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The option defaulting logic for tolerances (e.g., `applyWordOptDefaults`, `applyTextOptDefaults`, and `textOptsToWordOpts`) treats `0` as "use default", which makes it impossible for callers to intentionally request a true zero-tolerance behaviour (even though lower-level clustering supports a special `tolerance == 0` semantics); consider either using a separate sentinel (e.g., negative) or a `*float64`/`nil`-means-default pattern so `0` can be expressed explicitly.
- The key-building helpers in `extractWordsFromChars`/`dedupeChars` (`keyOf` functions) allocate fresh `[]byte` and strings on every call and re-run even inside tight loops (e.g., the `sort.SliceStable` comparator in `dedupeChars` recomputes keys repeatedly); if profiling shows this as hot, consider caching keys per element or precomputing them once to reduce per-char allocations.

## Individual Comments

### Comment 1
<location path="clustering.go" line_range="286" />
<code_context>
+// Float64bits dependency so other tests/files don't have to import
+// "math" just to compare floats. Left as a package-private helper for
+// now — only dedupeChars's key construction uses it.
+func float64Bits(f float64) uint64 { return math.Float64bits(f) }
</code_context>
<issue_to_address>
**nitpick:** float64Bits helper is currently unused and adds unnecessary surface area.

Given it isn’t used, this is effectively dead code and expands the package surface unnecessarily. Unless there’s a concrete near-term use, consider removing it and calling `math.Float64bits` directly where needed instead.
</issue_to_address>

### Comment 2
<location path="text_test.go" line_range="342" />
<code_context>
+	}
+}
+
+func TestDirectionFor(t *testing.T) {
+	tests := []struct {
+		upright, ltr, ttb bool
</code_context>
<issue_to_address>
**suggestion (testing):** Add tests that exercise vertical and rotated text flows (ttb/btt) end-to-end

Existing tests only cover upright horizontal (LTR) text. Since mergeLineIntoWords/charBeginsNewWord/sortCharsByDir have direction-specific branches for rotated text (Upright=false, TTB/BTT), please add tests that construct rotated Char slices and verify resulting Words’ Direction and grouping. For instance, use a synthetic top-to-bottom column and assert word order and behavior when toggling VerticalTTB/HorizontalLTR, so the vertical/rotated paths are exercised and protected against regressions.
</issue_to_address>

### Comment 3
<location path="text_test.go" line_range="325" />
<code_context>
+	}
+}
+
+func TestPageExtractTextSimple(t *testing.T) {
+	doc, err := openHelloWorldDoc()
+	if err != nil {
</code_context>
<issue_to_address>
**suggestion (testing):** Strengthen coverage of ExtractTextSimple with unit-style Char slices and edge cases

ExtractTextSimple has bespoke behavior (dropping empty-text chars, gap-based space insertion, defaulting tolerances, handling empty pages), but it’s only tested via a full PDF and substring checks. Please add unit-style tests that exercise the core logic directly (e.g., via a helper or by factoring it out) with hand-crafted Char slices, and assert exact output for key cases: no chars, empty Text chars, explicit spaces vs gap-inferred spaces, and xTolerance/yTolerance = 0 vs non-zero. This will better lock in the intended semantics and protect future refactors.

Suggested implementation:

```golang
func TestExtractTextSimple_EmptyChars(t *testing.T) {
	// Empty input should yield empty output without error.
	got := extractTextSimple(nil, DefaultTextOpts())
	if got != "" {
		t.Fatalf("extractTextSimple(nil) = %q, want empty string", got)
	}

	got = extractTextSimple([]Char{}, DefaultTextOpts())
	if got != "" {
		t.Fatalf("extractTextSimple([]Char{}) = %q, want empty string", got)
	}
}

func TestExtractTextSimple_DropsEmptyTextChars(t *testing.T) {
	opts := DefaultTextOpts()

	chars := []Char{
		{Text: ""},
		{Text: "A"},
		{Text: ""},
		{Text: "B"},
		{Text: ""},
	}

	got := extractTextSimple(chars, opts)
	want := "AB"
	if got != want {
		t.Fatalf("extractTextSimple dropped-empty-text: got %q, want %q", got, want)
	}
}

func TestExtractTextSimple_ExplicitSpacesVsGapSpaces(t *testing.T) {
	// This test assumes ExtractTextSimple:
	//   * keeps explicit space characters as-is
	//   * inserts spaces when the gap between consecutive chars exceeds xTolerance
	opts := DefaultTextOpts()
	opts.XTolerance = 2.0

	chars := []Char{
		// "A B" with explicit space
		{Text: "A", X: 0, Y: 0},
		{Text: " ", X: 1, Y: 0},
		{Text: "B", X: 2, Y: 0},

		// large gap should cause an inserted space between C and D
		{Text: "C", X: 10, Y: 0},
		{Text: "D", X: 20, Y: 0},
	}

	got := extractTextSimple(chars, opts)

	// We expect both the explicit space (between A and B) and the gap-based space (between C and D).
	if !strings.Contains(got, "A B") {
		t.Fatalf("extractTextSimple explicit-space: got %q, want to contain %q", got, "A B")
	}
	if !strings.Contains(got, "C D") {
		t.Fatalf("extractTextSimple gap-space: got %q, want to contain %q", got, "C D")
	}
}

func TestExtractTextSimple_XToleranceZero(t *testing.T) {
	// With xTolerance = 0, only explicit spaces should appear; no gap-based insertion.
	opts := DefaultTextOpts()
	opts.XTolerance = 0

	chars := []Char{
		{Text: "A", X: 0, Y: 0},
		{Text: "B", X: 10, Y: 0}, // large gap, but xTolerance == 0
	}

	got := extractTextSimple(chars, opts)
	want := "AB" // no gap-based space
	if got != want {
		t.Fatalf("extractTextSimple xTolerance=0: got %q, want %q", got, want)
	}
}

func TestExtractTextSimple_YToleranceGrouping(t *testing.T) {
	// This test exercises how yTolerance affects line grouping.
	opts := DefaultTextOpts()
	opts.YTolerance = 1.0

	chars := []Char{
		// "AB" on first line
		{Text: "A", X: 0, Y: 0},
		{Text: "B", X: 1, Y: 0},

		// "CD" on a second line outside yTolerance
		{Text: "C", X: 0, Y: 3},
		{Text: "D", X: 1, Y: 3},
	}

	got := extractTextSimple(chars, opts)

	// We expect a line break between "AB" and "CD".
	if !strings.Contains(got, "AB") || !strings.Contains(got, "CD") {
		t.Fatalf("extractTextSimple yTolerance: got %q, want to contain both %q and %q", got, "AB", "CD")
	}
	if !strings.Contains(got, "AB\nCD") && !strings.Contains(got, "AB\r\nCD") {
		t.Fatalf("extractTextSimple yTolerance: got %q, want line break between lines", got)
	}
}

```

I assumed there is an internal helper `extractTextSimple(chars []Char, opts TextOptions) string` and a `Char` type with at least `Text string`, `X float64`, and `Y float64` fields, plus `DefaultTextOpts()` returning a struct that includes `XTolerance` and `YTolerance`. If the actual API differs, you will need to:

1. Adjust the helper calls to the real function that implements ExtractTextSimple’s core logic (or factor such a helper out from the method under test into the package).
2. Update the `Char` field initializations in the tests to match the real `Char` struct (e.g., use `BBox`/`Point`/`Pos` fields instead of `X`/`Y` if necessary).
3. If `extractTextSimple` returns `(string, error)` instead of just `string`, capture and assert `err == nil` before checking `got`.
4. Place these tests near your existing `TestPageExtractTextSimple` (or update/replace that test if you prefer these unit-style cases instead of the current PDF-based one).
</issue_to_address>

### Comment 4
<location path="clustering_test.go" line_range="89" />
<code_context>
+
+// TestClusterObjects exercises both preserveOrder modes on a tiny set
+// of struct-valued inputs.
+func TestClusterObjects(t *testing.T) {
+	type pt struct {
+		x   float64
</code_context>
<issue_to_address>
**suggestion (testing):** Add tests for clusterObjects and dedupeChars with zero/negative tolerances

One remaining gap is how clusterObjects and dedupeChars behave when tolerance is 0 (or negative) in their own APIs. Please add:
- A test for clusterObjects with tolerance==0 that asserts the expected grouping (identical vs near-equal values).
- A test for dedupeChars with tolerance==0 that confirms whether only exact coordinate matches are removed.
This will lock in the intended contract and protect against subtle regressions around tolerance handling.

Suggested implementation:

```golang
 // TestClusterObjects exercises both preserveOrder modes on a tiny set
 // of struct-valued inputs.
 func TestClusterObjects(t *testing.T) {
 	type pt struct {
 		x   float64
 		tag string
 	}
 	xs := []pt{
 		{x: 1, tag: "a"},
 		{x: 10, tag: "b"},
 		{x: 2, tag: "c"},
 		{x: 11, tag: "d"},
 	}

 	// tolerance == 0: identical keys should cluster, distinct keys should not.
 	zeroTol := 0.0
 	xsZeroTol := []pt{
 		{x: 1, tag: "a1"},
 		{x: 1, tag: "a2"},        // identical to a1, should be in same cluster at tol=0
 		{x: 1.000001, tag: "a3"}, // near-equal but not identical, should be in a different cluster at tol=0
 	}
 	clustersZero := clusterObjects(xsZeroTol, func(p pt) float64 { return p.x }, zeroTol, false)

 	if len(clustersZero) != len(xsZeroTol) {
 		t.Fatalf("clusterObjects(tol=0) returned %d labels for %d inputs", len(clustersZero), len(xsZeroTol))
 	}
 	if clustersZero[0] != clustersZero[1] {
 		t.Errorf("expected identical keys (indices 0 and 1) to be in same cluster at tol=0, got %d vs %d", clustersZero[0], clustersZero[1])
 	}
 	if clustersZero[0] == clustersZero[2] {
 		t.Errorf("expected near-equal but non-identical keys (indices 0 and 2) to be in different clusters at tol=0, both got %d", clustersZero[0])
 	}

 	// preserveOrder=false: clusters sorted by key, items sorted within

```

To fully implement your suggestion, you’ll also want to add a dedicated test for `dedupeChars` with `tolerance == 0`. I can’t see the definition/signature of `dedupeChars` or the concrete character type it operates on, so the exact code needs to be aligned with your existing types, but the intended structure is:

1. Identify the character type used by `dedupeChars` (for example, `char`, `glyph`, or similar), including whatever fields carry the coordinates used for deduping (e.g. `x`, `y`, or a `pt`/`pos` field).
2. Add a new test function, e.g. `func TestDedupeCharsZeroTolerance(t *testing.T)`, in `clustering_test.go`.
3. In that test:
   - Construct an input slice like:
     - Two characters with exactly identical coordinates and other relevant fields.
     - One or more characters with very close but not bitwise-identical coordinates.
   - Call `dedupeChars` with `tolerance == 0`.
   - Assert that:
     - Only one of the exactly identical characters remains.
     - None of the near-equal (but not identical) characters are removed.
4. Place this new test after your existing `TestDedupeChars` (if present) to keep related tests grouped.

Conceptually, the test body will look like:

```go
func TestDedupeCharsZeroTolerance(t *testing.T) {
    chars := []YourCharType{
        {/* coord: (10, 20), rune: 'a' */},
        {/* coord: (10, 20), rune: 'a' */},       // exact duplicate
        {/* coord: (10.000001, 20), rune: 'a' */}, // near-duplicate
    }

    got := dedupeChars(chars, 0)

    // Assert that only one exact duplicate remains, and
    // the near-duplicate is still present.
}
```

You’ll need to replace `YourCharType` and the field initializers with whatever is actually used in your package and wired into `dedupeChars`.
</issue_to_address>

### Comment 5
<location path="clustering.go" line_range="200" />
<code_context>
+// The output is in the SAME ORDER as the input — the first occurrence
+// of each cluster is kept and subsequent duplicates are dropped. This
+// preserves content-stream order, which downstream code may rely on.
+func dedupeChars(chars []Char, tolerance float64, extraAttrs []string) []Char {
+	if len(chars) == 0 {
+		return nil
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the `dedupeChars` implementation by precomputing per-char keys, using a lightweight local coordinate-clustering helper instead of the generic clustering pipeline, and removing the unused `float64Bits` wrapper.

You can simplify the hot path in `dedupeChars` without changing behavior, and also drop some indirection that’s not paying for itself.

### 1. Avoid recomputing the string key for each comparison

Right now `keyOf` is called repeatedly:

- Inside the `sort.SliceStable` comparator.
- Inside the outer `for` that walks runs of equal keys.

That makes the grouping logic harder to follow and less efficient. Precompute the key once per char and carry it around:

```go
func dedupeChars(chars []Char, tolerance float64, extraAttrs []string) []Char {
	if len(chars) == 0 {
		return nil
	}

	buildKey := func(c Char) string {
		buf := make([]byte, 0, 32+len(c.Text)+len(c.FontName))
		if c.Upright {
			buf = append(buf, 'U')
		} else {
			buf = append(buf, 'u')
		}
		buf = append(buf, '\x00')
		buf = append(buf, c.Text...)
		for _, attr := range extraAttrs {
			buf = append(buf, '\x00')
			switch attr {
			case "fontname":
				buf = append(buf, c.FontName...)
			case "size":
				bits := math.Float64bits(c.FontSize)
				for i := 7; i >= 0; i-- {
					buf = append(buf, byte(bits>>(i*8)))
				}
			}
		}
		return string(buf)
	}

	type indexed struct {
		c   Char
		idx int
		key string
	}
	sorted := make([]indexed, len(chars))
	for i, c := range chars {
		sorted[i] = indexed{c: c, idx: i, key: buildKey(c)}
	}

	// Sort once by the precomputed key.
	sort.SliceStable(sorted, func(i, j int) bool {
		return sorted[i].key < sorted[j].key
	})

	keepIdx := make(map[int]struct{}, len(chars))

	// Walk equal-key runs without recomputing keys.
	for i := 0; i < len(sorted); {
		j := i + 1
		k := sorted[i].key
		for j < len(sorted) && sorted[j].key == k {
			j++
		}
		run := sorted[i:j]

		// ... inner clustering on run (see next section) ...

		i = j
	}

	// unchanged: build output using keepIdx
}
```

This makes the equivalence relation explicit (`key` field) and the logic easier to audit.

### 2. Use a simpler clustering helper for dedupe (avoid `makeClusterDict` indirection)

Within each equal-key run, you don’t need the full `clusterObjects` pipeline (which goes through `makeClusterDict` + map + extra allocation). For dedupe, a straightforward “sort by coord + linear scan” per dimension is enough and simpler to read.

You can keep the generic helper *local* to this file and use it only from `dedupeChars`, while leaving `clusterObjects` untouched for other callsites:

```go
// clusterByCoord clusters a slice by a float coordinate extracted by keyFn.
// It sorts by the coordinate and then groups consecutive items whose
// coordinate differs by <= tol.
func clusterByCoord[T any](xs []T, keyFn func(T) float64, tol float64) [][]T {
	if len(xs) == 0 {
		return nil
	}
	sort.Slice(xs, func(i, j int) bool {
		return keyFn(xs[i]) < keyFn(xs[j])
	})
	var out [][]T
	current := []T{xs[0]}
	last := keyFn(xs[0])
	for _, v := range xs[1:] {
		k := keyFn(v)
		if k <= last+tol {
			current = append(current, v)
		} else {
			out = append(out, current)
			current = []T{v}
		}
		last = k
	}
	out = append(out, current)
	return out
}
```

Then use it in `dedupeChars` instead of nested `clusterObjects`:

```go
// inside dedupeChars, after computing `run := sorted[i:j]`:

yClusters := clusterByCoord(run, func(e indexed) float64 { return e.c.Y0 }, tolerance)
for _, yc := range yClusters {
	xClusters := clusterByCoord(yc, func(e indexed) float64 { return e.c.X0 }, tolerance)
	for _, xc := range xClusters {
		minIdx := xc[0].idx
		for _, e := range xc[1:] {
			if e.idx < minIdx {
				minIdx = e.idx
			}
		}
		keepIdx[minIdx] = struct{}{}
	}
}
```

Behaviorally this matches the existing logic:

- `clusterByCoord` sorts by Y0/X0 and does a simple `delta <= tolerance` agglomeration (same as `clusterFloat1D`).
- You still choose the smallest original index within each (Y, X) bucket, preserving “first occurrence wins”.

But the dedupe path no longer depends on:

- `makeClusterDict`
- `clusterObjects`
- `clusterFloat1D`/map rebuilds for each run

which significantly reduces the conceptual depth of this function.

### 3. Remove the unused `float64Bits` wrapper

`float64Bits` is defined but not used; it adds a name without simplifying anything:

```go
// currently:
func float64Bits(f float64) uint64 { return math.Float64bits(f) }
```

You can safely delete this function. If you later find multiple callsites wanting stable float serialization for keys, you could instead introduce a more descriptive helper (e.g. the `buildKey` logic above already shows the pattern).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread clustering.go
// Float64bits dependency so other tests/files don't have to import
// "math" just to compare floats. Left as a package-private helper for
// now — only dedupeChars's key construction uses it.
func float64Bits(f float64) uint64 { return math.Float64bits(f) }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: float64Bits helper is currently unused and adds unnecessary surface area.

Given it isn’t used, this is effectively dead code and expands the package surface unnecessarily. Unless there’s a concrete near-term use, consider removing it and calling math.Float64bits directly where needed instead.

Comment thread text_test.go
}
}

func TestDirectionFor(t *testing.T) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add tests that exercise vertical and rotated text flows (ttb/btt) end-to-end

Existing tests only cover upright horizontal (LTR) text. Since mergeLineIntoWords/charBeginsNewWord/sortCharsByDir have direction-specific branches for rotated text (Upright=false, TTB/BTT), please add tests that construct rotated Char slices and verify resulting Words’ Direction and grouping. For instance, use a synthetic top-to-bottom column and assert word order and behavior when toggling VerticalTTB/HorizontalLTR, so the vertical/rotated paths are exercised and protected against regressions.

Comment thread text_test.go
}
}

func TestPageExtractTextSimple(t *testing.T) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Strengthen coverage of ExtractTextSimple with unit-style Char slices and edge cases

ExtractTextSimple has bespoke behavior (dropping empty-text chars, gap-based space insertion, defaulting tolerances, handling empty pages), but it’s only tested via a full PDF and substring checks. Please add unit-style tests that exercise the core logic directly (e.g., via a helper or by factoring it out) with hand-crafted Char slices, and assert exact output for key cases: no chars, empty Text chars, explicit spaces vs gap-inferred spaces, and xTolerance/yTolerance = 0 vs non-zero. This will better lock in the intended semantics and protect future refactors.

Suggested implementation:

func TestExtractTextSimple_EmptyChars(t *testing.T) {
	// Empty input should yield empty output without error.
	got := extractTextSimple(nil, DefaultTextOpts())
	if got != "" {
		t.Fatalf("extractTextSimple(nil) = %q, want empty string", got)
	}

	got = extractTextSimple([]Char{}, DefaultTextOpts())
	if got != "" {
		t.Fatalf("extractTextSimple([]Char{}) = %q, want empty string", got)
	}
}

func TestExtractTextSimple_DropsEmptyTextChars(t *testing.T) {
	opts := DefaultTextOpts()

	chars := []Char{
		{Text: ""},
		{Text: "A"},
		{Text: ""},
		{Text: "B"},
		{Text: ""},
	}

	got := extractTextSimple(chars, opts)
	want := "AB"
	if got != want {
		t.Fatalf("extractTextSimple dropped-empty-text: got %q, want %q", got, want)
	}
}

func TestExtractTextSimple_ExplicitSpacesVsGapSpaces(t *testing.T) {
	// This test assumes ExtractTextSimple:
	//   * keeps explicit space characters as-is
	//   * inserts spaces when the gap between consecutive chars exceeds xTolerance
	opts := DefaultTextOpts()
	opts.XTolerance = 2.0

	chars := []Char{
		// "A B" with explicit space
		{Text: "A", X: 0, Y: 0},
		{Text: " ", X: 1, Y: 0},
		{Text: "B", X: 2, Y: 0},

		// large gap should cause an inserted space between C and D
		{Text: "C", X: 10, Y: 0},
		{Text: "D", X: 20, Y: 0},
	}

	got := extractTextSimple(chars, opts)

	// We expect both the explicit space (between A and B) and the gap-based space (between C and D).
	if !strings.Contains(got, "A B") {
		t.Fatalf("extractTextSimple explicit-space: got %q, want to contain %q", got, "A B")
	}
	if !strings.Contains(got, "C D") {
		t.Fatalf("extractTextSimple gap-space: got %q, want to contain %q", got, "C D")
	}
}

func TestExtractTextSimple_XToleranceZero(t *testing.T) {
	// With xTolerance = 0, only explicit spaces should appear; no gap-based insertion.
	opts := DefaultTextOpts()
	opts.XTolerance = 0

	chars := []Char{
		{Text: "A", X: 0, Y: 0},
		{Text: "B", X: 10, Y: 0}, // large gap, but xTolerance == 0
	}

	got := extractTextSimple(chars, opts)
	want := "AB" // no gap-based space
	if got != want {
		t.Fatalf("extractTextSimple xTolerance=0: got %q, want %q", got, want)
	}
}

func TestExtractTextSimple_YToleranceGrouping(t *testing.T) {
	// This test exercises how yTolerance affects line grouping.
	opts := DefaultTextOpts()
	opts.YTolerance = 1.0

	chars := []Char{
		// "AB" on first line
		{Text: "A", X: 0, Y: 0},
		{Text: "B", X: 1, Y: 0},

		// "CD" on a second line outside yTolerance
		{Text: "C", X: 0, Y: 3},
		{Text: "D", X: 1, Y: 3},
	}

	got := extractTextSimple(chars, opts)

	// We expect a line break between "AB" and "CD".
	if !strings.Contains(got, "AB") || !strings.Contains(got, "CD") {
		t.Fatalf("extractTextSimple yTolerance: got %q, want to contain both %q and %q", got, "AB", "CD")
	}
	if !strings.Contains(got, "AB\nCD") && !strings.Contains(got, "AB\r\nCD") {
		t.Fatalf("extractTextSimple yTolerance: got %q, want line break between lines", got)
	}
}

I assumed there is an internal helper extractTextSimple(chars []Char, opts TextOptions) string and a Char type with at least Text string, X float64, and Y float64 fields, plus DefaultTextOpts() returning a struct that includes XTolerance and YTolerance. If the actual API differs, you will need to:

  1. Adjust the helper calls to the real function that implements ExtractTextSimple’s core logic (or factor such a helper out from the method under test into the package).
  2. Update the Char field initializations in the tests to match the real Char struct (e.g., use BBox/Point/Pos fields instead of X/Y if necessary).
  3. If extractTextSimple returns (string, error) instead of just string, capture and assert err == nil before checking got.
  4. Place these tests near your existing TestPageExtractTextSimple (or update/replace that test if you prefer these unit-style cases instead of the current PDF-based one).

Comment thread clustering_test.go

// TestClusterObjects exercises both preserveOrder modes on a tiny set
// of struct-valued inputs.
func TestClusterObjects(t *testing.T) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (testing): Add tests for clusterObjects and dedupeChars with zero/negative tolerances

One remaining gap is how clusterObjects and dedupeChars behave when tolerance is 0 (or negative) in their own APIs. Please add:

  • A test for clusterObjects with tolerance==0 that asserts the expected grouping (identical vs near-equal values).
  • A test for dedupeChars with tolerance==0 that confirms whether only exact coordinate matches are removed.
    This will lock in the intended contract and protect against subtle regressions around tolerance handling.

Suggested implementation:

 // TestClusterObjects exercises both preserveOrder modes on a tiny set
 // of struct-valued inputs.
 func TestClusterObjects(t *testing.T) {
 	type pt struct {
 		x   float64
 		tag string
 	}
 	xs := []pt{
 		{x: 1, tag: "a"},
 		{x: 10, tag: "b"},
 		{x: 2, tag: "c"},
 		{x: 11, tag: "d"},
 	}

 	// tolerance == 0: identical keys should cluster, distinct keys should not.
 	zeroTol := 0.0
 	xsZeroTol := []pt{
 		{x: 1, tag: "a1"},
 		{x: 1, tag: "a2"},        // identical to a1, should be in same cluster at tol=0
 		{x: 1.000001, tag: "a3"}, // near-equal but not identical, should be in a different cluster at tol=0
 	}
 	clustersZero := clusterObjects(xsZeroTol, func(p pt) float64 { return p.x }, zeroTol, false)

 	if len(clustersZero) != len(xsZeroTol) {
 		t.Fatalf("clusterObjects(tol=0) returned %d labels for %d inputs", len(clustersZero), len(xsZeroTol))
 	}
 	if clustersZero[0] != clustersZero[1] {
 		t.Errorf("expected identical keys (indices 0 and 1) to be in same cluster at tol=0, got %d vs %d", clustersZero[0], clustersZero[1])
 	}
 	if clustersZero[0] == clustersZero[2] {
 		t.Errorf("expected near-equal but non-identical keys (indices 0 and 2) to be in different clusters at tol=0, both got %d", clustersZero[0])
 	}

 	// preserveOrder=false: clusters sorted by key, items sorted within

To fully implement your suggestion, you’ll also want to add a dedicated test for dedupeChars with tolerance == 0. I can’t see the definition/signature of dedupeChars or the concrete character type it operates on, so the exact code needs to be aligned with your existing types, but the intended structure is:

  1. Identify the character type used by dedupeChars (for example, char, glyph, or similar), including whatever fields carry the coordinates used for deduping (e.g. x, y, or a pt/pos field).
  2. Add a new test function, e.g. func TestDedupeCharsZeroTolerance(t *testing.T), in clustering_test.go.
  3. In that test:
    • Construct an input slice like:
      • Two characters with exactly identical coordinates and other relevant fields.
      • One or more characters with very close but not bitwise-identical coordinates.
    • Call dedupeChars with tolerance == 0.
    • Assert that:
      • Only one of the exactly identical characters remains.
      • None of the near-equal (but not identical) characters are removed.
  4. Place this new test after your existing TestDedupeChars (if present) to keep related tests grouped.

Conceptually, the test body will look like:

func TestDedupeCharsZeroTolerance(t *testing.T) {
    chars := []YourCharType{
        {/* coord: (10, 20), rune: 'a' */},
        {/* coord: (10, 20), rune: 'a' */},       // exact duplicate
        {/* coord: (10.000001, 20), rune: 'a' */}, // near-duplicate
    }

    got := dedupeChars(chars, 0)

    // Assert that only one exact duplicate remains, and
    // the near-duplicate is still present.
}

You’ll need to replace YourCharType and the field initializers with whatever is actually used in your package and wired into dedupeChars.

Comment thread clustering.go
// The output is in the SAME ORDER as the input — the first occurrence
// of each cluster is kept and subsequent duplicates are dropped. This
// preserves content-stream order, which downstream code may rely on.
func dedupeChars(chars []Char, tolerance float64, extraAttrs []string) []Char {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider simplifying the dedupeChars implementation by precomputing per-char keys, using a lightweight local coordinate-clustering helper instead of the generic clustering pipeline, and removing the unused float64Bits wrapper.

You can simplify the hot path in dedupeChars without changing behavior, and also drop some indirection that’s not paying for itself.

1. Avoid recomputing the string key for each comparison

Right now keyOf is called repeatedly:

  • Inside the sort.SliceStable comparator.
  • Inside the outer for that walks runs of equal keys.

That makes the grouping logic harder to follow and less efficient. Precompute the key once per char and carry it around:

func dedupeChars(chars []Char, tolerance float64, extraAttrs []string) []Char {
	if len(chars) == 0 {
		return nil
	}

	buildKey := func(c Char) string {
		buf := make([]byte, 0, 32+len(c.Text)+len(c.FontName))
		if c.Upright {
			buf = append(buf, 'U')
		} else {
			buf = append(buf, 'u')
		}
		buf = append(buf, '\x00')
		buf = append(buf, c.Text...)
		for _, attr := range extraAttrs {
			buf = append(buf, '\x00')
			switch attr {
			case "fontname":
				buf = append(buf, c.FontName...)
			case "size":
				bits := math.Float64bits(c.FontSize)
				for i := 7; i >= 0; i-- {
					buf = append(buf, byte(bits>>(i*8)))
				}
			}
		}
		return string(buf)
	}

	type indexed struct {
		c   Char
		idx int
		key string
	}
	sorted := make([]indexed, len(chars))
	for i, c := range chars {
		sorted[i] = indexed{c: c, idx: i, key: buildKey(c)}
	}

	// Sort once by the precomputed key.
	sort.SliceStable(sorted, func(i, j int) bool {
		return sorted[i].key < sorted[j].key
	})

	keepIdx := make(map[int]struct{}, len(chars))

	// Walk equal-key runs without recomputing keys.
	for i := 0; i < len(sorted); {
		j := i + 1
		k := sorted[i].key
		for j < len(sorted) && sorted[j].key == k {
			j++
		}
		run := sorted[i:j]

		// ... inner clustering on run (see next section) ...

		i = j
	}

	// unchanged: build output using keepIdx
}

This makes the equivalence relation explicit (key field) and the logic easier to audit.

2. Use a simpler clustering helper for dedupe (avoid makeClusterDict indirection)

Within each equal-key run, you don’t need the full clusterObjects pipeline (which goes through makeClusterDict + map + extra allocation). For dedupe, a straightforward “sort by coord + linear scan” per dimension is enough and simpler to read.

You can keep the generic helper local to this file and use it only from dedupeChars, while leaving clusterObjects untouched for other callsites:

// clusterByCoord clusters a slice by a float coordinate extracted by keyFn.
// It sorts by the coordinate and then groups consecutive items whose
// coordinate differs by <= tol.
func clusterByCoord[T any](xs []T, keyFn func(T) float64, tol float64) [][]T {
	if len(xs) == 0 {
		return nil
	}
	sort.Slice(xs, func(i, j int) bool {
		return keyFn(xs[i]) < keyFn(xs[j])
	})
	var out [][]T
	current := []T{xs[0]}
	last := keyFn(xs[0])
	for _, v := range xs[1:] {
		k := keyFn(v)
		if k <= last+tol {
			current = append(current, v)
		} else {
			out = append(out, current)
			current = []T{v}
		}
		last = k
	}
	out = append(out, current)
	return out
}

Then use it in dedupeChars instead of nested clusterObjects:

// inside dedupeChars, after computing `run := sorted[i:j]`:

yClusters := clusterByCoord(run, func(e indexed) float64 { return e.c.Y0 }, tolerance)
for _, yc := range yClusters {
	xClusters := clusterByCoord(yc, func(e indexed) float64 { return e.c.X0 }, tolerance)
	for _, xc := range xClusters {
		minIdx := xc[0].idx
		for _, e := range xc[1:] {
			if e.idx < minIdx {
				minIdx = e.idx
			}
		}
		keepIdx[minIdx] = struct{}{}
	}
}

Behaviorally this matches the existing logic:

  • clusterByCoord sorts by Y0/X0 and does a simple delta <= tolerance agglomeration (same as clusterFloat1D).
  • You still choose the smallest original index within each (Y, X) bucket, preserving “first occurrence wins”.

But the dedupe path no longer depends on:

  • makeClusterDict
  • clusterObjects
  • clusterFloat1D/map rebuilds for each run

which significantly reduces the conceptual depth of this function.

3. Remove the unused float64Bits wrapper

float64Bits is defined but not used; it adds a name without simplifying anything:

// currently:
func float64Bits(f float64) uint64 { return math.Float64bits(f) }

You can safely delete this function. If you later find multiple callsites wanting stable float serialization for keys, you could instead introduce a more descriptive helper (e.g. the buildKey logic above already shows the pattern).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
README.md (1)

217-237: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Define the must helper or remove it.

Line 236 uses an undefined must helper function. Users copying this example will encounter a compilation error. Either define must (e.g., func must(s string, _ error) string { return s }) in the example, or replace it with explicit error handling as shown in the quickstart above.

📝 Proposed fix: remove the must helper
-fmt.Println(must(page.ExtractText(pdftable.DefaultTextOpts())))
+text, _ := page.ExtractText(pdftable.DefaultTextOpts())
+fmt.Println(text)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 217 - 237, The README example calls an undefined
helper must around page.ExtractText(pdftable.DefaultTextOpts()), causing a
compile error; fix by either adding a small helper func must(s string, err
error) string that returns s or by replacing the must call with explicit error
handling: call page.ExtractText(pdftable.DefaultTextOpts()), check the returned
error, handle/log/exit on error and then print the returned string. Update the
example to reference the chosen approach and ensure the symbol names (must,
page.ExtractText, pdftable.DefaultTextOpts) are used correctly.
page.go (1)

31-31: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid source-breaking by expanding Page

  • page.go expands the exported Page interface with Words, ExtractText, and ExtractTextSimple; in Go, adding methods to a public interface is source-breaking for any downstream type/mocks that already implement pdftable.Page.
  • The repo’s “Page interface is additive” wording covers callers that only use the interface, not external implementers. If third-party implementations are expected, keep Page stable and add a separate extended interface (e.g., TextPage embedding Page).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@page.go` at line 31, The change expands the exported Page interface by adding
Words, ExtractText, and ExtractTextSimple which is source-breaking for external
implementers; instead, keep the existing Page unchanged and introduce a new
extended interface (e.g., TextPage) that embeds Page and declares Words,
ExtractText, and ExtractTextSimple, then update internal code paths that need
text features to use TextPage while leaving Page consumers/implementers intact.
🧹 Nitpick comments (2)
text.go (2)

795-799: 💤 Low value

Dead code: no-op space assignment.

This block checks if a position is already a space, then sets it to a space—a no-op. The comment suggests separator insertion intent, but since the grid is initialized with spaces, this code has no effect.

♻️ Remove the dead code
             col++
         }
-        // Insert a separator space if there's room — but only if
-        // the next position isn't already non-blank.
-        if col < widthChars && rows[row][col] == ' ' {
-            rows[row][col] = ' '
-        }
     }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@text.go` around lines 795 - 799, Remove the no-op separator assignment: the
if block that checks "if col < widthChars && rows[row][col] == ' ' {
rows[row][col] = ' ' }" is dead code (grid is initialized with spaces) and
should be deleted; locate the code using the identifiers rows, row, col, and
widthChars and remove that conditional and its body (or, if the intended
behavior was to insert a separator when the cell is not blank, change the
condition to rows[row][col] != ' ' and set it to ' ' instead).

282-305: 💤 Low value

Ineffectual assignment flagged by static analysis.

The assignment filtered := chars at line 283 is never used—both branches of the if/else overwrite it. This can be simplified into a single filtering loop.

♻️ Suggested simplification
 // Step 1/2: filter blanks (unless KeepBlankChars) and empties.
-filtered := chars
-if !opts.KeepBlankChars {
-    out := make([]Char, 0, len(chars))
-    for _, c := range chars {
-        if c.Text == "" {
-            continue
-        }
-        if isAllSpace(c.Text) {
-            continue
-        }
-        out = append(out, c)
-    }
-    filtered = out
-} else {
-    out := make([]Char, 0, len(chars))
-    for _, c := range chars {
-        if c.Text == "" {
-            continue
-        }
-        out = append(out, c)
-    }
-    filtered = out
-}
+filtered := make([]Char, 0, len(chars))
+for _, c := range chars {
+    if c.Text == "" {
+        continue
+    }
+    if !opts.KeepBlankChars && isAllSpace(c.Text) {
+        continue
+    }
+    filtered = append(filtered, c)
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@text.go` around lines 282 - 305, The preallocated assignment filtered :=
chars is ineffectual because both branches replace it; remove that assignment
and replace the duplicated branches with a single loop that builds out :=
make([]Char, 0, len(chars)) iterating over chars and applying the two checks:
always skip empty Text (c.Text == ""), and conditionally skip all-space tokens
by calling isAllSpace(c.Text) only when opts.KeepBlankChars is false; assign
filtered = out at the end (use the existing types Char, variable chars, and
option opts.KeepBlankChars).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@clustering.go`:
- Around line 282-286: The helper function float64Bits is unused (lint error);
remove the unused function float64Bits (the small wrapper around
math.Float64bits) from the file, or if intended to be used, replace direct calls
to math.Float64bits in dedupeChars's key construction with float64Bits to make
it referenced; prefer deleting the float64Bits function to satisfy static
analysis if no call sites are added.

In `@geometry.go`:
- Around line 75-78: The comment above the Intersect logic incorrectly states
that shared-edge (zero-area) cases are treated as non-overlap while the
Intersect function and tests treat edge-touch (w>0,h=0 or w=0,h>0) as
overlapping; update the documentation comment for the Intersect function in
geometry.go to accurately describe the implemented behavior (edge-touch where
one dimension is zero and the other >0 is considered an intersection) and
mention the specific condition used (w>0 || h>0) so callers are not misled.

In `@text_test.go`:
- Line 260: Replace the naked defer doc.Close() calls with a deferred closure
that checks and surfaces the error returned by Document.Close; for each
occurrence (the doc.Close() calls in text_test.go) change to defer func() { if
err := doc.Close(); err != nil { t.Fatalf("closing document: %v", err) } } (or
t.Errorf if you prefer non-fatal) so cleanup failures are not silently discarded
— locate the doc.Close() usages and wrap them in the deferred error-checking
closure.
- Line 261: The test is discarding the error returned by Document.Page; change
all occurrences like p, _ := doc.Page(1) to capture the error (p, err :=
doc.Page(1)) and immediately check it (fail/assert) so lookup/out-of-range
failures surface; update the instances in text_test.go (lines around
261/283/314/331) and page_test.go (around line 159) to assert err == nil (or
t.Fatalf/t.Helper with the error) before using p.

---

Outside diff comments:
In `@page.go`:
- Line 31: The change expands the exported Page interface by adding Words,
ExtractText, and ExtractTextSimple which is source-breaking for external
implementers; instead, keep the existing Page unchanged and introduce a new
extended interface (e.g., TextPage) that embeds Page and declares Words,
ExtractText, and ExtractTextSimple, then update internal code paths that need
text features to use TextPage while leaving Page consumers/implementers intact.

In `@README.md`:
- Around line 217-237: The README example calls an undefined helper must around
page.ExtractText(pdftable.DefaultTextOpts()), causing a compile error; fix by
either adding a small helper func must(s string, err error) string that returns
s or by replacing the must call with explicit error handling: call
page.ExtractText(pdftable.DefaultTextOpts()), check the returned error,
handle/log/exit on error and then print the returned string. Update the example
to reference the chosen approach and ensure the symbol names (must,
page.ExtractText, pdftable.DefaultTextOpts) are used correctly.

---

Nitpick comments:
In `@text.go`:
- Around line 795-799: Remove the no-op separator assignment: the if block that
checks "if col < widthChars && rows[row][col] == ' ' { rows[row][col] = ' ' }"
is dead code (grid is initialized with spaces) and should be deleted; locate the
code using the identifiers rows, row, col, and widthChars and remove that
conditional and its body (or, if the intended behavior was to insert a separator
when the cell is not blank, change the condition to rows[row][col] != ' ' and
set it to ' ' instead).
- Around line 282-305: The preallocated assignment filtered := chars is
ineffectual because both branches replace it; remove that assignment and replace
the duplicated branches with a single loop that builds out := make([]Char, 0,
len(chars)) iterating over chars and applying the two checks: always skip empty
Text (c.Text == ""), and conditionally skip all-space tokens by calling
isAllSpace(c.Text) only when opts.KeepBlankChars is false; assign filtered = out
at the end (use the existing types Char, variable chars, and option
opts.KeepBlankChars).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 21f89b40-896e-4cd7-9005-ea55c3209473

📥 Commits

Reviewing files that changed from the base of the PR and between 3ad9fdc and a431cf4.

⛔ Files ignored due to path filters (3)
  • testdata/golden/hello.pdf is excluded by !**/*.pdf
  • testdata/golden/rules.pdf is excluded by !**/*.pdf
  • testdata/golden/simple1.pdf is excluded by !**/*.pdf
📒 Files selected for processing (15)
  • CHANGELOG.md
  • README.md
  • clustering.go
  • clustering_test.go
  • geometry.go
  • geometry_test.go
  • golden_test.go
  • page.go
  • pdftable.go
  • scripts/gen_golden.py
  • testdata/golden/hello.expected.json
  • testdata/golden/rules.expected.json
  • testdata/golden/simple1.expected.json
  • text.go
  • text_test.go

Comment thread clustering.go
Comment on lines +282 to +286
// float64Bits is a tiny re-export point that consolidates the math.
// Float64bits dependency so other tests/files don't have to import
// "math" just to compare floats. Left as a package-private helper for
// now — only dedupeChars's key construction uses it.
func float64Bits(f float64) uint64 { return math.Float64bits(f) }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove unused float64Bits helper (lint error).

Line 286 is unused and already flagged by static analysis. Please remove it (or use it consistently) to keep CI/lint clean.

Proposed cleanup
-// float64Bits is a tiny re-export point that consolidates the math.
-// Float64bits dependency so other tests/files don't have to import
-// "math" just to compare floats. Left as a package-private helper for
-// now — only dedupeChars's key construction uses it.
-func float64Bits(f float64) uint64 { return math.Float64bits(f) }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// float64Bits is a tiny re-export point that consolidates the math.
// Float64bits dependency so other tests/files don't have to import
// "math" just to compare floats. Left as a package-private helper for
// now — only dedupeChars's key construction uses it.
func float64Bits(f float64) uint64 { return math.Float64bits(f) }
🧰 Tools
🪛 golangci-lint (2.12.2)

[error] 286-286: func float64Bits is unused

(unused)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@clustering.go` around lines 282 - 286, The helper function float64Bits is
unused (lint error); remove the unused function float64Bits (the small wrapper
around math.Float64bits) from the file, or if intended to be used, replace
direct calls to math.Float64bits in dedupeChars's key construction with
float64Bits to make it referenced; prefer deleting the float64Bits function to
satisfy static analysis if no call sites are added.

Comment thread geometry.go
Comment on lines +75 to +78
// We treat touching-but-not-overlapping (shared edge, zero area) as
// non-overlap, matching pdfplumber's `o_height + o_width > 0` check —
// a single-line ruler that grazes a word's bbox should not be reported
// as "intersecting" the word.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix Intersect doc semantics for edge-touch cases.

Line 75 currently says shared-edge, zero-area intersections are treated as non-overlap, but Line 89 logic and tests treat edge-touch (w>0,h=0 or w=0,h>0) as overlap. Please align the comment to avoid misleading callers.

Proposed doc fix
-// We treat touching-but-not-overlapping (shared edge, zero area) as
-// non-overlap, matching pdfplumber's `o_height + o_width > 0` check —
-// a single-line ruler that grazes a word's bbox should not be reported
-// as "intersecting" the word.
+// We treat point-touch (w=0,h=0) as non-overlap, matching
+// pdfplumber's `o_height + o_width > 0` check. Edge-touch
+// intersections (w=0,h>0 or w>0,h=0) are counted as overlap.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@geometry.go` around lines 75 - 78, The comment above the Intersect logic
incorrectly states that shared-edge (zero-area) cases are treated as non-overlap
while the Intersect function and tests treat edge-touch (w>0,h=0 or w=0,h>0) as
overlapping; update the documentation comment for the Intersect function in
geometry.go to accurately describe the implemented behavior (edge-touch where
one dimension is zero and the other >0 is considered an intersection) and
mention the specific condition used (w>0 || h>0) so callers are not misled.

Comment thread text_test.go
if err != nil {
t.Fatalf("open: %v", err)
}
defer doc.Close()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle Document.Close errors instead of discarding them.

Line 260, Line 282, Line 313, and Line 330 use defer doc.Close() without checking the returned error, which is currently tripping errcheck and can hide cleanup failures.

Suggested fix pattern
-	defer doc.Close()
+	t.Cleanup(func() {
+		if cerr := doc.Close(); cerr != nil {
+			t.Errorf("close: %v", cerr)
+		}
+	})

Also applies to: 282-282, 313-313, 330-330

🧰 Tools
🪛 golangci-lint (2.12.2)

[error] 260-260: Error return value of doc.Close is not checked

(errcheck)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@text_test.go` at line 260, Replace the naked defer doc.Close() calls with a
deferred closure that checks and surfaces the error returned by Document.Close;
for each occurrence (the doc.Close() calls in text_test.go) change to defer
func() { if err := doc.Close(); err != nil { t.Fatalf("closing document: %v",
err) } } (or t.Errorf if you prefer non-fatal) so cleanup failures are not
silently discarded — locate the doc.Close() usages and wrap them in the deferred
error-checking closure.

Comment thread text_test.go
t.Fatalf("open: %v", err)
}
defer doc.Close()
p, _ := doc.Page(1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Inspect Document/Page signatures:"
rg -nP --type=go -C2 '\btype\s+Document\b|\bfunc\s*\([^)]*\)\s*Page\s*\('

echo
echo "Find call sites that ignore the second return value:"
rg -nP --type=go -C2 '\bp\s*,\s*_\s*:=\s*doc\.Page\s*\('

Repository: hallelx2/pdftable

Length of output: 1610


Don’t discard the error from doc.Page(1).

Document.Page returns (Page, error) (pdf.go), so p, _ := doc.Page(1) can hide lookup/out-of-range failures and make later assertions fail unclearly.

In text_test.go lines 261/283/314/331 (and also page_test.go line 159), replace _ with an err check, e.g. p, err := doc.Page(1) and fail/assert on err.

p, _ := doc.Page(1)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@text_test.go` at line 261, The test is discarding the error returned by
Document.Page; change all occurrences like p, _ := doc.Page(1) to capture the
error (p, err := doc.Page(1)) and immediately check it (fail/assert) so
lookup/out-of-range failures surface; update the instances in text_test.go
(lines around 261/283/314/331) and page_test.go (around line 159) to assert err
== nil (or t.Fatalf/t.Helper with the error) before using p.

@hallelx2 hallelx2 merged commit bb561ce into main May 26, 2026
5 of 6 checks passed
@hallelx2 hallelx2 deleted the feat/words-and-text-extract branch May 26, 2026 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants