fix(convert): preserve vector drawings in PDF→PPTX by nelsonduarte · Pull Request #41 · nelsonduarte/PDFApps

nelsonduarte · 2026-05-07T09:46:23Z

Problem

User compared the PPTX output of his slide-builder PDF against the original (sample shared in the conversation: dark theme, blue header bars, code blocks, card backgrounds). Text positions were correct after PR #40, but the slides looked "naked" — every visual element that wasn't text or a raster image was missing:

Dark blue header bar at top of each slide (the 06 Estrutura de um Log banner)
Thin colored separators below the headers
Dark code-block backgrounds
Light gray card backgrounds
Colored top borders on cards

Root cause

Those elements are vector drawings in the PDF (filled rectangles, stroked lines, paths), not text or images. PyMuPDF's page.get_text("dict") only returns text + image blocks. Drawings live behind a separate API: page.get_drawings(). The converter never called it.

Fix

Add a Phase-1 pass over page.get_drawings() before the text/image extraction. Each drawing maps to a PowerPoint shape:

Drawing kind	Mapped to
Filled rect (`items ⊆ {re, l}` + fill)	`MSO_SHAPE.RECTANGLE` with solid fill at bbox
Stroked thin line (no fill, has stroke, bbox aspect consistent with ≤ ~2pt thick separator)	Same MSO_SHAPE.RECTANGLE in stroke color (most reliable way to get a visible 1-2pt line — PowerPoint's `shape.line` thickness model is finicky)
Curves, complex paths, gradients	Skipped (rare in slide-builder PDFs)

Drawings are added first so PowerPoint's z-order puts them in the background; text and image phases stack on top, mirroring the PDF's draw order.

Also added an _rgb() helper that converts fitz's 0..1 float color tuples to RGBColor(0..255) and silently rejects malformed colors, so a single broken drawing doesn't abort the whole slide.

Test plan

Smoke test on Ubuntu 26.04 + Py3.14.4: 1-page PDF with a header bar / separator / code block / card background + 2 text lines now produces a PPTX with 4 rectangle shapes (fill colors 0C1933 / 337FE5 / 142138 / F2F2F2 — matching the PDF source) and 2 text shapes. Before the fix the same input produced 2 text shapes against a blank white slide.
Live test in PowerPoint with the user's original problematic PDF: header bars and card backgrounds visible; text still selectable / editable; visual fidelity close to PDF (won't be 100% — gradients and curves still skipped).
Spot-check a scanned/image-only PDF: image blocks still embed correctly, no extra empty rectangles get generated.

🤖 Generated with Claude Code

… backgrounds, separators) User compared the PPTX output against the original PDF (slide-builder presentation, sample attached as "Programar_Scripts_de_Normalização_e_Filtragem_de_Logs.pdf"): text positions were correct after the previous fix, but the slides looked "naked" — the dark blue header bars at the top of every slide, the colored thin-line separators, the dark code-block backgrounds, the light gray card backgrounds, and the colored top borders on each card were all missing. Root cause: those visual elements are vector drawings in the PDF (filled rectangles, stroked lines, paths), not text or images. PyMuPDF's `page.get_text("dict")` only returns text and image blocks — drawings live behind a separate API, `page.get_drawings()`, which the converter never called. Add a Phase-1 pass over `page.get_drawings()` BEFORE the text / image extraction. Each drawing is mapped to a PowerPoint shape: - Filled rectangle (`items` are only `re` and/or `l` ops with a fill color) → `MSO_SHAPE.RECTANGLE` with solid fill at the drawing's bbox. - Stroked thin line (no fill, has stroke, bbox aspect ratio consistent with a separator < ~2pt thick on the short axis) → same MSO_SHAPE.RECTANGLE drawn in the stroke color (PowerPoint's shape line thickness model is finicky; a thin filled rect is the most reliable way to get a visible 1-2pt separator). - Anything else (curves, complex paths, gradient fills) → skipped. These are rare in slide-builder PDFs and would need a much more involved path-translation pass. Drawings are added FIRST so PowerPoint's z-order puts them in the background; text and image phases stack on top, exactly mirroring the PDF's draw order. Also added an `_rgb()` helper that converts fitz's 0..1 float tuples to `RGBColor(0..255)` and silently rejects malformed colors, so a single broken drawing doesn't abort the slide. Smoke test on Ubuntu 26.04 + Py3.14.4: a 1-page PDF with a header bar, a separator, a code block, and a card background, plus two text lines, now produces a PPTX with 4 rectangle shapes (colors 0C1933 / 337FE5 / 142138 / F2F2F2 — matching the PDF fills) and 2 text shapes. Before this fix the PPTX had only the two text shapes against a blank white slide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…PDF→PPTX (#42) User confirmed running source (py pdfapps.py) on main, so the v1.13.8 vector-drawings fix from #41 was active but their slides still came out without the card backgrounds. Root cause: the strict filter `kinds <= {"re", "l"}` accepted only sharp filled rectangles. Modern slide-builder PDFs (Genially, Canva, the user's UFCD 1492 deck) draw rounded-corner cards with `c` (cubic Bezier) ops at the four corners — those drawings have `kinds = {"l", "c"}` and were silently skipped. Loosen the filter: - Any drawing with a fill color and at least one path item is rendered at its bbox. Drawings whose path contains `c` or `qu` curves map to MSO_SHAPE.ROUNDED_RECTANGLE; sharp paths (`re`/`l` only) stay MSO_SHAPE.RECTANGLE. - Stroked-only thin shapes still become a thin filled rect in the stroke colour. - Larger thin-shape tolerance (4× max dimension instead of 2×) so separator lines drawn as 3-4pt strokes survive. Trade-off: complex non-rectangular vector drawings (e.g. icon glyphs drawn as paths) will also render as rounded rectangles at their bbox — visually a "colored blob" instead of nothing. That's better than the previous behaviour (silent skip → naked slide) but not perfect; full path translation is out of scope. Smoke test on Ubuntu 26.04 + Py3.14.4: a 1-page PDF with a sharp header bar + a rounded card + a stroked separator line + two text elements now produces a PPTX with 1 RECTANGLE (header), 1 ROUNDED_RECTANGLE (card), and 2 textboxes. The previous strict filter rejected the card. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nelsonduarte merged commit daa2d61 into main May 7, 2026
3 checks passed

nelsonduarte deleted the fix/pptx-extract-vector-drawings branch May 7, 2026 09:48

nelsonduarte mentioned this pull request May 7, 2026

fix(convert): accept Bezier-corner drawings as rounded rectangles in PDF→PPTX #42

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(convert): preserve vector drawings in PDF→PPTX#41

fix(convert): preserve vector drawings in PDF→PPTX#41
nelsonduarte merged 1 commit into
mainfrom
fix/pptx-extract-vector-drawings

nelsonduarte commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nelsonduarte commented May 7, 2026

Problem

Root cause

Fix

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant