Skip to content

fix(convert): preserve vector drawings in PDF→PPTX#41

Merged
nelsonduarte merged 1 commit into
mainfrom
fix/pptx-extract-vector-drawings
May 7, 2026
Merged

fix(convert): preserve vector drawings in PDF→PPTX#41
nelsonduarte merged 1 commit into
mainfrom
fix/pptx-extract-vector-drawings

Conversation

@nelsonduarte
Copy link
Copy Markdown
Owner

Problem

User compared the PPTX output of his slide-builder PDF against the original (sample shared in the conversation: dark theme, blue header bars, code blocks, card backgrounds). Text positions were correct after PR #40, but the slides looked "naked" — every visual element that wasn't text or a raster image was missing:

  • Dark blue header bar at top of each slide (the 06 Estrutura de um Log banner)
  • Thin colored separators below the headers
  • Dark code-block backgrounds
  • Light gray card backgrounds
  • Colored top borders on cards

Root cause

Those elements are vector drawings in the PDF (filled rectangles, stroked lines, paths), not text or images. PyMuPDF's page.get_text("dict") only returns text + image blocks. Drawings live behind a separate API: page.get_drawings(). The converter never called it.

Fix

Add a Phase-1 pass over page.get_drawings() before the text/image extraction. Each drawing maps to a PowerPoint shape:

Drawing kind Mapped to
Filled rect (items ⊆ {re, l} + fill) MSO_SHAPE.RECTANGLE with solid fill at bbox
Stroked thin line (no fill, has stroke, bbox aspect consistent with ≤ ~2pt thick separator) Same MSO_SHAPE.RECTANGLE in stroke color (most reliable way to get a visible 1-2pt line — PowerPoint's shape.line thickness model is finicky)
Curves, complex paths, gradients Skipped (rare in slide-builder PDFs)

Drawings are added first so PowerPoint's z-order puts them in the background; text and image phases stack on top, mirroring the PDF's draw order.

Also added an _rgb() helper that converts fitz's 0..1 float color tuples to RGBColor(0..255) and silently rejects malformed colors, so a single broken drawing doesn't abort the whole slide.

Test plan

  • Smoke test on Ubuntu 26.04 + Py3.14.4: 1-page PDF with a header bar / separator / code block / card background + 2 text lines now produces a PPTX with 4 rectangle shapes (fill colors 0C1933 / 337FE5 / 142138 / F2F2F2 — matching the PDF source) and 2 text shapes. Before the fix the same input produced 2 text shapes against a blank white slide.
  • Live test in PowerPoint with the user's original problematic PDF: header bars and card backgrounds visible; text still selectable / editable; visual fidelity close to PDF (won't be 100% — gradients and curves still skipped).
  • Spot-check a scanned/image-only PDF: image blocks still embed correctly, no extra empty rectangles get generated.

🤖 Generated with Claude Code

… backgrounds, separators)

User compared the PPTX output against the original PDF
(slide-builder presentation, sample attached as
"Programar_Scripts_de_Normalização_e_Filtragem_de_Logs.pdf"):
text positions were correct after the previous fix, but the slides
looked "naked" — the dark blue header bars at the top of every
slide, the colored thin-line separators, the dark code-block
backgrounds, the light gray card backgrounds, and the colored top
borders on each card were all missing.

Root cause: those visual elements are vector drawings in the PDF
(filled rectangles, stroked lines, paths), not text or images.
PyMuPDF's `page.get_text("dict")` only returns text and image
blocks — drawings live behind a separate API, `page.get_drawings()`,
which the converter never called.

Add a Phase-1 pass over `page.get_drawings()` BEFORE the text /
image extraction. Each drawing is mapped to a PowerPoint shape:
- Filled rectangle (`items` are only `re` and/or `l` ops with a
  fill color) → `MSO_SHAPE.RECTANGLE` with solid fill at the
  drawing's bbox.
- Stroked thin line (no fill, has stroke, bbox aspect ratio
  consistent with a separator < ~2pt thick on the short axis) →
  same MSO_SHAPE.RECTANGLE drawn in the stroke color (PowerPoint's
  shape line thickness model is finicky; a thin filled rect is the
  most reliable way to get a visible 1-2pt separator).
- Anything else (curves, complex paths, gradient fills) → skipped.
  These are rare in slide-builder PDFs and would need a much more
  involved path-translation pass.

Drawings are added FIRST so PowerPoint's z-order puts them in the
background; text and image phases stack on top, exactly mirroring
the PDF's draw order.

Also added an `_rgb()` helper that converts fitz's 0..1 float
tuples to `RGBColor(0..255)` and silently rejects malformed colors,
so a single broken drawing doesn't abort the slide.

Smoke test on Ubuntu 26.04 + Py3.14.4: a 1-page PDF with a header
bar, a separator, a code block, and a card background, plus two
text lines, now produces a PPTX with 4 rectangle shapes (colors
0C1933 / 337FE5 / 142138 / F2F2F2 — matching the PDF fills) and 2
text shapes. Before this fix the PPTX had only the two text
shapes against a blank white slide.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nelsonduarte nelsonduarte merged commit daa2d61 into main May 7, 2026
3 checks passed
@nelsonduarte nelsonduarte deleted the fix/pptx-extract-vector-drawings branch May 7, 2026 09:48
nelsonduarte added a commit that referenced this pull request May 7, 2026
…PDF→PPTX (#42)

User confirmed running source (py pdfapps.py) on main, so the
v1.13.8 vector-drawings fix from #41 was active but their slides
still came out without the card backgrounds. Root cause: the strict
filter `kinds <= {"re", "l"}` accepted only sharp filled rectangles.
Modern slide-builder PDFs (Genially, Canva, the user's UFCD 1492
deck) draw rounded-corner cards with `c` (cubic Bezier) ops at the
four corners — those drawings have `kinds = {"l", "c"}` and were
silently skipped.

Loosen the filter:
- Any drawing with a fill color and at least one path item is
  rendered at its bbox. Drawings whose path contains `c` or `qu`
  curves map to MSO_SHAPE.ROUNDED_RECTANGLE; sharp paths
  (`re`/`l` only) stay MSO_SHAPE.RECTANGLE.
- Stroked-only thin shapes still become a thin filled rect in the
  stroke colour.
- Larger thin-shape tolerance (4× max dimension instead of 2×) so
  separator lines drawn as 3-4pt strokes survive.

Trade-off: complex non-rectangular vector drawings (e.g. icon
glyphs drawn as paths) will also render as rounded rectangles at
their bbox — visually a "colored blob" instead of nothing. That's
better than the previous behaviour (silent skip → naked slide) but
not perfect; full path translation is out of scope.

Smoke test on Ubuntu 26.04 + Py3.14.4: a 1-page PDF with a sharp
header bar + a rounded card + a stroked separator line + two text
elements now produces a PPTX with 1 RECTANGLE (header), 1
ROUNDED_RECTANGLE (card), and 2 textboxes. The previous strict
filter rejected the card.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant