fix(convert): preserve vector drawings in PDF→PPTX#41
Merged
Conversation
… backgrounds, separators)
User compared the PPTX output against the original PDF
(slide-builder presentation, sample attached as
"Programar_Scripts_de_Normalização_e_Filtragem_de_Logs.pdf"):
text positions were correct after the previous fix, but the slides
looked "naked" — the dark blue header bars at the top of every
slide, the colored thin-line separators, the dark code-block
backgrounds, the light gray card backgrounds, and the colored top
borders on each card were all missing.
Root cause: those visual elements are vector drawings in the PDF
(filled rectangles, stroked lines, paths), not text or images.
PyMuPDF's `page.get_text("dict")` only returns text and image
blocks — drawings live behind a separate API, `page.get_drawings()`,
which the converter never called.
Add a Phase-1 pass over `page.get_drawings()` BEFORE the text /
image extraction. Each drawing is mapped to a PowerPoint shape:
- Filled rectangle (`items` are only `re` and/or `l` ops with a
fill color) → `MSO_SHAPE.RECTANGLE` with solid fill at the
drawing's bbox.
- Stroked thin line (no fill, has stroke, bbox aspect ratio
consistent with a separator < ~2pt thick on the short axis) →
same MSO_SHAPE.RECTANGLE drawn in the stroke color (PowerPoint's
shape line thickness model is finicky; a thin filled rect is the
most reliable way to get a visible 1-2pt separator).
- Anything else (curves, complex paths, gradient fills) → skipped.
These are rare in slide-builder PDFs and would need a much more
involved path-translation pass.
Drawings are added FIRST so PowerPoint's z-order puts them in the
background; text and image phases stack on top, exactly mirroring
the PDF's draw order.
Also added an `_rgb()` helper that converts fitz's 0..1 float
tuples to `RGBColor(0..255)` and silently rejects malformed colors,
so a single broken drawing doesn't abort the slide.
Smoke test on Ubuntu 26.04 + Py3.14.4: a 1-page PDF with a header
bar, a separator, a code block, and a card background, plus two
text lines, now produces a PPTX with 4 rectangle shapes (colors
0C1933 / 337FE5 / 142138 / F2F2F2 — matching the PDF fills) and 2
text shapes. Before this fix the PPTX had only the two text
shapes against a blank white slide.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
nelsonduarte
added a commit
that referenced
this pull request
May 7, 2026
…PDF→PPTX (#42) User confirmed running source (py pdfapps.py) on main, so the v1.13.8 vector-drawings fix from #41 was active but their slides still came out without the card backgrounds. Root cause: the strict filter `kinds <= {"re", "l"}` accepted only sharp filled rectangles. Modern slide-builder PDFs (Genially, Canva, the user's UFCD 1492 deck) draw rounded-corner cards with `c` (cubic Bezier) ops at the four corners — those drawings have `kinds = {"l", "c"}` and were silently skipped. Loosen the filter: - Any drawing with a fill color and at least one path item is rendered at its bbox. Drawings whose path contains `c` or `qu` curves map to MSO_SHAPE.ROUNDED_RECTANGLE; sharp paths (`re`/`l` only) stay MSO_SHAPE.RECTANGLE. - Stroked-only thin shapes still become a thin filled rect in the stroke colour. - Larger thin-shape tolerance (4× max dimension instead of 2×) so separator lines drawn as 3-4pt strokes survive. Trade-off: complex non-rectangular vector drawings (e.g. icon glyphs drawn as paths) will also render as rounded rectangles at their bbox — visually a "colored blob" instead of nothing. That's better than the previous behaviour (silent skip → naked slide) but not perfect; full path translation is out of scope. Smoke test on Ubuntu 26.04 + Py3.14.4: a 1-page PDF with a sharp header bar + a rounded card + a stroked separator line + two text elements now produces a PPTX with 1 RECTANGLE (header), 1 ROUNDED_RECTANGLE (card), and 2 textboxes. The previous strict filter rejected the card. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
User compared the PPTX output of his slide-builder PDF against the original (sample shared in the conversation: dark theme, blue header bars, code blocks, card backgrounds). Text positions were correct after PR #40, but the slides looked "naked" — every visual element that wasn't text or a raster image was missing:
06 Estrutura de um Logbanner)Root cause
Those elements are vector drawings in the PDF (filled rectangles, stroked lines, paths), not text or images. PyMuPDF's
page.get_text("dict")only returns text + image blocks. Drawings live behind a separate API:page.get_drawings(). The converter never called it.Fix
Add a Phase-1 pass over
page.get_drawings()before the text/image extraction. Each drawing maps to a PowerPoint shape:items ⊆ {re, l}+ fill)MSO_SHAPE.RECTANGLEwith solid fill at bboxshape.linethickness model is finicky)Drawings are added first so PowerPoint's z-order puts them in the background; text and image phases stack on top, mirroring the PDF's draw order.
Also added an
_rgb()helper that converts fitz's 0..1 float color tuples toRGBColor(0..255)and silently rejects malformed colors, so a single broken drawing doesn't abort the whole slide.Test plan
0C1933/337FE5/142138/F2F2F2— matching the PDF source) and 2 text shapes. Before the fix the same input produced 2 text shapes against a blank white slide.🤖 Generated with Claude Code