Skip to content

fix(convert): preserve PDF→PPTX line layout and font names#40

Merged
nelsonduarte merged 1 commit into
mainfrom
fix/pptx-preserve-layout
May 7, 2026
Merged

fix(convert): preserve PDF→PPTX line layout and font names#40
nelsonduarte merged 1 commit into
mainfrom
fix/pptx-preserve-layout

Conversation

@nelsonduarte
Copy link
Copy Markdown
Owner

Problem

User report after v1.13.8: "já reconhece o texto mas fica tudo desformatado".

The previous fix made PPTX output editable but used one textbox per block, with each line of the block as a paragraph inside. PowerPoint's textframe re-flowed those paragraphs based on font metrics, ignoring the per-line y-coordinates from the PDF. Multi-line blocks collapsed visually — headings + body that were spaced apart in the PDF stacked together in PowerPoint.

Fix

Switch from per-block to per-line textboxes:

  • Each line from page.get_text("dict") gets its own add_textbox at the line's bbox, so PowerPoint has nothing to re-flow vertically and every line stays at its original PDF y position.
  • Spans within a line stay as runs in a single paragraph — inline formatting (bold/italic mid-sentence) is preserved.
  • word_wrap = False per textbox since each holds exactly one line; this stops PowerPoint from breaking a line across two visual lines if its glyph width metrics differ from the PDF's by a hair.

Also propagate the PDF span's font name to the run, stripping the 6-letter subset prefix that fitz reports (e.g. ABCDEF+ArialArial). PowerPoint substitutes a similar face if the exact font isn't installed, but at least the canonical name lets the substitution be reasonable instead of always landing on the theme default Calibri.

Test plan

  • Smoke test on Ubuntu 26.04 + Py3.14.4: a 1-page PDF with text at y=100/300/700, sizes 22/14/10, produces 3 separate textboxes with vertical order preserved, sizes preserved exactly, font name = Helvetica:
    'Title at top'    y=76.35  size=22.0  name=Helvetica
    'Middle line'     y=284.95 size=14.0  name=Helvetica
    'Bottom footnote' y=689.25 size=10.0  name=Helvetica
    
  • Live test in PowerPoint with the user's original problematic PDF: lines should land at their original positions; selection / editing still works on each line independently.

🤖 Generated with Claude Code

After v1.13.8 made PPTX output editable, the user reported that the
formatting was broken — text appeared but lines collapsed together
and were no longer visually positioned at their original PDF
coordinates.

Root cause: the previous fix used one textbox per *block*, with each
line as a paragraph inside. PowerPoint's textframe layout engine
then re-flowed the lines based on font metrics, ignoring the
original per-line y-coordinates from the PDF. Multi-line blocks
(headings + body in the same block, lists, etc.) got compressed.

Fix: switch from per-block to per-LINE textboxes. Each `line` from
`page.get_text("dict")` becomes its own textbox at the line's bbox,
so PowerPoint has nothing to re-flow vertically — every line stays
exactly where it was in the PDF. Spans within a line stay as runs
in a single paragraph, preserving inline formatting (bold/italic
mid-sentence). Word-wrap is disabled per textbox since each holds
exactly one line; this also stops PowerPoint from breaking a single
line across two visual lines if its glyph metrics differ from the
PDF's by a hair.

Also propagate the PDF span's font name to the run, stripping the
6-letter subset prefix that fitz reports (e.g. "ABCDEF+Arial" →
"Arial"). PowerPoint substitutes a similar face if the exact font
isn't installed, but at least the canonical name lets the
substitution be reasonable instead of always landing on the theme
default Calibri.

Smoke test on Ubuntu 26.04 + Py3.14.4 with a 1-page PDF that has
text at y=100/300/700 with sizes 22/14/10 confirms each line is in
its own textbox at the matching y position with the correct font
size, in vertical order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nelsonduarte nelsonduarte merged commit 3c168cb into main May 7, 2026
3 checks passed
@nelsonduarte nelsonduarte deleted the fix/pptx-preserve-layout branch May 7, 2026 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant