fix(convert): preserve PDF→PPTX line layout and font names#40
Merged
Conversation
After v1.13.8 made PPTX output editable, the user reported that the
formatting was broken — text appeared but lines collapsed together
and were no longer visually positioned at their original PDF
coordinates.
Root cause: the previous fix used one textbox per *block*, with each
line as a paragraph inside. PowerPoint's textframe layout engine
then re-flowed the lines based on font metrics, ignoring the
original per-line y-coordinates from the PDF. Multi-line blocks
(headings + body in the same block, lists, etc.) got compressed.
Fix: switch from per-block to per-LINE textboxes. Each `line` from
`page.get_text("dict")` becomes its own textbox at the line's bbox,
so PowerPoint has nothing to re-flow vertically — every line stays
exactly where it was in the PDF. Spans within a line stay as runs
in a single paragraph, preserving inline formatting (bold/italic
mid-sentence). Word-wrap is disabled per textbox since each holds
exactly one line; this also stops PowerPoint from breaking a single
line across two visual lines if its glyph metrics differ from the
PDF's by a hair.
Also propagate the PDF span's font name to the run, stripping the
6-letter subset prefix that fitz reports (e.g. "ABCDEF+Arial" →
"Arial"). PowerPoint substitutes a similar face if the exact font
isn't installed, but at least the canonical name lets the
substitution be reasonable instead of always landing on the theme
default Calibri.
Smoke test on Ubuntu 26.04 + Py3.14.4 with a 1-page PDF that has
text at y=100/300/700 with sizes 22/14/10 confirms each line is in
its own textbox at the matching y position with the correct font
size, in vertical order.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
User report after v1.13.8: "já reconhece o texto mas fica tudo desformatado".
The previous fix made PPTX output editable but used one textbox per block, with each line of the block as a paragraph inside. PowerPoint's textframe re-flowed those paragraphs based on font metrics, ignoring the per-line y-coordinates from the PDF. Multi-line blocks collapsed visually — headings + body that were spaced apart in the PDF stacked together in PowerPoint.
Fix
Switch from per-block to per-line textboxes:
linefrompage.get_text("dict")gets its ownadd_textboxat the line's bbox, so PowerPoint has nothing to re-flow vertically and every line stays at its original PDF y position.word_wrap = Falseper textbox since each holds exactly one line; this stops PowerPoint from breaking a line across two visual lines if its glyph width metrics differ from the PDF's by a hair.Also propagate the PDF span's font name to the run, stripping the 6-letter subset prefix that fitz reports (e.g.
ABCDEF+Arial→Arial). PowerPoint substitutes a similar face if the exact font isn't installed, but at least the canonical name lets the substitution be reasonable instead of always landing on the theme default Calibri.Test plan
🤖 Generated with Claude Code