Summary
PptxConverter.convert performs unguarded + concatenation on shape.text and notes_frame.text. When python-pptx yields None for a text frame (e.g. an <a:r> with no <a:t>, or certain third-party-generated decks), the whole conversion fails with:
PptxConverter threw TypeError with message: can only concatenate str (not "NoneType") to str
One malformed shape fails the entire file.
Affected lines (v0.1.5, also present on main)
packages/markitdown/src/markitdown/converters/_pptx_converter.py:
- L137 —
md_content += "# " + shape.text.lstrip() + "\n"
- L139 —
md_content += shape.text + "\n"
- L154 —
md_content += notes_frame.text
Reproduction
import pptx.text.text
_orig = pptx.text.text.TextFrame.text.fget
pptx.text.text.TextFrame.text = property(
lambda self: None if _orig(self) == "World" else _orig(self)
)
from markitdown import MarkItDown
MarkItDown().convert("sample.pptx") # any pptx containing the text "World"
# markitdown._exceptions.FileConversionException:
# - PptxConverter threw TypeError with message: unsupported operand type(s) for +: 'NoneType' and 'str'
Real-world trigger: a .pptx from a third-party generator with a text-frame run missing its <a:t> child, or a chart/SmartArt element whose title text is None.
Expected
Missing / None text should be treated as empty string. One bad shape shouldn't fail the entire file.
Proposed fix
md_content += "# " + (shape.text or "").lstrip() + "\n"
md_content += (shape.text or "") + "\n"
md_content += (notes_frame.text or "")
Happy to open a PR.
Environment
- markitdown 0.1.5 (
markitdown[all])
- python-pptx 1.0.2
- Python 3.12, Ubuntu 24.04
Summary
PptxConverter.convertperforms unguarded+concatenation onshape.textandnotes_frame.text. When python-pptx yieldsNonefor a text frame (e.g. an<a:r>with no<a:t>, or certain third-party-generated decks), the whole conversion fails with:One malformed shape fails the entire file.
Affected lines (v0.1.5, also present on
main)packages/markitdown/src/markitdown/converters/_pptx_converter.py:md_content += "# " + shape.text.lstrip() + "\n"md_content += shape.text + "\n"md_content += notes_frame.textReproduction
Real-world trigger: a
.pptxfrom a third-party generator with a text-frame run missing its<a:t>child, or a chart/SmartArt element whose title text isNone.Expected
Missing /
Nonetext should be treated as empty string. One bad shape shouldn't fail the entire file.Proposed fix
Happy to open a PR.
Environment
markitdown[all])