Skip to content

fix: preserve angle brackets in code when exporting Markdown#786

Merged
mishig25 merged 1 commit into
mainfrom
fix/strip-html-preserve-code-brackets
Apr 28, 2026
Merged

fix: preserve angle brackets in code when exporting Markdown#786
mishig25 merged 1 commit into
mainfrom
fix/strip-html-preserve-code-brackets

Conversation

@mishig25
Copy link
Copy Markdown
Contributor

@mishig25 mishig25 commented Apr 28, 2026

Summary

  • The View as Markdown button (and the .md companion files) were silently dropping <...> content. Two failure modes:
    • Placeholders eaten: literal placeholders inside fenced code (e.g. <YOUR_SUBSCRIPTION_ID>) were stripped because strip_remaining_html matched them as HTML tags.
    • Snippets nuked: a stray < in code (e.g. if idx < n:) plus any > later in the page (e.g. mask_img > 0) caused the regex <[^>]+> to eat everything between them across line boundaries — wiping entire code blocks and surrounding prose. (Reported by @juanju and @Mishig.)
  • Fix in src/doc_builder/utils.py::strip_remaining_html:
    • Stash fenced (``` ... ```) and inline (`...`) code into placeholders before any HTML stripping; restore them afterward, so brackets inside code survive verbatim.
    • Tightened the generic-tag fallback from <[^>]+> to <[a-zA-Z/!][^>\n]*>: requires a tag-like character right after < and forbids matches across newlines.
  • Added regression tests in tests/test_utils.py covering both screenshots' scenarios plus a positive test that real tags (<Tip>, <div>) are still stripped.

Test plan

  • uv run pytest tests/test_utils.py -v — all 5 tests pass (3 new + 2 existing)
  • uv run ruff check src/doc_builder/utils.py tests/test_utils.py — clean
  • uv run ruff format src/doc_builder/utils.py tests/test_utils.py — clean
  • Manually verify on the affected pages (Microsoft Foundry / Meta SAM 3 docs) once deployed

🤖 Generated with Claude Code

cc @alvarobartt

…arkdown

The "View as Markdown" / .md export was running a generic `<[^>]+>` strip
that ate any `<...>` segment, including literal placeholders like
`<YOUR_TOKEN>` inside fenced code and even spans across many lines when
a stray `<` (e.g. `if idx < n:`) was followed later by a `>`
(e.g. `mask_img > 0`), nuking entire snippets.

Fix: stash fenced and inline code before HTML stripping and restore
afterward; tighten the generic-tag fallback to require a tag-like
character after `<` and to forbid newline crossings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mishig25 mishig25 merged commit 8991858 into main Apr 28, 2026
4 checks passed
@mishig25 mishig25 deleted the fix/strip-html-preserve-code-brackets branch April 28, 2026 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant