v3.4.0
html-to-markdown 3.4.0 — high-performance HTML to Markdown converter with a Rust core and polyglot bindings (Python, Node/TypeScript, Ruby, PHP, Go, Java, C#, Elixir, R, WebAssembly, C FFI).
Install
| Language | Command |
|---|---|
| Rust | cargo add html-to-markdown-rs |
| Python | pip install html-to-markdown |
| Node / TS | npm install @kreuzberg/html-to-markdown |
| WASM | npm install @kreuzberg/html-to-markdown-wasm |
| Ruby | gem install html-to-markdown |
| PHP | pie install kreuzberg-dev/html-to-markdown-rs |
| Go | go get github.com/kreuzberg-dev/html-to-markdown/packages/go/v3 |
| Java | Maven Central: dev.kreuzberg:html-to-markdown:3.4.0 |
| C# | dotnet add package KreuzbergDev.HtmlToMarkdown |
| Elixir | {:html_to_markdown, "~> 3.4"} in mix.exs |
| R | install.packages("htmltomarkdown") |
| Homebrew (CLI) | brew install kreuzberg-dev/tap/html-to-markdown |
| Homebrew (lib) | brew install kreuzberg-dev/tap/libhtml-to-markdown |
Added
- Homebrew distribution for
html-to-markdown(CLI) andlibhtml-to-markdown(FFI library + headers + pkg-config + CMake configs). Pre-built tarballs for macOS arm64/x86_64 and Linux arm64/x86_64; install withbrew install kreuzberg-dev/tap/html-to-markdown. - WASM bundles for all four wasm-pack targets (
web,bundler,nodejs,deno) under@kreuzberg/html-to-markdown-wasm. - C# NuGet package
KreuzbergDev.HtmlToMarkdownwith native runtimes for linux-x64, linux-arm64, osx-x64, osx-arm64, win-x64, win-arm64. - Java Maven Central package
dev.kreuzberg:html-to-markdownbundling native libraries for the same six platforms viaMETA-INF/native/<rid>/. - Elixir Hex package with
rustler_precompiledNIFs for Linux + macOS (NIF 2.16/2.17 × 3 platforms); released artifacts download at first run. - PHP PIE pre-built archives for PHP 8.2/8.3/8.4/8.5 × 6 platforms —
pie install kreuzberg-dev/html-to-markdown-rsno longer requires building from source. - CLI panic guard — conversion failures inside the CLI now surface as actionable errors via
panic::catch_unwindinstead of partial output + Rust backtrace. HtmlVisitorparity across all bindings — Python, Node/TypeScript, Ruby, PHP, Go, Java, C#, Elixir, R, and WASM all expose the visitor interface withvisit_element_start/visit_text/visit_element_endandVisitResult::{Continue, Skip, Custom}semantics matching the Rust core.- Polyglot codegen via alef — bindings, e2e tests, and READMEs for all 11 target languages are generated from a single
alef.toml+ Rust source of truth, eliminating drift across the polyglot surface.
Fixed
- #348 —
OutputFormat::PlainignoredHtmlVisitorcallbacks. The plain-text walker (crates/html-to-markdown/src/converter/plain_text.rs) ran the markdown pipeline first, then discarded its output and re-traversed the DOM via a visitor-lesswalk_plain, soVisitResult::Custom/Skipreturned fromvisit_element_end/visit_textwas silently dropped forPlain. Threaded aWalkStatecarrying the visitor through the plain walker so element/text hooks fire and their results are honoured. - #347 —
<img src>URLs not escaped, breaking CommonMark round-trip.crates/html-to-markdown/src/converter/handlers/image.rsemittedsrcraw, while<a href>already wrapped spaces/parens in angle brackets. Image renderer now uses the same three-branch escaping as links: empty →<>, contains space/newline →<URL>, unbalanced parens →\(/\)escaping. - #336 — large MS Word HTML truncated when
<td><p class='MsoNormal'>…</td>appears as the leading cell. Thetlparser absorbs subsequent<td>and document content into the unclosed<p>, nesting the rest of the DOM inside the first table cell. Extendedhas_inline_block_misnestinconverter/preprocessing_helpers.rswith ahas_p_ancestorcheck that detectstd/tr/thunder<p>(structurally impossible in valid HTML) and triggers the existing html5ever repair path. - Split closing tags
</tagname\n>corrupted DOM and dropped content. JSX-style HTML (closing-tag>on the next line) caused thetlparser to leave elements unclosed, which silently absorbed siblings and dropped entire sections — affecting #127 (MW841 product headings missing from multilingual page), #143 (word-wrap merging nested link list items), and #121 (SPA menu nesting). Newnormalize_split_closing_tagspreprocessing pass collapses such patterns to</tagname>before parsing, wired into all four preprocessing branches inconverter/main.rs. - Tables now emit padded, aligned columns. Each cell is padded to the widest cell in its column; the separator row uses
max(3, col_width)dashes per column.*and_are escaped in table cells regardless ofescape_misc. Fixes the gh-140 fixture parity and produces CommonMark-conformant tables out of the box. - #339 — bogus HTML comment endings dropped following content. The
astral-tlparser silently discarded every byte after<!-- /// --->or any--[-]+>comment terminator. Newnormalize_bogus_comment_endingspreprocessing pass rewrites such sequences to-->before parsing; wired into the html5ever-repair and inline-block-misnest fallback paths too. - #340 — npm pre-release versions clobbered the
latestdist-tag. Pre-release versions (matching-(rc|beta|alpha|pre|dev)) now publish under thenextdist-tag, sonpm install @kreuzberg/html-to-markdown-nodeno longer pulls a 3.4.0-rc over a stable 3.3.x. - #337 —
from html_to_markdown import HeadingStyleraisedTypeError. The package now re-exports the native PyO3 enums directly from_html_to_markdownand adds uppercase aliases (HeadingStyle.ATX,CodeBlockStyle.BACKTICKS) so both naming conventions satisfyConversionOptions(heading_style=…). - #334 — Ruby
HtmlToMarkdown.convert(html, options)raisedTypeErroron every call with options. The wrapper passed aConversionOptionsobject to the FFI, but the generated Rust function expectsOption<String>JSON. Wrapper now serialises the options hash to JSON before crossing the FFI boundary. - #332 —
default-features = falseRust build broken. Bare#[serde(...)]and#[derive(Serialize, Deserialize)]on core types insrc/types/{document,tables,result,warnings}.rsandsrc/options/conversion.rsare now feature-gated behind#[cfg_attr(feature = "serde", ...)]. CI now runs acargo check --no-default-featuresmatrix to prevent regressions. - #331 — visitor
element_start/element_endevents mispaired for hyphenated/namespaced custom tags. Therepair_with_html5everfallback re-parsed under HTML5 semantics, which discard XML-style self-closing on unknown elements. The repair path now pre-expands XML self-closing tags on non-void elements to explicit open+close pairs before the HTML5 parse. - PHP visitor marshaling — visitor callbacks now correctly marshal arguments and handle array return values;
setVisitor()method added toConversionOptions. - Elixir metadata serialization — metadata maps now serialize as JSON instead of Elixir debug format.
- WASM Vitest environment — WASM module loading now correctly handles Node.js module format in Vitest test environments.
- R e2e result wrapping —
result_is_r_listconfigured to suppressjsonlitedouble-wrapping of conversion results.
Changed
- pnpm v11 — migrated from pnpm v10 to v11;
pnpm-workspace.yamldeclaresonlyBuiltDependencies: [esbuild]andignoredBuiltDependencies: [wasm-pack]for the new opt-in build script policy. - Cross-language dependency bumps —
org.jetbrains:annotations26.0.0 → 26.1.0, plus updates across all language toolchains viatask upgrade.