arXiv LaTeXML wraps display equations in <table class="ltx_equation ltx_eqn_table"> elements containing <math alttext="..."> with MathML and a <annotation encoding="application/x-tex"> LaTeX fallback.
Defuddle has detection for these (the Turndown table rule checks classList.contains('ltx_equation'), and handleNestedEquations queries math[alttext]), but standardizeContent() strips both class and alttext before Turndown runs, so neither check matches. The equations fall through to general table processing and appear as raw HTML.
Example
https://arxiv.org/html/1706.03762v7
The equation in section 3.2.1 has this HTML:
<table class="ltx_equation ltx_eqn_table">
<tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
<td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
<td class="ltx_eqn_cell ltx_align_center">
<math alttext="\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})V"
class="ltx_Math" display="block">
<semantics>
<mrow>...</mrow>
<annotation encoding="application/x-tex">\mathrm{Attention}(Q,K,V)=...</annotation>
</semantics>
</math>
</td>
<td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
</tr></tbody>
</table>
Expected:
$$\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})V$$
Actual:
<table><tbody><tr><td></td><td><math><semantics><mrow><mrow><mi>Attention</mi>...
Raw HTML table with MathML.
Root cause
stripUnwantedAttributes() removes class (not in ALLOWED_ATTRIBUTES) and alttext before Turndown runs. The equation table classList check and the math[alttext] query in handleNestedEquations both fail.
Performance impact
Without the equation table shortcut, every equation table goes through general table processing (cell iteration, recursive turndown() on MathML subtrees). On math-heavy papers this becomes catastrophically slow under JSDOM.
Tested with defuddle/node (v0.9.0) + JSDOM:
~35x slower for ~8x more math elements, consistent with O(n^2) behavior in the general table path. The browser playground handles 2510.08814 in seconds, so this specifically affects the Node.js bundle with JSDOM.
Fix direction
Process table.ltx_equation / table.ltx_eqn_table elements in standardize.ts before attribute stripping — extract the LaTeX from alttext or <annotation encoding="application/x-tex">, replace the table with a <math display="block" data-latex="..."> element that the existing Turndown math rule can convert. This eliminates the equation tables before they hit the general table path.
arXiv LaTeXML wraps display equations in
<table class="ltx_equation ltx_eqn_table">elements containing<math alttext="...">with MathML and a<annotation encoding="application/x-tex">LaTeX fallback.Defuddle has detection for these (the Turndown table rule checks
classList.contains('ltx_equation'), andhandleNestedEquationsqueriesmath[alttext]), butstandardizeContent()strips bothclassandalttextbefore Turndown runs, so neither check matches. The equations fall through to general table processing and appear as raw HTML.Example
https://arxiv.org/html/1706.03762v7
The equation in section 3.2.1 has this HTML:
Expected:
Actual:
Raw HTML table with MathML.
Root cause
stripUnwantedAttributes()removesclass(not inALLOWED_ATTRIBUTES) andalttextbefore Turndown runs. The equation tableclassListcheck and themath[alttext]query inhandleNestedEquationsboth fail.Performance impact
Without the equation table shortcut, every equation table goes through general table processing (cell iteration, recursive
turndown()on MathML subtrees). On math-heavy papers this becomes catastrophically slow under JSDOM.Tested with
defuddle/node(v0.9.0) + JSDOM:~35x slower for ~8x more math elements, consistent with O(n^2) behavior in the general table path. The browser playground handles 2510.08814 in seconds, so this specifically affects the Node.js bundle with JSDOM.
Fix direction
Process
table.ltx_equation/table.ltx_eqn_tableelements instandardize.tsbefore attribute stripping — extract the LaTeX fromalttextor<annotation encoding="application/x-tex">, replace the table with a<math display="block" data-latex="...">element that the existing Turndown math rule can convert. This eliminates the equation tables before they hit the general table path.