Skip to content

arXiv equation tables rendered as raw HTML instead of LaTeX #141

@MaxWolf-01

Description

@MaxWolf-01

arXiv LaTeXML wraps display equations in <table class="ltx_equation ltx_eqn_table"> elements containing <math alttext="..."> with MathML and a <annotation encoding="application/x-tex"> LaTeX fallback.

Defuddle has detection for these (the Turndown table rule checks classList.contains('ltx_equation'), and handleNestedEquations queries math[alttext]), but standardizeContent() strips both class and alttext before Turndown runs, so neither check matches. The equations fall through to general table processing and appear as raw HTML.

Example

https://arxiv.org/html/1706.03762v7

The equation in section 3.2.1 has this HTML:

<table class="ltx_equation ltx_eqn_table">
  <tbody><tr class="ltx_equation ltx_eqn_row ltx_align_baseline">
    <td class="ltx_eqn_cell ltx_eqn_center_padleft"></td>
    <td class="ltx_eqn_cell ltx_align_center">
      <math alttext="\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})V"
            class="ltx_Math" display="block">
        <semantics>
          <mrow>...</mrow>
          <annotation encoding="application/x-tex">\mathrm{Attention}(Q,K,V)=...</annotation>
        </semantics>
      </math>
    </td>
    <td class="ltx_eqn_cell ltx_eqn_center_padright"></td>
  </tr></tbody>
</table>

Expected:

$$\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(\frac{QK^{T}}{\sqrt{d_{k}}})V$$

Actual:

<table><tbody><tr><td></td><td><math><semantics><mrow><mrow><mi>Attention</mi>...

Raw HTML table with MathML.

Root cause

stripUnwantedAttributes() removes class (not in ALLOWED_ATTRIBUTES) and alttext before Turndown runs. The equation table classList check and the math[alttext] query in handleNestedEquations both fail.

Performance impact

Without the equation table shortcut, every equation table goes through general table processing (cell iteration, recursive turndown() on MathML subtrees). On math-heavy papers this becomes catastrophically slow under JSDOM.

Tested with defuddle/node (v0.9.0) + JSDOM:

~35x slower for ~8x more math elements, consistent with O(n^2) behavior in the general table path. The browser playground handles 2510.08814 in seconds, so this specifically affects the Node.js bundle with JSDOM.

Fix direction

Process table.ltx_equation / table.ltx_eqn_table elements in standardize.ts before attribute stripping — extract the LaTeX from alttext or <annotation encoding="application/x-tex">, replace the table with a <math display="block" data-latex="..."> element that the existing Turndown math rule can convert. This eliminates the equation tables before they hit the general table path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions