arXiv footnote marks: doubled superscripts and "footnotemark:" text leaking
Author affiliations and other footnote marks in arXiv HTML produce garbled output — each mark appears twice and the literal string "footnotemark: N" leaks into the markdown.
Example
https://arxiv.org/html/2305.18290v2
The author list has this HTML structure for each footnote mark:
Rafael Rafailov<span class="ltx_note ltx_role_footnotemark">
<sup class="ltx_note_mark">2</sup>
<span class="ltx_note_outer">
<span class="ltx_note_content">
<sup class="ltx_note_mark">2</sup>
<span class="ltx_note_type">footnotemark: </span>
<span class="ltx_tag ltx_tag_note">2</span>
</span>
</span>
</span>
On arxiv.org, ltx_note_outer is display: none — only the first <sup>2</sup> is visible.
Expected:
Rafael Rafailov <sup>2</sup>
Actual:
Rafael Rafailov <sup>2</sup> <sup>2</sup> footnotemark: 2
The hidden content leaks because class is stripped before the CSS-hidden elements can be identified, and JSDOM doesn't compute styles.
Fix direction
Same pattern as #141: process ltx_note_outer or ltx_role_footnotemark elements during standardization before class stripping. Remove the ltx_note_outer span (always hidden on arXiv), keeping only the first <sup> mark.
arXiv footnote marks: doubled superscripts and "footnotemark:" text leaking
Author affiliations and other footnote marks in arXiv HTML produce garbled output — each mark appears twice and the literal string "footnotemark: N" leaks into the markdown.
Example
https://arxiv.org/html/2305.18290v2
The author list has this HTML structure for each footnote mark:
On arxiv.org,
ltx_note_outerisdisplay: none— only the first<sup>2</sup>is visible.Expected:
Actual:
The hidden content leaks because
classis stripped before the CSS-hidden elements can be identified, and JSDOM doesn't compute styles.Fix direction
Same pattern as #141: process
ltx_note_outerorltx_role_footnotemarkelements during standardization before class stripping. Remove theltx_note_outerspan (always hidden on arXiv), keeping only the first<sup>mark.