Skip to content

arXiv footnote marks: doubled superscripts and "footnotemark:" text leaking #144

@MaxWolf-01

Description

@MaxWolf-01

arXiv footnote marks: doubled superscripts and "footnotemark:" text leaking

Author affiliations and other footnote marks in arXiv HTML produce garbled output — each mark appears twice and the literal string "footnotemark: N" leaks into the markdown.

Example

https://arxiv.org/html/2305.18290v2

The author list has this HTML structure for each footnote mark:

Rafael Rafailov<span class="ltx_note ltx_role_footnotemark">
  <sup class="ltx_note_mark">2</sup>
  <span class="ltx_note_outer">
    <span class="ltx_note_content">
      <sup class="ltx_note_mark">2</sup>
      <span class="ltx_note_type">footnotemark: </span>
      <span class="ltx_tag ltx_tag_note">2</span>
    </span>
  </span>
</span>

On arxiv.org, ltx_note_outer is display: none — only the first <sup>2</sup> is visible.

Expected:

Rafael Rafailov <sup>2</sup>

Actual:

Rafael Rafailov <sup>2</sup> <sup>2</sup> footnotemark: 2

The hidden content leaks because class is stripped before the CSS-hidden elements can be identified, and JSDOM doesn't compute styles.

Fix direction

Same pattern as #141: process ltx_note_outer or ltx_role_footnotemark elements during standardization before class stripping. Remove the ltx_note_outer span (always hidden on arXiv), keeping only the first <sup> mark.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions