# arXiv HTML to Markdown Converter

Converts arXiv papers to markdown by fetching HTML rendered via LaTeXML. Tries **ar5iv.org** first, then falls back to **arxiv.org/html** (useful for brand-new papers not yet on ar5iv). Both sources use identical HTML structure, so the conversion works seamlessly. The `alttext` attribute on `<math>` elements contains the original LaTeX.

## Quick Start (via Tools)

The recommended way to use this is via the arXiv tools (see "arXiv Paper Tools" section above). These support **multiple papers** in a single session with **automatic session persistence**:

1. **Fetch papers:** `arxiv_fetch("1706.03762")` — adds TOC as a note, stores paper for later
2. **List loaded papers:** `arxiv_list()` — shows all papers with abstracts for disambiguation
3. **Get sections:** `arxiv_section("3", "1706.03762")` — adds section as a note
4. **Get all sections:** `arxiv_all_sections("1706.03762")` — adds all sections as notes
5. **Unload paper:** `arxiv_remove("1706.03762")` — removes from store

## Session Persistence

Paper IDs are saved to `.arxiv_session.json`. After a kernel restart:
- **`arxiv_list()`** — silently reloads all papers from the session file
- **`arxiv_section()`** / **`arxiv_all_sections()`** — auto-refetches if needed, with a "(Re-fetched ... after kernel restart)" message

## Manual Usage (Direct Class)

```python
# Create converter from arXiv ID or URL
conv = Ar5ivToMarkdown("1706.03762").fetch()

# Check which source was used (ar5iv.org or arxiv.org)
print(f"Fetched from: {conv.source_base}")

# Browse the table of contents
print(conv.toc())

# Access sections programmatically
conv.sections  # List of (index, num, title, content)

# Get full markdown
md = conv.convert()

# Add sections as dialog notes
conv.add_section_by_num(3)          # Section 3
conv.add_section_by_num("A.1")      # Appendix A.1
conv.add_section_by_num("Abstract") # By title
conv.add_all_sections()             # All sections
```

## Dependencies
`httpx`, `beautifulsoup4`, `lxml`, `html2text`

## Current Limitations
- Complex tables may not render perfectly
- Author parsing assumes `&` or `\AND` separators

# Core Code

In [None]:
from dialoghelper import add_msg
import re
import httpx
import html2text
from bs4 import BeautifulSoup
from collections import defaultdict

class Ar5ivToMarkdown:
    """Convert ar5iv HTML papers to markdown"""
    
    def __init__(self, arxiv_id):
        if 'arxiv.org' in arxiv_id or 'ar5iv.org' in arxiv_id:
            arxiv_id = re.search(r'(\d+\.\d+)', arxiv_id).group(1)
        self.arxiv_id = arxiv_id
        self.source_base = "https://ar5iv.org"
        self.soup = None
        self._sections = None
        
    def fetch(self):
        """Fetch and parse - tries ar5iv first, falls back to arxiv.org/html"""
        for base in ["https://ar5iv.org", "https://arxiv.org"]:
            response = httpx.get(f"{base}/html/{self.arxiv_id}", follow_redirects=True)
            if response.status_code == 200 and '<article' in response.text:
                self.soup = BeautifulSoup(response.text, 'lxml')
                self.source_base = base
                return self
        raise ValueError(f"Could not fetch {self.arxiv_id} from ar5iv or arxiv")
    
    def _parse_authors(self, authors_div):
        text = authors_div.get_text()
        text = re.sub(r'\\AND', '&', text)
        text = re.sub(r'\d*footnotemark:?\s*\d*', '', text)
        authors = []
        for a in text.split('&'):
            lines = [l.strip() for l in a.strip().split('\n') if l.strip()]
            if lines:
                authors.append({
                    'name': lines[0], 
                    'affiliation': lines[1] if len(lines) > 1 else '', 
                    'email': lines[2] if len(lines) > 2 else ''
                })
        return authors
    
    def _format_authors(self, authors):
        by_affil = defaultdict(list)
        for a in authors:
            affil = a['affiliation'] if a['affiliation'] and '@' not in a['affiliation'] else 'Independent'
            by_affil[affil].append(a['name'])
        lines = ["**Authors:**\n"]
        for affil, names in by_affil.items():
            lines.append(f"*{affil}:* {', '.join(names)}")
        return '\n'.join(lines)
    
    def _preprocess(self, article):
        # Handle equation tables FIRST (before individual math tags)
        # Normalize image alt text to avoid bracket issues
        for img in article.find_all('img'):
            alt = img.get('alt', '')
            if alt:
                img['alt'] = alt.strip('[]')
        

        # Convert span-based tables to real tables (LaTeXML sometimes uses spans)
        for span_table in article.find_all('span', class_='ltx_tabular'):
            span_table.name = 'table'
            for row in span_table.find_all('span', class_='ltx_tr'):
                row.name = 'tr'
            for cell in span_table.find_all('span', class_='ltx_td'):
                cell.name = 'td'


        for table in article.find_all('table', class_='ltx_equationgroup'):
            rows = table.find_all('tr', class_='ltx_equation')
            result_lines = []
            for row in rows:
                latex_parts = []
                for math in row.find_all('math'):
                    alt = math.get('alttext', '')
                    alt = re.sub(r'^\\displaystyle\s*', '', alt)
                    alt = re.sub(r'^\\\[|\\\]$', '', alt)  # Remove \[ and \]
                    alt = re.sub(r'^\\\(|\\\)$', '', alt)  # Remove \( and \)
                    if alt:
                        latex_parts.append(alt)
                eqno_cell = row.find('td', class_='ltx_eqn_eqno')
                eqno = eqno_cell.get_text(strip=True) if eqno_cell else ''
                if latex_parts:
                    latex = ' '.join(latex_parts)
                    if eqno and (m := re.match(r'\((\d+)\)', eqno)):
                        latex += f" \\tag{{{m.group(1)}}}"
                    result_lines.append(f"$${latex}$$")
            table.replace_with(BeautifulSoup(f'<p>{chr(10).join(result_lines)}</p>', 'lxml'))
        
        authors_div = article.find('div', class_='ltx_authors')
        if authors_div:
            md = self._format_authors(self._parse_authors(authors_div))
            authors_div.replace_with(BeautifulSoup(f'<p>{md}</p>', 'lxml'))
        
        abstract = article.find('div', class_='ltx_abstract')
        if abstract and (h6 := abstract.find('h6')):
            h6.name = 'h2'
        
        for math in list(article.find_all('math')):
            alt = math.get('alttext', '')
            if not alt: continue  # skip if no LaTeX available
            # Strip any existing delimiters from alttext
            alt = re.sub(r'^\\displaystyle\s*', '', alt)
            alt = re.sub(r'^\\\[|\\\]$', '', alt)  # Remove \[ and \]
            alt = re.sub(r'^\\\(|\\\)$', '', alt)  # Remove \( and \)
            repl = f'$${alt}$$' if math.get('display') == 'block' else f'${alt}$'
            math.replace_with(repl)
    
    def convert(self):
        """Convert paper to markdown"""
        if not self.soup: self.fetch()
        article = self.soup.find('article')
        self._preprocess(article)
        
        h = html2text.HTML2Text()
        h.body_width = 0
        md = h.handle(str(article))
        
        # Fix any remaining \[...\] or \(...\) delimiters that slipped through
        md = re.sub(r'\\\[', '$$', md)
        md = re.sub(r'\\\]', '$$', md)
        md = re.sub(r'\\\(', '$', md)
        md = re.sub(r'\\\)', '$', md)
        
        md = re.sub(r'<(https?://[^>]+)>', r'[\1](\1)', md)
        md = re.sub(r'!\[([^\]]*)\]\((?!http)([^)]+)\)', rf'![\1]({self.source_base}\2)', md)
        md = md.replace('\\\\', '\\')
        md = md.replace('\\eqqcolon', '=:')
        # Fix \bigg{[} → \bigg[ (KaTeX doesn't support braced delimiters)
        md = re.sub(r'\\([Bb]ig+)\{([[\]()])\}', r'\\\1\2', md)
        
        return md
    
    @property
    def title(self):
        if not self.soup: self.fetch()
        h1 = self.soup.find('h1', class_='ltx_title')
        if h1:
            return h1.get_text(strip=True)
        bold = self.soup.find('span', class_='ltx_font_bold')
        if bold:
            return bold.get_text(strip=True)
        return "Untitled"
    
    def _section_title(self, section):
        if m := re.match(r'^##\s*(.+?)$', section, re.MULTILINE):
            raw = m.group(1).strip()
            if num_match := re.match(r'^(\d+(?:\.\d+)*|[A-Z](?:\.\d+)*)\s+(.+)$', raw):
                return num_match.group(1), num_match.group(2)
            return None, raw
        return None, "Preamble"
    
    @property 
    def sections(self):
        """Get list of (index, section_num, title, content) for each section"""
        if self._sections is None:
            md = self.convert()
            parts = re.split(r'(?=^## )', md, flags=re.MULTILINE)
            self._sections = [(i, *self._section_title(s), s.strip()) 
                             for i, s in enumerate(parts) if s.strip()]
        return self._sections
    
    def get_section(self, num_or_title):
        """Get section by number (e.g., '3', 'A.1') or title (e.g., 'Abstract')"""
        for sec in self.sections:
            if sec[1] == str(num_or_title):  # match by number
                return sec
        # fallback: match by title (case-insensitive)
        for sec in self.sections:
            if num_or_title.lower() in sec[2].lower():
                return sec
        return None
    
    def toc(self):
        """Return table of contents as markdown string"""
        lines = [f"# {self.title}", f"**arXiv:** {self.arxiv_id}\n"]
        for i, num, title, _ in self.sections:
            prefix = f"{num}." if num else "-"
            lines.append(f"{prefix} {title}")
        return '\n'.join(lines)
    
    async def add_toc_note(self):
        """Add TOC as a note message"""
        return await add_msg(content=self.toc(), placement="at_end")
    
    async def add_section_note(self, index):
        """Add a single section by list index"""
        _, _, _, content = self.sections[index]
        return await add_msg(content=content, placement="at_end")
    
    async def add_section_by_num(self, num):
        """Add section by paper number (e.g., '3', 'A.1')"""
        sec = self.get_section(num)
        if sec: return await add_msg(content=sec[3], placement="at_end")
        return None
    
    async def add_all_sections(self):
        "Add all sections as separate note messages"
        for _, _, _, content in self.sections: await add_msg(content=content, placement="at_end")

async def arxiv_toc(url_or_id: str) -> Ar5ivToMarkdown:
    """Quick TOC from URL or ID — adds note and returns converter for further use"""
    conv = Ar5ivToMarkdown(url_or_id).fetch()
    await conv.add_toc_note()
    return conv

## How the Ar5iv Converter Works

The `Ar5ivToMarkdown` class converts arXiv papers to markdown by leveraging LaTeXML-rendered HTML. It tries **ar5iv.org** first, then falls back to **arxiv.org/html** for papers not yet available on ar5iv (e.g., brand-new submissions).

### Key Steps

1. **Fetch**: Tries `ar5iv.org/html/{arxiv_id}` first; if unavailable, falls back to `arxiv.org/html/{arxiv_id}`. Stores which source was used in `self.source_base` for correct image URL handling.

2. **Preprocess**: Before conversion, it:
   - Extracts author info and reformats it cleanly (grouping by affiliation)
   - Converts `<math>` tags back to LaTeX using their `alttext` attribute (e.g., `<math alttext="\alpha">` → `$\alpha$`)
   - Promotes the abstract heading from `<h6>` to `<h2>`

3. **Convert**: Uses `html2text` to transform the preprocessed HTML into markdown, then fixes relative image URLs

4. **Parse sections**: Splits the markdown at `## ` headers, extracts section numbers (like "3.1" or "A.2") and titles

### Helper Functions

- `toc()` — Returns a table of contents
- `get_section(num)` — Retrieves a section by its paper number
- `add_*` methods — Create note messages in the dialog

The `arxiv_toc()` convenience function does fetch + add TOC in one call, returning the converter for further use.

# Tool Definitions

In [None]:
import json
from pathlib import Path

_arxiv_papers = {}  # {arxiv_id: Ar5ivToMarkdown}
_SESSION_FILE = Path(".arxiv_session.json")

def _save_session():
    """Save current paper IDs to disk"""
    _SESSION_FILE.write_text(json.dumps(list(_arxiv_papers.keys())))

def _load_session():
    """Load paper IDs from disk"""
    if _SESSION_FILE.exists():
        return set(json.loads(_SESSION_FILE.read_text()))
    return set()

def _ensure_paper(paper_id):
    """Load paper from memory, or re-fetch if in session file. Returns (conv, refetched_msg)"""
    if paper_id in _arxiv_papers:
        return _arxiv_papers[paper_id], ""
    
    if paper_id in _load_session():
        conv = Ar5ivToMarkdown(paper_id).fetch()
        _arxiv_papers[paper_id] = conv
        return conv, f" (Re-fetched {paper_id} after kernel restart)"
    
    return None, ""

async def arxiv_fetch(url_or_id: str) -> str:
    """Fetch an arXiv paper and return its table of contents.
    
    Args:
        url_or_id: arXiv URL (arxiv.org or ar5iv.org) or just the ID (e.g., '1706.03762')
    
    Returns:
        Table of contents with title, arXiv ID, and section listing
    """
    conv = Ar5ivToMarkdown(url_or_id).fetch()
    _arxiv_papers[conv.arxiv_id] = conv
    _save_session()
    await add_msg(content=conv.toc(), placement="at_end")
    return conv.toc()

async def arxiv_list() -> str:
    """List all currently loaded arXiv papers with abstracts for disambiguation.
    
    Returns:
        Formatted list of paper IDs, titles, and abstract snippets
    """
    # If no papers in memory, try to reload from session file
    if not _arxiv_papers:
        session_ids = _load_session()
        for paper_id in session_ids:
            conv = Ar5ivToMarkdown(paper_id).fetch()
            _arxiv_papers[paper_id] = conv
    
    if not _arxiv_papers:
        return "No papers loaded. Use arxiv_fetch first."
    
    lines = ["**Loaded papers:**"]
    for pid, conv in _arxiv_papers.items():
        abstract = conv.get_section("Abstract")
        abstract_text = abstract[3][:300] + "..." if abstract else "(no abstract)"
        lines.append(f"- **{pid}**: {conv.title}\n  {abstract_text}")
    return '\n'.join(lines)

async def arxiv_section(num: str, paper_id: str) -> str:
    """Get a specific section from a loaded arXiv paper.
    
    Args:
        num: Section number (e.g., '3', '4.1', 'A.1') or title (e.g., 'Abstract')
        paper_id: arXiv ID of the paper (e.g., '1706.03762')
    
    Returns:
        The section content as markdown
    """
    conv, refetch_msg = _ensure_paper(paper_id)
    if not conv:
        return f"Paper '{paper_id}' not loaded. Use arxiv_fetch first."
    
    sec = conv.get_section(num)
    if sec:
        await add_msg(content=sec[3], placement="at_end")
        return f"Added section {num}: {sec[2]}{refetch_msg}"
    return f"Section '{num}' not found in {paper_id}"

async def arxiv_all_sections(paper_id: str) -> str:
    """Get all sections from a loaded arXiv paper.
    
    Args:
        paper_id: arXiv ID of the paper (e.g., '1706.03762')
    
    Returns:
        Confirmation of sections added
    """
    conv, refetch_msg = _ensure_paper(paper_id)
    if not conv:
        return f"Paper '{paper_id}' not loaded. Use arxiv_fetch first."
    
    for _, _, _, content in conv.sections: await add_msg(content=content, placement="at_end")
    return f"Added {len(conv.sections)} sections from {paper_id}{refetch_msg}"

def arxiv_remove(paper_id: str) -> str:
    """Remove a paper from the loaded papers store.
    
    Args:
        paper_id: arXiv ID of the paper to remove (e.g., '1706.03762')
    
    Returns:
        Confirmation of removal
    """
    if paper_id not in _arxiv_papers:
        return f"Paper '{paper_id}' not loaded."
    
    title = _arxiv_papers[paper_id].title
    del _arxiv_papers[paper_id]
    _save_session()
    return f"Removed {paper_id}: {title}"

## arXiv Paper Tools

Use these tools to fetch and read arXiv papers (all tools add notes to the dialog):

- &`arxiv_fetch` — Fetch a paper by URL or ID, adds TOC as a note, stores paper for later access
- &`arxiv_list` — List all loaded papers with IDs, titles, and abstract snippets (auto-reloads from session file after kernel restart)
- &`arxiv_section` — Add a specific section by number (e.g., "3", "A.1") or title (e.g., "Abstract"); requires `paper_id`
- &`arxiv_all_sections` — Add all sections as separate notes; requires `paper_id`
- &`arxiv_remove` — Unload a paper from the store (also removes from session file)

**Session persistence:** Paper IDs are saved to `.arxiv_session.json` in the current working directory (i.e., per-folder, not at the CRAFT root), so papers survive kernel restarts and are auto-reloaded on demand. This means each project folder maintains its own paper session.

**Example workflow:** Fetch one or more papers, use `arxiv_list` to see what's loaded, then retrieve specific sections by paper ID.

## Version History

### v1.2 — 2026-02-21
- Fixed missing `await` on all `add_msg` calls in `Ar5ivToMarkdown` methods and tool functions; marked all affected methods/functions `async`

### v1.1 — 2026-02-21
- Fixed section ordering in `arxiv_all_sections`, `arxiv_section`, `arxiv_fetch`, and all `Ar5ivToMarkdown.add_*` methods: changed `add_msg` calls from default `placement="add_after"` (which caused reverse insertion order) to `placement="at_end"` so notes appear in paper section order.

### v1.0 — 2026-01-31
- Initial implementation of `Ar5ivToMarkdown` with ar5iv → arxiv.org fallback
- LaTeX math reconstruction from `alttext` attributes
- Author grouping by affiliation
- Section parsing and TOC generation
- `arxiv_fetch`, `arxiv_list`, `arxiv_section`, `arxiv_all_sections`, `arxiv_remove` tool functions
- Session persistence via `.arxiv_session.json`

# Example Manual Usage

In [None]:
# Quick TOC from URL
#conv = await arxiv_toc("https://arxiv.org/abs/1706.03762")

# Add a specific section (by index from TOC)
#await conv.add_section_by_num(4)

# Or add all sections at once
#await conv.add_all_sections()