feat: add llms.txt and per-page markdown endpoints for AI agents by molok0aleks99 · Pull Request #890 · lidofinance/docs

molok0aleks99 · 2026-05-14T08:19:09Z

Please, go through these steps before you request a review:

📝 Describe your changes

Generate an llmstxt.org-compliant /llms.txt index and a concatenated /llms-full.txt corpus at build time
Expose every doc page as a raw markdown endpoint (append .md to any docs URL)
Add two custom Docusaurus plugins (llms-txt, markdown-source) with a shared markdown utilities module
Exclude the new .md endpoints from search engine indexing via robots.txt
Document the new artifacts and plugin setup in the root README and src/plugins/README.md

🔎 Attach a source of truth or evidence that allows reviewers to confirm the changes independently

Copilot

Pull request overview

Adds build-time generation of AI-friendly artifacts: an llmstxt.org-compliant index (/llms.txt), a concatenated full corpus (/llms-full.txt), and per-page raw markdown endpoints (<page>.md). Two custom Docusaurus postBuild plugins (llms-txt, markdown-source) share a regex-based MDX sanitizer in _shared/markdown-utils.js. Robots and READMEs are updated accordingly.

Changes:

New llms-txt and markdown-source plugins wired into docusaurus.config.js for three doc collections (docs, run-on-lido, earn).
Shared markdown utilities: directory walk, gray-matter frontmatter parsing, permalink derivation, title/description extraction, and MDX-to-plain-markdown regex sanitization.
static/robots.txt disallows *.md URLs; root README and src/plugins/README.md document the new outputs and plugin behavior; gray-matter added as a dependency.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`docusaurus.config.js`	Registers the two new `postBuild` plugins with per-collection options.
`src/plugins/_shared/markdown-utils.js`	New utilities: file walk, doc loading, permalink/title/description extraction, MDX sanitization.
`src/plugins/llms-txt/index.js`	Emits `llms.txt` index and `llms-full.txt` concatenated corpus.
`src/plugins/markdown-source/index.js`	Writes one `.md` file per doc page mirroring the route tree.
`src/plugins/README.md`	Comprehensive plugin documentation.
`README.md`	Adds an "AI-friendly outputs" section pointing at the plugin README.
`static/robots.txt`	New file disallowing crawlers from `*.md` URLs.
`package.json` / `package-lock.json`	Adds `gray-matter ^4.0.3`; many lockfile entries gain a `"peer": true` flag (likely an npm-version side-effect).

Comments suppressed due to low confidence (8)

src/plugins/_shared/markdown-utils.js:161

sanitizeMdx runs all of its regex transforms over the entire body without first masking fenced code blocks or inline code. As a result, any markdown code sample that contains JSX/HTML examples (e.g. a jsx block showing <MyComponent className="foo" onClick={...}> or <div>...</div>) will have its attributes stripped, its tags removed, and its {/* ... */} comments deleted in the published .md files and in llms-full.txt. This silently corrupts code examples that are likely common in this repo (Lido docs include many React/JSX snippets). Consider extracting and re-inserting fenced/inline code segments around the regex pipeline so they are passed through verbatim.

function sanitizeMdx(content) {
  let out = content

  out = out.replace(/^\s*import\s+[^\n]+from\s+['"][^'"]+['"];?\s*$/gm, '')
  out = out.replace(/^\s*import\s+['"][^'"]+['"];?\s*$/gm, '')
  out = out.replace(/^\s*export\s+[^\n]+$/gm, '')

  out = out.replace(/<PdfViewer\s+[^/>]*pdfUrl=["']([^"']+)["'][^/>]*\/?>/g, (_, src) => {
    return `[PDF](${src})`
  })

  out = out.replace(/<Link\s+[^>]*to=["']([^"']+)["'][^>]*>([\s\S]*?)<\/Link>/g, (_, href, text) => {
    const clean = text.replace(/<[^>]+>/g, '').trim()
    return `[${clean || href}](${href})`
  })

  out = stripJsxAttributes(out)

  out = out.replace(/<([A-Z][A-Za-z0-9]*)\b[^>]*\/>/g, '')
  out = out.replace(/<([A-Z][A-Za-z0-9]*)\b[^>]*>([\s\S]*?)<\/\1>/g, '$2')

  out = out.replace(/<h([1-6])\b[^>]*>([\s\S]*?)<\/h\1>/gi, (_, level, text) => {
    const hashes = '#'.repeat(Number(level))
    return `\n\n${hashes} ${text.replace(/\s+/g, ' ').trim()}\n\n`
  })

  out = out.replace(/<br\s*\/?>/gi, '\n')
  out = out.replace(/<(div|span|section|article|aside|header|footer)\b[^>]*>/gi, '')
  out = out.replace(/<\/(div|span|section|article|aside|header|footer)>/gi, '')

  out = out.replace(/\{\/\*[\s\S]*?\*\/\}/g, '')

  out = out
    .split('\n')
    .map((line) => (line.trim() === '' ? '' : line))
    .join('\n')
  out = out.replace(/\n{3,}/g, '\n\n').trim()

  return out
}

src/plugins/markdown-source/index.js:67

The "is the title already present?" check uses head.includes(\# ${doc.title}`), which is a plain substring match. Any heading whose level is greater than 1 (e.g. ## Lido, ### Lido) contains the substring # Lidoand will be treated as if an H1 already exists, so the canonical# Titleis never inserted. Anchor the check to a heading line, for example via a regex like^#\s+${escapedTitle}\s*$evaluated per line. The same bug exists insrc/plugins/llms-txt/index.js (hasTitleAlready` on line 88).

function hasHeading(body, title) {
  const head = body.split('\n').slice(0, 20).join('\n')
  if (/^#\s+/m.test(head)) return true
  if (title && head.includes(`# ${title}`)) return true
  return false
}

src/plugins/_shared/markdown-utils.js:173

stripJsxAttributes matches every opening tag of the form <name ...> — including lowercase HTML tags like <a>, <img>, <input> that are perfectly valid in markdown. For such tags it strips className=, style={...}, onClick={...}, and generically any attribute={...} expression. While the className/style/event-handler stripping is benign for plain HTML, the generic attribute={...} rule is unlikely to fire on non-JSX HTML. However, the broader concern is that this regex is also applied unconditionally to text inside fenced code blocks (see related comment on sanitizeMdx), which means code examples that intentionally show JSX attributes will be mutilated in the output.

function stripJsxAttributes(content) {
  return content.replace(/<([a-zA-Z][a-zA-Z0-9]*)([^>]*)>/g, (match, tag, attrs) => {
    if (!attrs) return `<${tag}>`
    let cleaned = attrs
    cleaned = cleaned.replace(/\s+className=(?:"[^"]*"|'[^']*'|\{[^}]*\})/g, '')
    cleaned = cleaned.replace(/\s+style=(?:"[^"]*"|'[^']*'|\{\{[^}]*\}\}|\{[^}]*\})/g, '')
    cleaned = cleaned.replace(/\s+on[A-Z][a-zA-Z]*=\{[^}]*\}/g, '')
    cleaned = cleaned.replace(/\s+[a-zA-Z][a-zA-Z0-9-]*=\{[^}]*\}/g, '')
    return `<${tag}${cleaned}>`
  })
}

src/plugins/_shared/markdown-utils.js:116

extractDescription falls back to the first "meaningful" paragraph but uses very coarse filters (startsWith('<'), startsWith('>'), startsWith(':::'), etc.). A paragraph that begins with an admonition (e.g. :::info ... :::), a JSX block, or any HTML wrapper will be skipped, but a paragraph whose first character happens to be one of these for unrelated reasons (e.g. a sentence accidentally starting with < inside text after MDX normalization, or a Docusaurus admonition tail line that didn't end with :::) can pass through. Consider parsing the markdown more structurally (e.g. via remark) or at minimum documenting that this heuristic is best-effort.

function extractDescription(fm, body) {
  if (fm && typeof fm.description === 'string' && fm.description.trim()) {
    return fm.description.trim()
  }
  const cleaned = stripCodeFences(body)
  const paragraph = cleaned
    .split(/\n{2,}/)
    .map((p) => p.trim())
    .find((p) => {
      if (!p) return false
      if (p.startsWith('#')) return false
      if (p.startsWith(':::')) return false
      if (p.startsWith('import')) return false
      if (p.startsWith('export')) return false
      if (p.startsWith('>')) return false
      if (p.startsWith('<')) return false
      if (/^[-*+]\s/.test(p)) return false
      if (/^\d+\.\s/.test(p)) return false
      if (/^---+$/.test(p)) return false
      return true
    })
  if (!paragraph) return ''
  const text = sanitizeMdx(paragraph).replace(/\s+/g, ' ').trim()
  return text.length > 200 ? `${text.slice(0, 197)}...` : text
}

src/plugins/_shared/markdown-utils.js:79

The plugin's permalink derivation duplicates Docusaurus's own routing logic (frontmatter slug, index.md collapsing, routeBasePath) but ignores other things Docusaurus actually respects, including id frontmatter, sidebar slug overrides, numberPrefixParser-style filename prefixes, and _category_.json paths. For a page whose real URL differs from what buildPermalink produces, the generated .md will be written to a location that does not match the canonical HTML URL, and the  /  comments will point to a non-existent page. Consider sourcing permalinks from Docusaurus's already-resolved routes (e.g. via loadedRoutes/content from the docs plugin) instead of recomputing them.

function buildPermalink(absPath, collectionPath, routeBasePath, frontmatter) {
  const base = normalizeBase(routeBasePath)

  if (frontmatter && frontmatter.slug) {
    const slug = frontmatter.slug.startsWith('/') ? frontmatter.slug : `/${frontmatter.slug}`
    return joinUrl(base, slug)
  }

  const rel = path.relative(collectionPath, absPath)
  const noExt = rel.replace(/\.(md|mdx)$/i, '')
  const parts = noExt.split(path.sep).filter(Boolean)

  if (parts.length && parts[parts.length - 1].toLowerCase() === 'index') {
    parts.pop()
  }

  const route = parts.length ? `/${parts.join('/')}` : '/'
  return joinUrl(base, route)
}

function normalizeBase(routeBasePath) {
  if (!routeBasePath || routeBasePath === '/') return ''
  return routeBasePath.startsWith('/') ? routeBasePath : `/${routeBasePath}`
}

function joinUrl(base, route) {
  if (!base) return route
  if (route === '/') return base
  return `${base}${route}`
}

src/plugins/_shared/markdown-utils.js:178

escapeMarkdown is named like it escapes markdown special characters but actually only collapses newlines and trims. The description it produces is interpolated directly into a list item as - [title](url): ${desc}, so a description containing characters such as backticks, brackets, or pipes is emitted unescaped. Either rename to something like inlineDescription / collapseWhitespace, or actually escape characters that could break the list-item / link rendering.

function escapeMarkdown(text) {
  if (!text) return ''
  return text.replace(/\r/g, '').replace(/\n+/g, ' ').trim()
}

src/plugins/_shared/markdown-utils.js:89

The emoji-stripping range [\u{1F300}-\u{1FAFF}\u{2600}-\u{27BF}] misses several common emoji ranges used in titles (e.g. regional indicators, dingbats outside 2600–27BF, supplemental symbols 1F900–1F9FF are covered by 1F300–1FAFF but flag sequences, ZWJ joiners 200D, variation selectors FE0F, and skin-tone modifiers 1F3FB–1F3FF are not). Titles that contain 🇺🇸 or compound emoji will end up with leftover joiner characters. Consider using a more complete regex (e.g. \p{Extended_Pictographic} plus \uFE0F/\u200D) or a small library.

    .replace(/\{#[^}]+\}/g, '')
    .replace(/[\u{1F300}-\u{1FAFF}\u{2600}-\u{27BF}]/gu, '')
    .replace(/\s+/g, ' ')
    .trim()

src/plugins/markdown-source/index.js:29

When two collections produce the same resolved file path (collision), the warning message says "overwriting" but the code does not actually overwrite atomically — it just writes the second doc and discards the first. More importantly, the collision is detected only after both docs have been loaded sequentially, so the first doc's fs.writeFileSync has already happened. The map check is correct for the warning, but the order means the later file always wins regardless of which collection should be authoritative. Consider failing the build (or at least clearly logging both source paths and the winning one) on collision instead of silently picking the last writer.

          const targetPath = resolveTargetPath(outDir, doc.permalink)
          if (seen.has(targetPath)) {
            console.warn(
              `[markdown-source] permalink collision at ${doc.permalink}: ${seen.get(targetPath)} vs ${doc.relPath} (overwriting)`,
            )
          }
          seen.set(targetPath, doc.relPath)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

feat: add llms.txt and per-page markdown endpoints for AI agents

f061d2b

molok0aleks99 requested a review from a team as a code owner May 14, 2026 08:19

karinamaulitova requested a review from Copilot May 14, 2026 09:39

Copilot started reviewing on behalf of karinamaulitova May 14, 2026 09:39 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread src/plugins/markdown-source/index.js Outdated

Comment thread docusaurus.config.js

Comment thread static/robots.txt

Comment thread src/plugins/_shared/markdown-utils.js

fix: fixed comments

a9c6139

karinamaulitova approved these changes May 14, 2026

View reviewed changes

tamtamchik mentioned this pull request May 15, 2026

fix AI markdown artifacts #894

Closed

TheDZhon approved these changes May 18, 2026

View reviewed changes

karinamaulitova merged commit 92ab99a into main May 18, 2026
1 check passed

karinamaulitova deleted the feature/we-2291-add-ai-into-docs branch May 18, 2026 08:15

karinamaulitova mentioned this pull request May 18, 2026

add docusaurus-plugin-copy-page-button #896

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add llms.txt and per-page markdown endpoints for AI agents#890

feat: add llms.txt and per-page markdown endpoints for AI agents#890
karinamaulitova merged 2 commits into
mainfrom
feature/we-2291-add-ai-into-docs

molok0aleks99 commented May 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

molok0aleks99 commented May 14, 2026

Please, go through these steps before you request a review:

📝 Describe your changes

🔎 Attach a source of truth or evidence that allows reviewers to confirm the changes independently

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants