Skip to content

feat: add llms.txt and per-page markdown endpoints for AI agents#890

Merged
karinamaulitova merged 2 commits into
mainfrom
feature/we-2291-add-ai-into-docs
May 18, 2026
Merged

feat: add llms.txt and per-page markdown endpoints for AI agents#890
karinamaulitova merged 2 commits into
mainfrom
feature/we-2291-add-ai-into-docs

Conversation

@molok0aleks99
Copy link
Copy Markdown
Contributor

Please, go through these steps before you request a review:

📝 Describe your changes

  1. Generate an llmstxt.org-compliant /llms.txt index and a concatenated /llms-full.txt corpus at build time
  2. Expose every doc page as a raw markdown endpoint (append .md to any docs URL)
  3. Add two custom Docusaurus plugins (llms-txt, markdown-source) with a shared markdown utilities module
  4. Exclude the new .md endpoints from search engine indexing via robots.txt
  5. Document the new artifacts and plugin setup in the root README and src/plugins/README.md

🔎 Attach a source of truth or evidence that allows reviewers to confirm the changes independently

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds build-time generation of AI-friendly artifacts: an llmstxt.org-compliant index (/llms.txt), a concatenated full corpus (/llms-full.txt), and per-page raw markdown endpoints (<page>.md). Two custom Docusaurus postBuild plugins (llms-txt, markdown-source) share a regex-based MDX sanitizer in _shared/markdown-utils.js. Robots and READMEs are updated accordingly.

Changes:

  • New llms-txt and markdown-source plugins wired into docusaurus.config.js for three doc collections (docs, run-on-lido, earn).
  • Shared markdown utilities: directory walk, gray-matter frontmatter parsing, permalink derivation, title/description extraction, and MDX-to-plain-markdown regex sanitization.
  • static/robots.txt disallows *.md URLs; root README and src/plugins/README.md document the new outputs and plugin behavior; gray-matter added as a dependency.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
docusaurus.config.js Registers the two new postBuild plugins with per-collection options.
src/plugins/_shared/markdown-utils.js New utilities: file walk, doc loading, permalink/title/description extraction, MDX sanitization.
src/plugins/llms-txt/index.js Emits llms.txt index and llms-full.txt concatenated corpus.
src/plugins/markdown-source/index.js Writes one .md file per doc page mirroring the route tree.
src/plugins/README.md Comprehensive plugin documentation.
README.md Adds an "AI-friendly outputs" section pointing at the plugin README.
static/robots.txt New file disallowing crawlers from *.md URLs.
package.json / package-lock.json Adds gray-matter ^4.0.3; many lockfile entries gain a "peer": true flag (likely an npm-version side-effect).
Comments suppressed due to low confidence (8)

src/plugins/_shared/markdown-utils.js:161

  • sanitizeMdx runs all of its regex transforms over the entire body without first masking fenced code blocks or inline code. As a result, any markdown code sample that contains JSX/HTML examples (e.g. a jsx block showing <MyComponent className="foo" onClick={...}> or <div>...</div>) will have its attributes stripped, its tags removed, and its {/* ... */} comments deleted in the published .md files and in llms-full.txt. This silently corrupts code examples that are likely common in this repo (Lido docs include many React/JSX snippets). Consider extracting and re-inserting fenced/inline code segments around the regex pipeline so they are passed through verbatim.
function sanitizeMdx(content) {
  let out = content

  out = out.replace(/^\s*import\s+[^\n]+from\s+['"][^'"]+['"];?\s*$/gm, '')
  out = out.replace(/^\s*import\s+['"][^'"]+['"];?\s*$/gm, '')
  out = out.replace(/^\s*export\s+[^\n]+$/gm, '')

  out = out.replace(/<PdfViewer\s+[^/>]*pdfUrl=["']([^"']+)["'][^/>]*\/?>/g, (_, src) => {
    return `[PDF](${src})`
  })

  out = out.replace(/<Link\s+[^>]*to=["']([^"']+)["'][^>]*>([\s\S]*?)<\/Link>/g, (_, href, text) => {
    const clean = text.replace(/<[^>]+>/g, '').trim()
    return `[${clean || href}](${href})`
  })

  out = stripJsxAttributes(out)

  out = out.replace(/<([A-Z][A-Za-z0-9]*)\b[^>]*\/>/g, '')
  out = out.replace(/<([A-Z][A-Za-z0-9]*)\b[^>]*>([\s\S]*?)<\/\1>/g, '$2')

  out = out.replace(/<h([1-6])\b[^>]*>([\s\S]*?)<\/h\1>/gi, (_, level, text) => {
    const hashes = '#'.repeat(Number(level))
    return `\n\n${hashes} ${text.replace(/\s+/g, ' ').trim()}\n\n`
  })

  out = out.replace(/<br\s*\/?>/gi, '\n')
  out = out.replace(/<(div|span|section|article|aside|header|footer)\b[^>]*>/gi, '')
  out = out.replace(/<\/(div|span|section|article|aside|header|footer)>/gi, '')

  out = out.replace(/\{\/\*[\s\S]*?\*\/\}/g, '')

  out = out
    .split('\n')
    .map((line) => (line.trim() === '' ? '' : line))
    .join('\n')
  out = out.replace(/\n{3,}/g, '\n\n').trim()

  return out
}

src/plugins/markdown-source/index.js:67

  • The "is the title already present?" check uses head.includes(\# ${doc.title}`), which is a plain substring match. Any heading whose level is greater than 1 (e.g. ## Lido, ### Lido) contains the substring # Lidoand will be treated as if an H1 already exists, so the canonical# Titleis never inserted. Anchor the check to a heading line, for example via a regex like^#\s+${escapedTitle}\s*$evaluated per line. The same bug exists insrc/plugins/llms-txt/index.js (hasTitleAlready` on line 88).
function hasHeading(body, title) {
  const head = body.split('\n').slice(0, 20).join('\n')
  if (/^#\s+/m.test(head)) return true
  if (title && head.includes(`# ${title}`)) return true
  return false
}

src/plugins/_shared/markdown-utils.js:173

  • stripJsxAttributes matches every opening tag of the form <name ...> — including lowercase HTML tags like <a>, <img>, <input> that are perfectly valid in markdown. For such tags it strips className=, style={...}, onClick={...}, and generically any attribute={...} expression. While the className/style/event-handler stripping is benign for plain HTML, the generic attribute={...} rule is unlikely to fire on non-JSX HTML. However, the broader concern is that this regex is also applied unconditionally to text inside fenced code blocks (see related comment on sanitizeMdx), which means code examples that intentionally show JSX attributes will be mutilated in the output.
function stripJsxAttributes(content) {
  return content.replace(/<([a-zA-Z][a-zA-Z0-9]*)([^>]*)>/g, (match, tag, attrs) => {
    if (!attrs) return `<${tag}>`
    let cleaned = attrs
    cleaned = cleaned.replace(/\s+className=(?:"[^"]*"|'[^']*'|\{[^}]*\})/g, '')
    cleaned = cleaned.replace(/\s+style=(?:"[^"]*"|'[^']*'|\{\{[^}]*\}\}|\{[^}]*\})/g, '')
    cleaned = cleaned.replace(/\s+on[A-Z][a-zA-Z]*=\{[^}]*\}/g, '')
    cleaned = cleaned.replace(/\s+[a-zA-Z][a-zA-Z0-9-]*=\{[^}]*\}/g, '')
    return `<${tag}${cleaned}>`
  })
}

src/plugins/_shared/markdown-utils.js:116

  • extractDescription falls back to the first "meaningful" paragraph but uses very coarse filters (startsWith('<'), startsWith('>'), startsWith(':::'), etc.). A paragraph that begins with an admonition (e.g. :::info ... :::), a JSX block, or any HTML wrapper will be skipped, but a paragraph whose first character happens to be one of these for unrelated reasons (e.g. a sentence accidentally starting with < inside text after MDX normalization, or a Docusaurus admonition tail line that didn't end with :::) can pass through. Consider parsing the markdown more structurally (e.g. via remark) or at minimum documenting that this heuristic is best-effort.
function extractDescription(fm, body) {
  if (fm && typeof fm.description === 'string' && fm.description.trim()) {
    return fm.description.trim()
  }
  const cleaned = stripCodeFences(body)
  const paragraph = cleaned
    .split(/\n{2,}/)
    .map((p) => p.trim())
    .find((p) => {
      if (!p) return false
      if (p.startsWith('#')) return false
      if (p.startsWith(':::')) return false
      if (p.startsWith('import')) return false
      if (p.startsWith('export')) return false
      if (p.startsWith('>')) return false
      if (p.startsWith('<')) return false
      if (/^[-*+]\s/.test(p)) return false
      if (/^\d+\.\s/.test(p)) return false
      if (/^---+$/.test(p)) return false
      return true
    })
  if (!paragraph) return ''
  const text = sanitizeMdx(paragraph).replace(/\s+/g, ' ').trim()
  return text.length > 200 ? `${text.slice(0, 197)}...` : text
}

src/plugins/_shared/markdown-utils.js:79

  • The plugin's permalink derivation duplicates Docusaurus's own routing logic (frontmatter slug, index.md collapsing, routeBasePath) but ignores other things Docusaurus actually respects, including id frontmatter, sidebar slug overrides, numberPrefixParser-style filename prefixes, and _category_.json paths. For a page whose real URL differs from what buildPermalink produces, the generated .md will be written to a location that does not match the canonical HTML URL, and the <!-- canonical: --> / <!-- source: --> comments will point to a non-existent page. Consider sourcing permalinks from Docusaurus's already-resolved routes (e.g. via loadedRoutes/content from the docs plugin) instead of recomputing them.
function buildPermalink(absPath, collectionPath, routeBasePath, frontmatter) {
  const base = normalizeBase(routeBasePath)

  if (frontmatter && frontmatter.slug) {
    const slug = frontmatter.slug.startsWith('/') ? frontmatter.slug : `/${frontmatter.slug}`
    return joinUrl(base, slug)
  }

  const rel = path.relative(collectionPath, absPath)
  const noExt = rel.replace(/\.(md|mdx)$/i, '')
  const parts = noExt.split(path.sep).filter(Boolean)

  if (parts.length && parts[parts.length - 1].toLowerCase() === 'index') {
    parts.pop()
  }

  const route = parts.length ? `/${parts.join('/')}` : '/'
  return joinUrl(base, route)
}

function normalizeBase(routeBasePath) {
  if (!routeBasePath || routeBasePath === '/') return ''
  return routeBasePath.startsWith('/') ? routeBasePath : `/${routeBasePath}`
}

function joinUrl(base, route) {
  if (!base) return route
  if (route === '/') return base
  return `${base}${route}`
}

src/plugins/_shared/markdown-utils.js:178

  • escapeMarkdown is named like it escapes markdown special characters but actually only collapses newlines and trims. The description it produces is interpolated directly into a list item as - [title](url): ${desc}, so a description containing characters such as backticks, brackets, or pipes is emitted unescaped. Either rename to something like inlineDescription / collapseWhitespace, or actually escape characters that could break the list-item / link rendering.
function escapeMarkdown(text) {
  if (!text) return ''
  return text.replace(/\r/g, '').replace(/\n+/g, ' ').trim()
}

src/plugins/_shared/markdown-utils.js:89

  • The emoji-stripping range [\u{1F300}-\u{1FAFF}\u{2600}-\u{27BF}] misses several common emoji ranges used in titles (e.g. regional indicators, dingbats outside 2600–27BF, supplemental symbols 1F900–1F9FF are covered by 1F300–1FAFF but flag sequences, ZWJ joiners 200D, variation selectors FE0F, and skin-tone modifiers 1F3FB–1F3FF are not). Titles that contain 🇺🇸 or compound emoji will end up with leftover joiner characters. Consider using a more complete regex (e.g. \p{Extended_Pictographic} plus \uFE0F/\u200D) or a small library.
    .replace(/\{#[^}]+\}/g, '')
    .replace(/[\u{1F300}-\u{1FAFF}\u{2600}-\u{27BF}]/gu, '')
    .replace(/\s+/g, ' ')
    .trim()

src/plugins/markdown-source/index.js:29

  • When two collections produce the same resolved file path (collision), the warning message says "overwriting" but the code does not actually overwrite atomically — it just writes the second doc and discards the first. More importantly, the collision is detected only after both docs have been loaded sequentially, so the first doc's fs.writeFileSync has already happened. The map check is correct for the warning, but the order means the later file always wins regardless of which collection should be authoritative. Consider failing the build (or at least clearly logging both source paths and the winning one) on collision instead of silently picking the last writer.
          const targetPath = resolveTargetPath(outDir, doc.permalink)
          if (seen.has(targetPath)) {
            console.warn(
              `[markdown-source] permalink collision at ${doc.permalink}: ${seen.get(targetPath)} vs ${doc.relPath} (overwriting)`,
            )
          }
          seen.set(targetPath, doc.relPath)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/plugins/markdown-source/index.js Outdated
Comment thread docusaurus.config.js
Comment thread static/robots.txt
Comment thread src/plugins/_shared/markdown-utils.js
@karinamaulitova karinamaulitova merged commit 92ab99a into main May 18, 2026
1 check passed
@karinamaulitova karinamaulitova deleted the feature/we-2291-add-ai-into-docs branch May 18, 2026 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants