feat: add llms.txt and per-page markdown endpoints for AI agents#890
Conversation
There was a problem hiding this comment.
Pull request overview
Adds build-time generation of AI-friendly artifacts: an llmstxt.org-compliant index (/llms.txt), a concatenated full corpus (/llms-full.txt), and per-page raw markdown endpoints (<page>.md). Two custom Docusaurus postBuild plugins (llms-txt, markdown-source) share a regex-based MDX sanitizer in _shared/markdown-utils.js. Robots and READMEs are updated accordingly.
Changes:
- New
llms-txtandmarkdown-sourceplugins wired intodocusaurus.config.jsfor three doc collections (docs,run-on-lido,earn). - Shared markdown utilities: directory walk, gray-matter frontmatter parsing, permalink derivation, title/description extraction, and MDX-to-plain-markdown regex sanitization.
static/robots.txtdisallows*.mdURLs; root README andsrc/plugins/README.mddocument the new outputs and plugin behavior;gray-matteradded as a dependency.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
docusaurus.config.js |
Registers the two new postBuild plugins with per-collection options. |
src/plugins/_shared/markdown-utils.js |
New utilities: file walk, doc loading, permalink/title/description extraction, MDX sanitization. |
src/plugins/llms-txt/index.js |
Emits llms.txt index and llms-full.txt concatenated corpus. |
src/plugins/markdown-source/index.js |
Writes one .md file per doc page mirroring the route tree. |
src/plugins/README.md |
Comprehensive plugin documentation. |
README.md |
Adds an "AI-friendly outputs" section pointing at the plugin README. |
static/robots.txt |
New file disallowing crawlers from *.md URLs. |
package.json / package-lock.json |
Adds gray-matter ^4.0.3; many lockfile entries gain a "peer": true flag (likely an npm-version side-effect). |
Comments suppressed due to low confidence (8)
src/plugins/_shared/markdown-utils.js:161
sanitizeMdxruns all of its regex transforms over the entire body without first masking fenced code blocks or inline code. As a result, any markdown code sample that contains JSX/HTML examples (e.g. ajsxblock showing<MyComponent className="foo" onClick={...}>or<div>...</div>) will have its attributes stripped, its tags removed, and its{/* ... */}comments deleted in the published.mdfiles and inllms-full.txt. This silently corrupts code examples that are likely common in this repo (Lido docs include many React/JSX snippets). Consider extracting and re-inserting fenced/inline code segments around the regex pipeline so they are passed through verbatim.
function sanitizeMdx(content) {
let out = content
out = out.replace(/^\s*import\s+[^\n]+from\s+['"][^'"]+['"];?\s*$/gm, '')
out = out.replace(/^\s*import\s+['"][^'"]+['"];?\s*$/gm, '')
out = out.replace(/^\s*export\s+[^\n]+$/gm, '')
out = out.replace(/<PdfViewer\s+[^/>]*pdfUrl=["']([^"']+)["'][^/>]*\/?>/g, (_, src) => {
return `[PDF](${src})`
})
out = out.replace(/<Link\s+[^>]*to=["']([^"']+)["'][^>]*>([\s\S]*?)<\/Link>/g, (_, href, text) => {
const clean = text.replace(/<[^>]+>/g, '').trim()
return `[${clean || href}](${href})`
})
out = stripJsxAttributes(out)
out = out.replace(/<([A-Z][A-Za-z0-9]*)\b[^>]*\/>/g, '')
out = out.replace(/<([A-Z][A-Za-z0-9]*)\b[^>]*>([\s\S]*?)<\/\1>/g, '$2')
out = out.replace(/<h([1-6])\b[^>]*>([\s\S]*?)<\/h\1>/gi, (_, level, text) => {
const hashes = '#'.repeat(Number(level))
return `\n\n${hashes} ${text.replace(/\s+/g, ' ').trim()}\n\n`
})
out = out.replace(/<br\s*\/?>/gi, '\n')
out = out.replace(/<(div|span|section|article|aside|header|footer)\b[^>]*>/gi, '')
out = out.replace(/<\/(div|span|section|article|aside|header|footer)>/gi, '')
out = out.replace(/\{\/\*[\s\S]*?\*\/\}/g, '')
out = out
.split('\n')
.map((line) => (line.trim() === '' ? '' : line))
.join('\n')
out = out.replace(/\n{3,}/g, '\n\n').trim()
return out
}
src/plugins/markdown-source/index.js:67
- The "is the title already present?" check uses
head.includes(\# ${doc.title}`), which is a plain substring match. Any heading whose level is greater than 1 (e.g.## Lido,### Lido) contains the substring# Lidoand will be treated as if an H1 already exists, so the canonical# Titleis never inserted. Anchor the check to a heading line, for example via a regex like^#\s+${escapedTitle}\s*$evaluated per line. The same bug exists insrc/plugins/llms-txt/index.js(hasTitleAlready` on line 88).
function hasHeading(body, title) {
const head = body.split('\n').slice(0, 20).join('\n')
if (/^#\s+/m.test(head)) return true
if (title && head.includes(`# ${title}`)) return true
return false
}
src/plugins/_shared/markdown-utils.js:173
stripJsxAttributesmatches every opening tag of the form<name ...>— including lowercase HTML tags like<a>,<img>,<input>that are perfectly valid in markdown. For such tags it stripsclassName=,style={...},onClick={...}, and generically anyattribute={...}expression. While the className/style/event-handler stripping is benign for plain HTML, the genericattribute={...}rule is unlikely to fire on non-JSX HTML. However, the broader concern is that this regex is also applied unconditionally to text inside fenced code blocks (see related comment onsanitizeMdx), which means code examples that intentionally show JSX attributes will be mutilated in the output.
function stripJsxAttributes(content) {
return content.replace(/<([a-zA-Z][a-zA-Z0-9]*)([^>]*)>/g, (match, tag, attrs) => {
if (!attrs) return `<${tag}>`
let cleaned = attrs
cleaned = cleaned.replace(/\s+className=(?:"[^"]*"|'[^']*'|\{[^}]*\})/g, '')
cleaned = cleaned.replace(/\s+style=(?:"[^"]*"|'[^']*'|\{\{[^}]*\}\}|\{[^}]*\})/g, '')
cleaned = cleaned.replace(/\s+on[A-Z][a-zA-Z]*=\{[^}]*\}/g, '')
cleaned = cleaned.replace(/\s+[a-zA-Z][a-zA-Z0-9-]*=\{[^}]*\}/g, '')
return `<${tag}${cleaned}>`
})
}
src/plugins/_shared/markdown-utils.js:116
extractDescriptionfalls back to the first "meaningful" paragraph but uses very coarse filters (startsWith('<'),startsWith('>'),startsWith(':::'), etc.). A paragraph that begins with an admonition (e.g.:::info ... :::), a JSX block, or any HTML wrapper will be skipped, but a paragraph whose first character happens to be one of these for unrelated reasons (e.g. a sentence accidentally starting with<inside text after MDX normalization, or a Docusaurus admonition tail line that didn't end with:::) can pass through. Consider parsing the markdown more structurally (e.g. viaremark) or at minimum documenting that this heuristic is best-effort.
function extractDescription(fm, body) {
if (fm && typeof fm.description === 'string' && fm.description.trim()) {
return fm.description.trim()
}
const cleaned = stripCodeFences(body)
const paragraph = cleaned
.split(/\n{2,}/)
.map((p) => p.trim())
.find((p) => {
if (!p) return false
if (p.startsWith('#')) return false
if (p.startsWith(':::')) return false
if (p.startsWith('import')) return false
if (p.startsWith('export')) return false
if (p.startsWith('>')) return false
if (p.startsWith('<')) return false
if (/^[-*+]\s/.test(p)) return false
if (/^\d+\.\s/.test(p)) return false
if (/^---+$/.test(p)) return false
return true
})
if (!paragraph) return ''
const text = sanitizeMdx(paragraph).replace(/\s+/g, ' ').trim()
return text.length > 200 ? `${text.slice(0, 197)}...` : text
}
src/plugins/_shared/markdown-utils.js:79
- The plugin's permalink derivation duplicates Docusaurus's own routing logic (frontmatter
slug,index.mdcollapsing,routeBasePath) but ignores other things Docusaurus actually respects, includingidfrontmatter, sidebarslugoverrides,numberPrefixParser-style filename prefixes, and_category_.jsonpaths. For a page whose real URL differs from whatbuildPermalinkproduces, the generated.mdwill be written to a location that does not match the canonical HTML URL, and the<!-- canonical: -->/<!-- source: -->comments will point to a non-existent page. Consider sourcing permalinks from Docusaurus's already-resolved routes (e.g. vialoadedRoutes/content from the docs plugin) instead of recomputing them.
function buildPermalink(absPath, collectionPath, routeBasePath, frontmatter) {
const base = normalizeBase(routeBasePath)
if (frontmatter && frontmatter.slug) {
const slug = frontmatter.slug.startsWith('/') ? frontmatter.slug : `/${frontmatter.slug}`
return joinUrl(base, slug)
}
const rel = path.relative(collectionPath, absPath)
const noExt = rel.replace(/\.(md|mdx)$/i, '')
const parts = noExt.split(path.sep).filter(Boolean)
if (parts.length && parts[parts.length - 1].toLowerCase() === 'index') {
parts.pop()
}
const route = parts.length ? `/${parts.join('/')}` : '/'
return joinUrl(base, route)
}
function normalizeBase(routeBasePath) {
if (!routeBasePath || routeBasePath === '/') return ''
return routeBasePath.startsWith('/') ? routeBasePath : `/${routeBasePath}`
}
function joinUrl(base, route) {
if (!base) return route
if (route === '/') return base
return `${base}${route}`
}
src/plugins/_shared/markdown-utils.js:178
escapeMarkdownis named like it escapes markdown special characters but actually only collapses newlines and trims. The description it produces is interpolated directly into a list item as- [title](url): ${desc}, so a description containing characters such as backticks, brackets, or pipes is emitted unescaped. Either rename to something likeinlineDescription/collapseWhitespace, or actually escape characters that could break the list-item / link rendering.
function escapeMarkdown(text) {
if (!text) return ''
return text.replace(/\r/g, '').replace(/\n+/g, ' ').trim()
}
src/plugins/_shared/markdown-utils.js:89
- The emoji-stripping range
[\u{1F300}-\u{1FAFF}\u{2600}-\u{27BF}]misses several common emoji ranges used in titles (e.g. regional indicators, dingbats outside2600–27BF, supplemental symbols1F900–1F9FFare covered by1F300–1FAFFbut flag sequences, ZWJ joiners200D, variation selectorsFE0F, and skin-tone modifiers1F3FB–1F3FFare not). Titles that contain🇺🇸or compound emoji will end up with leftover joiner characters. Consider using a more complete regex (e.g.\p{Extended_Pictographic}plus\uFE0F/\u200D) or a small library.
.replace(/\{#[^}]+\}/g, '')
.replace(/[\u{1F300}-\u{1FAFF}\u{2600}-\u{27BF}]/gu, '')
.replace(/\s+/g, ' ')
.trim()
src/plugins/markdown-source/index.js:29
- When two collections produce the same resolved file path (collision), the warning message says "overwriting" but the code does not actually overwrite atomically — it just writes the second doc and discards the first. More importantly, the collision is detected only after both docs have been loaded sequentially, so the first doc's
fs.writeFileSynchas already happened. The map check is correct for the warning, but the order means the later file always wins regardless of which collection should be authoritative. Consider failing the build (or at least clearly logging both source paths and the winning one) on collision instead of silently picking the last writer.
const targetPath = resolveTargetPath(outDir, doc.permalink)
if (seen.has(targetPath)) {
console.warn(
`[markdown-source] permalink collision at ${doc.permalink}: ${seen.get(targetPath)} vs ${doc.relPath} (overwriting)`,
)
}
seen.set(targetPath, doc.relPath)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Please, go through these steps before you request a review:
📝 Describe your changes
/llms.txtindex and a concatenated/llms-full.txtcorpus at build time.mdto any docs URL)llms-txt,markdown-source) with a shared markdown utilities module.mdendpoints from search engine indexing viarobots.txtsrc/plugins/README.md🔎 Attach a source of truth or evidence that allows reviewers to confirm the changes independently