-
Notifications
You must be signed in to change notification settings - Fork 71
Overhaul LLMs (docs changes) #969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ded pages, adds initial batch of ai files
…ew llms files (still need category files)
|
🔍 Documentation URL Checker This PR modifies documentation files in ways that could potentially create broken links. Renamed/Moved files: 🚨 Please review these changes carefully 🚨 If not handled properly, broken links (404 errors) could appear. To maintain a smooth user experience, consider:
|
…precated with these changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR overhauls the LLM-friendly file generation system for Polkadot Docs, transforming the output from basic text files to structured Markdown and JSON formats optimized for AI model consumption.
- Replaces single llms-full.txt file with structured site-index.json and llms-full.jsonl files containing metadata, content chunks, and indexing information
- Generates resolved Markdown pages in /.ai/ directory with snippets/variables processed and HTML comments stripped for clean AI consumption
- Implements category-based bundle generation with configurable llms-config.json for flexible content organization
Reviewed Changes
Copilot reviewed 76 out of 188 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| .ai/pages/ | New resolved Markdown pages for AI consumption with processed content and metadata |
| Various .ai/pages files | Individual documentation pages converted to clean Markdown format without snippets/variables |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dawnkelly09, just checking in: can all the content under /.ai/ be easily regenerated if needed? I'm wondering because we're about to refactor the whole documentation architecture, and most of those pages will change. So, that shouldn't be an issue, right? Once the new architecture is ready, we can just run the LLMs script again, and all the content under /.ai/ will be re-populated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1️⃣ Only reviewed the AI-Ready page.
Can you please create an .ai/README.md file that summarizes what's going on in this directory and the goes over the different file types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2️⃣ Reviewed everything except for each script in the scripts directory and the resulting output in the .ai directory.
Correct, Nico! This process is meant to update the /.ai/ files each time content is updated so no worries about architecture or content changes outside of anything we might end up needing to update in the config (like new Categories maybe?). Thank you for asking! |
…to download via browser like other /.ai/ files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! One minor comment but after that this should be good to go! Thank youuu!

Goes with: papermoonio/polkadot-mkdocs#125
This PR makes the following changes to overhaul LLM-friendly file generation for Polkadot Docs:
llms-config.jsonchanges: scripts to generate the LLM files are now fully configured in this file, no values are changed directly in the generation scriptsOutput file format changes: the llms.txt in the root of the docs repo is the only remaining
.txtformat output file. All LLM-facing file outputs are now in Markdown or JSON/JSONL to help models parse and index info more effeciently and correctlyA single resolved markdown page is output to the /.ai/ directory for each content page on the deployed docs site. Resolved markdown = snippets & variables replaced with their content and HTML comments stripped, etc. This /.ai/ directory prevents these md pages from being converted to HTML at site build so they can be served from their github raw URL for copy, download, and (in a future PR) view page in markdown features
llms.txtfile improvements: now serves raw GH URLs for the resolved markdown pages organized by category, includes fields for source repos and additional resourcesFull-site content changes: llms-full.txt replaced with new output files:
-->
/.ai/site-index.json: array of one object per page with:-
id: unique slug to id page-
title: page title from front matter (H1)-
slug: relative file path with dashes instead of slashes-
categories: the categories set up inllms_config.jsonand added to front matter-
raw_md_url: URL for the raw markdown with the snippets & variables rendered-
html_url: URL for the deployed page online (so model has a human friendly version to share with the user)-
preview: The first X characters of content that isn't inside a code fence, admonition, or part of a bullet list to give the model context for the contents of the page-
outline: listsdepth(h1, h2, h3),title, andanchor(heading anchor slug) for the page to aid in chunking and indexing-
stats: provides character, word, heading, and estimated token counts to aid in chunking (token_estimatorvalue can be customized via optional CLI flags to generate model-specific counts in the future if desired)- NOTE: In testing, this file was approx 401 KB, so it is a very lightweight site index that is also model-friendly (
llms-full.txtis currently 2.1 MB)-->
/.ai/llms-full.jsonl: (default behavior) JSON lines of one object per H2/H3 section with:-
page_id: unique slug to the page-
page_title: title from page front matter-
index: position of this heading on the given page-
depth: 2 for H2 and 3 for H3 headings-
title: heading text-
anchor: heading anchor slug-
start_char: character count for start of section-
end_char: character count for end of section-
estimated_token_count: estimated token count for the content of this section to aid in chunking-
token_estimator: defaults to a heuristic method, can configure via optional CLI flag-
text: plain text content of the section- NOTE: In testing, this file (which contains full site content plus metadata) was 3.2 MB while
llms-full.txtis currently 2.1 MB. Though this file size is bigger, it is ready for easy chunking and/or model indexing right out of the box compared to the.txtversion./.ai/categories/<category-slug>.bundle.md: a single concatenated Markdown file for a category with page boundaries and titles.--> This category script is designed to check for categories in the llms-config. json file. If no categories are found, bundle generation is skipped to allow for projects with a single product or focus.
--> The category script also outputs JSON and JSONL versions but I am not wiring them up to any UI yet as I'm deciding how to/if I want to use them
--> Retains the behavior of basics and reference content gets included in categories for model context purposes
-UI/UX changes: wired up footer and AI Ready Docs page to use these new files.
Note: you still generate new LLM files via
python3 scripts/generate_llms.pyin the terminal. Automation/workflows, and centralization will come in future PR(s).I know this thing is a monster so, let me know if you want to hop on a call for a walk-through, etc.