Overhaul LLMs (docs changes) #969

dawnkelly09 · 2025-09-08T17:05:12Z

Goes with: papermoonio/polkadot-mkdocs#125

This PR makes the following changes to overhaul LLM-friendly file generation for Polkadot Docs:

llms-config.json changes: scripts to generate the LLM files are now fully configured in this file, no values are changed directly in the generation scripts
Output file format changes: the llms.txt in the root of the docs repo is the only remaining .txt format output file. All LLM-facing file outputs are now in Markdown or JSON/JSONL to help models parse and index info more effeciently and correctly
A single resolved markdown page is output to the /.ai/ directory for each content page on the deployed docs site. Resolved markdown = snippets & variables replaced with their content and HTML comments stripped, etc. This /.ai/ directory prevents these md pages from being converted to HTML at site build so they can be served from their github raw URL for copy, download, and (in a future PR) view page in markdown features
llms.txt file improvements: now serves raw GH URLs for the resolved markdown pages organized by category, includes fields for source repos and additional resources
Full-site content changes: llms-full.txt replaced with new output files:

--> /.ai/site-index.json: array of one object per page with:
- id: unique slug to id page
- title: page title from front matter (H1)
- slug: relative file path with dashes instead of slashes
- categories: the categories set up in llms_config.json and added to front matter
- raw_md_url: URL for the raw markdown with the snippets & variables rendered
- html_url: URL for the deployed page online (so model has a human friendly version to share with the user)
- preview: The first X characters of content that isn't inside a code fence, admonition, or part of a bullet list to give the model context for the contents of the page
- outline: lists depth(h1, h2, h3), title, and anchor (heading anchor slug) for the page to aid in chunking and indexing
- stats: provides character, word, heading, and estimated token counts to aid in chunking (token_estimator value can be customized via optional CLI flags to generate model-specific counts in the future if desired)
- NOTE: In testing, this file was approx 401 KB, so it is a very lightweight site index that is also model-friendly (llms-full.txt is currently 2.1 MB)

--> /.ai/llms-full.jsonl: (default behavior) JSON lines of one object per H2/H3 section with:
- page_id: unique slug to the page
- page_title: title from page front matter
- index: position of this heading on the given page
- depth: 2 for H2 and 3 for H3 headings
- title: heading text
- anchor: heading anchor slug
- start_char: character count for start of section
- end_char: character count for end of section
- estimated_token_count: estimated token count for the content of this section to aid in chunking
- token_estimator: defaults to a heuristic method, can configure via optional CLI flag
- text: plain text content of the section
- NOTE: In testing, this file (which contains full site content plus metadata) was 3.2 MB while llms-full.txt is currently 2.1 MB. Though this file size is bigger, it is ready for easy chunking and/or model indexing right out of the box compared to the .txt version.

Category file changes: llms-*.txt files replaced with /.ai/categories/<category-slug>.bundle.md: a single concatenated Markdown file for a category with page boundaries and titles.
--> This category script is designed to check for categories in the llms-config. json file. If no categories are found, bundle generation is skipped to allow for projects with a single product or focus.
--> The category script also outputs JSON and JSONL versions but I am not wiring them up to any UI yet as I'm deciding how to/if I want to use them
--> Retains the behavior of basics and reference content gets included in categories for model context purposes

-UI/UX changes: wired up footer and AI Ready Docs page to use these new files.

Note: you still generate new LLM files via python3 scripts/generate_llms.py in the terminal. Automation/workflows, and centralization will come in future PR(s).

I know this thing is a monster so, let me know if you want to hop on a call for a walk-through, etc.

…ded pages, adds initial batch of ai files

…ew llms files (still need category files)

github-actions · 2025-09-08T17:05:27Z

🔍 Documentation URL Checker

This PR modifies documentation files in ways that could potentially create broken links.

Renamed/Moved files:

llms-files/llms-basics.txt -> .ai/categories/basics.bundle.md
llms-files/llms-dapps.txt -> .ai/categories/dapps.bundle.md
llms-files/llms-infrastructure.txt -> .ai/categories/infrastructure.bundle.md
llms-files/llms-networks.txt -> .ai/categories/networks.bundle.md
llms-files/llms-parachains.txt -> .ai/categories/parachains.bundle.md
llms-files/llms-reference.txt -> .ai/categories/reference.bundle.md
llms-files/llms-smart-contracts.txt -> .ai/categories/smart-contracts.bundle.md
llms-files/llms-tooling.txt -> .ai/categories/tooling.bundle.md

🚨 Please review these changes carefully 🚨

If not handled properly, broken links (404 errors) could appear. To maintain a smooth user experience, consider:

Adding redirects in the mkdocs.yml file from the old URLs to the new ones
Updating internal references to these files

…precated with these changes

Copilot

Pull Request Overview

This PR overhauls the LLM-friendly file generation system for Polkadot Docs, transforming the output from basic text files to structured Markdown and JSON formats optimized for AI model consumption.

Replaces single llms-full.txt file with structured site-index.json and llms-full.jsonl files containing metadata, content chunks, and indexing information
Generates resolved Markdown pages in /.ai/ directory with snippets/variables processed and HTML comments stripped for clean AI consumption
Implements category-based bundle generation with configurable llms-config.json for flexible content organization

Reviewed Changes

Copilot reviewed 76 out of 188 changed files in this pull request and generated no comments.

File	Description
.ai/pages/	New resolved Markdown pages for AI consumption with processed content and metadata
Various .ai/pages files	Individual documentation pages converted to clean Markdown format without snippets/variables

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

nhussein11

Hi @dawnkelly09, just checking in: can all the content under /.ai/ be easily regenerated if needed? I'm wondering because we're about to refactor the whole documentation architecture, and most of those pages will change. So, that shouldn't be an issue, right? Once the new architecture is ready, we can just run the LLMs script again, and all the content under /.ai/ will be re-populated.

eshaben

1️⃣ Only reviewed the AI-Ready page.

Can you please create an .ai/README.md file that summarizes what's going on in this directory and the goes over the different file types?

get-support/ai-ready-docs.md

eshaben

2️⃣ Reviewed everything except for each script in the scripts directory and the resulting output in the .ai directory.

llms.txt

scripts/llms_config.json

.github/workflows/check-llms.yml

llms-files/llms-polkadot-protocol.txt

dawnkelly09 · 2025-09-12T16:00:16Z

Hi @dawnkelly09, just checking in: can all the content under /.ai/ be easily regenerated if needed? I'm wondering because we're about to refactor the whole documentation architecture, and most of those pages will change. So, that shouldn't be an issue, right? Once the new architecture is ready, we can just run the LLMs script again, and all the content under /.ai/ will be re-populated.

Correct, Nico! This process is meant to update the /.ai/ files each time content is updated so no worries about architecture or content changes outside of anything we might end up needing to update in the config (like new Categories maybe?). Thank you for asking!

…P README

…to download via browser like other /.ai/ files)

.ai/pages/develop-interoperability-send-messages.md

scripts/generate_llms.py

.ai/README.md

nhussein11

LGTM! One minor comment but after that this should be good to go! Thank youuu!

nhussein11 · 2025-09-18T18:49:17Z

oh sorry, one more comment. When I ran the script, it generated some llms file for:

modified:   .ai/categories/dapps.md
modified:   .ai/site-index.json

Not sure if that would indicate that this PR is out of sync or what, but today I merged some PRs so I'd ensure to merge master into this branch first to ensure we are covering everything :)

This reverts commit 68f574d.

dawnkelly09 added 10 commits August 28, 2025 16:15

adds script to output .md pages to /.ai/, adds ai_exclude flag to nee…

0844bdb

…ded pages, adds initial batch of ai files

adds Tutorial tag to relevant pages, updates scripts for generating n…

d29c5d3

…ew llms files (still need category files)

Merge remote-tracking branch 'origin' into dawn/improved-llms

f96aea9

adds category file generation

ebe551e

updates main llms script, removes deprecated LLMs files

d547694

adds estimated token count to category bundle outputs

4c69492

adds variable for base url for ai artifact files

7cadc61

wired up existing UI, updated default file outputs

13e0578

update f string copy to match file renaming

6640c2a

Merge remote-tracking branch 'origin' into dawn/improved-llms

03e3f61

dawnkelly09 mentioned this pull request Sep 8, 2025

LLMs Overhaul (mkdocs changes) papermoonio/polkadot-mkdocs#125

Merged

dawnkelly09 added 2 commits September 8, 2025 13:09

updates check-llms workflow to verify llms.txt as llms-full.txt is de…

7e65d13

…precated with these changes

fresh llms

ebc5ca8

dawnkelly09 marked this pull request as ready for review September 8, 2025 17:11

Copilot AI review requested due to automatic review settings September 8, 2025 17:11

dawnkelly09 requested a review from a team as a code owner September 8, 2025 17:11

Copilot AI reviewed Sep 8, 2025

View reviewed changes

nhussein11 self-requested a review September 9, 2025 12:58

nhussein11 reviewed Sep 9, 2025

View reviewed changes

eshaben reviewed Sep 9, 2025

View reviewed changes

updates urls for prod

0f87ce1

dawnkelly09 added 6 commits September 12, 2025 12:33

update scripts per feedback (remove bundle from file path)

6e883cd

Fix filepaths to remove bundle

27820d1

remove log line that is no longer needed

876785b

patch to remove source repos and optional resources from llms.txt, WI…

7de30d4

…P README

patch script to output llms-full.jsonl to root of docs repo (too big …

c430536

…to download via browser like other /.ai/ files)

updates README

233263f

polkadot-developers deleted a comment from github-actions bot Sep 17, 2025

dawnkelly09 added B0 - Needs Review Pull request is ready for review A1 - Maintenance Major Pull request contains major updates to an existing page (i.e., adding a new section, reorgs, etc.) labels Sep 17, 2025

dawnkelly09 requested a review from eshaben September 17, 2025 18:27

eshaben reviewed Sep 17, 2025

View reviewed changes

.ai/pages/develop-interoperability-send-messages.md Outdated Show resolved Hide resolved

scripts/generate_llms.py Outdated Show resolved Hide resolved

polkadot-developers deleted a comment from github-actions bot Sep 18, 2025

dawnkelly09 added 3 commits September 18, 2025 12:30

updates script to fix formatting issue in output files, fresh llms

5aace73

update code comment

5013a55

Merge remote-tracking branch 'origin' into dawn/improved-llms

a85d42a

dawnkelly09 requested a review from eshaben September 18, 2025 16:53

missed a file save!

2ba0688

polkadot-developers deleted a comment from github-actions bot Sep 18, 2025

eshaben approved these changes Sep 18, 2025

View reviewed changes

nhussein11 reviewed Sep 18, 2025

View reviewed changes

.ai/README.md Show resolved Hide resolved

nhussein11 approved these changes Sep 18, 2025

View reviewed changes

dawnkelly09 added 2 commits September 18, 2025 15:10

improved README

39f8282

fresh llms

5df7d22

polkadot-developers deleted a comment from github-actions bot Sep 18, 2025

dawnkelly09 added B1 - Ready to Merge Pull request is ready to be merged and removed B0 - Needs Review Pull request is ready for review labels Sep 18, 2025

eshaben approved these changes Sep 18, 2025

View reviewed changes

eshaben merged commit 68f574d into master Sep 18, 2025
11 of 12 checks passed

eshaben deleted the dawn/improved-llms branch September 18, 2025 19:35

eshaben added a commit that referenced this pull request Sep 18, 2025

Revert "Overhaul LLMs (docs changes) (#969)"

3d7898b

This reverts commit 68f574d.

eshaben added a commit that referenced this pull request Sep 18, 2025

Revert "Overhaul LLMs (docs changes) (#969)" (#1004)

48e917f

This reverts commit 68f574d.

Overhaul LLMs (docs changes) #969

Overhaul LLMs (docs changes) #969

Uh oh!

Conversation

dawnkelly09 commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

nhussein11 left a comment

Choose a reason for hiding this comment

Uh oh!

eshaben left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eshaben left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dawnkelly09 commented Sep 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nhussein11 left a comment

Choose a reason for hiding this comment

Uh oh!

nhussein11 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dawnkelly09 commented Sep 8, 2025 •

edited

Loading

nhussein11 commented Sep 18, 2025 •

edited

Loading