Skip to content

Conversation

@dawnkelly09
Copy link
Collaborator

@dawnkelly09 dawnkelly09 commented Sep 8, 2025

Goes with: papermoonio/polkadot-mkdocs#125

This PR makes the following changes to overhaul LLM-friendly file generation for Polkadot Docs:

  • llms-config.json changes: scripts to generate the LLM files are now fully configured in this file, no values are changed directly in the generation scripts

  • Output file format changes: the llms.txt in the root of the docs repo is the only remaining .txt format output file. All LLM-facing file outputs are now in Markdown or JSON/JSONL to help models parse and index info more effeciently and correctly

  • A single resolved markdown page is output to the /.ai/ directory for each content page on the deployed docs site. Resolved markdown = snippets & variables replaced with their content and HTML comments stripped, etc. This /.ai/ directory prevents these md pages from being converted to HTML at site build so they can be served from their github raw URL for copy, download, and (in a future PR) view page in markdown features

  • llms.txt file improvements: now serves raw GH URLs for the resolved markdown pages organized by category, includes fields for source repos and additional resources

  • Full-site content changes: llms-full.txt replaced with new output files:

--> /.ai/site-index.json: array of one object per page with:
- id: unique slug to id page
- title: page title from front matter (H1)
- slug: relative file path with dashes instead of slashes
- categories: the categories set up in llms_config.json and added to front matter
- raw_md_url: URL for the raw markdown with the snippets & variables rendered
- html_url: URL for the deployed page online (so model has a human friendly version to share with the user)
- preview: The first X characters of content that isn't inside a code fence, admonition, or part of a bullet list to give the model context for the contents of the page
- outline: lists depth(h1, h2, h3), title, and anchor (heading anchor slug) for the page to aid in chunking and indexing
- stats: provides character, word, heading, and estimated token counts to aid in chunking (token_estimator value can be customized via optional CLI flags to generate model-specific counts in the future if desired)
- NOTE: In testing, this file was approx 401 KB, so it is a very lightweight site index that is also model-friendly (llms-full.txt is currently 2.1 MB)

--> /.ai/llms-full.jsonl: (default behavior) JSON lines of one object per H2/H3 section with:
- page_id: unique slug to the page
- page_title: title from page front matter
- index: position of this heading on the given page
- depth: 2 for H2 and 3 for H3 headings
- title: heading text
- anchor: heading anchor slug
- start_char: character count for start of section
- end_char: character count for end of section
- estimated_token_count: estimated token count for the content of this section to aid in chunking
- token_estimator: defaults to a heuristic method, can configure via optional CLI flag
- text: plain text content of the section
- NOTE: In testing, this file (which contains full site content plus metadata) was 3.2 MB while llms-full.txt is currently 2.1 MB. Though this file size is bigger, it is ready for easy chunking and/or model indexing right out of the box compared to the .txt version.

  • Category file changes: llms-*.txt files replaced with /.ai/categories/<category-slug>.bundle.md: a single concatenated Markdown file for a category with page boundaries and titles.
    --> This category script is designed to check for categories in the llms-config. json file. If no categories are found, bundle generation is skipped to allow for projects with a single product or focus.
    --> The category script also outputs JSON and JSONL versions but I am not wiring them up to any UI yet as I'm deciding how to/if I want to use them
    --> Retains the behavior of basics and reference content gets included in categories for model context purposes

-UI/UX changes: wired up footer and AI Ready Docs page to use these new files.

Note: you still generate new LLM files via python3 scripts/generate_llms.py in the terminal. Automation/workflows, and centralization will come in future PR(s).

I know this thing is a monster so, let me know if you want to hop on a call for a walk-through, etc.

@github-actions
Copy link

github-actions bot commented Sep 8, 2025

🔍 Documentation URL Checker

This PR modifies documentation files in ways that could potentially create broken links.

Renamed/Moved files:

llms-files/llms-basics.txt -> .ai/categories/basics.bundle.md
llms-files/llms-dapps.txt -> .ai/categories/dapps.bundle.md
llms-files/llms-infrastructure.txt -> .ai/categories/infrastructure.bundle.md
llms-files/llms-networks.txt -> .ai/categories/networks.bundle.md
llms-files/llms-parachains.txt -> .ai/categories/parachains.bundle.md
llms-files/llms-reference.txt -> .ai/categories/reference.bundle.md
llms-files/llms-smart-contracts.txt -> .ai/categories/smart-contracts.bundle.md
llms-files/llms-tooling.txt -> .ai/categories/tooling.bundle.md

🚨 Please review these changes carefully 🚨

If not handled properly, broken links (404 errors) could appear. To maintain a smooth user experience, consider:

  • Adding redirects in the mkdocs.yml file from the old URLs to the new ones
  • Updating internal references to these files

@dawnkelly09 dawnkelly09 marked this pull request as ready for review September 8, 2025 17:11
Copilot AI review requested due to automatic review settings September 8, 2025 17:11
@dawnkelly09 dawnkelly09 requested a review from a team as a code owner September 8, 2025 17:11
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR overhauls the LLM-friendly file generation system for Polkadot Docs, transforming the output from basic text files to structured Markdown and JSON formats optimized for AI model consumption.

  • Replaces single llms-full.txt file with structured site-index.json and llms-full.jsonl files containing metadata, content chunks, and indexing information
  • Generates resolved Markdown pages in /.ai/ directory with snippets/variables processed and HTML comments stripped for clean AI consumption
  • Implements category-based bundle generation with configurable llms-config.json for flexible content organization

Reviewed Changes

Copilot reviewed 76 out of 188 changed files in this pull request and generated no comments.

File Description
.ai/pages/ New resolved Markdown pages for AI consumption with processed content and metadata
Various .ai/pages files Individual documentation pages converted to clean Markdown format without snippets/variables

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@nhussein11 nhussein11 self-requested a review September 9, 2025 12:58
Copy link
Collaborator

@nhussein11 nhussein11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dawnkelly09, just checking in: can all the content under /.ai/ be easily regenerated if needed? I'm wondering because we're about to refactor the whole documentation architecture, and most of those pages will change. So, that shouldn't be an issue, right? Once the new architecture is ready, we can just run the LLMs script again, and all the content under /.ai/ will be re-populated.

Copy link
Collaborator

@eshaben eshaben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1️⃣ Only reviewed the AI-Ready page.

Can you please create an .ai/README.md file that summarizes what's going on in this directory and the goes over the different file types?

Copy link
Collaborator

@eshaben eshaben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2️⃣ Reviewed everything except for each script in the scripts directory and the resulting output in the .ai directory.

@dawnkelly09
Copy link
Collaborator Author

Hi @dawnkelly09, just checking in: can all the content under /.ai/ be easily regenerated if needed? I'm wondering because we're about to refactor the whole documentation architecture, and most of those pages will change. So, that shouldn't be an issue, right? Once the new architecture is ready, we can just run the LLMs script again, and all the content under /.ai/ will be re-populated.

Correct, Nico! This process is meant to update the /.ai/ files each time content is updated so no worries about architecture or content changes outside of anything we might end up needing to update in the config (like new Categories maybe?). Thank you for asking!

@polkadot-developers polkadot-developers deleted a comment from github-actions bot Sep 17, 2025
@polkadot-developers polkadot-developers deleted a comment from github-actions bot Sep 17, 2025
@polkadot-developers polkadot-developers deleted a comment from github-actions bot Sep 17, 2025
@dawnkelly09 dawnkelly09 added B0 - Needs Review Pull request is ready for review A1 - Maintenance Major Pull request contains major updates to an existing page (i.e., adding a new section, reorgs, etc.) labels Sep 17, 2025
@dawnkelly09 dawnkelly09 requested a review from eshaben September 17, 2025 18:27
@polkadot-developers polkadot-developers deleted a comment from github-actions bot Sep 18, 2025
@polkadot-developers polkadot-developers deleted a comment from github-actions bot Sep 18, 2025
@dawnkelly09 dawnkelly09 requested a review from eshaben September 18, 2025 16:53
@polkadot-developers polkadot-developers deleted a comment from github-actions bot Sep 18, 2025
@polkadot-developers polkadot-developers deleted a comment from github-actions bot Sep 18, 2025
Copy link
Collaborator

@nhussein11 nhussein11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One minor comment but after that this should be good to go! Thank youuu!

@nhussein11
Copy link
Collaborator

nhussein11 commented Sep 18, 2025

oh sorry, one more comment. When I ran the script, it generated some llms file for:

modified:   .ai/categories/dapps.md
modified:   .ai/site-index.json
image

Not sure if that would indicate that this PR is out of sync or what, but today I merged some PRs so I'd ensure to merge master into this branch first to ensure we are covering everything :)

@polkadot-developers polkadot-developers deleted a comment from github-actions bot Sep 18, 2025
@dawnkelly09 dawnkelly09 added B1 - Ready to Merge Pull request is ready to be merged and removed B0 - Needs Review Pull request is ready for review labels Sep 18, 2025
@eshaben eshaben merged commit 68f574d into master Sep 18, 2025
11 of 12 checks passed
@eshaben eshaben deleted the dawn/improved-llms branch September 18, 2025 19:35
eshaben added a commit that referenced this pull request Sep 18, 2025
eshaben added a commit that referenced this pull request Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A1 - Maintenance Major Pull request contains major updates to an existing page (i.e., adding a new section, reorgs, etc.) B1 - Ready to Merge Pull request is ready to be merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants