Feature Request: Support Adding Arbitrary URLs (Blog Posts, Articles, Substack, PDFs)
Current Behavior
The context add <url> command currently only supports websites that publish an llms.txt or llms-full.txt file. When given a generic URL (e.g., a blog post, Substack article, or documentation page), it tries well-known paths like /llms.txt and /llms-full.txt, then fails with:
No llms.txt found at <url>. Tried:
- <url>/llms-full.txt
- <url>/llms.txt
This website may not provide an llms.txt file.
This means users cannot embed:
- Individual blog posts or tutorials
- Substack newsletters or articles
- Single documentation pages that don't have an
llms.txt index
- PDFs or other documents hosted at a direct URL
- GitHub READMEs or wiki pages
Desired Behavior
Allow context add <url> (or a new subcommand/flag) to ingest any arbitrary URL and convert it into a documentation package that can be queried via the MCP server.
Supported Source Types
| Source |
Example |
| Blog posts |
https://engineering.example.com/react-server-components |
| Substack articles |
https://author.substack.com/p/article-slug |
| GitHub raw files |
https://raw.githubusercontent.com/.../README.md |
| PDFs |
https://example.com/paper.pdf |
| HTML docs |
https://docs.framework.io/guides/routing |
Use Cases
- Research & Learning: A developer finds a great deep-dive article on Substack about React internals and wants their AI to reference it during coding sessions.
- Internal Documentation: Teams have wikis or Confluence pages without
llms.txt support.
- Academic Papers: Researchers want to query PDFs locally instead of uploading them to cloud services.
- One-off References: A specific GitHub issue or discussion contains critical context for a project.
Proposed Interface
Option A: Auto-detect in context add
Extend detectSourceType() to recognize when a URL doesn't have llms.txt and treat it as a single-document source:
# Fetches the HTML/Markdown/PDF at the URL directly
context add https://overreacted.io/things-i-dont-know-as-of-2018/
# Same as above, with explicit naming
context add https://overreacted.io/... --name overreacted --pkg-version 2018
Option B: Explicit flag
context add https://author.substack.com/p/slug --from-url
Option C: New subcommand
context fetch https://author.substack.com/p/slug --name my-article
Preference: Option A (auto-detect) keeps the CLI simple and intuitive. If llms.txt is not found and the URL is a specific page (not a domain root), fall back to fetching the page content directly.
Technical Considerations
Content Type Handling
The fetch logic needs to handle different content types:
text/html → Convert to Markdown (existing html.ts logic?)
text/markdown, text/plain → Store as-is
application/pdf → Extract text (would need a new dependency like pdf-parse)
URL vs Website Detection
Currently detectSourceType() returns "website" for all http:// / https:// URLs. We could distinguish:
- Domain root (
https://svelte.dev) → Try llms.txt first (current behavior)
- Specific path (
https://svelte.dev/blog/...) → Try llms.txt, but if not found, fetch the page directly
Code Pointers
packages/context/src/cli.ts: detectSourceType(), addFromWebsite()
packages/context/src/llms-txt.ts: fetchLinkedDocs() — could reuse fetch logic
packages/context/src/package-builder.ts: buildPackage() — already accepts MarkdownFile[]
Related Issues
Would the maintainers be open to a PR for this? I'm happy to implement Option A or whichever approach the team prefers.
Feature Request: Support Adding Arbitrary URLs (Blog Posts, Articles, Substack, PDFs)
Current Behavior
The
context add <url>command currently only supports websites that publish anllms.txtorllms-full.txtfile. When given a generic URL (e.g., a blog post, Substack article, or documentation page), it tries well-known paths like/llms.txtand/llms-full.txt, then fails with:This means users cannot embed:
llms.txtindexDesired Behavior
Allow
context add <url>(or a new subcommand/flag) to ingest any arbitrary URL and convert it into a documentation package that can be queried via the MCP server.Supported Source Types
https://engineering.example.com/react-server-componentshttps://author.substack.com/p/article-slughttps://raw.githubusercontent.com/.../README.mdhttps://example.com/paper.pdfhttps://docs.framework.io/guides/routingUse Cases
llms.txtsupport.Proposed Interface
Option A: Auto-detect in
context addExtend
detectSourceType()to recognize when a URL doesn't havellms.txtand treat it as a single-document source:Option B: Explicit flag
Option C: New subcommand
Preference: Option A (auto-detect) keeps the CLI simple and intuitive. If
llms.txtis not found and the URL is a specific page (not a domain root), fall back to fetching the page content directly.Technical Considerations
Content Type Handling
The fetch logic needs to handle different content types:
text/html→ Convert to Markdown (existinghtml.tslogic?)text/markdown,text/plain→ Store as-isapplication/pdf→ Extract text (would need a new dependency likepdf-parse)URL vs Website Detection
Currently
detectSourceType()returns"website"for allhttp:///https://URLs. We could distinguish:https://svelte.dev) → Tryllms.txtfirst (current behavior)https://svelte.dev/blog/...) → Tryllms.txt, but if not found, fetch the page directlyCode Pointers
packages/context/src/cli.ts:detectSourceType(),addFromWebsite()packages/context/src/llms-txt.ts:fetchLinkedDocs()— could reuse fetch logicpackages/context/src/package-builder.ts:buildPackage()— already acceptsMarkdownFile[]Related Issues
llms.txtsupport (closed). This extends that work.Would the maintainers be open to a PR for this? I'm happy to implement Option A or whichever approach the team prefers.