Skip to content

Feature request: Support adding arbitrary URLs (blog posts, Substack, PDFs, single pages) #72

@shivaram19

Description

@shivaram19

Feature Request: Support Adding Arbitrary URLs (Blog Posts, Articles, Substack, PDFs)

Current Behavior

The context add <url> command currently only supports websites that publish an llms.txt or llms-full.txt file. When given a generic URL (e.g., a blog post, Substack article, or documentation page), it tries well-known paths like /llms.txt and /llms-full.txt, then fails with:

No llms.txt found at <url>. Tried:
  - <url>/llms-full.txt
  - <url>/llms.txt

This website may not provide an llms.txt file.

This means users cannot embed:

  • Individual blog posts or tutorials
  • Substack newsletters or articles
  • Single documentation pages that don't have an llms.txt index
  • PDFs or other documents hosted at a direct URL
  • GitHub READMEs or wiki pages

Desired Behavior

Allow context add <url> (or a new subcommand/flag) to ingest any arbitrary URL and convert it into a documentation package that can be queried via the MCP server.

Supported Source Types

Source Example
Blog posts https://engineering.example.com/react-server-components
Substack articles https://author.substack.com/p/article-slug
GitHub raw files https://raw.githubusercontent.com/.../README.md
PDFs https://example.com/paper.pdf
HTML docs https://docs.framework.io/guides/routing

Use Cases

  1. Research & Learning: A developer finds a great deep-dive article on Substack about React internals and wants their AI to reference it during coding sessions.
  2. Internal Documentation: Teams have wikis or Confluence pages without llms.txt support.
  3. Academic Papers: Researchers want to query PDFs locally instead of uploading them to cloud services.
  4. One-off References: A specific GitHub issue or discussion contains critical context for a project.

Proposed Interface

Option A: Auto-detect in context add

Extend detectSourceType() to recognize when a URL doesn't have llms.txt and treat it as a single-document source:

# Fetches the HTML/Markdown/PDF at the URL directly
context add https://overreacted.io/things-i-dont-know-as-of-2018/

# Same as above, with explicit naming
context add https://overreacted.io/... --name overreacted --pkg-version 2018

Option B: Explicit flag

context add https://author.substack.com/p/slug --from-url

Option C: New subcommand

context fetch https://author.substack.com/p/slug --name my-article

Preference: Option A (auto-detect) keeps the CLI simple and intuitive. If llms.txt is not found and the URL is a specific page (not a domain root), fall back to fetching the page content directly.

Technical Considerations

Content Type Handling

The fetch logic needs to handle different content types:

  • text/html → Convert to Markdown (existing html.ts logic?)
  • text/markdown, text/plain → Store as-is
  • application/pdf → Extract text (would need a new dependency like pdf-parse)

URL vs Website Detection

Currently detectSourceType() returns "website" for all http:// / https:// URLs. We could distinguish:

  • Domain root (https://svelte.dev) → Try llms.txt first (current behavior)
  • Specific path (https://svelte.dev/blog/...) → Try llms.txt, but if not found, fetch the page directly

Code Pointers

  • packages/context/src/cli.ts: detectSourceType(), addFromWebsite()
  • packages/context/src/llms-txt.ts: fetchLinkedDocs() — could reuse fetch logic
  • packages/context/src/package-builder.ts: buildPackage() — already accepts MarkdownFile[]

Related Issues


Would the maintainers be open to a PR for this? I'm happy to implement Option A or whichever approach the team prefers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions