Feature request: Support adding arbitrary URLs (blog posts, Substack, PDFs, single pages)

# Feature Request: Support Adding Arbitrary URLs (Blog Posts, Articles, Substack, PDFs)

## Current Behavior

The `context add <url>` command currently **only** supports websites that publish an `llms.txt` or `llms-full.txt` file. When given a generic URL (e.g., a blog post, Substack article, or documentation page), it tries well-known paths like `/llms.txt` and `/llms-full.txt`, then fails with:

```
No llms.txt found at <url>. Tried:
  - <url>/llms-full.txt
  - <url>/llms.txt

This website may not provide an llms.txt file.
```

This means users cannot embed:
- Individual blog posts or tutorials
- Substack newsletters or articles
- Single documentation pages that don't have an `llms.txt` index
- PDFs or other documents hosted at a direct URL
- GitHub READMEs or wiki pages

## Desired Behavior

Allow `context add <url>` (or a new subcommand/flag) to ingest **any arbitrary URL** and convert it into a documentation package that can be queried via the MCP server.

### Supported Source Types

| Source | Example |
|--------|---------|
| Blog posts | `https://engineering.example.com/react-server-components` |
| Substack articles | `https://author.substack.com/p/article-slug` |
| GitHub raw files | `https://raw.githubusercontent.com/.../README.md` |
| PDFs | `https://example.com/paper.pdf` |
| HTML docs | `https://docs.framework.io/guides/routing` |

## Use Cases

1. **Research & Learning**: A developer finds a great deep-dive article on Substack about React internals and wants their AI to reference it during coding sessions.
2. **Internal Documentation**: Teams have wikis or Confluence pages without `llms.txt` support.
3. **Academic Papers**: Researchers want to query PDFs locally instead of uploading them to cloud services.
4. **One-off References**: A specific GitHub issue or discussion contains critical context for a project.

## Proposed Interface

### Option A: Auto-detect in `context add`

Extend `detectSourceType()` to recognize when a URL doesn't have `llms.txt` and treat it as a single-document source:

```bash
# Fetches the HTML/Markdown/PDF at the URL directly
context add https://overreacted.io/things-i-dont-know-as-of-2018/

# Same as above, with explicit naming
context add https://overreacted.io/... --name overreacted --pkg-version 2018
```

### Option B: Explicit flag

```bash
context add https://author.substack.com/p/slug --from-url
```

### Option C: New subcommand

```bash
context fetch https://author.substack.com/p/slug --name my-article
```

**Preference**: Option A (auto-detect) keeps the CLI simple and intuitive. If `llms.txt` is not found and the URL is a specific page (not a domain root), fall back to fetching the page content directly.

## Technical Considerations

### Content Type Handling

The fetch logic needs to handle different content types:

- `text/html` → Convert to Markdown (existing `html.ts` logic?)
- `text/markdown`, `text/plain` → Store as-is
- `application/pdf` → Extract text (would need a new dependency like `pdf-parse`)

### URL vs Website Detection

Currently `detectSourceType()` returns `"website"` for all `http://` / `https://` URLs. We could distinguish:

- **Domain root** (`https://svelte.dev`) → Try `llms.txt` first (current behavior)
- **Specific path** (`https://svelte.dev/blog/...`) → Try `llms.txt`, but if not found, fetch the page directly

### Code Pointers

- `packages/context/src/cli.ts`: `detectSourceType()`, `addFromWebsite()`
- `packages/context/src/llms-txt.ts`: `fetchLinkedDocs()` — could reuse fetch logic
- `packages/context/src/package-builder.ts`: `buildPackage()` — already accepts `MarkdownFile[]`

## Related Issues

- #51 — Implemented `llms.txt` support (closed). This extends that work.

---

**Would the maintainers be open to a PR for this?** I'm happy to implement Option A or whichever approach the team prefers.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Support adding arbitrary URLs (blog posts, Substack, PDFs, single pages) #72

Feature Request: Support Adding Arbitrary URLs (Blog Posts, Articles, Substack, PDFs)

Current Behavior

Desired Behavior

Supported Source Types

Use Cases

Proposed Interface

Option A: Auto-detect in `context add`

Option B: Explicit flag

Option C: New subcommand

Technical Considerations

Content Type Handling

URL vs Website Detection

Code Pointers

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Source	Example
Blog posts	`https://engineering.example.com/react-server-components`
Substack articles	`https://author.substack.com/p/article-slug`
GitHub raw files	`https://raw.githubusercontent.com/.../README.md`
PDFs	`https://example.com/paper.pdf`
HTML docs	`https://docs.framework.io/guides/routing`

Feature request: Support adding arbitrary URLs (blog posts, Substack, PDFs, single pages) #72

Description

Feature Request: Support Adding Arbitrary URLs (Blog Posts, Articles, Substack, PDFs)

Current Behavior

Desired Behavior

Supported Source Types

Use Cases

Proposed Interface

Option A: Auto-detect in context add

Option B: Explicit flag

Option C: New subcommand

Technical Considerations

Content Type Handling

URL vs Website Detection

Code Pointers

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option A: Auto-detect in `context add`