Skip to content

iNetworkX/web-reader-mcp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

web-reader-mcp

MCP server that fetches URLs and extracts clean, readable content as Markdown or plain text. Built on the Model Context Protocol so AI assistants (Claude, Cline, OpenCode, etc.) can read web pages.

Features

  • Extracts main article content from any URL using Readability
  • Converts HTML to clean Markdown via Turndown
  • Extracts metadata (title, author, publish date, Open Graph tags)
  • Optionally extracts all links from the page
  • Optional JavaScript rendering via Playwright for SPAs and JS-heavy sites
  • Exposes a single webReader tool over MCP Streamable HTTP transport

Quick Start

bun install
bun run dev

The server starts on http://localhost:3001 (or PORT env var).

Tool: webReader

Parameter Type Default Description
url string (required) URL to fetch
return_format "markdown" | "text" "markdown" Output format
timeout number (1–60) 20 Request timeout in seconds
retain_images boolean true Keep image references in output
with_links_summary boolean false Include a summary of all links
render_js boolean false Use Playwright to render JavaScript

Example Response

{
  "title": "Article Title",
  "url": "https://example.com/article",
  "content": "# Article Title\n\nArticle content in markdown...",
  "excerpt": "A brief summary of the article.",
  "byline": "Author Name",
  "siteName": "Example",
  "publishedTime": "2025-01-15T10:00:00Z",
  "metadata": { "og:description": "..." }
}

MCP Client Configuration

Claude Desktop

{
  "mcpServers": {
    "web-reader": {
      "url": "http://localhost:3001/mcp"
    }
  }
}

Cline (VS Code)

{
  "mcpServers": {
    "web-reader": {
      "url": "http://localhost:3001/mcp",
      "transportType": "streamable-http"
    }
  }
}

JavaScript Rendering (Optional)

By default, pages are fetched with a simple HTTP client (got). For JavaScript-heavy sites (SPAs, React apps), enable Playwright:

bun add playwright
npx playwright install chromium

Then set render_js: true when calling the tool. Note that Playwright requires ~300–500 MB RAM per browser tab.

Scripts

Command Description
bun run dev Development server with hot reload
bun run build Compile TypeScript
bun run start Production server (node dist/index.js)
bun run typecheck Type-check without emitting

Docker

docker compose up --build

Runs on port 3000 with health checks enabled.

Architecture

src/
├── index.ts              # Express server + MCP session management
├── types.ts              # Shared TypeScript interfaces
├── tools/
│   └── web-reader.ts     # Tool registration (Zod schema + handler)
└── lib/
    ├── fetcher.ts        # HTTP fetching (got)
    ├── fetcher-playwright.ts  # JS rendering (Playwright, optional)
    ├── parser.ts         # HTML → structured content (Readability)
    └── converter.ts      # HTML → Markdown (Turndown)

Tech Stack

  • Runtime: Node.js (ES2022)
  • Package Manager: Bun
  • HTTP Server: Express
  • MCP SDK: @modelcontextprotocol/sdk v1
  • HTML Parsing: jsdom + Readability
  • Markdown: Turndown
  • HTTP Client: got
  • JS Rendering: Playwright (optional)

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors