Skip to content

junipr-labs/mcp-web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MCP Web Scraper Server

Pricing: $3.90 per 1,000 tool calls (Pay-Per-Event) — no subscription required Protocol: Model Context Protocol (MCP) — JSON-RPC 2.0 over SSE or stdio

Introduction

The Model Context Protocol (MCP) is an open standard for connecting AI assistants to external tools and data sources. Instead of hard-coding a scraping library into your agent, you connect the agent to an MCP server and it gains instant access to web data through a clean, standardized interface.

This actor is an MCP server that runs on Apify's infrastructure and exposes four powerful tools for web data retrieval: scrape_url, extract_content, search_web, and get_links. Any MCP-compatible AI client can connect and start making tool calls immediately — no API keys, no server setup, no infrastructure to maintain.

Compatible with: Claude Desktop, Claude Code, Cursor, Windsurf, VS Code Copilot, LangChain, CrewAI, AutoGPT, the OpenAI Agents SDK, and any client that implements the MCP specification.

Unlike a regular scraper actor that runs once and exits, this actor stays alive in Apify's Standby mode, keeping a persistent connection open for your AI agent to use throughout a conversation or autonomous workflow. The agent calls tools, gets results, and continues — exactly like calling a local function, but backed by Apify's residential proxy network and full Playwright browser fleet.

Available Tools

scrape_url — Full page scraping

Fetches a URL and returns the content in your preferred format. Supports JavaScript rendering for single-page apps, configurable CSS selector removal, content chunking for RAG pipelines, and full page metadata extraction.

{ "name": "scrape_url", "arguments": { "url": "https://example.com/article", "outputFormat": "markdown", "enableChunking": true, "chunkSize": 1000 } }

extract_content — Targeted extraction

Extract specific elements from a page using CSS selectors or a natural language description. Useful when you need only the product price, the article body, or a table — not the entire page.

{ "name": "extract_content", "arguments": { "url": "https://shop.example.com/product", "description": "product name and price", "outputFormat": "json" } }

search_web — Web search with optional page scraping

Search Google, Bing, or DuckDuckGo and get structured results including titles, URLs, and snippets. Optionally scrape each result page and include its content.

{ "name": "search_web", "arguments": { "query": "best JavaScript frameworks 2025", "maxResults": 10, "searchEngine": "google" } }

get_links — Link extraction and analysis

Extract all links from a page and categorize them as internal, external, image, document, or social. Optionally check the HTTP status of each link for broken link detection.

{ "name": "get_links", "arguments": { "url": "https://example.com", "type": "external", "checkStatus": false } }

How to Connect

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "junipr-web-scraper": {
      "transport": "sse",
      "url": "https://mcp-web-scraper.apify.actor/sse",
      "headers": { "Authorization": "Bearer YOUR_APIFY_TOKEN" }
    }
  }
}

Claude Code

claude mcp add junipr-web-scraper --transport sse "https://mcp-web-scraper.apify.actor/sse"

Cursor / Windsurf

In your MCP settings file, add the SSE connection pointing to https://mcp-web-scraper.apify.actor/sse with your Apify token as a Bearer header.

LangChain (Python)

from langchain_mcp_adapters.client import MultiServerMCPClient

client = MultiServerMCPClient({
    "web_scraper": {
        "transport": "sse",
        "url": "https://mcp-web-scraper.apify.actor/sse",
        "headers": {"Authorization": "Bearer YOUR_APIFY_TOKEN"},
    }
})

tools = await client.get_tools()

Custom Client (direct SSE)

Connect with an HTTP GET to the SSE endpoint. The server will send an endpoint event with the URL to POST MCP messages to, following the standard MCP SSE transport specification.

GET https://mcp-web-scraper.apify.actor/sse
Authorization: Bearer YOUR_APIFY_TOKEN

Configuration

The following actor-level inputs control global server behavior. Configure them in the Apify Console before starting the actor.

Parameter Default Description
transport "sse" "sse" for HTTP clients, "stdio" for local testing
defaultRenderJs true Default JavaScript rendering (agents can override per call)
maxConcurrentRequests 5 Max simultaneous scraping operations (1–20)
defaultProxyGroup "RESIDENTIAL" "RESIDENTIAL", "DATACENTER", or "NONE"
maxPageSizeBytes 5242880 Max page size before truncation (5 MB default)
defaultTimeout 30000 Per-request timeout in milliseconds
enableSearchTool true Whether search_web is available to agents
rateLimitPerMinute 60 Max tool calls per minute (prevents runaway agents)
allowedDomains [] If non-empty, restrict scraping to these domains only
blockedDomains [...] Domains agents cannot scrape (SSRF prevention)

Output Formats

Each tool returns a JSON response wrapped in the MCP protocol envelope:

{
  "jsonrpc": "2.0",
  "id": "req-1",
  "result": {
    "content": [{ "type": "text", "text": "{...tool response JSON...}" }],
    "isError": false
  }
}

The text field contains the tool-specific response. For scrape_url, this includes content.markdown, metadata.title, metadata.wordCount, timing info, and optional chunks for RAG pipelines. For search_web, it includes an array of results with position, title, URL, and snippet.

Monitor real-time session statistics via the scraper://usage MCP resource, which returns total calls, total errors, and session start time.

Pricing and Usage

This actor uses Pay-Per-Event (PPE) pricing at $3.90 per 1,000 tool calls ($0.0039 per call).

Pricing includes all platform compute costs — no hidden fees.

A billable event is counted when an MCP client sends a valid tool call (scrape_url, extract_content, search_web, or get_links) and the server processes it and returns a response. Protocol handshakes, tool listings, resource reads, and rate-limited rejections are not billed.

Scenario Calls Cost
Research session (20 searches + 30 scrapes) 50 $0.19
Website audit (100 pages + link check) 101 $0.39
Daily monitoring (10 URLs, 30 days) 300 $1.17
AI agent workflow (200 calls/day, 30 days) 6,000 $23.40

Apify Standby mode compute time is billed separately from PPE events (approximately $0.001/min while idle). The actor automatically shuts down after 5 minutes of inactivity to minimize idle costs.

Security and Limitations

SSRF prevention: The server blocks requests to localhost, private IP ranges (10.x, 192.168.x, 172.16–31.x), link-local addresses (169.254.x), and any domain in the blockedDomains configuration. This prevents agents from inadvertently accessing internal network resources.

Domain allowlisting: Set allowedDomains to restrict agents to a specific set of domains. Useful in enterprise deployments where agents should only access your own data sources.

Rate limiting: The rateLimitPerMinute setting prevents a runaway agent from making thousands of calls per minute. Rate-limited requests receive a RATE_LIMITED error with a retryAfter field indicating when to retry.

What this actor cannot do:

  • Bypass paywalls or authentication-protected pages
  • Solve CAPTCHAs (returns CAPTCHA_DETECTED error instead)
  • Maintain persistent browser sessions between separate actor runs
  • Guarantee results for sites with aggressive anti-bot measures

Related Actors

FAQ

What is MCP and why should I use this?

MCP (Model Context Protocol) is an open standard published by Anthropic that defines how AI assistants connect to external tools. Instead of writing custom code to give your AI agent access to a scraper, you point it at an MCP server and it can immediately call tools to fetch web data. This actor provides those tools running on Apify's infrastructure — no server to maintain, no proxies to configure, no browser to install.

Which AI tools support MCP?

Claude Desktop, Claude Code, Cursor, Windsurf, VS Code Copilot, LibreChat, and most major AI coding assistants support MCP natively. LangChain, CrewAI, AutoGPT, and the OpenAI Agents SDK support MCP via adapter libraries. The protocol is rapidly becoming the standard for AI tool connectivity in 2025–2026.

How do I connect this to Claude Desktop?

Add an entry to your claude_desktop_config.json file with the SSE transport pointing to https://mcp-web-scraper.apify.actor/sse and your Apify API token as a Bearer header. Restart Claude Desktop. The four tools will appear in the tool picker automatically.

Can I use this with LangChain?

Yes. Use the langchain-mcp-adapters package (Python) or @langchain/mcp-adapters (TypeScript). Point the SSE transport at the actor's endpoint with your Apify token. The tools will be auto-discovered and usable as standard LangChain tools.

How is this different from a regular scraper actor?

A regular scraper actor runs once, processes a list of URLs, saves output to a dataset, and exits. This actor is a persistent MCP server that stays alive and responds to on-demand tool calls from AI agents. The agent decides in real time which URLs to fetch based on its reasoning — it is not a batch job.

What does a "tool call" cost?

Each MCP tool call costs $0.0039 (about four-tenths of a cent). A search_web call with scrapeResults: true that internally scrapes 10 pages still counts as 1 tool call, not 11. Failed calls that return errors are still billed because the server did the work of attempting the request.

Can I restrict which domains agents can scrape?

Yes. Set allowedDomains to a list of domains (e.g., ["docs.mycompany.com", "api.myservice.com"]). Any agent attempt to scrape a domain not on this list will receive a DOMAIN_BLOCKED error. You can also add domains to blockedDomains to supplement the default SSRF blocklist.

Does this support authentication or cookies?

The scrape_url tool does not currently support passing custom cookies or authentication headers per call. For authenticated scraping, consider the RAG Web Content Extractor actor which supports cookie injection. This limitation exists intentionally to prevent misuse of the MCP server's proxy network for credential-based access.

About

Apify Actor: mcp web scraper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors