Pricing: $3.90 per 1,000 tool calls (Pay-Per-Event) — no subscription required Protocol: Model Context Protocol (MCP) — JSON-RPC 2.0 over SSE or stdio
The Model Context Protocol (MCP) is an open standard for connecting AI assistants to external tools and data sources. Instead of hard-coding a scraping library into your agent, you connect the agent to an MCP server and it gains instant access to web data through a clean, standardized interface.
This actor is an MCP server that runs on Apify's infrastructure and exposes four powerful tools for web data retrieval: scrape_url, extract_content, search_web, and get_links. Any MCP-compatible AI client can connect and start making tool calls immediately — no API keys, no server setup, no infrastructure to maintain.
Compatible with: Claude Desktop, Claude Code, Cursor, Windsurf, VS Code Copilot, LangChain, CrewAI, AutoGPT, the OpenAI Agents SDK, and any client that implements the MCP specification.
Unlike a regular scraper actor that runs once and exits, this actor stays alive in Apify's Standby mode, keeping a persistent connection open for your AI agent to use throughout a conversation or autonomous workflow. The agent calls tools, gets results, and continues — exactly like calling a local function, but backed by Apify's residential proxy network and full Playwright browser fleet.
Fetches a URL and returns the content in your preferred format. Supports JavaScript rendering for single-page apps, configurable CSS selector removal, content chunking for RAG pipelines, and full page metadata extraction.
{ "name": "scrape_url", "arguments": { "url": "https://example.com/article", "outputFormat": "markdown", "enableChunking": true, "chunkSize": 1000 } }Extract specific elements from a page using CSS selectors or a natural language description. Useful when you need only the product price, the article body, or a table — not the entire page.
{ "name": "extract_content", "arguments": { "url": "https://shop.example.com/product", "description": "product name and price", "outputFormat": "json" } }Search Google, Bing, or DuckDuckGo and get structured results including titles, URLs, and snippets. Optionally scrape each result page and include its content.
{ "name": "search_web", "arguments": { "query": "best JavaScript frameworks 2025", "maxResults": 10, "searchEngine": "google" } }Extract all links from a page and categorize them as internal, external, image, document, or social. Optionally check the HTTP status of each link for broken link detection.
{ "name": "get_links", "arguments": { "url": "https://example.com", "type": "external", "checkStatus": false } }Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"junipr-web-scraper": {
"transport": "sse",
"url": "https://mcp-web-scraper.apify.actor/sse",
"headers": { "Authorization": "Bearer YOUR_APIFY_TOKEN" }
}
}
}claude mcp add junipr-web-scraper --transport sse "https://mcp-web-scraper.apify.actor/sse"In your MCP settings file, add the SSE connection pointing to https://mcp-web-scraper.apify.actor/sse with your Apify token as a Bearer header.
from langchain_mcp_adapters.client import MultiServerMCPClient
client = MultiServerMCPClient({
"web_scraper": {
"transport": "sse",
"url": "https://mcp-web-scraper.apify.actor/sse",
"headers": {"Authorization": "Bearer YOUR_APIFY_TOKEN"},
}
})
tools = await client.get_tools()Connect with an HTTP GET to the SSE endpoint. The server will send an endpoint event with the URL to POST MCP messages to, following the standard MCP SSE transport specification.
GET https://mcp-web-scraper.apify.actor/sse
Authorization: Bearer YOUR_APIFY_TOKEN
The following actor-level inputs control global server behavior. Configure them in the Apify Console before starting the actor.
| Parameter | Default | Description |
|---|---|---|
transport |
"sse" |
"sse" for HTTP clients, "stdio" for local testing |
defaultRenderJs |
true |
Default JavaScript rendering (agents can override per call) |
maxConcurrentRequests |
5 |
Max simultaneous scraping operations (1–20) |
defaultProxyGroup |
"RESIDENTIAL" |
"RESIDENTIAL", "DATACENTER", or "NONE" |
maxPageSizeBytes |
5242880 |
Max page size before truncation (5 MB default) |
defaultTimeout |
30000 |
Per-request timeout in milliseconds |
enableSearchTool |
true |
Whether search_web is available to agents |
rateLimitPerMinute |
60 |
Max tool calls per minute (prevents runaway agents) |
allowedDomains |
[] |
If non-empty, restrict scraping to these domains only |
blockedDomains |
[...] |
Domains agents cannot scrape (SSRF prevention) |
Each tool returns a JSON response wrapped in the MCP protocol envelope:
{
"jsonrpc": "2.0",
"id": "req-1",
"result": {
"content": [{ "type": "text", "text": "{...tool response JSON...}" }],
"isError": false
}
}The text field contains the tool-specific response. For scrape_url, this includes content.markdown, metadata.title, metadata.wordCount, timing info, and optional chunks for RAG pipelines. For search_web, it includes an array of results with position, title, URL, and snippet.
Monitor real-time session statistics via the scraper://usage MCP resource, which returns total calls, total errors, and session start time.
This actor uses Pay-Per-Event (PPE) pricing at $3.90 per 1,000 tool calls ($0.0039 per call).
Pricing includes all platform compute costs — no hidden fees.
A billable event is counted when an MCP client sends a valid tool call (scrape_url, extract_content, search_web, or get_links) and the server processes it and returns a response. Protocol handshakes, tool listings, resource reads, and rate-limited rejections are not billed.
| Scenario | Calls | Cost |
|---|---|---|
| Research session (20 searches + 30 scrapes) | 50 | $0.19 |
| Website audit (100 pages + link check) | 101 | $0.39 |
| Daily monitoring (10 URLs, 30 days) | 300 | $1.17 |
| AI agent workflow (200 calls/day, 30 days) | 6,000 | $23.40 |
Apify Standby mode compute time is billed separately from PPE events (approximately $0.001/min while idle). The actor automatically shuts down after 5 minutes of inactivity to minimize idle costs.
SSRF prevention: The server blocks requests to localhost, private IP ranges (10.x, 192.168.x, 172.16–31.x), link-local addresses (169.254.x), and any domain in the blockedDomains configuration. This prevents agents from inadvertently accessing internal network resources.
Domain allowlisting: Set allowedDomains to restrict agents to a specific set of domains. Useful in enterprise deployments where agents should only access your own data sources.
Rate limiting: The rateLimitPerMinute setting prevents a runaway agent from making thousands of calls per minute. Rate-limited requests receive a RATE_LIMITED error with a retryAfter field indicating when to retry.
What this actor cannot do:
- Bypass paywalls or authentication-protected pages
- Solve CAPTCHAs (returns
CAPTCHA_DETECTEDerror instead) - Maintain persistent browser sessions between separate actor runs
- Guarantee results for sites with aggressive anti-bot measures
- RAG Web Content Extractor — Bulk web extraction optimized for RAG pipelines
MCP (Model Context Protocol) is an open standard published by Anthropic that defines how AI assistants connect to external tools. Instead of writing custom code to give your AI agent access to a scraper, you point it at an MCP server and it can immediately call tools to fetch web data. This actor provides those tools running on Apify's infrastructure — no server to maintain, no proxies to configure, no browser to install.
Claude Desktop, Claude Code, Cursor, Windsurf, VS Code Copilot, LibreChat, and most major AI coding assistants support MCP natively. LangChain, CrewAI, AutoGPT, and the OpenAI Agents SDK support MCP via adapter libraries. The protocol is rapidly becoming the standard for AI tool connectivity in 2025–2026.
Add an entry to your claude_desktop_config.json file with the SSE transport pointing to https://mcp-web-scraper.apify.actor/sse and your Apify API token as a Bearer header. Restart Claude Desktop. The four tools will appear in the tool picker automatically.
Yes. Use the langchain-mcp-adapters package (Python) or @langchain/mcp-adapters (TypeScript). Point the SSE transport at the actor's endpoint with your Apify token. The tools will be auto-discovered and usable as standard LangChain tools.
A regular scraper actor runs once, processes a list of URLs, saves output to a dataset, and exits. This actor is a persistent MCP server that stays alive and responds to on-demand tool calls from AI agents. The agent decides in real time which URLs to fetch based on its reasoning — it is not a batch job.
Each MCP tool call costs $0.0039 (about four-tenths of a cent). A search_web call with scrapeResults: true that internally scrapes 10 pages still counts as 1 tool call, not 11. Failed calls that return errors are still billed because the server did the work of attempting the request.
Yes. Set allowedDomains to a list of domains (e.g., ["docs.mycompany.com", "api.myservice.com"]). Any agent attempt to scrape a domain not on this list will receive a DOMAIN_BLOCKED error. You can also add domains to blockedDomains to supplement the default SSRF blocklist.
The scrape_url tool does not currently support passing custom cookies or authentication headers per call. For authenticated scraping, consider the RAG Web Content Extractor actor which supports cookie injection. This limitation exists intentionally to prevent misuse of the MCP server's proxy network for credential-based access.