# ParallelExtractTool

This notebook provides a quick overview for getting started with Parallel AI [extract tool](/docs/integrations/tools/). For detailed documentation of all ParallelExtractTool features and configurations head to the [API reference](https://python.langchain.com/api_reference/parallel_web/extract_tool/langchain_parallel_web.extract_tool.ParallelExtractTool.html).

The ParallelExtractTool provides access to Parallel AI's Extract API, which extracts clean, structured content from web pages.

## Overview

### Integration details

| Class | Package | Serializable | JS support |  Package latest |
| :--- | :--- | :---: | :---: | :---: |
| [ParallelExtractTool](https://python.langchain.com/api_reference/parallel_web/extract_tool/langchain_parallel_web.extract_tool.ParallelExtractTool.html) | [langchain-parallel-web](https://python.langchain.com/api_reference/parallel_web/) | ❌ | ❌ |  ![PyPI - Version](https://img.shields.io/pypi/v/langchain-parallel-web?style=flat-square&label=%20) |

### Tool features

- **Clean content extraction**: Extracts main content from web pages, removing ads, navigation, and boilerplate
- **Markdown formatting**: Returns content formatted as clean markdown
- **Batch processing**: Extract from multiple URLs in a single API call
- **Metadata extraction**: Includes title, publish date, and other metadata
- **Content length control**: Configure maximum characters per extraction
- **Error handling**: Gracefully handles failed extractions with detailed error information
- **Async support**: Full async/await support for better performance

## Setup

The integration lives in the `langchain-parallel-web` package.

In [None]:
%pip install --quiet -U langchain-parallel-web

### Credentials

Head to [Parallel AI](https://beta.parallel.ai) to sign up and generate an API key. Once you've done this set the PARALLEL_AI_API_KEY environment variable:

In [None]:
import getpass
import os

if not os.environ.get("PARALLEL_AI_API_KEY"):
    os.environ["PARALLEL_AI_API_KEY"] = getpass.getpass("Parallel AI API key:\n")

## Instantiation

Here we show how to instantiate an instance of the ParallelExtractTool. The tool can be configured with API key and content length parameters:

In [None]:
from langchain_parallel_web import ParallelExtractTool

# Basic instantiation - API key from environment
tool = ParallelExtractTool()

# With explicit API key and custom settings
tool = ParallelExtractTool(
    api_key="your-api-key",
    base_url="https://api.parallel.ai",  # default value
    max_chars_per_extract=5000,  # Limit content length
)

## Invocation

### [Invoke directly with args](/docs/concepts/tools/#use-the-tool-directly)

You can invoke the tool with a list of URLs to extract content from:

In [None]:
# Extract from a single URL
result = tool.invoke(
    {"urls": ["https://en.wikipedia.org/wiki/Artificial_intelligence"]}
)

print(f"Extracted {len(result)} result(s)")
print(f"Title: {result[0]['title']}")
print(f"URL: {result[0]['url']}")
print(f"Content length: {len(result[0]['content'])} characters")
print(f"Content preview: {result[0]['content'][:200]}...")

In [None]:
# Extract from multiple URLs
result = tool.invoke(
    {
        "urls": [
            "https://en.wikipedia.org/wiki/Machine_learning",
            "https://en.wikipedia.org/wiki/Deep_learning",
            "https://en.wikipedia.org/wiki/Natural_language_processing",
        ]
    }
)

print(f"Extracted {len(result)} results")
for i, item in enumerate(result, 1):
    print(f"\n{i}. {item['title']}")
    print(f"   URL: {item['url']}")
    print(f"   Content length: {len(item['content'])} characters")

# Example response structure:
# [
#     {
#         "url": "https://example.com/article",
#         "title": "Article Title",
#         "content": "# Article Title\n\nMain content in markdown...",
#         "publish_date": "2024-01-15"  # Optional
#     }
# ]

### [Invoke with ToolCall](/docs/concepts/tool_calling/#tool-execution)

We can also invoke the tool with a model-generated ToolCall, in which case a ToolMessage will be returned:

In [None]:
# This is usually generated by a model, but we'll create a tool call directly for demo purposes.
model_generated_tool_call = {
    "args": {
        "urls": [
            "https://en.wikipedia.org/wiki/Climate_change",
            "https://en.wikipedia.org/wiki/Renewable_energy",
        ]
    },
    "id": "call_123",
    "name": tool.name,  # "parallel_extract"
    "type": "tool_call",
}

result = tool.invoke(model_generated_tool_call)
print(result)
print(f"Tool name: {tool.name}")  # parallel_extract
print(f"Tool description: {tool.description}")

### [Async Usage](/docs/concepts/tools/#async)

The tool supports full async/await operations for better performance in async applications:

In [None]:
async def extract_async():
    return await tool.ainvoke(
        {
            "urls": [
                "https://en.wikipedia.org/wiki/Python_(programming_language)",
                "https://en.wikipedia.org/wiki/JavaScript",
            ]
        }
    )


# Run async extraction
result = await extract_async()
print(f"Extracted {len(result)} results asynchronously")

### Content Length Control

You can configure the maximum characters per extraction to control content length:

In [None]:
# Create tool with content length limit
limited_tool = ParallelExtractTool(max_chars_per_extract=2000)

result = limited_tool.invoke(
    {"urls": ["https://en.wikipedia.org/wiki/Quantum_computing"]}
)

print(f"Content length: {len(result[0]['content'])} characters")
print("Content is limited to approximately 2000 characters")

### Error Handling

The tool gracefully handles URLs that fail to extract, including them in results with error information:

In [None]:
# Mix of valid and invalid URLs
result = tool.invoke(
    {
        "urls": [
            "https://en.wikipedia.org/wiki/Artificial_intelligence",
            "https://this-domain-does-not-exist-12345.com/",
        ]
    }
)

for item in result:
    if "error_type" in item:
        print(f"Failed: {item['url']}")
        print(f"Error: {item['content']}")
    else:
        print(f"Success: {item['url']}")
        print(f"Extracted {len(item['content'])} characters")

## Chaining

We can use our tool in a chain by first binding it to a [tool-calling model](/docs/how_to/tool_calling/) and then calling it:

import ChatModelTabs from "@theme/ChatModelTabs";

<ChatModelTabs customVarName="llm" />

In [None]:
# | output: false
# | echo: false

# !pip install -qU langchain langchain-openai
from langchain.chat_models import init_chat_model

llm = init_chat_model(model="gpt-4o", model_provider="openai")

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableConfig, chain

prompt = ChatPromptTemplate(
    [
        (
            "system",
            "You are a helpful assistant that extracts and summarizes web content.",
        ),
        ("human", "{user_input}"),
        ("placeholder", "{messages}"),
    ]
)

# specifying tool_choice will force the model to call this tool.
llm_with_tools = llm.bind_tools([tool], tool_choice=tool.name)

llm_chain = prompt | llm_with_tools


@chain
def tool_chain(user_input: str, config: RunnableConfig):
    input_ = {"user_input": user_input}
    ai_msg = llm_chain.invoke(input_, config=config)
    tool_msgs = tool.batch(ai_msg.tool_calls, config=config)
    return llm_chain.invoke({**input_, "messages": [ai_msg, *tool_msgs]}, config=config)


tool_chain.invoke(
    "Extract and summarize the content from https://en.wikipedia.org/wiki/Large_language_model"
)

## Best Practices

- **Batch URLs**: Extract multiple URLs in a single call for better performance
- **Set content limits**: Use `max_chars_per_extract` to control response size and token usage
- **Handle errors**: Check for `error_type` in results to identify failed extractions
- **Use async for performance**: Use `ainvoke()` in async applications for better performance
- **Metadata fields**: Use `publish_date` and other metadata when available for context

## Response Format

The tool returns a list of dictionaries with the following format:

```python
[
    {
        "url": "https://example.com/article",
        "title": "Article Title",
        "content": "# Article Title\n\nMain content formatted as markdown...",
        "publish_date": "2024-01-15"  # Optional, if available
    },
    # For failed extractions:
    {
        "url": "https://failed-site.com",
        "title": None,
        "content": "Error: 404 Not Found",
        "error_type": "http_error"
    }
]
```

## API reference

For detailed documentation of all ParallelExtractTool features and configurations head to the API reference: https://python.langchain.com/api_reference/parallel_web/extract_tool/langchain_parallel_web.extract_tool.ParallelExtractTool.html