# Web Scraping using Oxylabs Web Scraper API

This notebook shows how to use [Oxylabs Web Scraper API](https://oxylabs.io/products/scraper-api/web-data) with AutoGen agents to scrape data and generate automated reports.

First, you need to have Python installed on your system and access to an LLM provider. For this tutorial, we'll use OpenAI's API and the gpt-4o-mini model. Start by creating a virtual environment:

In [None]:
python -m venv venv
source venv/bin/activate

Install the required dependencies:

In [None]:
pip install aiohttp autogen-agentchat "autogen-ext[openai]"

As an example, let’s use the [Amazon source](https://developers.oxylabs.io/scraper-apis/web-scraper-api/amazon) to search for items on Amazon based on a provided query using your credentials from the Oxylabs dashboard.

Import `aiohttp` components and define an `AmazonScraper` class with a constructor, Web Scraper API endpoint URL, and credentials:

In [None]:
from aiohttp import BasicAuth, ClientSession


class AmazonScraper:
    def __init__(self) -> None:
        self._base_url = "https://realtime.oxylabs.io/v1/queries"
        self._auth = BasicAuth("USERNAME", "PASSWORD")

**NOTE:** Don't forget to replace placeholders with your credentials.

Define an asynchronous `get_amazon_search_data` method with a query parameter. Use type hints (`str → list[dict]`) for better readability.

In [None]:
async def get_amazon_search_data(self, query: str) -> list[dict]:
"""Gets search data for provided query from Amazon."""

Next, define your API payload together with the API call:

In [None]:
print(f"Fetching data for query: {query}")
payload = {
    "source": "amazon_search",
    "domain": "com",
    "query": query,
    "start_page": 1,
    "pages": 1,
    "parse": True,
}
session = ClientSession()


try:
    response = await session.post(
        self._base_url,
        auth=self._auth,
        json=payload,
    )
    response.raise_for_status()
    data = await response.json()
finally:
    await session.close()

The `self._auth` parameter provides Web Scraper API authentication. The `finally` clause ensures session cleanup regardless of outcome.
You can parse the response slightly to make it easier for AI processing. Now insert the following part to return a list of dictionaries representing Amazon data.

In [None]:
results = data["results"][0]["content"]["results"]
return [*results.values()]

You now have a class for scraping Amazon results. This can be expanded for different sources or multiple pages.
For what you have now, the full class should look like this:

In [None]:
from aiohttp import BasicAuth, ClientSession


class AmazonScraper:


    def __init__(self) -> None:
        self._base_url = "https://realtime.oxylabs.io/v1/queries"
        self._auth = BasicAuth("USERNAME", "PASSWORD")


    async def get_amazon_search_data(self, query: str) -> list[dict]:
        """Gets search data for provided query from Amazon."""
        print(f"Fetching data for query: {query}")
        payload = {
            "source": "amazon_search",
            "domain": "com",
            "query": query,
            "start_page": 1,
            "pages": 1,
            "parse": True,
        }
        session = ClientSession()


        try:
            response = await session.post(
                self._base_url,
                auth=self._auth,
                json=payload,
            )
            response.raise_for_status()
            data = await response.json()
        finally:
            await session.close()


        results = data["results"][0]["content"]["results"]
        return [*results.values()]

Test it in `main.py` by initializing and calling the `get_amazon_search_data` method:

In [None]:
import asyncio
from pprint import pprint
from scraper import AmazonScraper


async def main():
    scraper = AmazonScraper()
    pprint(scraper.get_amazon_search_data("laptop"))


if __name__ == "__main__":
    asyncio.run(main())

Running `python main.py` shows a list of Amazon laptop results.
Now, create the `AmazonDataSummarizer` class to implement AutoGen agents
1. Build AI agents to summarize the scraped data by defining the `AmazonDataSummarizer` class and constructor variables for later use.
2. Use dependency injection to link code parts in a structured way.
3. Import and define the OpenAI client for AutoGen to communicate with OpenAI models.
Use your OpenAI API key here:

In [None]:
from autogen_ext.models.openai import OpenAIChatCompletionClient
from scraper import AmazonScraper


class AmazonDataSummarizer:


    def __init__(self, scraper: AmazonScraper) -> None:
        self._client = OpenAIChatCompletionClient(
            model="gpt-4o-mini",
            api_key="YOUR_API_KEY",
        )
        self._scraper = scraper

Define AI agent names using an Enum class for easier tracking:

In [None]:
from enum import Enum


class AgentName(str, Enum):
    """Enum for AI agent names."""


    PRICE_SUMMARIZER = "Price_Summarizer"
    DEAL_FINDER = "Deal_Finder"

This makes agent tracking easier. Import AssistantAgent and define the `_initialize_agents` method with agent configuration:

In [None]:
def _initialize_agents(self) -> list[AssistantAgent]:
        """Initializes the agents."""
        price_summarizer_agent = AssistantAgent(
            name=AgentName.PRICE_SUMMARIZER,
            model_client=self._client,
            reflect_on_tool_use=True,
            tools=[self._scraper.get_amazon_search_data],
            system_message="You are an expert in analyzing prices from online shopping data. Summarize the key price statistics, including average, min, max, and any interesting price patterns. Share your summary with the group",
        )


        deal_finder_agent = AssistantAgent(
            name=AgentName.DEAL_FINDER,
            model_client=self._client,
            tools=[self._scraper.get_amazon_search_data],
            reflect_on_tool_use=True,
            system_message="You are a skilled deal finder in online shopping data. Find the best possible deals based on price, availability, and general value. Share your findings with the group. Respond with 'SUMMARY_COMPLETE' when you've shared your findings.",
        )


        return [price_summarizer_agent, deal_finder_agent]

`AssistantAgent` needs:
* Agent name
* OpenAI client
* Scraping tools
* System message (clear instructions like ChatGPT prompts)
The system message defines what each agent should do. `SUMMARY_COMPLETE` signals when to stop running agents, while the `reflect_on_tool_use=True` flag makes agents use function data as response context.

This integrates data sources with AutoGen agents. Define an async function in `tools`, and agents can use it. Prompts handle everything else.

Next, make agents work together using AutoGen teams. Teams enable agent collaboration on shared tasks. For this, use AutoGen teams for agent collaboration. `RoundRobinGroupChat` runs agents sequentially, so they take turns analyzing Amazon data and sharing findings.

Finally, the `SUMMARY_COMPLETE` termination condition tells AutoGen when agents finish and the team should stop. Here's how it should look:

In [None]:
async def generate_summary(self, query: str) -> None:
        """Generates a summary using AI agents based on the given query"""
        agents = self._initialize_agents()


        text_termination = TextMentionTermination("SUMMARY_COMPLETE")
        team = RoundRobinGroupChat(
            participants=agents,
            termination_condition=text_termination,
        )


        task = f"Search for products for the query {query} and provide a summary in formatted Markdown of your findings."
        messages = []


        async for message in team.run_stream(task=task):
            if isinstance(message, BaseChatMessage) and message.source in {
                AgentName.PRICE_SUMMARIZER,
                AgentName.DEAL_FINDER,
            }:
                messages.append(message.to_text())

Set up the team with agents and a termination condition. Pass the task to run_stream and collect agent messages. This produces a complete price and deal summary in Markdown format.

Save results to a Markdown file using this method:

In [None]:
def _write_to_md(self, messages: list[str]) -> None:
        """Writes the messages to a Markdown file."""
        with open("summary.md", "w") as f:
            for message in messages:
                f.write(f"{message}\n\n")

Call this at the end of generate_summary to save results. The complete method:

In [None]:
async def generate_summary(self, query: str) -> None:
        """Generates a summary using AI agents based on the given query""""
        agents = self._initialize_agents()


        text_termination = TextMentionTermination("SUMMARY_COMPLETE")
        team = RoundRobinGroupChat(
            participants=agents,
            termination_condition=text_termination,
        )


        task = f"Search for products for the query {query} and provide a summary in formatted Markdown of your findings."
        messages = []


        async for message in team.run_stream(task=task):
            if isinstance(message, BaseChatMessage) and message.source in {
                AgentName.PRICE_SUMMARIZER,
                AgentName.DEAL_FINDER,
            }:
                messages.append(message.to_text())


        self._write_to_md(messages)

Now you have the complete tool for generating summaries using AutoGen agents with Web Scraper API data. Combine everything in the main file:

In [None]:
import asyncio
from scraper import AmazonScraper
from summary import AmazonDataSummarizer


async def main():
    scraper = AmazonScraper()
    summarizer = AmazonDataSummarizer(scraper=scraper)
    await summarizer.generate_summary(query="laptop")


if __name__ == "__main__":
    asyncio.run(main())

Running `python main.py` creates a `summary.md` file in your directory where you can view results in a Markdown-capable text editor.