# Content Collection Tasks with WebArchiverAgent

### Why would we want this?
As part of a larger pipeline, `WebArchiverAgent` accomplishes the task of automatic retrieval and storage of online content for numerous downstream tasks.  
This task is facilitated by a headless Selenium webdriver. 


## Requirements

AutoGen requires `Python>=3.8`. To run this notebook example, please install:
```bash
pip install "pyautogen[websurfer]"
```

## Ensure that we have the WebDrivers present for Selenium
Following the instructions in [Selenium Documentation](https://www.selenium.dev/documentation/webdriver/troubleshooting/errors/driver_location/#download-the-driver), 
we first download the web driver for our browser of choice, or all 3: [Edge](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/?form=MA13LH#downloads), [Firefox](https://github.com/mozilla/geckodriver/releases), [Chrome](https://chromedriver.chromium.org/downloads).

In [1]:
# %%capture --no-stderr
import os
import logging
import autogen
from PIL import Image
from IPython.core.display_functions import display
from autogen.agentchat.contrib.web_archiver_agent import WebArchiverAgent
from autogen.agentchat.user_proxy_agent import UserProxyAgent
from autogen.oai import config_list_from_json
from autogen.browser_utils import display_binary_image
from autogen.browser_utils import get_file_path_from_url

# Get the logger instance for the current module (__name__).
logger = logging.getLogger(__name__)

Neither powershell nor pwsh is installed.


## Set your API Endpoint

The [`config_list_from_json`](https://microsoft.github.io/autogen/docs/reference/oai/openai_utils#config_list_from_json) function loads a list of configurations from an environment variable or a json file.

It first looks for environment variable "OAI_CONFIG_LIST" which needs to be a valid json string. If that variable is not found, it then looks for a json file named "OAI_CONFIG_LIST". It filters the configs by models (you can filter by other keys as well).

The WebSurferAgent uses a combination of models. GPT-4 and GPT-3.5-turbo-16 are recommended.

Your json config should look something like the following:
```json
[
    {
        "model": "gpt-4",
        "api_key": "<your OpenAI API key here>"
    },
    {
        "model": "gpt-3.5-turbo-16k",
        "api_key": "<your OpenAI API key here>"
    }
]
```

If you open this notebook in colab, you can upload your files by clicking the file icon on the left panel and then choose "upload file" icon.


In [2]:
llm_config = {
    "timeout": 600,
    "cache_seed": 44,  # change the seed for different trials
    "config_list": config_list_from_json(
        "OAI_CONFIG_LIST",
        # filter_dict={"model": ["Sakura-SOLAR-Instruct-f16"]},
        filter_dict={
            "model": ["gpt-3.5-turbo"]
        },  # , "gpt-4", "gpt-4-0613", "gpt-4-32k", "gpt-4-32k-0613", "gpt-4-1106-preview"]},
    ),
    "temperature": 0,
}

summarizer_llm_config = {
    "timeout": 600,
    "cache_seed": 44,  # change the seed for different trials
    "config_list": config_list_from_json(
        "OAI_CONFIG_LIST",
        # filter_dict={"model": ["Sakura-SOLAR-Instruct-f16"]},
        filter_dict={"model": ["gpt-3.5-turbo"]},
    ),
    "temperature": 0,
}

## Configure Bing

For WebSurferAgent to be reasonably useful, it needs to be able to search the web -- and that means it needs a Bing API key. 
You can read more about how to get an API on the [Bing Web Search API](https://www.microsoft.com/en-us/bing/apis/bing-web-search-api) page.

Once you have your key, either set it as the `BING_API_KEY` system environment variable, or simply input your key below.

In [3]:
bing_api_key = os.environ["BING_API_KEY"] if "BING_API_KEY" in os.environ else ""

### Define our agents

In [4]:
# Specify where our web content will be stored, we'll use this at the end of the notebook
storage_path = "./content"

web_archiver_agent = WebArchiverAgent(
    name="ContentAgent",  # Choose any name you prefer
    system_message="You are data collection agent specializing in content on the web.",
    max_depth=0,
    llm_config=llm_config,
    max_consecutive_auto_reply=0,
    silent=False,  # *NEW* In case we want to hear the inner-conversation,
    storage_path=storage_path,  # *NEW* This is where our archived content is stored, defaulting to `./content`
    browser_config={
        "bing_api_key": bing_api_key,
        "type": "selenium",  # *NEW* Here we specify that we intend to use our headless GUI browser. The default setting is "text".
        "browser": "edge",  # *NEW* We'll use the edge browser for these tests.  Choices include 'edge', 'firefox', and 'chrome'
        # "resolution": (1400,900), # *NEW* we specify the browser window size.  The default is (1920,5200)
        "render_text": False,  # *NEW* We still have the option to convert the output to text and render it on the screen
    },
)

# Define the user agent
user_proxy = autogen.agentchat.UserProxyAgent(
    "user_proxy",
    human_input_mode="NEVER",
    code_execution_config=False,
    default_auto_reply="",
    is_termination_msg=lambda x: True,
    max_consecutive_auto_reply=0,
)

# We register our collection function as the default response
web_archiver_agent.register_reply(user_proxy, web_archiver_agent.collect_content)

### Let's take it for a spin!  
The Autogen open-source framework has an academic paper on arxiv.org!  We'd certainly be interested to have that in our archives for later retrieval

In [5]:
link = "https://arxiv.org/abs/2308.08155"

user_proxy.initiate_chat(web_archiver_agent, message=link)

[33muser_proxy[0m (to ContentAgent):

https://arxiv.org/abs/2308.08155

--------------------------------------------------------------------------------
[33mContentAgent[0m (to user_proxy):

Success: archived the following links in your chosen location ./content/ <-- https://arxiv.org/abs/2308.08155

--------------------------------------------------------------------------------


ChatResult(chat_history=[{'content': 'https://arxiv.org/abs/2308.08155', 'role': 'assistant'}, {'content': 'Success: archived the following links in your chosen location ./content/ <-- https://arxiv.org/abs/2308.08155', 'role': 'user'}], summary='Success: archived the following links in your chosen location ./content/ <-- https://arxiv.org/abs/2308.08155', cost=({'total_cost': 0}, {'total_cost': 0}), human_input=[])

### We'll try another, this time the examples page from the Autogen official website

In [6]:
link = "https://microsoft.github.io/autogen/docs/Examples"
user_proxy.initiate_chat(web_archiver_agent, message=link)

[33muser_proxy[0m (to ContentAgent):

https://microsoft.github.io/autogen/docs/Examples

--------------------------------------------------------------------------------
[33mContentAgent[0m (to Content Classifier):

Title: `Examples | AutoGen`, Data: ```Examples`

--------------------------------------------------------------------------------
[33mContent Classifier[0m (to ContentAgent):

False

--------------------------------------------------------------------------------
[33mContentAgent[0m (to Content Classifier):

Title: `Examples | AutoGen`, Data: ```Automated Multi Agent Chat​`

--------------------------------------------------------------------------------
[33mContent Classifier[0m (to ContentAgent):

False

--------------------------------------------------------------------------------
[33mContentAgent[0m (to Content Classifier):

Title: `Examples | AutoGen`, Data: ```AutoGen offers conversable agents powered by LLM, tool or human, which can be used to perform t

ChatResult(chat_history=[{'content': 'https://microsoft.github.io/autogen/docs/Examples', 'role': 'assistant'}, {'content': 'Success: archived the following links in your chosen location ./content/ <-- https://microsoft.github.io/autogen/docs/Examples', 'role': 'user'}], summary='Success: archived the following links in your chosen location ./content/ <-- https://microsoft.github.io/autogen/docs/Examples', cost=({'total_cost': 0}, {'total_cost': 0}), human_input=[])

We see a lot of communication taking place when listening to the inner-dialog.  The agent needs to confirm relevance of various pieces of content so its not storing advertisements or content not associated with the page topic.

### We'll collect one more recent and very interesting publication by the good scientists at Microsoft

In [7]:
link = "https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/"
user_proxy.initiate_chat(web_archiver_agent, message=link)

[33muser_proxy[0m (to ContentAgent):

https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

--------------------------------------------------------------------------------
[33mContentAgent[0m (to Content Classifier):

Title: `GraphRAG: Unlocking LLM discovery on narrative private data - Microsoft Research`, Data: ```Global`

--------------------------------------------------------------------------------
[33mContent Classifier[0m (to ContentAgent):

False

--------------------------------------------------------------------------------
[33mContentAgent[0m (to Content Classifier):

Title: `GraphRAG: Unlocking LLM discovery on narrative private data - Microsoft Research`, Data: ```Microsoft Research Blog`

--------------------------------------------------------------------------------
[33mContent Classifier[0m (to ContentAgent):

False

--------------------------------------------------------------------------------
[33mCo

ChatResult(chat_history=[{'content': 'https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/', 'role': 'assistant'}, {'content': 'Success: archived the following links in your chosen location ./content/ <-- https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/', 'role': 'user'}], summary='Success: archived the following links in your chosen location ./content/ <-- https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/', cost=({'total_cost': 0}, {'total_cost': 0}), human_input=[])

### 

In [8]:
!ls {storage_path}/microsoft.com/graphrag-unlocking-llm-discovery-on-narrative-private-data/

aiex01_blog_hero_1400x788.png
aiex01_blog_hero_1400x788.txt
amit_emre_podcast_hero_feature_1400x788.jpg
amit_emre_podcast_hero_feature_1400x788.txt
content.txt
emnlp-2023-blogherofeature-1400x788-1.png
emnlp-2023-blogherofeature-1400x788-1.txt
graphrag-blogherofeature-1400x788-1.png
graphrag-blogherofeature-1400x788-1.txt
graphrag-figure3.jpg
graphrag-figure3.txt
graphrag_figure1.png
graphrag_figure1.txt
graphrag_figure2.png
graphrag_figure2.txt
headshot150px.png
headshot150px.txt
index.html
links.txt
metadata.txt
msr-ai-2x.png
newsplitwise-jan-24-blogherofeature-1400x788-1.jpg
newsplitwise-jan-24-blogherofeature-1400x788-1.txt
screenshot.png
sot-blogherofeature-1400x788-1.jpg
sot-blogherofeature-1400x788-1.txt
steven-truitt_360x360.jpg
steven-truitt_360x360.txt


### Just for reference, what did the page look like?

In [None]:
last_page = list(web_archiver_agent.process_history.keys())[-1]

local_path = f"{storage_path}/{get_file_path_from_url(last_page)}"
screenshot_path = os.path.join(local_path, "screenshot.png")
assert os.path.exists(screenshot_path)

# Load the image
image = Image.open(screenshot_path)

# Display the image
display(image)

It seems the bottom was cropped, but using the 'firefox' browser for our agent will trigger the "full page screenshot" function.<br>
But not to worry, everything is also stored to disk in its original form, including the source HTML as it was loaded in the desktop browser.

Below we confirm that our Autogen Agent successfully cataloged all of the content into the file.

In [28]:
with open(f"{local_path}/content.txt") as f:
    content = f.readlines()
for idx, line in enumerate(content):
    if "What are the top 5" in line:
        break
print(f"We located our search term on line {idx} out of a total {len(content)} lines\n")
print("The last 3 lines stored in content were:\n")
for i in reversed(range(1, 4)):
    print(content[-i])

We located our search term on line 14 out of a total 27 lines

The last 3 lines stored in content were:

In addition to relative comparisons, we also use SelfCheckGPT (opens in new tab) to perform an absolute measurement of faithfulness to help ensure factual, coherent results grounded in the source material. Results show that GraphRAG achieves a similar level of faithfulness to baseline RAG. We are currently developing an evaluation framework to measure performance on the class of problems above.  This will include more robust mechanisms for generating question-answer test sets as well as additional metrics, such as accuracy and context relevance.

By combining LLM-generated knowledge graphs and graph machine learning, GraphRAG enables us to answer important classes of questions that we cannot attempt with baseline RAG alone.  We have seen promising results after applying this technology to a variety of scenarios, including social media, news articles, workplace productivity, and chem

## Thanks for looking at our new WebArchiverAgent:
### Stay tuned for more updates from Autogen!