## Web Voyager

Inspired by [LangChain](https://github.com/hwchase17/langchain) [Web Voyager](https://github.com/langchain-ai/langgraph/blob/main/examples/web-navigation/web_voyager.ipynb), the notebook showcases a [LangGraph](https://github.com/langchain-ai/langgraph/tree/main) based Web Navigator powered by [Anthropic's Claude](https://www.anthropic.com/news/claude-3-family).

[WebVoyager](https://arxiv.org/abs/2401.13919) by He, et. al., is a vision-enabled web-browsing agent capable of controlling the mouse and keyboard.

<video 
    width="1280"
    height="720"
    controls>
    <source src="./video/web_voyager.mp4" type="video/mp4"/>
</video>

It works by viewing annotated browser screenshots for each turn, then choosing the next step to take. The agent architecture is a basic reasoning and action (ReAct) loop. 
The unique aspects of this agent are:
- It's usage of [Set-of-Marks](https://som-gpt4v.github.io/)-like image annotations to serve as UI affordances for the agent
- It's application in the browser by using tools to control both the mouse and keyboard

The overall design looks like the following:

<!-- ![Voyager Image](./img/web-voyager.excalidraw.png) -->
<img src="./img/web-voyager.excalidraw.png" width="50%">

Before we build, let's configure the environment:

In [1]:
import sys
import logging

stream_handler = logging.StreamHandler(sys.stdout)
logger = logging.getLogger()


logger.setLevel(logging.INFO)
logger.addHandler(stream_handler)

In [2]:
import os
from getpass import getpass

def _set_if_undefined(var: str):
  if not os.environ.get(var):
    os.environ[var] = getpass(f"Please provide your {var}")

_set_if_undefined("ANTHROPIC_API_KEY")

#### Install Agent requirements

The only additional requirement we have is the [playwright](https://playwright.dev/) browser. Uncomment and install below:

In [3]:
# %pip install --upgrade --quiet  playwright > /dev/null
# !playwright install

In [4]:
import nest_asyncio

# This is just required for running async playwright in a Jupyter notebook
nest_asyncio.apply()

## How It Works

### Overview:

The system encapsulates its functionality in the [WebVoyager](../../../../../../slangchain/slangchain/graphs/anthropic/web_voyager.py) class, which extends the [Chain](https://github.com/langchain-ai/langchain/blob/master/libs/langchain/langchain/chains/base.py) class from a custom framework. This setup allows the system to manage a sequence of actions and decisions in an automated web browsing task. The core functionality revolves around navigating web pages, interacting with elements (clicking, typing, scrolling), and dynamically making decisions based on the current state of the webpage and the outcomes of previous actions.

### Key Components and Methods:

- **`mark_page`:** This asynchronous method annotates a web page by marking interactive elements with bounding boxes. It leverages a JavaScript script (`mark_page.js`) to visually identify clickable elements and text fields, facilitating the interaction process. After marking, it captures a screenshot and identifies the bounding boxes for later use.

- **`_create_agent`:** Initializes the agent by setting up a chain of actions that includes marking the page, interpreting the marked elements, sending prompts to the AI model (via `ChatAnthropic`), parsing the output, and deciding on the next action. This method effectively bridges the gap between visual page representation and AI-driven decision-making.

- **`init_workflow_nodes`:** Constructs a state graph representing the workflow of the web navigation task. Nodes in this graph correspond to various actions (e.g., click, type, scroll), and edges define the transition from one action to another based on the agent's decisions. This method is crucial for defining the logical flow of the navigation process.

- **`_acall`:** The primary asynchronous method responsible for executing the web navigation task. It initializes the browser, sets up the page, and iterates through the action chain by following the workflow defined in the state graph. This method captures the dynamic interaction with the web page, including making decisions, performing actions, and handling the outcomes until a final result is achieved.

- **Action Methods (`_click`, `_type_text`, `_scroll`, `_wait`, `_go_back`, `_to_google`):** A collection of asynchronous methods that correspond to specific actions the agent can perform on a web page. These include clicking a specified element, typing text into a field, scrolling the page or an element, waiting for a specified duration, navigating back, and going to the Google homepage. Each method takes the current state as input, performs the action using Playwright commands, and returns a description of the action for logging or decision-making purposes.

- **`_select_tool`:** Determines the next action to take based on the agent's current state and the prediction from the AI model. This routing function is a critical part of the decision-making process, enabling dynamic transitions between actions in the workflow.

- **`_annotate`, `_format_descriptions`, `_parse`:** Utility methods that prepare the agent's state and parse its actions. `_annotate` updates the state with marked page elements and screenshots; `_format_descriptions` generates textual descriptions of the marked elements for the AI model; `_parse` converts the model's output into actionable commands.

- **`_update_scratchpad`:** Updates the agent's memory (scratchpad) with the outcomes of the latest action. This method ensures the AI model has context for its decisions, incorporating the history of actions and their results.

- **`from_llm`:** A class method to instantiate the [WebVoyager](../../../../../../slangchain/slangchain/graphs/anthropic/web_voyager.py) class with a specific language model (`ChatAnthropic`), setting up the necessary configurations for the navigation task.

### Functionality Flow:

1. **Initialization:** The system is initialized with a language model and configurations.
2. **Workflow Setup:** It constructs a state graph defining the logical flow of actions.
3. **Execution:** Through the `_acall` method, the system iterates over the workflow, performing web interactions based on AI decisions.
4. **Dynamic Decision-Making:** Utilizes AI model predictions to dynamically decide on actions, managing complex navigation tasks efficiently.

This architecture combines advanced asynchronous programming, AI-driven decision-making, and state management to navigate and interact with web pages, aiming to automate a wide range of online tasks.


In [5]:
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-3-sonnet-20240229", max_tokens=4096)

In [6]:
from slangchain.graphs.anthropic.web_navigation.web_voyager import WebVoyager


web_voyager = WebVoyager.from_llm(
  llm = llm,
)

## Run Web Voyager

Now that we've created the whole agent executor, we can run it on a few questions! We'll start our browser at "google.com" and then let it control the rest.

Below is a helper function to help print out the steps to the notebook (and display the intermediate screenshots).

In [7]:
result = await web_voyager.ainvoke(input={"input": "Could you explain the WebVoyager paper (on arxiv)?"})
print(f"{result}")

Steps: 1. Type: ['7', 'WebVoyager paper arxiv']
Steps: 1. Type: ['7', 'WebVoyager paper arxiv']
2. Click: ['19']
Steps: 1. Type: ['7', 'WebVoyager paper arxiv']
2. Click: ['19']
3. ANSWER;: ['WebVoyager is an end-to-end web agent powered by a large multimodal model that can follow user instructions by interacting with real websites. It introduces a new benchmark compiling tasks from 15 popular sites and an automatic evaluation leveraging multimodal understanding. WebVoyager achieves 59.1% task success on the benchmark, surpassing GPT-4 and earlier methods. The automatic metric agrees 85.3% with human judgment, demonstrating reliable assessment of open-ended web agents.']
{'input': 'Could you explain the WebVoyager paper (on arxiv)?', 'output': 'WebVoyager is an end-to-end web agent powered by a large multimodal model that can follow user instructions by interacting with real websites. It introduces a new benchmark compiling tasks from 15 popular sites and an automatic evaluation levera