## Installation

```bash
uv pip install autogen-ext
uv add autogen-ext[openai]
```

## VideoSurfer

### Video Analysis
- Extract information from local video files
- Answer questions about video content
- Process both visual and audio elements

### Audio Processing
- Extract audio from videos
- Transcribe audio with timestamps
- Analyze spoken content

### Visual Analysis
- Capture screenshots at specific timestamps
- Save screenshots to disk
- Analyze visual content using AI vision capabilities

**Note**: *`video_surfer` require `cv2`, `ffmpeg`, `whisper`, `web`*

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_ext.agents.video_surfer import VideoSurfer
from autogen_agentchat.messages import TextMessage
from autogen_agentchat.agents import UserProxyAgent
from autogen_agentchat.base import Response
from autogen_core import CancellationToken

In [2]:
model_client = OpenAIChatCompletionClient(model="gpt-4o")

In [3]:
video_surfer = VideoSurfer(name="VideoSurfer", model_client=model_client)

In [4]:
cancellation_token = CancellationToken()

In [5]:
video_path = "assets/The two talking cats [z3U0udLH974].f606.mp4"

message = TextMessage(
    content=f"For video at {video_path}: What is the duration of this video?",
    source = "user"
)

# Send message to VideoSurfer
response:Response = await video_surfer.on_messages([message],
                                                   cancellation_token=cancellation_token)

response

Response(chat_message=ToolCallSummaryMessage(id='a32ca21e-0ad0-40f3-b681-8c81aaa7f9e0', source='VideoSurfer', models_usage=None, metadata={}, created_at=datetime.datetime(2025, 8, 12, 3, 42, 20, 96108, tzinfo=datetime.timezone.utc), content='The video is 55.80 seconds long.', type='ToolCallSummaryMessage', tool_calls=[FunctionCall(id='call_yzHmggBoGPOh8v1wEGAV8Ye6', arguments='{"video_path":"assets/The two talking cats [z3U0udLH974].f606.mp4"}', name='get_video_length')], results=[FunctionExecutionResult(content='The video is 55.80 seconds long.', name='get_video_length', call_id='call_yzHmggBoGPOh8v1wEGAV8Ye6', is_error=False)]), inner_messages=[ToolCallRequestEvent(id='4fef1dbf-e8e2-4ef5-9f41-2fbdab93558c', source='VideoSurfer', models_usage=RequestUsage(prompt_tokens=644, completion_tokens=33), metadata={}, created_at=datetime.datetime(2025, 8, 12, 3, 42, 20, 91830, tzinfo=datetime.timezone.utc), content=[FunctionCall(id='call_yzHmggBoGPOh8v1wEGAV8Ye6', arguments='{"video_path":"ass

## FileSurfer

### FileSurfer Capabilities
Based on my exploration of the FileSurfer module, here's what you can do with it:

### Core Functionality
FileSurfer is an agent in AutoGen that acts as a local file previewer, allowing you to:

1. **Browse Local Files and Directories**
- Open files and directories using relative or absolute paths
- Navigate through file hierarchies

2. **View File Contents**
- Display file contents in a text-based viewport
- Navigate through large files with pagination

3. **Search Within Files**
- Search for specific text patterns within files
- Navigate between search results

### Available Tools
FileSurfer provides these specific tools:
- `open_path` - Open a file or directory at a specified path
- `page_up` - Scroll viewport up one page
- `page_down` - Scroll viewport down one page
- `find_on_page_ctrl_f` - Search for text within the current file (like Ctrl+F)
- `find_next` - Navigate to the next occurrence of the search term

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_ext.agents.file_surfer import FileSurfer
from autogen_agentchat.messages import TextMessage
from autogen_agentchat.base import Response
from autogen_core import CancellationToken

In [2]:
# Initialize your model client
model_client = OpenAIChatCompletionClient(model="gpt-4o-mini")

In [3]:
# Create the FileSurfer agent
file_surfer = FileSurfer(
    name="FileSurfer",
    model_client=model_client,
    description="Standalone FileSurfer agent to browse local files",
    base_path="."
)

In [4]:
# Create a cancellation token
token = CancellationToken()

In [5]:
# Open a file

query = TextMessage(content="Open the file 2411.04468v1.md",
                   source="user")

response: Response = await file_surfer.on_messages([query],
                                                  cancellation_token=token)

print("Response after opening file:\n", response.chat_message.content[:1000])

Response after opening file:
 Path: /home/locch/Works/zsrc/2411.04468v1.md
Viewport position: Showing page 1 of 19.
4
2
0
2

v
o
N
7

]
I

A
.
s
c
[

1
v
8
6
4
4
0
.
1
1
4
2
:
v
i
X
r
a

Magentic-One: A Generalist Multi-Agent System
for Solving Complex Tasks

⋆ Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan ⋆
† Eduardo Salinas, Erkang (Eric) Zhu, Friederike Niedtner, Grace Proebsting,
Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang,
Ricky Loynd, Robert West, Victor Dibia †
⋄ Ahmed Awadallah, Ece Kamar, Rafah Hosn, Saleema Amershi ⋄

Microsoft Research AI Frontiers

Figure 1: An illustration of the Magentic-One mutli-agent team completing a complex task
from the GAIA benchmark. Magentic-One’s Orchestrator agent creates a plan, delegates tasks
to other agents, and tracks progress towards the goal, dynamically revising the plan as needed.
The Orchestrator can delegate tasks to a FileSurfer agent to read and handle files, a WebSurfer
agent to operate a web browser, or a Co

In [6]:
# Navigate within the file: page down

query = TextMessage(content="Page down",
                   source="user")

response: Response = await file_surfer.on_messages([query],
                                                  cancellation_token=token)

print("After paging down:\n", response.chat_message.content[:1000])

After paging down:
 Path: /home/locch/Works/zsrc/2411.04468v1.md
Viewport position: Showing page 2 of 19.
Together, Magentic-One’s agents achieve
strong performance on multiple challenging agentic benchmarks. Figure 1 shows an example of
Magentic-One solving one such benchmark task that requires multiple steps and diverse tools.
Key to Magentic-One’s performance is its modular and flexible multi-agent approach [51,
28, 53, 13, 52], implemented via the AutoGen2 framework [60]. The multi-agent paradigm
offers numerous advantages over monolithic single-agent approaches [51, 53, 6, 62], which we
believe makes it poised to become the leading paradigm in agentic development. For example,
encapsulating distinct skills in separate agents simplifies development and facilitates reusability,
akin to object-oriented programming. Magentic-One’s specific design further supports easy
adaptation and extensibility by enabling agents to be added or removed without altering other
agents, or the overa


In [7]:
# Find text within the file: search for "installation"

query = TextMessage(content="Find 'Ablations' in this file",
                   source="user")

response: Response = await file_surfer.on_messages([query],
                                                  cancellation_token=token)

print("After searching:\n", response.chat_message.content[:1000])

After searching:
 Path: /home/locch/Works/zsrc/2411.04468v1.md
Viewport position: Showing page 8 of 19.
report results for Magentic-One (GPT-4o, o1) on WebArena since the o1 model refused to complete
26% of WebArena Gitlab tasks, and 12% of Shopping Administration tasks, making a fair comparison impossible.

10

2024. This includes entries that are neither open-source, nor described by technical reports,
making them difficult to independently validate. Finally, we also include human performance
where available.

We use statistical tests to compare the performance of Magentic-One to baselines and say
that two methods are statistically comparable if the difference in their performance is not
statistically significant (α=0.05); details about our statistical methodology can be found in
Appendix A.

Magentic-One (GPT-4o, o1-preview) achieves statistically comparable performance to SOTA
methods on both GAIA and AssistantBench. On WebArena, only the GPT-4o variant was eval-
uated12, an


In [9]:
# Other extension

query = TextMessage(content="Open the file assets/2411.04468v1.pdf",
                   source="user")

response: Response = await file_surfer.on_messages([query],
                                                  cancellation_token=token)

print("Response after opening file:\n", response.chat_message.content[:1000])

Cannot set gray non-stroke color because /'P1' is an invalid float value


Response after opening file:
 Path: /home/locch/Works/zsrc/assets/2411.04468v1.pdf
Viewport position: Showing page 1 of 19.
4
2
0
2

v
o
N
7

]
I

A
.
s
c
[

1
v
8
6
4
4
0
.
1
1
4
2
:
v
i
X
r
a

Magentic-One: A Generalist Multi-Agent System
for Solving Complex Tasks

⋆ Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan ⋆
† Eduardo Salinas, Erkang (Eric) Zhu, Friederike Niedtner, Grace Proebsting,
Griffin Bassman, Jack Gerrits, Jacob Alber, Peter Chang,
Ricky Loynd, Robert West, Victor Dibia †
⋄ Ahmed Awadallah, Ece Kamar, Rafah Hosn, Saleema Amershi ⋄

Microsoft Research AI Frontiers

Figure 1: An illustration of the Magentic-One mutli-agent team completing a complex task
from the GAIA benchmark. Magentic-One’s Orchestrator agent creates a plan, delegates tasks
to other agents, and tracks progress towards the goal, dynamically revising the plan as needed.
The Orchestrator can delegate tasks to a FileSurfer agent to read and handle files, a WebSurfer
agent to operate a web browser,

## MultimodalWebSurfer

### Core Functionality

#### Web Navigation
- Visit specific URLs directly
- Perform web searches on search engines
- Navigate browser history (back button)
- Scroll through web pages (up/down)

#### Web Interaction
- Click on elements (buttons, links, etc.)
- Type text into form fields
- Hover over elements to reveal hidden content
- Scroll specific elements on a page

#### Content Analysis
- Take screenshots of web pages
- Extract and analyze text content
- Answer questions about page content
- Summarize entire web pages

#### Visual Processing
- Identify interactive elements on a page
- Create annotated screenshots with bounding boxes
- Process visual content using multimodal models
- Optionally use OCR for text extraction

### Advanced Features

#### Browser Control
- Launches and controls a Chromium browser via Playwright  
- Supports headless or visible browser operation  
- Can animate actions for better visibility  
- Handles file downloads  

#### Multimodal Capabilities
- Works with vision-capable models (like GPT-4o)  
- Processes both visual and textual information  
- Creates **"Set-of-Mark"** screenshots with interactive elements highlighted  
- Makes decisions based on visual understanding of web pages  

#### Tool-Based Interaction
Uses a comprehensive set of tools for web interaction:
- `visit_url` – Navigate to specific URLs  
- `web_search` – Perform web searches  
- `click` – Click on elements  
- `type` – Enter text into fields  
- `hover` – Hover over elements  
- `scroll_up` / `scroll_down` – Navigate page content  
- `answer_question` – Answer questions about page content  
- `summarize_page` – Generate page summaries  

#### Debugging & Monitoring
- Optional screenshot saving  
- Debug directory for logs and artifacts  
- Browser session management  

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()

from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_ext.agents.web_surfer import MultimodalWebSurfer
from autogen_agentchat.messages import TextMessage
from autogen_core import CancellationToken

**Note**: *`web_surfer` require `playwright`*

In [2]:
model_client = OpenAIChatCompletionClient(model="gpt-4o")

In [3]:
surfer = MultimodalWebSurfer(
        name="WebSurfer",
        model_client=model_client,
        headless=True,
        downloads_folder="./downloads"
    )

In [4]:
token = CancellationToken()

In [5]:
# Searching
query = "Search the web for 'latest advances in AI by Microsoft'. Summarize key findings."

response = await surfer.on_messages(
    [TextMessage(content=query, source="user")],
    cancellation_token=token
)

print(f"\nQuery: {query}\n---\n{response.chat_message.content}")


Query: Search the web for 'latest advances in AI by Microsoft'. Summarize key findings.
---
['I typed \'latest advances in AI by Microsoft\' into \'0 trên 2000 ký tự\'.\n\nThe web browser is open to the page [latest advances in AI by Microsoft - Tìm kiếm](https://www.bing.com/search?q=latest+advances+in+AI+by+Microsoft&form=QBLH&sp=-1&lq=0&pq=&sc=0-0&qs=n&sk=&cvid=0C9BBDF37EE94F08A7F98696A2D4896B).\nThe viewport shows 23% of the webpage, and is positioned at the top of the page\nThe following text is visible in the viewport:\n\nBỏ qua tới phần Nội dung\nlatest advances in AI by MicrosoftEnglishDi độngTất cả\nTìm kiếm\nTin tức\nHình ảnh\nVideo\nCopilot\nXem thêm\nCông cụ\nVề 7.400.000 kết quảWe launched a new category of Windows PCs designed for \nAI, called Copilot+ PCs; brought \nadvanced reasoning, planning and multimodal capabilities to Copilot; expanded our suite of \nAI tools and models in \nMicrosoft Azure; innovated with new \nAI model categories like Phi-3; and are 

In [8]:
print(response.chat_message.content[0])

I typed 'latest advances in AI by Microsoft' into '0 trên 2000 ký tự'.

The web browser is open to the page [latest advances in AI by Microsoft - Tìm kiếm](https://www.bing.com/search?q=latest+advances+in+AI+by+Microsoft&form=QBLH&sp=-1&lq=0&pq=&sc=0-0&qs=n&sk=&cvid=0C9BBDF37EE94F08A7F98696A2D4896B).
The viewport shows 23% of the webpage, and is positioned at the top of the page
The following text is visible in the viewport:

Bỏ qua tới phần Nội dung
latest advances in AI by MicrosoftEnglishDi độngTất cả
Tìm kiếm
Tin tức
Hình ảnh
Video
Copilot
Xem thêm
Công cụ
Về 7.400.000 kết quảWe launched a new category of Windows PCs designed for 
AI, called Copilot+ PCs; brought 
advanced reasoning, planning and multimodal capabilities to Copilot; expanded our suite of 
AI tools and models in 
Microsoft Azure; innovated with new 
AI model categories like Phi-3; and are exploring novel abilities of science and 
AI through 
Microsoft Research and our 
AI for Good program that benefit huma

In [9]:
# Navigate
query = "Navigate to the first search result and scroll down to extract the main headlines."

response = await surfer.on_messages(
    [TextMessage(content=query, source="user")],
    cancellation_token=token
)

print(f"\nQuery: {query}\n---\n{response.chat_message.content}")


Query: Navigate to the first search result and scroll down to extract the main headlines.
---
['I clicked \'Understanding AI at Microsoft\'.\n\nThe web browser is open to the page [Understanding AI at Microsoft](https://news.microsoft.com/ai/#:~:text=We%20launched%20a%20new%20category%20of%20Windows%20PCs,our%20AI%20for%20Good%20program%20that%20benefit%20humanity.?msockid=3e62115f8b79603701d207198a58612e).\nThe viewport shows 15% of the webpage, and is positioned at the top of the page\nThe following text is visible in the viewport:\n\nMicrosoft\nSource\nOur Company\nAI\nInnovation\nDigital Transformation\nDiversity & Inclusion\nSustainability\nMore\nAll MicrosoftSearch \nCart\nTop Microsoft AI newsBuilding Microsoft AI responsiblyFAQs AI storiesUnderstanding\nMicrosoft\n AI\nMicrosoft\n AI\n is\n working\n across\n many\n fronts\n to\n make\n the\n promise\n of\n generative\nartificial\n intelligence\n real,\n providing\n resources,\n information\n and\n access\n to\n help\n media\n

In [11]:
print(response.chat_message.content[0])

I clicked 'Understanding AI at Microsoft'.

The web browser is open to the page [Understanding AI at Microsoft](https://news.microsoft.com/ai/#:~:text=We%20launched%20a%20new%20category%20of%20Windows%20PCs,our%20AI%20for%20Good%20program%20that%20benefit%20humanity.?msockid=3e62115f8b79603701d207198a58612e).
The viewport shows 15% of the webpage, and is positioned at the top of the page
The following text is visible in the viewport:

Microsoft
Source
Our Company
AI
Innovation
Digital Transformation
Diversity & Inclusion
Sustainability
More
All MicrosoftSearch 
Cart
Top Microsoft AI newsBuilding Microsoft AI responsiblyFAQs AI storiesUnderstanding
Microsoft
 AI
Microsoft
 AI
 is
 working
 across
 many
 fronts
 to
 make
 the
 promise
 of
 generative
artificial
 intelligence
 real,
 providing
 resources,
 information
 and
 access
 to
 help
 media
outlets,
 journalists
 and
 influencers
 bring
 their
 audiences
 along
 on
 the
 journey.
Top Microsoft AI news Product & platform 
 Responsib

In [12]:
# Content Analysis

query = "Analyze the current page: what are the top three technologies discussed here?"

response = await surfer.on_messages(
    [TextMessage(content=query, source="user")],
    cancellation_token=token
)

print(f"\nQuery: {query}\n---\n{response.chat_message.content}")


Query: Analyze the current page: what are the top three technologies discussed here?
---
The "Understanding AI at Microsoft" webpage details Microsoft's comprehensive efforts to advance AI technology across various sectors, emphasizing responsible development and deployment. It highlights new capabilities in Azure AI Foundry and a range of AI tools and models aimed at enhancing productivity, particularly through the Copilot PCs. Microsoft's AI initiatives focus on democratizing AI benefits, ensuring inclusivity and transparency, and promoting innovation while addressing societal challenges such as accessibility, health, and sustainability. The webpage also underscores Microsoft's significant investments in advanced AI infrastructure and its commitment to global collaborations and partnerships, including its work with OpenAI.

Additionally, Microsoft is dedicated to building AI responsibly, implementing safeguards against potential risks, and fostering transparency and accountability t

## Markitdown

### Installation

```bash
uv add markitdown[all]
```

**Note**: *Install Dependencies*

```bash
sudo apt update
sudo apt install ffmpeg
```

In [1]:
# Let's convert document in *.pdf to *.md and use AI gent to summary the content

In [3]:
#!markitdown assets/preparedness-framework-v2.pdf > preparedness-framework-v2.md

In [4]:
from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False) # Set to True to enable plugins
result = md.convert("assets/preparedness-framework-v2.pdf")
print(result.text_content)

Preparedness Framework

Version 2. Last updated: 15th April, 2025

OpenAI’s mission is to ensure that AGI (artificial general intelligence) benefits all of humanity. To pursue
that mission, we are committed to safely developing and deploying highly capable AI systems, which
create significant benefits and also bring new risks. We build for safety at every step and share our
learnings so that society can make well-informed choices to manage new risks from frontier AI.

The Preparedness Framework is OpenAI’s approach to tracking and preparing for frontier capabilities
that create new risks of severe harm.1 We currently focus this work on three areas of frontier capability,
which we call Tracked Categories:

• Biological and Chemical capabilities that, in addition to unlocking discoveries and cures, can also

reduce barriers to creating and using biological or chemical weapons.

• Cybersecurity capabilities that, in addition to helping protect vulnerable systems, can also create

new ris

In [15]:
import os
from dotenv import load_dotenv
load_dotenv()

import tiktoken
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import TextMessage
from autogen_core import CancellationToken

In [18]:
# Count tokens
encoding = tiktoken.encoding_for_model("gpt-4o-mini")
token_count = len(encoding.encode(result.text_content))
print(f"Prompt token count: {token_count}")

Prompt token count: 13482


In [21]:
assistant = AssistantAgent(
    name="Assistant",
    model_client=OpenAIChatCompletionClient(model="gpt-4o-mini"),
    description="You are helpful assistant"
)

msg = TextMessage(content=f"Summary this content {result.text_content} in 30 words",
                    source="user")

res = await assistant.on_messages([msg],cancellation_token=CancellationToken())

print(res.chat_message.content)

OpenAI's Preparedness Framework aims to manage risks from advanced AI capabilities by categorizing threats, measuring capabilities, developing safeguards, and ensuring transparent governance to protect against severe harm. 

TERMINATE


## References

https://openai.com/api/pricing/