What if every website had an intelligent, voice-enabled agent that could actually do things for you: not just answer questions into the void?
Quick Start Β· Architecture Β· Demo Β· Deploy Β· Challenge
WebClaw is a voice-first AI agent that lives on websites. Unlike traditional chat widgets that serve canned responses, WebClaw can:
- See the page: understands DOM structure, layout, and content in real-time
- Hear the user: captures speech via microphone with real-time streaming
- Speak back: responds with natural voice through Gemini's native audio
- Take actions: clicks buttons, fills forms, navigates pages, highlights elements
- Use knowledge: answers questions using site-specific knowledge bases
It is not a chatbot. It is a companion that operates the website alongside you.
| Current State | With WebClaw |
|---|---|
| Chat widgets serve canned responses | Agent understands context and takes action |
| Users abandon carts, fail forms, can't find features | Agent guides users through workflows in real-time |
| Support staff can't see what users see | Agent sees the exact same page and can operate it |
| Every interaction starts from zero | Personal agent carries context across sites |
| Text-only, turn-based, disconnected | Voice-first, real-time, integrated with the page |
WebClaw uses a Gateway architecture (not peer-to-peer) to provide privacy, security, and scalability. The Gateway brokers context between the site's knowledge base and the executing agent while enforcing asymmetric privacy: site knowledge flows to the agent, but user data stays private.
flowchart TB
subgraph browser ["User's Browser"]
direction LR
embed["Embed Script\n(Site Agent)\nScript Tag"]
ext["Chrome Extension\n(Personal Agent)\nWorks on any site"]
site["Target Website\nDOM + Content"]
end
subgraph gateway ["WebClaw Gateway: Cloud Run"]
direction TB
ws["WebSocket Server\nFastAPI"]
agent["ADK Agent Runtime\nGemini Live API"]
ctx["Context Broker\nKnowledge + Permissions"]
tools["DOM Action Engine\nclick Β· type Β· scroll\nnavigate Β· highlight"]
session["Session Store\nFirestore"]
end
embed <-->|"WebSocket\naudio + text + DOM"| ws
ext <-->|"WebSocket\naudio + text + DOM"| ws
embed <--> site
ext <--> site
ws <--> agent
agent <--> ctx
agent <--> tools
agent <--> session
agent <-->|"Bidirectional\nAudio Streaming"| gemini["Gemini Live API\n(Native Audio)"]
WebClaw operates in two complementary modes, each serving a different use case:
For website owners who want to add an intelligent agent to their site. Zero installation for visitors.
sequenceDiagram
participant Owner as Site Owner
participant Site as Website
participant User as Visitor
participant Embed as WebClaw Embed
participant GW as Gateway
participant Gemini as Gemini Live API
Owner->>Site: Add script tag + site_id
Owner->>GW: Configure knowledge base,<br/>persona, permissions
User->>Site: Visits website
Site->>Embed: Embed script loads (19KB)
Embed->>Embed: Create Shadow DOM overlay
Embed->>Embed: Capture DOM snapshot
User->>Embed: Clicks avatar or speaks
Embed->>GW: WebSocket connect + DOM snapshot
GW->>GW: Load site config + knowledge base
GW->>Gemini: Open bidi audio stream
User->>Embed: "Help me check out"
Embed->>GW: Audio stream (PCM 16kHz)
GW->>Gemini: Forward audio
Gemini->>GW: Audio response + tool calls
GW->>Embed: Audio (PCM 24kHz) + DOM actions
Embed->>Site: Execute: click "Add to Cart"
Embed->>User: Voice: "I've added it to your cart.<br/>Ready to check out?"
For users who want a personal AI assistant that travels with them across the web.
sequenceDiagram
participant User as User
participant Ext as Chrome Extension
participant Any as Any Website
participant GW as Gateway
participant Gemini as Gemini Live API
User->>Ext: Install extension, configure gateway
Ext->>Ext: Persistent mic permission granted
User->>Any: Navigates to any website
Ext->>Ext: Capture interactive elements
User->>Ext: "Find the return policy"
Ext->>GW: WebSocket + page snapshot
GW->>Gemini: Audio + context
Gemini->>GW: Response + scroll_to action
GW->>Ext: Audio + DOM action
Ext->>Any: Scroll to FAQ section
Ext->>Any: Highlight "Return Policy"
Ext->>User: Voice: "Here it is. Returns<br/>accepted within 30 days."
When a Personal Agent meets a Site Agent, the Gateway enforces asymmetric context sharing. The site's knowledge flows to help the user; the user's personal data never flows to the site.
flowchart LR
subgraph userSide ["User's Context (Private)"]
prefs["Preferences"]
history["Browsing History"]
personal["Personal Data"]
end
subgraph siteSide ["Site's Context (Shared)"]
kb["Knowledge Base"]
faq["FAQs & Docs"]
actions["Allowed Actions"]
persona["Brand Persona"]
end
subgraph agentCtx ["Agent's Working Context"]
merged["Merged Context\n(User prefs + Site KB)"]
end
prefs -->|"Private channel"| merged
history -->|"Private channel"| merged
personal -.->|"BLOCKED"| siteSide
kb -->|"Public channel"| merged
faq -->|"Public channel"| merged
actions -->|"Public channel"| merged
persona -->|"Public channel"| merged
WebClaw features an animated avatar built with Canvas 2D that provides visual feedback for every agent state:
| State | Visual | Behavior |
|---|---|---|
| Idle | Gentle breathing, occasional blinks, subtle pulse | Agent is available and ready |
| Listening | Attentive eyes, blue glow ring pulsing | Microphone active, processing user speech |
| Speaking | Lip-synced mouth animation, green glow, gentle bounce | Agent is responding with voice |
| Thinking | Spinning arc indicator around head | Processing request, can still listen (barge-in) |
| Acting | Lightning bolt indicator β‘ | Executing a DOM action on the page |
The avatar uses real audio analysis when connected to the Web Audio API for accurate lip sync, or falls back to simulated mouth movement driven by sinusoidal functions for natural-looking speech animation.
WebClaw's agent can perform 10 categories of DOM operations, each implemented as a Gemini function-calling tool:
| Tool | Description | Example Use Case |
|---|---|---|
click_element |
Click buttons, links, tabs, menu items | "Add this to my cart" |
type_text |
Type into input fields and textareas | "Fill in my email address" |
scroll_to |
Scroll to elements or by pixel amount | "Show me the pricing section" |
scroll_to_top |
Scroll to the very top of the page | "Go back to the top" |
scroll_to_bottom |
Scroll to the very bottom of the page | "Show me the footer" |
navigate_to |
Navigate to URLs within the site | "Go to the contact page" |
highlight_element |
Draw attention with glow border + tooltip | "Where is the search bar?" |
read_page |
Extract text content from elements | "What does this section say?" |
select_option |
Choose from dropdowns and selects | "Select medium size" |
check_checkbox |
Toggle checkboxes | "Agree to terms and conditions" |
The action engine uses a smart element finder that tries three strategies in order:
- CSS selector: direct DOM query
- ARIA label: accessibility attribute matching
- Text content: fuzzy matching against interactive elements (buttons, links, labels)
| Tool | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Gateway backend |
| Node.js | 18+ | Embed script build |
| Gemini API Key | - | Get one free |
git clone https://github.com/AfrexAI/webclaw.git
cd webclawcd gateway
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # macOS/Linux
# .venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Configure
cp .env.example .env
# Edit .env β add your GOOGLE_API_KEY
# Run
uvicorn main:app --host 127.0.0.1 --port 8081You should see:
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8081
Verify with:
curl http://127.0.0.1:8081/health
# β {"status":"ok","service":"webclaw-gateway","version":"0.3.0"}
curl http://127.0.0.1:8081/api/sites
# β {"sites":[{"site_id":"demo","domain":"localhost",...}]}cd embed
npm install
npm run build
# β dist/webclaw.js 19.6kb β‘ Done in 2msOpen the demo e-commerce site in your browser:
open demo-site/index.html
# Or: python -m http.server 3000 -d demo-siteThe TechByte Store demo includes:
- 6 product cards with Add to Cart functionality
- FAQ section with expandable details
- Contact form with dropdown subject selection
- WebClaw integration via
<script>tag
Load the extension for Personal Agent mode:
- Open
chrome://extensionsin Chrome - Enable Developer Mode (toggle in top-right)
- Click Load unpacked β select the
extension/folder - Click the WebClaw icon on any page to activate
webclaw/
β
βββ gateway/ # π Python FastAPI Backend
β βββ main.py # WebSocket server, REST API, CORS (v0.3.0)
β βββ agent/ # ADK Agent Definition
β β βββ agent.py # Root agent (Gemini 2.5 Flash native audio)
β β βββ prompts.py # System prompt + site-specific builder
β β βββ tools.py # 10 DOM action tools (function-calling)
β βββ context/ # Context Broker
β β βββ broker.py # Site config, knowledge base, permissions, session history
β βββ storage/ # Persistent Storage
β β βββ __init__.py
β β βββ firestore.py # Firestore client (configs, sessions, knowledge, analytics)
β βββ voice/ # Voice pipeline (reserved)
β βββ Dockerfile # Cloud Run container (Python 3.12-slim)
β βββ requirements.txt # google-adk, fastapi, uvicorn, etc.
β βββ .env.example # Environment template
β
βββ embed/ # π Client-Side Embed Script (TypeScript)
β βββ src/
β β βββ index.ts # Main entry, overlay UI (Shadow DOM)
β β βββ gateway-client.ts # WebSocket client + event system
β β βββ audio.ts # Mic capture (16kHz) + playback (24kHz)
β β βββ avatar.ts # Canvas 2D animated avatar (lip-sync)
β β βββ dom-actions.ts # DOM action executor (10 operations)
β β βββ dom-snapshot.ts # Token-efficient DOM serializer
β β βββ action-visualizer.ts # Bezier flight animation to target elements
β β βββ screenshot.ts # Canvas-based viewport capture for vision
β βββ dist/
β β βββ webclaw.js # Bundled output (26.1KB minified)
β βββ package.json # esbuild bundler
β βββ tsconfig.json
β
βββ dashboard/ # π Site Owner Dashboard
β βββ dist/
β βββ index.html # Vanilla HTML/JS dashboard (no build step)
β
βββ extension/ # π Chrome Extension (Manifest V3)
β βββ manifest.json # Permissions: activeTab, storage, scripting
β βββ popup.html # Settings UI
β βββ popup.js # Gateway URL, auto-activate, voice toggle
β βββ content.js # Page injection, WebSocket, DOM actions, negotiation
β βββ background.js # Service worker
β βββ icons/ # Extension icons (16/48/128px)
β
βββ demo-site/ # π Demo E-Commerce Site
β βββ index.html # TechByte Store (products, FAQ, contact)
β
βββ infra/ # βοΈ GCP Infrastructure
β βββ main.tf # Terraform: Cloud Run, Artifact Registry,
β β # Firestore, IAM
β βββ deploy.sh # One-command Docker build + deploy
β βββ terraform.tfvars.example
β
βββ CONCEPT.md # Full design document & vision
βββ CHALLENGE.md # Hackathon rules reference
βββ README.md # You are here
cd infra
./deploy.sh YOUR_PROJECT_ID us-central1This will:
- Build the Docker image from
gateway/Dockerfile - Push to Artifact Registry
- Deploy to Cloud Run with session affinity (required for WebSocket)
- Output the public gateway URL
cd infra
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your GCP project ID and Gemini API key
terraform init
terraform applyTerraform provisions:
- Cloud Run service with auto-scaling (0-10 instances)
- Artifact Registry for container images
- Firestore database (Native mode) for site configs and sessions
- IAM policy for public access (unauthenticated invocation)
Update your embed script to point to the Cloud Run URL:
<script src="https://webclaw-gateway-HASH-uc.a.run.app/embed.js"
data-site-id="your_site_id"
data-gateway="https://webclaw-gateway-HASH-uc.a.run.app">
</script>Adding WebClaw to your website takes 60 seconds:
Step 1: Register your site via the API:
curl -X POST http://localhost:8081/api/sites \
-H "Content-Type: application/json" \
-d '{
"domain": "yoursite.com",
"persona_name": "Aria",
"persona_voice": "warm, professional, concise",
"welcome_message": "Hi! I'\''m Aria. How can I help you today?",
"knowledge_base": "We sell premium coffee. Free shipping over $30. 14-day returns.",
"allowed_actions": ["click", "scroll", "navigate", "highlight", "read"],
"restricted_actions": ["type"]
}'
# β {"site_id": "a1b2c3d4", "config": {...}}Step 2: Add the script tag to your HTML:
<script src="https://your-gateway.run.app/embed.js"
data-site-id="a1b2c3d4"
data-gateway="https://your-gateway.run.app"
data-position="bottom-right"
data-theme="light"
data-color="#your-brand-color">
</script>Configuration Options:
| Attribute | Default | Description |
|---|---|---|
data-site-id |
demo |
Your registered site identifier |
data-gateway |
http://localhost:8080 |
Gateway URL |
data-position |
bottom-right |
Overlay position (bottom-right or bottom-left) |
data-theme |
light |
Color theme (light or dark) |
data-color |
#4285f4 |
Primary accent color (avatar, buttons, highlights) |
| Method | Endpoint | Description |
|---|---|---|
GET |
/health |
Health check |
GET |
/embed.js |
Serve embed script |
GET |
/api/sites |
List all registered sites |
GET |
/api/sites/{id} |
Get site configuration |
POST |
/api/sites |
Register a new site |
PUT |
/api/sites/{id} |
Update site configuration |
WS |
/ws/{site_id}/{session_id} |
Bidirectional streaming |
Client β Server:
| Frame Type | Format | Description |
|---|---|---|
| Binary | Raw PCM bytes | Audio data (16kHz, 16-bit, mono) |
| Text | {"type":"text","text":"..."} |
Text message |
| Text | {"type":"dom_snapshot","html":"...","url":"..."} |
Page structure |
| Text | {"type":"dom_result","action_id":"...","result":{}} |
Action result |
| Text | {"type":"image","data":"base64...","mimeType":"..."} |
Screenshot |
Server β Client:
ADK events containing:
content.parts[].text: Agent text responsescontent.parts[].inlineData: Audio data (PCM 24kHz, base64)content.parts[].functionCall: DOM actions for the client to execute
The gateway has been tested end-to-end with the Gemini Live API:
β
WebSocket connection established
β
Gemini Live API bidi stream opened (gemini-2.5-flash-native-audio-preview-12-2025)
β
Session resumption handles received
β
Audio response chunks streaming (PCM 24kHz, ~15KB per chunk)
β
Output transcriptions generated
β
Turn completion signals received
β
Graceful disconnect handling
Sample test output:
AUDIO: audio/pcm;rate=24000 (12800 b64 chars)
AUDIO: audio/pcm;rate=24000 (15360 b64 chars)
AUDIO: audio/pcm;rate=24000 (15360 b64 chars)
AUDIO: audio/pcm;rate=24000 (15360 b64 chars)
AUDIO: audio/pcm;rate=24000 (15360 b64 chars)
AUDIO: audio/pcm;rate=24000 (5120 b64 chars)
EVT: ['modelVersion', 'usageMetadata', ...]
EVT: ['turnComplete', ...]
--- Done (18 events) ---
| Component | Technology | Role |
|---|---|---|
| AI Model | Gemini 2.5 Flash (Native Audio) | Multimodal understanding + function calling |
| Agent Framework | Google ADK | Agent lifecycle, tool execution, session management |
| Voice Streaming | Gemini Live API | Real-time bidirectional audio (PCM 16kHz β 24kHz) |
| Backend | FastAPI + Uvicorn | Async WebSocket server + REST API |
| Embed Script | TypeScript + esbuild | 26.1KB minified bundle, zero runtime dependencies |
| Overlay UI | Web Components (Shadow DOM) | Style-isolated, framework-agnostic |
| Avatar | Canvas 2D | 60fps animated face with lip-sync |
| Extension | Chrome Manifest V3 | Persistent mic access, cross-site persistence |
| Infrastructure | Terraform | Reproducible GCP deployment |
| Service | Usage |
|---|---|
| Cloud Run | Stateless gateway hosting with session affinity for WebSocket |
| Artifact Registry | Container image storage and versioning |
| Firestore | Site configurations, knowledge bases, session history |
| Gemini Live API | Real-time bidirectional voice AI |
Category: Live Agents
Hackathon: Gemini Live Agent Challenge
| Requirement | Status | Implementation |
|---|---|---|
| Uses a Gemini model | β | gemini-2.5-flash-native-audio-preview-12-2025 (bidiGenerateContent) |
| Uses Google GenAI SDK or ADK | β | Google ADK (google-adk) |
| At least one Google Cloud service | β | Cloud Run, Firestore, Artifact Registry |
| New project created during contest | β | First commit: March 6, 2026 |
| Demo video < 4 min | π² | Planned |
| Public code repository | β | This repo |
| Spin-up instructions | β | See Quick Start |
| Bonus | Points | Status |
|---|---|---|
| Terraform deployment | +0.2 | β
infra/main.tf |
| GDG membership | +0.2 | β |
| Blog post | +0.6 | π² |
| Criterion | Weight | How WebClaw Delivers |
|---|---|---|
| Innovation & Multimodal UX | 40% | Breaks the text-box paradigm entirely. Users talk; the agent talks back AND operates the page. Animated avatar with lip-sync, DOM action visualization, voice barge-in support. Not a chatbot with a microphone icon: it is a companion that operates the website. |
| Technical Implementation | 30% | Full ADK agent pipeline with Gemini Live API bidirectional audio, 10-tool DOM action engine, context broker with asymmetric privacy, Shadow DOM isolation, Canvas 2D avatar, smart element finder with CSS/ARIA/text fallback, token-efficient DOM snapshot serializer. |
| Demo & Presentation | 30% | Extremely demo-friendly. "Watch the agent navigate to checkout, fill in the form, and complete the purchase: all while explaining what it is doing in natural voice." Visual, live, undeniable. |
| Decision | Rationale |
|---|---|
| Gateway over P2P | Provides privacy (asymmetric context), security (action validation), scalability (stateless Cloud Run), and analytics (centralized data). |
| Shadow DOM for overlay | Complete style isolation from host page. WebClaw's CSS never conflicts with the site's styles, regardless of framework. |
| Canvas 2D over Lottie/WebGL | Zero dependencies, <3KB of avatar code, 60fps on all devices. Good enough for lip-sync; no heavy animation library needed. |
| esbuild over Webpack/Rollup | 2ms build time. 26.1KB output. The embed script must be tiny and build instantly. |
<script> tag integration |
Same pattern as Google Analytics, Intercom, and Segment. Zero friction. Site owners already know this pattern. |
| PCM audio over Opus/WebM | Gemini Live API requires PCM. Direct PCM avoids encoding/decoding overhead and reduces latency. |
| Smart element finder | CSS selectors alone are fragile. ARIA labels provide accessibility-aware matching. Text content matching handles "click the Buy button" naturally. |
| Token-efficient DOM snapshots | Full DOM serialization would blow context windows. The snapshot walker prunes non-semantic elements, skips scripts/styles, and caps output at 4KB. |
- Gateway backend with ADK + Gemini Live API
- Embed script with Shadow DOM overlay
- Canvas 2D avatar with lip-sync animation
- DOM action engine (10 tools)
- DOM snapshot serializer
- Chrome extension (Personal Agent)
- Demo e-commerce site
- Terraform + deploy script
- End-to-end audio streaming verified
- Firestore persistent sessions and knowledge base
- Site owner dashboard (config UI at
/dashboard) - Action visualization (avatar flies to target elements via bezier curve)
- Screenshot-based page understanding (vision)
- Agent negotiation protocol (Personal meets Site)
- Analytics pipeline (BigQuery)
- Multi-language voice support
MIT License. See LICENSE for details.
Built for the Gemini Live Agent Challenge by David Nzagha and the Nzagha Ventures team.
WebClaw: Because websites should have agents, not just chat widgets.