Navigate any website by voice. Built for disabled users.
WayPoint is a Chrome extension that makes any website navigable entirely by voice. Say "checkout" and it clicks. Say "search" and the field focuses. Say "scroll down" and it scrolls until you say stop. No tab-cycling. No mouse. No prior knowledge of the page's structure. Just intent.
Built for the Google Gemini Live Agent Challenge (UI Navigator category).
Live backend: https://waypoint-backend-97725420601.us-central1.run.app
WayPoint's core paradigm is DISCOVER → DOCUMENT → ACTIVATE.
DISCOVER — When you click "Index this page," WayPoint walks the entire DOM: every link, button, input, form, and semantic region gets stamped with a unique data-wp-id. A structural containment tree and a flat list of all interactive elements are extracted.
DOCUMENT — A three-layer pipeline builds an Intent Surface Map — a structured description of everything a user can do on the page:
| Layer | Latency | What happens |
|---|---|---|
| 0 | 0ms | Client-side semantic scan. Every <nav>, <main>, <form>, <dialog> becomes a surface. Voice activates immediately. |
| 1 | ~1s | Gemini (text) receives the full DOM tree + every interactive element. Groups nav links, enriches labels, writes voice triggers, guarantees 100% coverage. |
| 2 | ~2-4s | Gemini (vision) receives the screenshot. Identifies visual regions, icon-only buttons, hero CTAs, and anything the DOM analysis missed. |
Each layer merges into the previous. Structural surfaces not covered by Gemini are kept — so nothing is ever lost.
ACTIVATE — You speak. Gemini Live receives the audio over a WebSocket, understands intent in context, and returns a tool call (activate_surface, scroll_page, enter_click_mode, etc.). The extension executes the action against the live DOM. Gemini speaks back through the same stream.
Chrome Extension (MV3)
│
├── content/index.js — 3-layer indexing pipeline, state machine
├── content/extractor.js — DOM tree builder, stamp every element (data-wp-id)
│ extractAllInteractives() — every link/button/input
│ extractSemanticSurfaces() — Layer 0 instant surfaces
├── content/gemini-live.js — Gemini Live WebSocket (STT + NLU + TTS in one stream)
│ AudioWorklet: Float32 mic → Int16 PCM → Gemini
│ Tool calls: activate_surface, scroll_page,
│ enter_click_mode, navigate_highlight,
│ activate_highlight
├── content/actions.js — DOM action executor (5-stage fallback target lookup)
├── content/highlight.js — Click mode: highlight focusable elements, navigate by voice
├── content/scroll-indicator.js — Visual scroll direction overlay
├── content/voice.js — Voice session lifecycle wrapper
├── content/observer.js — MutationObserver, re-index on significant DOM change
├── content/debug.js — Floating shadow DOM debug panel (LOGS + SURFACES tabs)
├── content/context.js — Session state (intentMap, voiceMode, history)
├── content/feedback.js — showBadge(), speak() for system messages
├── overlay/ — Shadow DOM popup UI
└── background/ — Service worker: screenshot capture, message relay
Cloud Run Backend (Node 20 + Express)
│
├── POST /index/text — Layer 1: tree + all interactives → Gemini text → surfaces
├── POST /index/vision — Layer 2: screenshot + current map → Gemini vision → enriched
├── POST /index — Backward-compat combined call
├── POST /resolve — Gemini utterance resolution (fallback, not primary path)
├── GET /config — Serves GEMINI_API_KEY to extension (never in bundle)
└── GET /health — Cloud Run health check
- Chrome Extension Manifest V3 — content scripts, service worker, shadow DOM
- Gemini 2.5 Flash — text + vision indexing via Gemini REST API (
generativelanguage.googleapis.com) - Gemini Live (
gemini-2.5-flash-native-audio-preview-12-2025) — bidirectional WebSocket, STT + NLU + TTS in one stream, tool calls for DOM actions - AudioWorklet — Float32 → Int16 PCM conversion off the main thread
- Node.js + Express on Cloud Run — no Vertex AI SDK, auth via API key
- Bun — bundler for extension content scripts (IIFE format for MV3)
- Bun (
curl -fsSL https://bun.sh/install | bash) - A Gemini API key from aistudio.google.com
- Node.js 20+ (for the backend)
cd backend
cp .env.example .env
# Fill in GEMINI_API_KEY in .env
npm install
npm run dev
# → [WayPoint backend] Listening on port 3000cd ext
bun run build
# → dist/content.js, dist/overlay-app.jsLoad in Chrome:
chrome://extensions→ Developer mode → Load unpacked- Select the
ext/folder
The extension points at the live Cloud Run backend by default. To use a local backend, change BACKEND_URL in ext/shared/constants.js to http://localhost:3000 and rebuild.
gcloud run deploy waypoint-backend \
--source ./backend \
--region us-central1 \
--allow-unauthenticated \
--set-env-vars "GEMINI_API_KEY=your-key,GOOGLE_CLOUD_PROJECT=your-project"WayPoint understands natural language — these are examples, not a fixed list. Gemini Live resolves intent against the indexed surfaces of the current page.
| Say | Does |
|---|---|
| "go to checkout" / "open nav" / "menu" | Activates the matching surface |
| "scroll down" / "scroll up" | Continuous scroll until "stop" |
| "scroll to top" / "bottom" | Jump to page top or bottom |
| "click" | Enter click mode — highlights first focusable element |
| "right" / "left" / "next" / "back" | Navigate highlighted element |
| "up" / "down" | Move highlight to next/previous row |
| "select" / "activate" / "yes" | Click the highlighted element |
| "search for..." | Focuses search field, enters dictation mode |
| "read" | Reads element content aloud |
| "dismiss" / "close" | Dismisses modal or overlay |
Click the WP tab at the bottom-right of any indexed page to open the debug panel.
- LOGS tab — live pipeline events: what was heard, how it resolved, what action fired, any errors
- SURFACES tab — every indexed surface with type, action, element dimensions, and a green dot when the element is currently in the viewport
Google Gemini Live Agent Challenge · 2026