WayPoint

Navigate any website by voice. Built for disabled users.

WayPoint is a Chrome extension that makes any website navigable entirely by voice. Say "checkout" and it clicks. Say "search" and the field focuses. Say "scroll down" and it scrolls until you say stop. No tab-cycling. No mouse. No prior knowledge of the page's structure. Just intent.

Built for the Google Gemini Live Agent Challenge (UI Navigator category).

Live backend: https://waypoint-backend-97725420601.us-central1.run.app

How it works

WayPoint's core paradigm is DISCOVER → DOCUMENT → ACTIVATE.

DISCOVER — When you click "Index this page," WayPoint walks the entire DOM: every link, button, input, form, and semantic region gets stamped with a unique data-wp-id. A structural containment tree and a flat list of all interactive elements are extracted.

DOCUMENT — A three-layer pipeline builds an Intent Surface Map — a structured description of everything a user can do on the page:

Layer	Latency	What happens
0	0ms	Client-side semantic scan. Every `<nav>`, `<main>`, `<form>`, `<dialog>` becomes a surface. Voice activates immediately.
1	~1s	Gemini (text) receives the full DOM tree + every interactive element. Groups nav links, enriches labels, writes voice triggers, guarantees 100% coverage.
2	~2-4s	Gemini (vision) receives the screenshot. Identifies visual regions, icon-only buttons, hero CTAs, and anything the DOM analysis missed.

Each layer merges into the previous. Structural surfaces not covered by Gemini are kept — so nothing is ever lost.

ACTIVATE — You speak. Gemini Live receives the audio over a WebSocket, understands intent in context, and returns a tool call (activate_surface, scroll_page, enter_click_mode, etc.). The extension executes the action against the live DOM. Gemini speaks back through the same stream.

Architecture

Chrome Extension (MV3)
│
├── content/index.js          — 3-layer indexing pipeline, state machine
├── content/extractor.js      — DOM tree builder, stamp every element (data-wp-id)
│                               extractAllInteractives() — every link/button/input
│                               extractSemanticSurfaces() — Layer 0 instant surfaces
├── content/gemini-live.js    — Gemini Live WebSocket (STT + NLU + TTS in one stream)
│                               AudioWorklet: Float32 mic → Int16 PCM → Gemini
│                               Tool calls: activate_surface, scroll_page,
│                                           enter_click_mode, navigate_highlight,
│                                           activate_highlight
├── content/actions.js        — DOM action executor (5-stage fallback target lookup)
├── content/highlight.js      — Click mode: highlight focusable elements, navigate by voice
├── content/scroll-indicator.js — Visual scroll direction overlay
├── content/voice.js          — Voice session lifecycle wrapper
├── content/observer.js       — MutationObserver, re-index on significant DOM change
├── content/debug.js          — Floating shadow DOM debug panel (LOGS + SURFACES tabs)
├── content/context.js        — Session state (intentMap, voiceMode, history)
├── content/feedback.js       — showBadge(), speak() for system messages
├── overlay/                  — Shadow DOM popup UI
└── background/               — Service worker: screenshot capture, message relay

Cloud Run Backend (Node 20 + Express)
│
├── POST /index/text          — Layer 1: tree + all interactives → Gemini text → surfaces
├── POST /index/vision        — Layer 2: screenshot + current map → Gemini vision → enriched
├── POST /index               — Backward-compat combined call
├── POST /resolve             — Gemini utterance resolution (fallback, not primary path)
├── GET  /config              — Serves GEMINI_API_KEY to extension (never in bundle)
└── GET  /health              — Cloud Run health check

Stack

Chrome Extension Manifest V3 — content scripts, service worker, shadow DOM
Gemini 2.5 Flash — text + vision indexing via Gemini REST API (generativelanguage.googleapis.com)
Gemini Live (gemini-2.5-flash-native-audio-preview-12-2025) — bidirectional WebSocket, STT + NLU + TTS in one stream, tool calls for DOM actions
AudioWorklet — Float32 → Int16 PCM conversion off the main thread
Node.js + Express on Cloud Run — no Vertex AI SDK, auth via API key
Bun — bundler for extension content scripts (IIFE format for MV3)

Setup

Prerequisites

Bun (curl -fsSL https://bun.sh/install | bash)
A Gemini API key from aistudio.google.com
Node.js 20+ (for the backend)

Backend (local)

cd backend
cp .env.example .env
# Fill in GEMINI_API_KEY in .env

npm install
npm run dev
# → [WayPoint backend] Listening on port 3000

Extension

cd ext
bun run build
# → dist/content.js, dist/overlay-app.js

Load in Chrome:

chrome://extensions → Developer mode → Load unpacked
Select the ext/ folder

The extension points at the live Cloud Run backend by default. To use a local backend, change BACKEND_URL in ext/shared/constants.js to http://localhost:3000 and rebuild.

Deploy backend to Cloud Run

gcloud run deploy waypoint-backend \
  --source ./backend \
  --region us-central1 \
  --allow-unauthenticated \
  --set-env-vars "GEMINI_API_KEY=your-key,GOOGLE_CLOUD_PROJECT=your-project"

Voice commands

WayPoint understands natural language — these are examples, not a fixed list. Gemini Live resolves intent against the indexed surfaces of the current page.

Say	Does
"go to checkout" / "open nav" / "menu"	Activates the matching surface
"scroll down" / "scroll up"	Continuous scroll until "stop"
"scroll to top" / "bottom"	Jump to page top or bottom
"click"	Enter click mode — highlights first focusable element
"right" / "left" / "next" / "back"	Navigate highlighted element
"up" / "down"	Move highlight to next/previous row
"select" / "activate" / "yes"	Click the highlighted element
"search for..."	Focuses search field, enters dictation mode
"read"	Reads element content aloud
"dismiss" / "close"	Dismisses modal or overlay

Debug panel

Click the WP tab at the bottom-right of any indexed page to open the debug panel.

LOGS tab — live pipeline events: what was heard, how it resolved, what action fired, any errors
SURFACES tab — every indexed surface with type, action, element dimensions, and a green dot when the element is currently in the viewport

Google Gemini Live Agent Challenge · 2026

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.claude		.claude
backend		backend
ext		ext
.gitignore		.gitignore
README.md		README.md
deploy.sh		deploy.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WayPoint

How it works

Architecture

Stack

Setup

Prerequisites

Backend (local)

Extension

Deploy backend to Cloud Run

Voice commands

Debug panel

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WayPoint

How it works

Architecture

Stack

Setup

Prerequisites

Backend (local)

Extension

Deploy backend to Cloud Run

Voice commands

Debug panel

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages