Inside the Agent

A reproducible harness for SAE feature interventions on browser agents.

A fully open reference implementation of SAE-feature-level steering on a browser agent, with a deterministic benchmark and a live interpretability telemetry surface.

Two SAE feature edits at one decision step shift overall success rate from 10% (baseline) to 57% (targeted) on a 60-trial held-out suite. The lift is concentrated where the features were calibrated: promotional traps 0 → 79% and hallucination tasks 0 → 67%. On planning tasks the same edits hurt the agent (33 → 17%) — a real cost we surface, not bury. A prompt-only control beats targeted overall at 73% by doing well across all categories, but loses to it on promotional traps. Direction-flipped, random, and matched-norm-noise controls all stay near baseline.

The honest framing: this is not a "best browser agent" claim. It is a working reference for runtime SAE interventions, an observability + controllability surface for agentic LLMs, and a benchmark that surfaces both wins and failure modes by construction. Features themselves are under-characterized; we name them by feature ID and logit-lens-derived behavior tag (f26737_ui_selection_vocab, f23803_distraction_avoidance_vocab) until independent validation lands.

Headline result (v0.8 — held-out 20 tasks × 3 trials = 60 per policy)

Wilson 95% CIs. All numbers regenerated from data/results/*.jsonl via python -m bench.report and verified by python -m bench.artifact_check on every CI run.

Policy	Success	95% CI	Δ vs baseline	Notes
baseline (no steering)	10.0%	[4.7%, 20.1%]	—	Falls for the trap most of the time
wrong-sign	13.3%	[6.9%, 24.2%]	+3 pts	Sign-flipped targeted edits — inside baseline CI ⇒ direction matters causally
random (per-trial seeded)	15.0%	[8.1%, 26.1%]	+5 pts	Random feature edits — small lift from "any intervention"
noise (matched-norm)	18.3%	[10.6%, 29.9%]	+8 pts	Random isotropic residual perturbation, same magnitude as targeted
targeted — 2 SAE feature edits at Step 0	56.7%	[44.1%, 68.4%]	+47 pts	f26737 (-6) + f23803 (+6), `position_mode=all`
prompt-only (system-prompt control)	73.3%	[61.0%, 82.9%]	+63 pts	"Avoid promotional banners; use search" in the system prompt

Action quality: valid vs executed (v0.24-F diagnostic)

A first-class diagnostic the reviewer flagged: success rate alone hides a gap between "model emitted well-formed JSON" and "Playwright actually dispatched the action." A targeted-steered model that emits more confident-looking but harder-to-dispatch selectors looks the same on valid_action but worse on executed.

Policy	n steps	valid_action	executed	parse-but-no-exec
`baseline`	600	100.0%	100.0%	0
`prompt-only`	600	100.0%	85.5%	87
`failure-mining`	599	90.0%	83.6%	38
`noise`	600	99.8%	48.3%	309
`targeted`	600	100.0%	36.3%	382
`random`	600	100.0%	32.5%	405
`wrong-sign`	600	99.0%	24.0%	450
`dynamic`	600	100.0%	22.7%	464

Targeted hits 100% valid_action but only 36.3% executed. The intervention is producing well-formed actions that Playwright can't dispatch (often because the selector pattern doesn't exist in the real DOM, or the click target is occluded). That gap is part of the cost, and a future policy that shrinks it without losing success is a clear next-generation target. We treat executed as a gating diagnostic for any future policy claim.

The category-specific story (this is the real headline)

The targeted edits don't lift uniformly — the mechanism is category-specific. Breaking the 60 trials per policy out by task category:

Policy	promo (calibrated)	hallucination (cross-domain)	planning (out-of-distribution)
baseline	0% (0/24)	0% (0/18)	33% (6/18)
targeted	79% (19/24)	67% (12/18)	17% (3/18)
prompt-only	83% (20/24)	67% (12/18)	67% (12/18)
wrong-sign	4%	33%	6%
random	0%	22%	28%

Three honest findings:

Targeted dominates the calibration distribution. On promotional traps — what we tuned for — the agent goes from 0% to 79%. The original v0.2 headline (83% on 24 trials) was correct for promo; the 56.7% overall just averages across categories.
Targeted transfers cross-domain to hallucination tasks. 0% → 67% on a category we never calibrated against. Suppressing UI-selection vocabulary stops the agent from inventing buttons that don't exist. Evidence of cross-distribution generalization.
Targeted hurts on planning. 33% → 17%, worse than baseline. The features that block "click the wrong thing" also block "click the right thing" when multi-step navigation needs legitimate clicks. Mechanistically consistent — the logit lens predicts this failure.

When does SAE steering beat prompt-only?

Prompt-only wins on average; SAE steering wins inside its calibration domain. The two interventions are mechanistically different and they tell different stories.

Category	prompt-only	targeted	who wins	margin
promo	83%	79%	prompt-only (barely)	+4 pp
hallucination	67%	67%	tie	0 pp
planning	67%	17%	prompt-only	+50 pp

Prompt-only modifies the input tokens ("Avoid promotional banners; use search."). It works because the model follows instructions and the instruction happens to be correct across all three categories.
Targeted modifies the residual stream at layer 19 by ±6 on two SAE features. It works on promo and hallucination because those features encode "click this option" UI-selection vocabulary, which is the trap. On planning, the same features encode the legitimate clicks the agent needs to navigate, so suppressing them backfires.

The combination (prompt-only AND SAE steering simultaneously) is the obvious next experiment and is on the roadmap. SAE steering is not a replacement for prompt engineering; it is a runtime intervention surface at a layer of representation prompts cannot directly access, with category-specific causal effects you can read and write live.

How to read these numbers honestly

Wrong-sign sits inside baseline's CI. Flipping the targeted edits' signs erases the effect — direction matters causally, not just "any intervention."
Random at 15% is the corrected number. v0.1 reported random at 45.8% due to a fixed-seed bug; v0.2-A fixed it; v0.8 confirms random doesn't get lucky much.
Targeted at 57% is the average across three categories. See breakdown above for the mechanistic story.
Position-mode caveat. The 57% / 79% uses position_mode=all (delta applied at every position). The surgical position_mode=last_prompt_only (Modal default) gives 0% in our tests — the effect is real and causal, not yet localized to a single token. Scope-comparison table in artifacts/benchmark_report.md.
Verifier caveat. Headline rate uses the lenient verifier (cart contains target). A strict-cart pass that requires "exactly once, no other product polluted" is being captured directly in the runner (v0.22 P2, see roadmap) and will become the canonical headline once the full rerun lands. The earlier approximate strict from action history was removed in v0.24-D after it was found to count click intents rather than executed adds.

This is not a claim that we found "the promotional bias feature." It's a claim that two specific SAE features, intervened at the first decision step, causally shift the agent's success rate — strongly on the calibration distribution, with measurable cross-domain transfer, AND with a documented failure mode on planning tasks. The features are characterized via three independent methods (logit lens, corpus probe, ablation) and labelled by what the methods agree on — f26737_ui_selection_vocab and f23803_distraction_avoidance_vocab. Full evidence in docs/feature_characterization.md.

See docs/methodology.md for the full writeup and method details.

What this is

LLM agents are black boxes. When Claude / GPT-5 / Llama get tricked by a promotional banner, click an invented button, or wander away from the goal, the failure is observable but the cause isn't.

Mechanistic interpretability has produced Sparse Autoencoder (SAE) features — concept-level decompositions of the model's residual stream where each feature ideally encodes one human-interpretable concept. Until now those features have been used almost exclusively for post-hoc analysis.

This project wires them into a working agent as a runtime intervention surface:

Read which features fire at every decision step (live telemetry)
Intervene by adding feature-level deltas to the residual stream during inference
See it all in a HUD: feature activations, intervention timeline, before/after action diff, success/failure verdict

What ships in the box (as of v0.22)

Interactive cockpit for browser-agent SAE interventions. Live SAE feature activations, an effect-size strip per active edit (source-coded colors), a command queue for HUD-issued edits that drain at the next agent step, a baseline-vs-current action diff, a 3-second viewport-ring pulse + source badge whenever a steering edit lands, and a live counterfactual at every steering step (WITHOUT EDIT row showing what the same model on the same prompt would have done without your intervention).
Trajectory replayer + in-HUD browser (v0.21). A ▶ REPLAY SAVED button lists every past data/trajectories/*.jsonl and replays it through the same cockpit at controllable speed — zero Modal cost, deterministic playback. Both ▶ TARGETED and ▷ baseline run buttons are in the HUD too, so the entire demo flow lives inside the browser.
Reproducible testbed. bench/artifact_check.py verifies that every published number in seed_manifest.json matches the committed artifacts/results/*.jsonl snapshot. Hard-fails CI on drift (v0.24-D). bench/report.py regenerates artifacts/benchmark_report.md. bench/make_chart.py regenerates artifacts/headline.png from raw artifacts. Strict-cart canonical (exactly-one-target, no pollution) is being captured directly in the runner and is on the roadmap.
11 controlled policies in POLICY_REGISTRY:
- baseline / static / random / wrong-sign / noise (controls)
- targeted — 2 contrast-derived SAE features at step 0
- targeted-f26737-only, targeted-f23803-only — per-feature ablation (v0.22)
- prompt-only — system-prompt-only control
- failure-mining — 4 data-derived features (v0.9)
- dynamic — per-step adaptive policy (v0.9 rewrite)
Live segment on real public sites. shopgym/web_env.py is a generic Playwright env. Validated headlessly on Google Shopping (24+ sponsored cards in named "Sponsored products" section vs "All products" — strongest visual binary), eBay /deals, AliExpress. Walmart documented as PerimeterX-bot-walled. Captured trajectories live under data/trajectories/ for replay.
Honest failure modes exposed. The v0.8 executed: bool per step surfaces the gap between "model emitted valid JSON" and "Playwright actually clicked something." The v0.22 strict-cart double-verifier captures both lenient and "cart contains exactly one of target" per trial.

Demo (live cockpit on real public sites)

The entire demo flow lives inside the HUD now — two terminals, then everything else is in-browser:

# Terminal A — WebSocket bridge (start once, leave running):
python -m agent.ws_server                # localhost:8765

# Terminal B — Next.js cockpit (start once, leave running):
cd hud && NEXT_PUBLIC_WS_URL=ws://localhost:8765/feed npm run dev   # localhost:3000

# Open http://localhost:3000. Everything else is point-and-click.

In the HUD you can:

▶ TARGETED (eBay) — fires a live targeted run on the real eBay /deals page (shopgym/tasks/real_ebay.json)
▷ baseline (no steering) — fires the same eBay task with no SAE edits — for A/B comparison
▶ REPLAY SAVED (top-right) — opens a dropdown of every saved trajectory under data/trajectories/*.jsonl with step counts and policy labels. Pick google_shopping_usb_c_cable · targeted · 6 steps for the strongest captured demo (24+ sponsored cards on Google Shopping with explicit "Sponsored products" section vs "All products"). Adjustable replay speed (fast / normal / slow / demo). Zero Modal calls during replay — deterministic playback.

What you'll see during a targeted run on the captured Google Shopping trajectory:

step 0  ▶ baseline:  click sponsored filter chip "36-72 inch long"
        ▷ targeted:  scroll past sponsored section + steering applied
                     (f26737 -6, f23803 +6)   ← step-0 emerald pulse
step 1  ▷ targeted:  click "Lightning Cables filter" (organic refinement)
step 2-4              click organic product cards from "All products"

Cockpit shows:
- Effect Size strip with the two edits as bipolar bars
- Counterfactual row "WITHOUT EDIT → click sponsored filter chip"
- Intervention pulse + badge
- Trajectory log step-by-step

Full runbook + 60-second talk track: docs/live_demo.md. Recording recipe: docs/recording_guide.md. Presentation script: docs/presentation_script.md.

Architecture

Three loosely-coupled processes:

hud (local Next.js)
  Verdict overlay + Steering flash + Feature bars colored by category
        ▲
        │ WebSocket events
        │
browser-worker (local Python)
  ShopGym deterministic storefronts + Playwright + verifiers
        │ HTTP: /act, /features, /steer_act
        ▼
brain-server (Modal L40S)
  Llama 3.1-8B-Instruct (BF16) + Goodfire SAE on layer 19

Quickstart

Prerequisites

Python 3.11+ with pip
Node 20+ with npm
A Modal account (free; pip install modal && modal token new)
A HuggingFace account with the Llama 3.1-8B-Instruct license accepted (gated repo)

Install

git clone https://github.com/kalyvask/inside-the-agent
cd inside-the-agent

pip install -e ".[dev]"
playwright install chromium
cd hud && npm install && cd ..

cp .env.example .env
# Fill in HF_TOKEN, ANTHROPIC_API_KEY

modal token new
modal secret create hf-token HF_TOKEN=hf_xxx...
modal deploy modal_deploy/app.py

Day 1 — verify (5-test gate)

make verify

Runs five tests against the deployed brain-server:

Model + SAE load
Feature catalog has agent-relevant features
Feature reading on agent-style prompts
Steering produces observable behavior change
Latency under 5s/step

Reproduce the headline result

# Feature discovery + magnitude tuning (~10 min)
python -m verify.feature_drill
python -m verify.tune_deltas

# Step-0 calibration to find features that flip the first decision
python -m verify.step0_calibration

# Full 9-policy benchmark on the 20-task held-out suite × 3 trials
python -m bench.rerun_p0           # baseline / targeted / wrong-sign / random / noise / prompt-only
python -m bench.rerun_v0_9_extra   # failure-mining / dynamic (v0.9 additions)
python -m bench.rerun_p0_2_scope   # targeted at last_prompt_only + all_prompt (scope comparison)

# One-shot orchestrator that runs everything above + regenerates artifacts:
python -m bench.v0_8_finalize

# Inspect / verify the artifacts
python -m bench.artifact_check     # CI gate: hard-fails on drift between manifest and artifacts/results
python -m bench.report             # regenerates artifacts/benchmark_report.md

Watch the HUD live (for the demo)

Three terminals:

# Terminal 1 — WebSocket bridge (long-lived)
python -m agent.ws_server

# Terminal 2 — Next.js HUD frontend (long-lived)
cd hud && NEXT_PUBLIC_WS_URL=ws://localhost:8765/feed npm run dev
# Open http://localhost:3000
#
# Note (Windows): `npm run build` (production) shares `.next/` with `npm run dev`.
# Stop the dev server first if you need to run a production build locally.
# CI is unaffected; `next build` runs in a fresh checkout with no dev server.

# Terminal 3 — one-command live demo
python record_demo.py
# Or: python record_demo.py --task shopgym/tasks/held_out.json --pause 6.0

# Or replay an offline trajectory (no Modal cost):
python -m verify.replay_trajectory \\
  data/trajectories/promo_held_001_seed_0_targeted.jsonl --slow

Warm a real-website session (only if you hit bot detection)

# Opens a real Chrome window. Click through any CAPTCHA / cookies,
# then ask your AI assistant to "go save" — it creates the sentinel
# file and the script writes data/<site>_storage_state.json.
python warm_session.py --url https://www.walmart.com/ \
    --out data/walmart_storage_state.json --channel chrome

Repository layout

inside-the-agent/
├── modal_deploy/         brain-server (Modal app, Llama + Goodfire SAE)
│   ├── app.py            primary: Llama 3.1-8B + Goodfire SAE l19, with
│   │                     steer_act / steer_act_with_noise / read_features /
│   │                     feature_logit_lens / feature_decoder_similarity /
│   │                     sae_validation endpoints
│   └── app_gemma.py      fallback: Gemma 2-9B + Gemma Scope (not gated).
│                         Runbook: docs/cross_model_path.md
├── sae/                  loader, steering controller, feature catalog
│   └── features.yaml     v0.4 logit-lens + v0.9 failure-mining labels
├── agent/                trajectory schema, prompts, agent loop, HUD publisher
│   ├── llm_agent.py      core loop: read features → policy → steer → act
│   ├── hud_publisher.py  events to ws_server (policy_meta, baseline_action,
│   │                     step_started, features_read, steering_applied,
│   │                     action_chosen, env_updated, task_done)
│   └── ws_server.py      FastAPI bridge — /feed (WS) /publish /control
│                         /control/pending /clear /screenshots /health
│                         /trajectories /replay /start_run
├── policies/             11 policies in POLICY_REGISTRY:
│                         baseline · static · dynamic (adaptive) ·
│                         random · wrong-sign · targeted · prompt-only ·
│                         noise · failure-mining ·
│                         targeted-f26737-only · targeted-f23803-only  (v0.22 ablation)
├── shopgym/              deterministic storefronts (templated) + WebEnv
│   ├── storefront_template.py  ShopGym env + verifier hookup + strict-cart
│   │                           double-capture (v0.22)
│   ├── web_env.py              generic Playwright env for real sites
│   └── tasks/                  held_out.json (20 tasks: 8 promo + 6 halluc
│                               + 6 planning), real_ebay.json, real_google.json,
│                               real_walmart.json, real_aliexpress.json
├── bench/
│   ├── runner.py               main CLI: --policy --tasks --hud --pause --position-mode
│   ├── rerun_p0.py             sequential rerun of all 6 main policies
│   ├── rerun_v0_9_extra.py     failure-mining + dynamic
│   ├── rerun_p0_2_scope.py     targeted at last_prompt_only + all_prompt
│   ├── rerun_v0_22.py          per-feature ablation + strict-cart + corpus probe
│   ├── v0_8_finalize.py        chains scope + v0.9 extras + report regen + manifest
│   ├── artifact_check.py       CI gate: verifies manifest ↔ jsonl consistency
│   ├── report.py               regenerates artifacts/benchmark_report.md
│   ├── make_chart.py           regenerates artifacts/headline.png (v0.20)
│   └── verifiers.py            lenient + strict + upsell verifiers
├── hud/                  Next.js cockpit on localhost:3000
│   ├── app/page.tsx            layout + event handlers
│   └── components/             DemoBanner (policy + scope + seed badges),
│                               BrowserViewport, FeatureBars,
│                               SteeringControls (start-run + presets),
│                               CommandQueue (queued / applied / consumed),
│                               EffectSizeStrip, InterventionTimeline,
│                               BeforeAfterDiff, CurrentAction (+ counterfactual),
│                               TrajectoryBrowser (saved-runs replay),
│                               Verdict, SteeringFlash
├── verify/               feature discovery + verification tooling:
│                         sae_smoke, sae_validation, feature_drill,
│                         feature_characterize (logit lens),
│                         corpus_probe_large (v0.22 — 1000-prompt wikitext probe),
│                         tune_deltas, step0_calibration, feature_ablations,
│                         replay_trajectory
├── docs/                 methodology, feature_characterization, demo_script,
│                         live_demo, real_world_generalization,
│                         cross_model_path, recording_guide, data_splits
├── tests/                46 unit tests (action parser, trajectory schema,
│                         verifiers, task config, noise routing, executed
│                         tracking, ...)
├── notebooks/            explore_demo_pages.py (12-site survey)
├── artifacts/            committed canonical subset of data/:
│                         seed_manifest.json, headline.png, benchmark_report.md,
│                         results/*.jsonl (10 benchmark policy snapshots),
│                         sample_trajectory_*.jsonl
├── record_demo.py        one-command live demo launcher (clear + warm + countdown + fire)
├── warm_session.py       headed-Chrome cookie warm-up for bot-walled sites
└── data/                 trajectories, results, baselines, screenshots (gitignored)

Roadmap

Immediate (this week — demo polish)

Main rerun + auto-finalize (running now, ~2h). bench/rerun_p0.py is replacing the stale v0.2 artifact rows. bench/v0_8_finalize.py auto-chains scope reruns + report regen + manifest refresh + artifact_check.
Regenerate artifacts/headline.png from the new numbers — current chart is v0.2.
Refresh README headline table with v0.7+ rates (random=0% after seed fix, noise + prompt-only rows added).
Flip artifact_check from soft-fail to hard-fail in CI once the artifact rows are consistent.
Record the live cockpit clip via python record_demo.py + screen capture.

Short-term (1-2 weeks — close P1 reviewer items)

Strict-cart as canonical headline. Reviewer P1: lenient verifier hides repeated add-to-cart pollution. Run a strict pass that captures cart_contains_target_exactly_once alongside lenient.
Per-feature ablation studies. f26737 alone vs f23803 alone vs combined — closes the "is the effect synergistic or additive?" question.
Sponsored-vs-organic decision on a search-results page. Needs the real-site selector flake addressed first (LLM emits search-result-N patterns that don't exist in real DOMs).
HUD: latency badge per step — credibility marker, ~30 min of plumbing existing timestamps.
HUD: counterfactual baseline diff. Currently uses a cache from a prior baseline run; live counterfactual = call brain twice/step (with + without edits), shows true per-step divergence. Doubles brain cost.

Medium-term (next month — strengthen the science)

Cross-model Gemma replication. Scaffolded in modal_deploy/app_gemma.py; runbook in docs/cross_model_path.md. ~$15 Modal + 3 hours attended. Closes the biggest reviewer ask: "is the result Llama-specific or general?"
v0.22 — built. Larger corpus probe. verify/corpus_probe_large.py streams wikitext-103 (1000 prompts) and reports top-activating prompts per watched feature. Output: artifacts/corpus_probe_large.json. Tightens the lexical-cluster labels in docs/feature_characterization.md.
Failure-mining feature semantic characterization. f50853 / f19079 / f39820 / f44602 are still tagged fail_mode_a/b/c/d — their logit lens returned code symbols, not English clusters. v0.22 corpus probe ALSO runs on three of these; results will either reinforce or weaken the labels.
Cross-reference with Neuronpedia. Other public SAE explorers may have richer data on our features; haven't checked.
v0.21 — built. HUD trajectory replay mode + browser. ▶ REPLAY SAVED button in the HUD lists every saved trajectory and replays it through the cockpit. Zero Modal cost.
v0.24 — scaffolded, awaiting demo. Cross-scale to Llama-3.3-70B + Goodfire l50. modal_deploy/app_70b.py with all methods pre-filled; runbook in docs/cross_scale_path.md. ~$25-40 Modal + 4 hours attended. Tests whether the planning failure mode at 8B is intrinsic to the lexical-feature limit (which should persist at 70B) or specific to the 8B SAE's representation (which scale should fix). Either result is publishable: the first as evidence the limit is in the SAE training objective, the second as evidence for Goodfire's "bigger models are easier to interpret" thesis at the agentic-intervention level.

Long-term (months — research direction)

Multi-domain expansion. Beyond promo / halluc / planning — add forms, comparison shopping, multi-step planning suites. Test whether targeted generalizes across task types.
Dynamic policy v2. Current adaptive thresholds (0.40 for failure-mining features) are hand-set. Learn thresholds from a validation split.
Compositional steering. Pair f26737 with each of its decoder-neighbors (cosine sim > 0.5) — does the steering effect amplify? Tests whether feature clusters or single features carry the meaning.
Reusable testbed. Package the runner + HUD + brain-server contract so others can plug in their SAE + their model. The wedge per reviewer P2: "reproducible testbed for runtime feature interventions in browser agents, with live telemetry and controllable steering."
Failure causality vs correlation. The 4 failure-mining features fire in 100% of failures — but a heartbeat fires in 100% of car accidents. The failure-mining policy (v0.9) tests whether suppressing them actually rescues behavior, separating the causal from correlational story.
Train a dedicated SAE on browser-agent residuals. The Goodfire SAE we use was trained on LMSYS-Chat-1M, a chat-style corpus. Its features reflect chat concepts; that is why the top intervention features encode lexical patterns ("ui-selection vocabulary") rather than agent-level concepts ("sponsored-banner-recognition"). A SAE trained on residual-stream activations collected from agent episodes (target: ~~10M tokens across promo, hallucination, and planning categories) should yield features more semantically aligned with agent decisions. Significant cost (~~$500-1000 GPU training run) and infrastructure to build, but it addresses the lexical-feature limit at its source rather than only at the model-scale level. The most direct path past the planning failure mode if the v0.24 70B run shows it persists.

Status of the 4 original "Open questions"

The four open questions from earlier reviewer feedback are now wired and measurable in the codebase:

Original ask	Status	Where
Failure-mode features as steering targets	✅ built (`failure-mining` policy + catalog labels)	`policies/failure_mining.py`
Cross-domain (hallucination + planning)	✅ wired	per-category section in `artifacts/benchmark_report.md`
Cross-model (Gemma 2-9B + Gemma Scope)	📘 runbook ready	`docs/cross_model_path.md`
Dynamic steering (not just step 0)	✅ rewritten	`policies/dynamic.py` watches failure features per step

Built on

Anthropic — Scaling Monosemanticity (the SAE → frontier-model story)
Goodfire AI — Llama-3.1-8B-Instruct SAE on layer 19 (the open-weight SAE this project uses)
Cho et al. — Control RL with SAE Features (the architecture this paper proposes; this project ships an open implementation)
Modal for the brain-server compute
Playwright for ShopGym browser automation

License

MIT (code), CC-BY-4.0 (writeup in docs/).

Citation

If this is useful in your own work:

@misc{kalyvas2026insidetheagent,
  title  = {Inside the Agent: A Live Interpretability HUD for Open-Source AI},
  author = {Kalyvas, Alexandros},
  year   = {2026},
  howpublished = {Stanford CS153 Frontier Systems},
  url    = {https://github.com/kalyvask/inside-the-agent}
}

Acknowledgements

CS153 Frontier Systems (Stanford GSB / SOE, Spring 2026). Thanks to the Goodfire AI team for releasing the open SAE that made this possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inside the Agent

Headline result (v0.8 — held-out 20 tasks × 3 trials = 60 per policy)

Action quality: valid vs executed (v0.24-F diagnostic)

The category-specific story (this is the real headline)

When does SAE steering beat prompt-only?

How to read these numbers honestly

What this is

What ships in the box (as of v0.22)

Demo (live cockpit on real public sites)

Architecture

Quickstart

Prerequisites

Install

Day 1 — verify (5-test gate)

Reproduce the headline result

Watch the HUD live (for the demo)

Warm a real-website session (only if you hit bot detection)

Repository layout

Roadmap

Immediate (this week — demo polish)

Short-term (1-2 weeks — close P1 reviewer items)

Medium-term (next month — strengthen the science)

Long-term (months — research direction)

Status of the 4 original "Open questions"

Built on

License

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
agent		agent
artifacts		artifacts
bench		bench
docs		docs
hud		hud
modal_deploy		modal_deploy
notebooks		notebooks
policies		policies
sae		sae
shopgym		shopgym
tests		tests
verify		verify
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
record_demo.py		record_demo.py
warm_session.py		warm_session.py

Folders and files

Latest commit

History

Repository files navigation

Inside the Agent

Headline result (v0.8 — held-out 20 tasks × 3 trials = 60 per policy)

Action quality: valid vs executed (v0.24-F diagnostic)

The category-specific story (this is the real headline)

When does SAE steering beat prompt-only?

How to read these numbers honestly

What this is

What ships in the box (as of v0.22)

Demo (live cockpit on real public sites)

Architecture

Quickstart

Prerequisites

Install

Day 1 — verify (5-test gate)

Reproduce the headline result

Watch the HUD live (for the demo)

Warm a real-website session (only if you hit bot detection)

Repository layout

Roadmap

Immediate (this week — demo polish)

Short-term (1-2 weeks — close P1 reviewer items)

Medium-term (next month — strengthen the science)

Long-term (months — research direction)

Status of the 4 original "Open questions"

Built on

License

Citation

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages