Skip to content

kevinkicho/agent-vision-unity

Repository files navigation

Agent Vision for Unity

⚠️ WORK IN PROGRESS — NOT PRODUCTION READY This project currently only uses static screenshots for visual verification. It is not a viable product for agentic Unity development assistance — it's an exploration of what agent-Unity integration could look like. Real agentic dev assistance would require live video streaming, DOM-like scene inspection, and bidirectional control, none of which are fully implemented here.

Like Playwright for web apps, but for Unity games. Screenshots + JSON state are streamed to an external agent for visual verification and regression testing.

What Each Script Does

Unity Side (C#) — drop into any Unity project

Script What it does
AgentVisionBootstrap.cs Automatically creates a GameViewRecorder and GameStateLogger when a scene loads. Add this to any GameObject and it will start capturing on Play.
GameViewRecorder.cs Captures a JPG screenshot of the Game view every 0.5 seconds and on every mouse click. Saves them to persistentDataPath/AgentVision/Frames/<timestamp>/.
GameStateLogger.cs Writes a JSON snapshot of game state alongside each recorded frame. Base version logs timestamp, scene name, and FPS only. To log game-specific state (HP, cards, enemies), subclass it and override BuildGameJson(). See "Custom Game State" section below.
UnityWebhookBridge.cs An Editor script that sends an HTTP POST to localhost:8765/webhook whenever you press Play or Stop in Unity. The server now knows when gameplay starts and ends.
UnityBridgeNotifier.cs An Editor script that writes a status.json file to %LocalAppData%/AI_Bridge/ on Play/Stop. This is a file-based fallback if the HTTP server isn't running.

Python Side — the observation and analysis tools

Script What it does
agent_server.py Runs an HTTP server on port 8765. Receives webhook events from Unity, captures screenshots of the Unity window every 2 seconds, and serves them via GET endpoints (/status, /frame, /events). This is the main entry point — start this first.
agent_client.py A command-line tool to query the server. Run python agent_client.py full to get the latest event, screenshot, and pixel diagnosis all at once.
capture_unity_view.py Finds the Unity window by title, takes a screenshot of its client area, and analyzes pixels for pink/black/brightness. Used as a library by other scripts.
run_agent_vision.py A unified CLI with four modes: capture (one screenshot), session (record many frames at a target FPS), analyze (extract keyframes from a session), watch (continuous capture loop).
session_recorder.py Records numbered frames (frame_00000.jpg, frame_00001.jpg, ...) into a timestamped folder. Optionally encodes them to MP4 via ffmpeg. Stop with stop_recording.py.
analyze_session.py Reads a session folder, picks the most distinct keyframes, runs pixel diagnosis on each, and writes a report.json summary.
vision_daemon.py Runs a loop that captures a screenshot and overwrites Log/current.png every N seconds. A simpler alternative to agent_server.py's built-in vision thread.
watch.py Minimal loop that captures a screenshot and prints the diagnosis (bright/pink/black) to the terminal every few seconds.
auto_diag.py Runs a detect-fix-retest loop: captures a screenshot, diagnoses it, applies a fix (shader, camera background, etc.), then captures again to verify. ⚠️ Game-specific — contains project-specific fix functions. Adapt for your own game.
unity_input.py Sends mouse clicks, drags, and keyboard presses to the Unity window using Win32 SendInput. For programmatic gameplay control.
windows_capture.py Takes a screenshot using Win32 GDI calls with no PIL dependency. A fallback capture method.
stop_recording.py Creates a STOP sentinel file that tells session_recorder.py to finish recording.

Process Flow

┌─────────────────────────────────────────────────────────────┐
│                        SETUP                                │
│                                                             │
│  1. Copy the Assets/ folder into your Unity project          │
│  2. Add AgentVisionBootstrap to a GameObject in your scene   │
│  3. pip install pillow                                      │
│  4. python agent_server.py          ← start this first      │
│  5. Press Play in Unity            ← game starts            │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                    DURING GAMEPLAY                           │
│                                                             │
│  Unity (C#)                      Python (server)            │
│  ┌─────────────────────┐         ┌──────────────────┐      │
│  │ GameViewRecorder     │         │ Listens on :8765 │      │
│  │  captures frame.jpg  │──file──>│                  │      │
│  │  every 0.5s + click  │         │  /webhook  ←── POST    │
│  │                      │         │  /status   ──> GET      │
│  │ GameStateLogger      │         │  /frame    ──> GET      │
│  │  writes state.json   │──file──>│  /events   ──> GET      │
│  │                      │         │                  │      │
│  │ UnityWebhookBridge   │────────>│  (POST on       │      │
│  │  POST on Play/Stop   │  HTTP  │   Play/Stop)     │      │
│  └─────────────────────┘         └────────┬─────────┘      │
│                                            │                │
└────────────────────────────────────────────┼────────────────┘
                                             │
                                             ▼
┌─────────────────────────────────────────────────────────────┐
│                    AI AGENT QUERY                            │
│                                                             │
│  python agent_client.py full                                │
│                                                             │
│  Returns:                                                   │
│    • Latest event (play_started / play_stopped)             │
│    • Pixel diagnosis (brightness, pink%, black%)            │
│    • Screenshot saved to Log/current_frame.png              │
│                                                             │
│  OR query individual endpoints:                             │
│    python agent_client.py status   → event + diagnosis     │
│    python agent_client.py frame    → save screenshot only   │
│    python agent_client.py events   → last 20 webhook events│
└─────────────────────────────────────────────────────────────┘

Requirements

Dependency Version Install
Unity 2022.3+ (URP recommended) https://unity.com/download
Python 3.8+ https://python.org
Pillow 10+ pip install pillow
Windows 10/11 Required for Win32 screenshot capture

No other Python packages are needed. The server uses only the standard library (http.server, json, threading).

Step-by-Step Reproduction

1. Copy Unity scripts into your project

The repo folder structure matches Unity's convention:

YourUnityProject/
  Assets/
    Editor/
      UnityWebhookBridge.cs      ← Editor-only: POSTs on Play/Stop
      UnityWebhookBridge.cs.meta
      UnityBridgeNotifier.cs      ← Editor-only: writes status.json
      UnityBridgeNotifier.cs.meta
    Scripts/
      AgentVision/
        AgentVisionBootstrap.cs  ← Add this to a GameObject
        AgentVisionBootstrap.cs.meta
        GameViewRecorder.cs      ← Auto-started by Bootstrap
        GameViewRecorder.cs.meta
        GameStateLogger.cs       ← Auto-started by Bootstrap
        GameStateLogger.cs.meta

Copy the Assets/ folder from this repo into your Unity project's Assets/ folder. The .meta files are included so Unity preserves GUIDs.

2. Add to your scene

  1. Open any scene in Unity
  2. Create an empty GameObject (right-click hierarchy → Create Empty)
  3. Name it AgentVision
  4. In the Inspector, click Add Component → search for AgentVisionBootstrap
  5. The captureInterval (default 0.5s) and autoStart (default true) are configurable in the Inspector

3. Install Python dependencies

pip install pillow

4. Start the server

python agent_server.py

You should see:

[AgentServer] Running on http://127.0.0.1:8765
  POST /webhook   ← Unity sends events here
  GET  /status    → Latest event + screenshot diagnosis
  GET  /frame     → Base64 PNG of last capture
  GET  /events    → Last 20 events

If your Unity window has a custom title (not "Unity"), specify it:

python agent_server.py --title "My Game Title"

5. Press Play in Unity

The Console should show:

[GameViewRecorder] Output: C:\...\AgentVision\Frames\2026-04-28_16-34-43
[GameViewRecorder] Capturing every 0.5s as JPG
[AgentVisionBootstrap] Vision pipeline started automatically.
[WebhookBridge] Sent play_started to http://127.0.0.1:8765/webhook

6. Query from another terminal

python agent_client.py full

Expected output:

=== STATUS ===
Server time: 1777419319.055
Latest event: play_started at 2026-04-28T16:34:43Z
Scene: MyScene
Frame exists: True
Diagnosis: bright=90.6 pink=0.0 black=0.07
  > OK

=== FRAME ===
Frame saved to: Log/current_frame.png

7. Stop playing

Press Stop in Unity. The server receives play_stopped with needs_ai_attention: true.

Where Files Are Saved

What Where
Screenshot frames %LocalAppData%/<Company>/<Project>/AgentVision/Frames/<timestamp>/
State JSON files Same folder as frames (state_XXXXX.json)
Bridge status.json %LocalAppData%/AI_Bridge/status.json
Server screenshot Log/current.png (in the directory where agent_server.py runs)

<Company> defaults to DefaultCompany, <Project> is your Unity project name (e.g. slay-the-spire-mock).

Optional: Session Recording

# Record a session at 10 FPS
python run_agent_vision.py session --fps 10 --title "YourUnityWindowTitle"

# Stop recording
python stop_recording.py

# Analyze keyframes from the session
python run_agent_vision.py analyze --dir "path/to/session/folder"

Optional: Auto-Diagnosis Loop

# Captures, diagnoses, and attempts to fix common rendering issues
python auto_diag.py

Note: auto_diag.py is game-specific. It contains hardcoded fix functions for the Slay the Spire mock project. Adapt the try_fix() function for your own game.

Self-Diagnosis

Each captured frame is analyzed for rendering failures:

  • Pink pixels → shader fallback (missing URP material)
  • Black pixels (>60%) → nothing rendered (camera/rendering bug)
  • Overexposed (>240 avg) → material/lighting issue

Custom Game State

The base GameStateLogger only logs timestamp, scene name, and FPS. To log your game's state (HP, enemies, cards, etc.), create a subclass:

using AgentVision;
using UnityEngine;

public class MyGameStateLogger : GameStateLogger
{
    protected override string BuildGameJson()
    {
        // Example: return your game's state as JSON
        return $"{{ \"playerHp\": {Player.Instance.hp}, \"score\": {GameManager.score} }}";
    }
}

Then replace go.AddComponent<GameStateLogger>() in AgentVisionBootstrap.cs with go.AddComponent<MyGameStateLogger>().

Note: All C# scripts use the AgentVision namespace. You can change this to match your project's conventions.

Making GIFs from Sessions

python -c "
from PIL import Image; import glob
frames = sorted(glob.glob('path/to/Frames/frame_*.jpg'))
images = [Image.open(f).resize((640,480)) for f in frames[:60]]
images[0].save('gameplay.gif', save_all=True, append_images=images[1:], duration=500, loop=0)
"

Troubleshooting

Problem Solution
No screenshots captured Make sure AgentVisionBootstrap is on a GameObject in the active scene
Screenshots are all black The window capture looks for title containing "Unity". Use --title flag if your window title differs
agent_client.py can't connect Make sure agent_server.py is running first on port 8765
Compile error: FindObjectOfType is obsolete The scripts use FindAnyObjectByType which requires Unity 2022.3+. Update Unity or replace with FindObjectOfType
State JSON only shows timestamp and scene That's the base logger. Subclass GameStateLogger and override BuildGameJson() to add game state

Credits

  • Code: Generated by glm-5.1:cloud via Ollama Cloud
  • Design, input, feedback, testing: Kevin Cho (kevinkicho)

License

MIT License — see LICENSE.

About

Like Playwright for Unity: streams live screenshots + JSON state from a running game to AI agents for autonomous visual testing. Just as Playwright automates browsers with page.screenshot() and DOM queries, Agent Vision gives agents GET /frame and GET /status to see, diagnose, and regression-test gameplay — without manual screenshots.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors