diff --git a/.codeactor/skills/commit.md b/.codeactor/skills/commit.md index a419141..b5064cf 100644 --- a/.codeactor/skills/commit.md +++ b/.codeactor/skills/commit.md @@ -29,18 +29,14 @@ 1. 运行 `git status --short` 获取所有变更文件列表。 2. **过滤排除以下文件**: - | 类别 | 排除规则 | - |------|----------| - | **数据文件** | 扩展名:`.csv`, `.tsv`, `.xlsx`, `.xls`, `.parquet`, `.arrow`, `.feather`, `.h5`, `.hdf5`, `.npz`, `.npy`, `.pkl`, `.joblib`, `.sqlite`, `.sqlite3`, `.db`, `.dta`, `.sav`, `.rds`, `.rda` | - | **二进制/编译产物** | 扩展名:`.exe`, `.dll`, `.so`, `.a`, `.o`, `.obj`, `.bin`, `.pt`, `.pth`, `.onnx`, `.safetensors`, `.gguf`, `.wasm`, `.pyc`, `.pyo`, `.class`, `.jar`, `.war`, `.apk`, `.ipa`, `.whl`, `.egg` | - | **媒体文件** | 扩展名:`.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.tif`, `.tiff`, `.webp`, `.svg`(非图标/UI资源时排除),`.mp3`, `.wav`, `.flac`, `.ogg`, `.mp4`, `.avi`, `.mov`, `.mkv`, `.webm` | - | **压缩包** | 扩展名:`.zip`, `.tar`, `.gz`, `.bz2`, `.7z`, `.rar`, `.xz`, `.zst`, `.tgz`, `.tar.gz`, `.tar.bz2` | - | **测试数据/夹具** | 路径包含:`test/data/`、`tests/data/`、`test/fixtures/`、`tests/fixtures/`、`testdata/`、`__test_data__/`、`sample_data/`、`*.testdata.*` | - | **大文件提醒** | 单个文件超过 **5MB** 时,跳过并提醒用户手动处理 | + | **数据文件** | + | **二进制/编译产物** | + | **媒体文件** | + | **压缩包** | + | **测试数据/夹具** | 3. 对过滤后的代码文件执行 `git add ...`。 4. 执行 `git commit -m ""` 提交。**无需用户确认,直接提交。** -5. 如果过滤后没有任何文件可提交,告知用户「没有需要提交的代码文件(数据文件、二进制文件已自动排除)」,然后结束。 ## 步骤 4:展示提交结果 提交完成后,运行 `git log --oneline -3` 展示最近3条提交记录,让用户确认 commit 内容是否正确。 @@ -53,4 +49,3 @@ **注意事项**: - 如果仓库没有变更,直接告知用户 "没有需要提交的变更" 并结束 - 所有 git 命令使用 `run_bash` 工具执行 -- 保持交互自然,不要过度自动化 diff --git a/docs/Browser_Agent_Design.md b/docs/Browser_Agent_Design.md new file mode 100644 index 0000000..eee6242 --- /dev/null +++ b/docs/Browser_Agent_Design.md @@ -0,0 +1,902 @@ +# Browser-Agent 设计方案:go-rod 浏览器自动化集成 + +> **状态**: 设计阶段 +> **版本**: v1.0 +> **日期**: 2025-07 +> **依赖**: [go-rod/rod](https://github.com/go-rod/rod) + +--- + +## 目录 + +1. [概述](#1-概述) +2. [根因分析](#2-根因分析) +3. [候选架构对比](#3-候选架构对比) +4. [推荐架构设计](#4-推荐架构设计) +5. [分阶段实施路线图](#5-分阶段实施路线图) +6. [文件结构规划](#6-文件结构规划) +7. [配置设计](#7-配置设计) +8. [浏览器工具定义](#8-浏览器工具定义) +9. [安全设计](#9-安全设计) +10. [数据流设计](#10-数据流设计) +11. [关键代码模式](#11-关键代码模式) +12. [TUI 与 HTTP/WebSocket 集成](#12-tui-与-httpwebsocket-集成) +13. [生命周期与韧性](#13-生命周期与韧性) +14. [测试策略](#14-测试策略) +15. [回滚与监控](#15-回滚与监控) + +--- + +## 1. 概述 + +### 1.1 背景 + +**CodeActor Agent** 是一个基于 Hub-and-Spoke 多 Agent 架构的 AI 驱动自主编程助手,使用 Go 语言构建。当前系统包含以下 Agent: + +| Agent | 职责 | +|-------|------| +| **Conductor** | 中央调度器:任务分类、计划制定、Agent 委派、结果审核 | +| **Repo-Agent** | 代码分析:语义搜索、代码骨架、函数片段 | +| **Coding-Agent** | 代码编写、文件修改、Shell 执行、测试、自调试 | +| **Chat-Agent** | 技术解释、通用问答 | +| **DevOps-Agent** | 系统管理、Shell 命令、日志检查、进程管理 | +| **Meta-Agent** | 运行时创建自定义专用 Agent | + +### 1.2 目标 + +集成 **go-rod** 浏览器自动化库,创建 **Browser-Agent**,使系统具备以下能力: + +- 🌐 **网页自动化**:导航、点击、表单填写 +- 📸 **视觉捕获**:截图、PDF 生成 +- 📊 **数据提取**:文本、HTML、结构化数据抓取 +- 🔧 **JS 执行**:页面内脚本执行(需用户确认) +- 🍪 **会话管理**:Cookie 读取/设置 +- 📈 **监控检测**:网站健康检查、内容变更监控 + +--- + +## 2. 根因分析 + +| 维度 | 结论 | +|------|------| +| **核心问题** | 当前系统缺乏浏览器交互能力,所有 Agent 仅限于 OS/文件/代码层操作 | +| **影响范围** | 无法执行网页自动化、数据抓取、表单填写、视觉测试、网站监控等任务 | +| **根本原因** | 项目初始范围聚焦于代码分析,未纳入浏览器自动化需求 | +| **关键洞察** | 问题不仅是添加库依赖,而是需要按照 Hub-and-Spoke 模式设计完整的 Agent,同时保持安全性、资源管理和工具化交互的一致性 | + +### 系统影响评估 + +| 组件 | 影响程度 | 说明 | +|------|----------|------| +| `internal/agents/` | **高** | 新增 `BrowserAgent` 结构体、注册、任务执行循环 | +| `internal/tools/` | **高** | 新增 15 个浏览器专用工具 | +| `internal/llm/` | **无** | 复用现有 LLM 抽象层 | +| `pkg/messaging/` | **中** | 新增浏览器任务/结果消息主题 | +| `internal/config/` | **中** | 新增 `[browser]` 配置段 | +| `internal/http/` | **低** | 可能需要新增 workspace 文件服务端点 | +| `internal/tui/` | **低** | 浏览器输出文本/文件路径在聊天 UI 中显示 | +| `main.go` | **中** | Agent 注册、依赖注入、生命周期钩子 | +| `go.mod` | **低** | 添加 `github.com/go-rod/rod` | +| `codebase/` (Rust) | **无** | 不受影响 | + +--- + +## 3. 候选架构对比 + +### 候选 1:完全嵌入 Agent(紧耦合) + +``` +Browser-Agent 直接实例化 go-rod +``` + +| 优点 | 缺点 | +|------|------| +| 实现简单、开销低 | 浏览器实例无法跨任务共享 | +| Agent 完全控制 | 每个任务需冷启动浏览器 | + +### 候选 2:浏览器服务 + Agent 前端(解耦) + +``` +BrowserService(独立服务) ←→ Browser-Agent(命令翻译) +``` + +| 优点 | 缺点 | +|------|------| +| 关注点分离清晰 | 额外的通信层 | +| 浏览器生命周期可跨 Agent 复用 | 并发控制更复杂 | + +### 候选 3:混合模式 — Agent 持有 rod,生命周期由单例提供者管理 ⭐ + +``` +Browser-Agent ←→ BrowserManager(单例)←→ Chrome 进程 + ←→ Page 上下文(每任务独立) +``` + +| 优点 | 缺点 | +|------|------| +| 最优资源使用(单 Chromium 进程) | 需仔细同步控制 | +| Agent 代码简洁 | 浏览器崩溃影响所有待处理任务 | +| 避免冷启动成本 | 需实现健康检查和恢复机制 | +| 通过信号量控制并发 | — | + +--- + +## 4. 推荐架构设计 + +### 4.1 选择:候选 3 — 混合模式 + +**理由**:遵循"浏览器进程为单一事实来源"原则,通过 `BrowserManager` 单例管理整个应用生命周期的浏览器实例。与现有 Hub-and-Spoke 模式高度契合。 + +### 4.2 架构图 + +``` + ┌─────────────────┐ + │ Conductor │ ← 中央协调器 + │ (Orchestrator) │ + └────────┬────────┘ + │ + │ delegate_browser 工具调用 + │ 发布 BrowserTask 消息 + ▼ + ┌─────────────────┐ + │ Browser-Agent │ ← LLM 推理 + BrowserToolSet + │ (浏览器专家) │ + └────────┬────────┘ + │ + │ 工具执行调用 + ▼ + ┌─────────────────┐ + │ BrowserManager │ ← 单例,浏览器生命周期管理 + │ (浏览器管理器) │ + └────────┬────────┘ + │ + │ go-rod (Chrome DevTools Protocol) + ▼ + ┌─────────────────┐ + │ Chrome Browser │ ← Headless Chrome 进程 + │ (无头浏览器) │ + └────────┬────────┘ + │ + │ 为每个任务创建独立页面 + ▼ + ┌─────────────────┐ + │ Page Context │ ← 每个任务独立的标签页 + │ (页面上下文) │ + └─────────────────┘ +``` + +### 4.3 核心组件职责 + +| 组件 | 职责 | +|------|------| +| **Conductor** | 分类浏览器任务 → `delegate_browser` 调用 | +| **Browser-Agent** | 接收任务 → LLM 推理 → 工具调用 → 返回结果 | +| **BrowserManager** | 浏览器启动/关闭、页面获取/释放、健康检查、安全策略 | +| **BrowserToolSet** | 15 个浏览器操作工具,供 LLM 调用 | +| **WorkspaceGuard** | 复用现有机制,确保文件写入工作区内 | + +--- + +## 5. 分阶段实施路线图 + +### Phase 1:基础设施 — 依赖、配置、BrowserManager + +| 步骤 | 内容 | 优先级 | +|------|------|--------| +| 1.1 | `go.mod` 添加 `github.com/go-rod/rod` 依赖 | P0 | +| 1.2 | `internal/config/config.go` 扩展 `[browser]` 配置段 | P0 | +| 1.3 | 创建 `internal/browser/manager.go` — 浏览器生命周期管理 | P0 | +| 1.4 | 创建 `internal/browser/security.go` — 安全策略实现 | P0 | +| 1.5 | 创建 `internal/browser/config.go` — Chrome 命令行标志生成 | P1 | + +### Phase 2:浏览器工具定义 + +| 步骤 | 内容 | 优先级 | +|------|------|--------| +| 2.1 | 创建 `internal/tools/browser/` 包目录 | P0 | +| 2.2 | 实现导航类工具:`navigate`, `go_back`, `go_forward`, `reload`, `get_current_url` | P0 | +| 2.3 | 实现交互类工具:`click`, `input`, `scroll`, `wait_element`, `wait` | P0 | +| 2.4 | 实现提取类工具:`extract_text`, `extract_html` | P1 | +| 2.5 | 实现输出类工具:`screenshot`, `pdf` | P1 | +| 2.6 | 实现高级工具:`evaluate_js`, `get_cookies`, `set_cookies` | P2 | +| 2.7 | 创建 `registry.go` — `BrowserTools()` 注册函数 | P0 | + +### Phase 3:Browser-Agent 实现 + +| 步骤 | 内容 | 优先级 | +|------|------|--------| +| 3.1 | 创建 `internal/agents/browser_agent.go` | P0 | +| 3.2 | 实现 LLM 推理循环(接收任务 → 获取页面 → 工具调用 → 返回结果) | P0 | +| 3.3 | 编写浏览器 Agent 系统提示词 | P0 | +| 3.4 | 实现 Agent 工厂函数和注册逻辑 | P1 | + +### Phase 4:Conductor 集成 + +| 步骤 | 内容 | 优先级 | +|------|------|--------| +| 4.1 | 在 `conductor.go` 中新增 `delegate_browser` 工具 | P0 | +| 4.2 | 扩展 Conductor 任务分类规则(识别浏览器类意图) | P1 | +| 4.3 | 通过消息系统实现 Conductor ↔ Browser-Agent 通信 | P0 | + +### Phase 5:TUI 与 HTTP/WebSocket 支持 + +| 步骤 | 内容 | 优先级 | +|------|------|--------| +| 5.1 | 确保 Browser-Agent 消息流入 TUI 显示 | P1 | +| 5.2 | HTTP 服务器添加 workspace 文件服务端点 `GET /workspace/{filepath}` | P1 | +| 5.3 | WebSocket 消息中包含文件路径引用 | P2 | + +### Phase 6:生命周期与韧性 + +| 步骤 | 内容 | 优先级 | +|------|------|--------| +| 6.1 | 实现 `BrowserManager.HealthCheck()` 健康检查 | P1 | +| 6.2 | 浏览器崩溃自动重启机制 | P1 | +| 6.3 | 空闲超时自动关闭浏览器释放资源 | P2 | +| 6.4 | `main.go` 优雅关闭时清理浏览器进程 | P1 | + +### Phase 7:测试与验证 + +| 步骤 | 内容 | 优先级 | +|------|------|--------| +| 7.1 | 浏览器工具单元测试(使用 `httptest` 本地服务器) | P1 | +| 7.2 | Browser-Agent 集成测试(模拟 LLM 输出) | P2 | +| 7.3 | 安全测试(`file://` 拦截、工作区外下载阻止) | P1 | +| 7.4 | 并发压力测试(信号量正确序列化) | P2 | +| 7.5 | 端到端系统测试 | P2 | + +--- + +## 6. 文件结构规划 + +``` +codeactor-agent/ +├── main.go # 🔄 修改:初始化BrowserManager,注册BrowserAgent +├── go.mod # 🔄 修改:+ go-rod 依赖 +├── config.example.toml # 🔄 修改:更新 [browser] 配置段 +│ +├── internal/ +│ ├── agents/ +│ │ ├── browser_agent.go # 🆕 Browser-Agent 实现 +│ │ ├── conductor.go # 🔄 修改:delegate_browser 工具 +│ │ └── agent_registry.go # 🔄 修改:BrowserAgent 注册 +│ │ +│ ├── tools/ +│ │ └── browser/ # 🆕 浏览器工具包 +│ │ ├── navigate.go # 导航到指定 URL +│ │ ├── click.go # 元素点击 +│ │ ├── input.go # 表单输入 +│ │ ├── extract.go # 文本/HTML 提取 +│ │ ├── screenshot.go # 截图 +│ │ ├── pdf.go # PDF 生成 +│ │ ├── evaluate_js.go # JavaScript 执行 +│ │ ├── wait_element.go # 等待元素出现 +│ │ ├── cookies.go # Cookie 管理 +│ │ ├── scroll.go # 页面滚动 +│ │ ├── history.go # 浏览器历史导航 +│ │ └── registry.go # BrowserTools() 注册函数 +│ │ +│ ├── browser/ # 🆕 浏览器管理包 +│ │ ├── manager.go # 浏览器生命周期、页面获取/释放 +│ │ ├── config.go # 配置结构和 Chrome flags 生成 +│ │ ├── security.go # 域名过滤、file:// 拦截、下载边界 +│ │ └── manager_test.go # BrowserManager 测试 +│ │ +│ ├── config/ +│ │ └── config.go # 🔄 修改:BrowserConfig 结构体 +│ │ +│ └── http/ +│ └── fileserver.go # 🆕 可选:workspace 文件服务端点 +``` + +> 🆕 = 新增文件 | 🔄 = 修改现有文件 + +--- + +## 7. 配置设计 + +### 7.1 配置结构体 + +```go +// BrowserConfig 浏览器配置 +type BrowserConfig struct { + Headless bool `toml:"headless"` // 无头模式 + BrowserPath string `toml:"browser_path"` // 浏览器路径(空=自动下载) + UserDataDir string `toml:"user_data_dir"` // 用户数据目录(空=临时目录) + ViewportWidth int `toml:"viewport_width"` // 视口宽度 + ViewportHeight int `toml:"viewport_height"` // 视口高度 + AllowedDomains []string `toml:"allowed_domains"` // 允许域名列表(空=全部允许) + BlockedDomains []string `toml:"blocked_domains"` // 阻止域名列表 + TimeoutSeconds int `toml:"timeout_seconds"` // 单个操作超时 + MaxConcurrentPages int `toml:"max_concurrent_pages"` // 最大并发页面数 + AutoLaunch bool `toml:"auto_launch"` // 首次请求时自动启动 + IdleTimeout string `toml:"idle_timeout"` // 空闲超时(如 "5m") + AllowNoSandbox bool `toml:"allow_no_sandbox"` // 允许--no-sandbox(Docker中需要) + ExtraArgs []string `toml:"extra_args"` // 额外的Chrome命令行参数 +} +``` + +### 7.2 配置示例 + +```toml +[browser] +headless = true +browser_path = "" +user_data_dir = "" +viewport_width = 1280 +viewport_height = 720 +allowed_domains = ["example.com", "api.example.com"] +blocked_domains = ["malware.test"] +timeout_seconds = 30 +max_concurrent_pages = 4 +auto_launch = true +idle_timeout = "5m" +allow_no_sandbox = false # Docker 中设为 true +extra_args = [] +``` + +### 7.3 Chrome 启动标志(安全强化) + +```go +var secureChromeFlags = []string{ + "--headless=new", // 新版无头模式 + "--disable-gpu", // 禁用GPU + "--no-first-run", // 跳过首次运行向导 + "--disable-default-apps", // 禁用默认应用 + "--disable-extensions", // 禁用扩展 + "--disable-background-networking", // 禁用后台网络 + "--disable-sync", // 禁用同步 + "--disable-translate", // 禁用翻译 + "--hide-scrollbars", // 隐藏滚动条 + "--metrics-recording-only", // 仅记录指标 + "--mute-audio", // 静音 + "--disable-dev-shm-usage", // 使用 /tmp 而非 /dev/shm + "--no-sandbox", // 仅容器环境(由 allow_no_sandbox 控制) +} +``` + +--- + +## 8. 浏览器工具定义 + +### 8.1 工具总览 + +| # | 工具名 | 功能 | 参数 | 返回值 | +|---|--------|------|------|--------| +| 1 | `navigate` | 页面导航 | `url` (string), `timeout_seconds` (int?) | title, url, status | +| 2 | `click` | 元素点击 | `selector` (string), `button` (string?) | 操作结果 | +| 3 | `input` | 表单输入 | `selector` (string), `text` (string) | 操作结果 | +| 4 | `extract_text` | 提取文本 | `selector` (string), `max_chars` (int?) | 文本内容 | +| 5 | `extract_html` | 提取HTML | `selector` (string) | outerHTML(截断) | +| 6 | `screenshot` | 截图 | `selector` (string?), `whole_page` (bool), `output_file` (string?) | 文件路径 | +| 7 | `pdf` | 生成PDF | `output_file` (string?) | 文件路径 | +| 8 | `evaluate_js` | 执行JS | `code` (string) | 执行结果 JSON | +| 9 | `wait_element` | 等待元素 | `selector` (string), `timeout_seconds` (int?) | 是否出现 | +| 10 | `get_cookies` | 获取Cookie | 无 | Cookie列表 | +| 11 | `set_cookies` | 设置Cookie | `cookies` ([]map) | 操作结果 | +| 12 | `scroll` | 页面滚动 | `x` (int), `y` (int) | 操作结果 | +| 13 | `go_back` | 后退 | 无 | 操作结果 | +| 14 | `go_forward` | 前进 | 无 | 操作结果 | +| 15 | `reload` | 刷新 | 无 | 操作结果 | +| 16 | `get_current_url` | 获取当前URL | 无 | URL字符串 | +| 17 | `wait` | 等待毫秒 | `milliseconds` (int) | 操作结果 | + +### 8.2 工具安全分类 + +| 安全等级 | 工具 | 限制 | +|----------|------|------| +| 🟢 安全 | `navigate`, `click`, `input`, `extract_text`, `extract_html`, `screenshot`, `pdf`, `wait_element`, `scroll`, `go_back`, `go_forward`, `reload`, `get_current_url`, `wait`, `get_cookies`, `set_cookies` | URL 验证 + 工作区边界 | +| 🟡 需确认 | `evaluate_js` | 触发 `ask_user_for_help` 用户确认流程 | + +--- + +## 9. 安全设计 + +### 9.1 多层安全防护 + +``` +┌─────────────────────────────────────────────────────┐ +│ 第1层:URL 验证 │ +│ 仅允许 http:// 和 https:// scheme │ +│ 拦截 file://、data:、javascript:、chrome:// 等 │ +├─────────────────────────────────────────────────────┤ +│ 第2层:域名过滤 │ +│ HijackRequests 实现允许/阻止列表 │ +│ 支持域名模式匹配(如 *.example.com) │ +├─────────────────────────────────────────────────────┤ +│ 第3层:Chrome 沙箱 │ +│ --headless=new, --disable-extensions, etc. │ +│ --no-sandbox 仅在容器环境下启用 │ +├─────────────────────────────────────────────────────┤ +│ 第4层:工作区边界 │ +│ 截图/PDF/下载 强制写入 workspace 目录 │ +│ 复用现有 WorkspaceGuard 进行路径检查 │ +├─────────────────────────────────────────────────────┤ +│ 第5层:用户确认 │ +│ evaluate_js 等高风险操作需用户明确批准 │ +│ 通过 ask_user_for_help 机制实现 │ +├─────────────────────────────────────────────────────┤ +│ 第6层:资源限制 │ +│ --js-flags="--max-old-space-size=256" │ +│ --renderer-process-limit=4 │ +│ 操作超时 30s(可配置) │ +└─────────────────────────────────────────────────────┘ +``` + +### 9.2 BrowserManager 安全职责 + +```go +// BrowserManager 安全相关方法 +type BrowserManager struct { + // ... + securityPolicy *SecurityPolicy + workspaceGuard *WorkspaceGuard +} + +type SecurityPolicy struct { + AllowedDomains []string // 允许访问的域名 + BlockedDomains []string // 禁止访问的域名 + AllowFileAccess bool // 是否允许 file://(默认 false) + AllowDataURL bool // 是否允许 data: URL(默认 false) +} + +// ValidateURL 验证 URL 安全性 +func (m *Manager) ValidateURL(rawURL string) error + +// SetupPageSecurity 为页面设置安全路由器 +func (m *Manager) SetupPageSecurity(page *rod.Page) error + +// ValidateFilePath 验证文件路径在工作区内 +func (m *Manager) ValidateFilePath(path string) error +``` + +--- + +## 10. 数据流设计 + +### 10.1 完整请求流 + +``` +用户输入: "请截图 https://example.com 首页" + │ + ▼ +┌──────────────────────────────────────────────────────────┐ +│ Conductor │ +│ 1. 任务分类 → 识别为浏览器任务 │ +│ 2. 调用 delegate_browser(task="截图 example.com 首页") │ +│ 3. 发布 BrowserTask 消息到消息总线 │ +└──────────────┬───────────────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────┐ +│ Browser-Agent │ +│ 1. 从消息总线接收 BrowserTask │ +│ 2. 创建任务上下文(含超时) │ +│ 3. 从 BrowserManager 获取页面 │ +│ 4. 开始 LLM 推理循环: │ +│ ┌─────────────────────────────────────────┐ │ +│ │ LLM: 调用 navigate("https://example.com")│ │ +│ │ Tool: 验证URL → 导航 → 等待加载 │ │ +│ │ Result: {title:"Example Domain", url:...}│ │ +│ │ │ │ +│ │ LLM: 调用 screenshot(whole_page=true) │ │ +│ │ Tool: 截图 → 保存到 workspace/browser/ │ │ +│ │ Result: {path:"browser/screenshots/x.png"}│ │ +│ │ │ │ +│ │ LLM: 调用 finish("截图已完成") │ │ +│ └─────────────────────────────────────────┘ │ +│ 5. 释放页面 │ +│ 6. 发送 BrowserResult 消息 │ +└──────────────┬───────────────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────┐ +│ Conductor │ +│ 1. 接收 BrowserResult │ +│ 2. 汇总结果返回给用户 │ +└──────────────────────────────────────────────────────────┘ +``` + +### 10.2 消息格式 + +**BrowserTask(Conductor → Browser-Agent)**: +```json +{ + "task_id": "uuid-xxx", + "prompt": "截图 https://example.com 首页", + "context": "用户需要整页截图", + "timeout_seconds": 60 +} +``` + +**BrowserResult(Browser-Agent → Conductor)**: +```json +{ + "task_id": "uuid-xxx", + "status": "success", + "summary": "截图已完成", + "files": [ + { + "type": "screenshot", + "path": "browser/screenshots/abc123.png", + "description": "example.com 首页截图" + } + ] +} +``` + +--- + +## 11. 关键代码模式 + +### 11.1 BrowserManager 页面获取 + +```go +// AcquirePage 获取一个浏览器页面(受信号量控制) +func (m *Manager) AcquirePage(ctx context.Context) (*rod.Page, func(), error) { + // 信号量控制并发 + select { + case m.sem <- struct{}{}: + case <-ctx.Done(): + return nil, nil, ctx.Err() + } + + m.mu.Lock() + defer m.mu.Unlock() + + // 懒启动浏览器 + if m.browser == nil { + if err := m.launch(); err != nil { + <-m.sem + return nil, nil, fmt.Errorf("启动浏览器失败: %w", err) + } + } + + // 创建新页面 + page, err := m.browser.Page(proto.TargetCreateTarget{URL: "about:blank"}) + if err != nil { + <-m.sem + return nil, nil, fmt.Errorf("创建页面失败: %w", err) + } + + // 设置安全策略 + m.setupPageSecurity(page) + + // 释放函数 + release := func() { + page.Close() + <-m.sem + m.lastUsed = time.Now() + } + + return page, release, nil +} +``` + +### 11.2 浏览器工具示例(navigate) + +```go +// NavigateTool 页面导航工具 +type NavigateTool struct { + workspaceGuard *WorkspaceGuard +} + +func (t *NavigateTool) Execute(ctx context.Context, params map[string]interface{}) (interface{}, error) { + url, ok := params["url"].(string) + if !ok { + return nil, fmt.Errorf("参数 'url' 必须为字符串") + } + + // URL 安全验证 + uri, err := url.Parse(url) + if err != nil || (uri.Scheme != "http" && uri.Scheme != "https") { + return nil, fmt.Errorf("仅允许 http/https URL,收到: %s", url) + } + + // 获取页面上下文 + page, ok := ctx.Value(PageCtxKey).(*rod.Page) + if !ok { + return nil, errors.New("页面上下文不可用") + } + + // 超时控制 + timeout := 30 * time.Second + ctx, cancel := context.WithTimeout(ctx, timeout) + defer cancel() + + // 导航 + if err := page.Timeout(timeout).Navigate(url); err != nil { + return nil, fmt.Errorf("导航失败: %w", err) + } + + page.WaitLoad() + + info, _ := page.Info() + return map[string]string{ + "title": info.Title, + "url": info.URL, + "status": "success", + }, nil +} +``` + +### 11.3 Browser-Agent LLM 推理循环 + +```go +func (ag *BrowserAgent) handleTask(taskMsg BrowserTask) { + ctx, cancel := context.WithTimeout(context.Background(), ag.timeout) + defer cancel() + + // 获取页面 + page, release, err := ag.browserMgr.AcquirePage(ctx) + if err != nil { + ag.reportError(taskMsg, err) + return + } + defer release() + + // 将页面注入上下文 + ctx = context.WithValue(ctx, PageCtxKey, page) + + // 构建对话 + msgs := []llm.Message{ + {Role: "system", Content: browserSystemPrompt}, + {Role: "user", Content: taskMsg.Prompt}, + } + + // LLM 推理循环 + for { + response, err := ag.llm.Chat(ctx, msgs, ag.tools) + if err != nil { + ag.reportError(taskMsg, err) + return + } + + // 处理工具调用 + for _, call := range response.ToolCalls { + result := ag.executeTool(ctx, call) + msgs = append(msgs, llm.Message{ + Role: "tool", + Content: result, + }) + } + + if response.FinishReason == "stop" { + ag.sendResult(taskMsg, response.Content) + return + } + } +} +``` + +--- + +## 12. TUI 与 HTTP/WebSocket 集成 + +### 12.1 TUI 模式 + +- 浏览器 Agent 的文本输出通过消息系统流入 TUI,与现有 Agent 消息显示方式一致 +- 截图/PDF 文件路径以文本形式显示(如 `📸 截图已保存: browser/screenshots/abc.png`) +- 无需额外 TUI 组件修改 + +### 12.2 HTTP/WebSocket 模式 + +- **文件服务端点**:`GET /workspace/{filepath}` → 流式返回二进制文件 +- **WebSocket 消息**:Browser-Agent 结果消息中包含文件引用 +- **前端行为**:收到文件路径后通过 HTTP 端点获取文件内容渲染 + +```json +// WebSocket 消息示例 +{ + "type": "browser_result", + "task_id": "uuid-xxx", + "summary": "截图已完成", + "attachments": [ + { + "type": "image/png", + "url": "/workspace/browser/screenshots/abc123.png", + "description": "example.com 首页截图" + } + ] +} +``` + +### 12.3 工作区文件服务安全 + +```go +// 文件服务端点必须验证路径在工作区内 +func (s *Server) ServeWorkspaceFile(c *gin.Context) { + filepath := c.Param("filepath") + fullPath := filepath.Join(s.workspaceDir, filepath) + + // WorkspaceGuard 验证 + if !s.workspaceGuard.IsPathInWorkspace(fullPath) { + c.Status(http.StatusForbidden) + return + } + + c.File(fullPath) +} +``` + +--- + +## 13. 生命周期与韧性 + +### 13.1 状态机 + +``` + ┌─────────┐ 首次 AcquirePage ┌──────────┐ + │ Idle │ ──────────────────────→ │ Active │ + │ (空闲) │ │ (活跃) │ + └────┬────┘ └─────┬─────┘ + │ │ + │ 空闲超时 │ 页面全部释放 + │ (idle_timeout) │ + 无等待者 + │ │ + ▼ ▼ + ┌─────────┐ 崩溃/健康检查失败 ┌──────────┐ + │ Closed │ ←──────────────────────── │ Crashed │ + │ (已关闭) │ │ (已崩溃) │ + └─────────┘ └─────┬─────┘ + │ + │ 自动重启 + │ + ▼ + ┌──────────┐ + │ Active │ + │ (重新活跃) │ + └──────────┘ +``` + +### 13.2 健康检查 + +```go +// HealthCheck 通过 CDP ping 检查浏览器是否存活 +func (m *Manager) HealthCheck() error { + if m.browser == nil { + return nil // 尚未启动 + } + ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + return m.browser.Ping(ctx) +} +``` + +### 13.3 优雅关闭 + +```go +// Close 优雅关闭浏览器管理器 +func (m *Manager) Close() error { + m.mu.Lock() + defer m.mu.Unlock() + + if m.browser != nil { + if err := m.browser.Close(); err != nil { + return fmt.Errorf("关闭浏览器失败: %w", err) + } + m.browser = nil + } + + // 清理临时 user-data-dir + if m.tempDir != "" { + os.RemoveAll(m.tempDir) + } + + return nil +} +``` + +--- + +## 14. 测试策略 + +### 14.1 测试层次 + +``` +┌──────────────────────────────────────────┐ +│ E2E 系统测试 │ +│ 真实用户请求 → Conductor → Browser-Agent │ +│ → Chrome → 真实网站 │ +├──────────────────────────────────────────┤ +│ 集成测试 │ +│ Browser-Agent + 模拟 LLM + 真实 Chrome │ +├──────────────────────────────────────────┤ +│ 单元测试 │ +│ 工具级测试(httptest 本地服务器) │ +├──────────────────────────────────────────┤ +│ 安全测试 │ +│ URL拦截、文件边界、JS确认流程 │ +└──────────────────────────────────────────┘ +``` + +### 14.2 测试用例清单 + +| 类别 | 测试用例 | 预期结果 | +|------|----------|----------| +| **工具测试** | `navigate` 正常URL | 返回页面标题和URL | +| | `navigate` file:// URL | 返回错误 | +| | `click` 有效选择器 | 点击成功 | +| | `click` 无效选择器 | 返回超时错误 | +| | `screenshot` 默认路径 | 文件保存到 workspace | +| | `screenshot` 自定义路径(工作区外) | 返回权限错误 | +| | `extract_text` 大页面 | 按 max_chars 截断 | +| **Agent测试** | 模拟LLM工具调用序列 | 正确执行并返回结果 | +| | 页面获取超时 | 返回超时错误 | +| **安全测试** | `file:///etc/passwd` 导航 | 被URL验证拦截 | +| | `evaluate_js` 未确认 | 等待确认/超时 | +| | 下载到工作区外 | WorkspaceGuard拦截 | +| **并发测试** | 超过 max_concurrent_pages 的请求 | 第5个请求阻塞直到释放 | +| **生命周期测试** | 空闲超时后新请求 | 浏览器重启并正常服务 | + +--- + +## 15. 回滚与监控 + +### 15.1 Feature Flag + +```toml +# 紧急回滚开关 +[agents] +enable_browser_agent = true # 设为 false 可禁用 Browser-Agent +``` + +- 设为 `false` 时:`BrowserManager` 不初始化,`delegate_browser` 工具返回"不可用" +- 不影响现有 Agent 的任何功能 +- `go-rod` 依赖可保留在 `go.mod` 中(不影响编译产物大小) + +### 15.2 监控指标 + +| 指标 | 来源 | 说明 | +|------|------|------| +| 浏览器存活状态 | `GET /health/browser` | `{"status":"ok","pages":2}` | +| 活跃页面数 | BrowserManager | 当前使用的标签页数 | +| 工具调用计数 | 日志 | 每种工具的调用频率 | +| 错误率 | 日志 | 工具执行失败/超时比例 | +| 浏览器崩溃次数 | 日志 | 自动恢复计数 | +| 内存使用 | Chrome进程监控 | 防止内存泄漏 | + +### 15.3 日志记录 + +```go +// 关键事件日志 +log.Info("浏览器启动", "flags", flags) +log.Info("页面获取", "available_pages", available) +log.Info("页面释放", "remaining_pages", remaining) +log.Warn("浏览器崩溃", "error", err, "restart_attempt", attempt) +log.Error("导航失败", "url", sanitizedURL, "error", err) +log.Info("浏览器空闲超时,自动关闭") +``` + +--- + +## 附录 + +### A. 依赖项 + +``` +github.com/go-rod/rod # 浏览器自动化核心库 +``` + +### B. 参考资源 + +- [go-rod 官方文档](https://go-rod.github.io/) +- [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/) +- [CodeActor Agent 架构文档](./ARCHITECTURE.md) + +### C. 术语表 + +| 术语 | 说明 | +|------|------| +| CDP | Chrome DevTools Protocol — Chrome 调试协议 | +| Hub-and-Spoke | 中心辐射型架构 — Conductor 为 Hub,Agent 为 Spoke | +| BrowserManager | 浏览器管理器 — 管理浏览器实例的单例 | +| WorkspaceGuard | 工作区守卫 — 确保文件操作在安全边界内 | +| Semaphore | 信号量 — 控制并发页面数的同步原语 | +| HijackRequests | Rod 的请求拦截功能 — 用于域名过滤和安全策略 | + +--- + +> **文档维护者**: CodeActor Team +> **最后更新**: 2025-07 +> **状态**: 待审查 → 待实施 diff --git a/docs/Prompt_Cache_Optimization_Plan.md b/docs/Prompt_Cache_Optimization_Plan.md new file mode 100644 index 0000000..a2f96e9 --- /dev/null +++ b/docs/Prompt_Cache_Optimization_Plan.md @@ -0,0 +1,224 @@ +# Prompt 缓存优化方案 + +> 审计日期:2025-07-16 +> 审计依据:`docs/Prompt_cache.md` 最佳实践文档 +> 决策方法:经过 `deepthinking` 深度分析后确定 + +--- + +## 一、审计背景 + +对照 LLM Prompt Cache 最佳实践文档的五项核心检查清单,对项目中 7 个 Agent(Conductor、Coding、Repo、Chat、DevOps、Meta、ImplPlan)的 prompt 构建方式进行了全面审计。 + +LLM 缓存采用**严格前缀匹配**(Prefix Matching)机制:从第一个 Token 开始必须完全一致,一旦中途有任何字符不同,该字符之后的所有缓存全部失效。 + +## 二、审计结论总览 + +| 编号 | 严重程度 | 问题 | 决策 | +|------|---------|------|------| +| **A** | 🔴 P0 | Conductor 动态项目上下文放在静态 prompt 之前 | **必须修复** | +| **B** | 🟡 P0 | RepoAgent 动态数据插入顺序不当 | **必须修复** | +| **C** | 🟢 P1 | `FunctionDef.Parameters` 使用 `map[string]any` | **建议优化**(仅加防御注释) | +| **D** | 🟡 P0 | `FormatPrompt` 中 Environment 字段条件拼接 | **必须修复** | +| **E** | ⚪ P2 | 未实现路由亲和性 | **暂不修复**(集群部署时再处理) | + +--- + +## 三、P0 必须修复项 + +### 问题 A:Conductor 动态项目上下文放在静态 prompt 之前 + +**文件**:`internal/agents/conductor.go`(约第 688 行) + +**当前代码**: +```go +// ❌ 动态上下文被 prepend 到静态 prompt 之前 +systemPrompt = fmt.Sprintf("### Project Workspace Context\n%s\n\n", loadResult.Content) + systemPrompt +``` + +**问题分析**: +- `loadProjectContext()` 加载的 `CODEACTOR.md`/`CLAUDE.md`/`AGENTS.md` 每个项目内容不同 +- 它被放在 `conductor.prompt.md`(139 行静态 prompt)的**最前面** +- 切换项目时,第一个 token 就不同 → 整个 System Prompt 缓存 100% 失效 +- Conductor 的 prompt 是系统中最大、最复杂的,缓存失效代价极高 + +**修复方案**:将 Project Context 移到 system prompt **末尾** + +```go +// ✅ 正确:动态上下文放在末尾 +systemPrompt := a.GlobalCtx.FormatPrompt(conductorPrompt) +// ... 追加 Custom Agents 注册信息 ... +if shouldLoadProjectContext { + systemPrompt += "\n\n### Project Workspace Context\n" + loadResult.Content + "\n" +} +``` + +**预期收益**: +- 139 行静态模板 + Environment + Language Instructions 成为固定前缀 +- 跨项目、跨会话共享缓存 +- 缓存命中率预计提升 60%~80% + +--- + +### 问题 B:RepoAgent 动态数据插入顺序不当 + +**文件**:`internal/agents/repo.go`(约第 195-217 行) + +**当前代码**: +```go +// ❌ 动态 investigation 数据插在静态 prompt 和环境信息之间 +systemPrompt := repoPrompt // 54 行静态 +systemPrompt += info // ← 动态数据插在中间 +systemPrompt = a.GlobalCtx.FormatPrompt(systemPrompt) // 追加 Environment +``` + +**问题分析**: +- `doPreInvestigate()` 返回的 Directory Tree、Core Functions、File Skeletons 每个项目不同,甚至同一项目代码变化后也不同 +- 动态数据后的 Environment + Language Instructions 缓存连带失效 +- RepoAgent 是高频调用 Agent + +**修复方案**:先 `FormatPrompt`(静态 + 环境),最后追加动态调查数据 + +```go +// ✅ 正确:静态在前,动态在最后 +systemPrompt := a.GlobalCtx.FormatPrompt(repoPrompt) +systemPrompt += info // investigation 数据放在最后 +``` + +**预期收益**: +- 54 行静态指令完全固化于前缀 +- 动态数据放尾部符合 LLM 注意力机制的"近因效应" + +--- + +### 问题 D:`FormatPrompt` 中 Environment 字段条件拼接 + +**文件**:`internal/globalctx/global_context.go`(`FormatPrompt` 方法) + +**当前代码**: +```go +// ❌ 条件判断导致同环境下前缀不一致 +if g.ProjectPath != "" { + sb.WriteString(fmt.Sprintf("- **Project Path**: %s\n", g.ProjectPath)) +} +if g.OS != "" { + sb.WriteString(fmt.Sprintf("- **Operating System**: %s\n", g.OS)) +} +if g.Arch != "" { + sb.WriteString(fmt.Sprintf("- **Architecture**: %s\n", g.Arch)) +} +``` + +**问题分析**: +- 条件分支导致相同环境下的请求前缀长度/内容不同 +- 若某字段为空被跳过,Environment 块的结构发生变化 +- 虽然 Environment 在最末尾,不会破坏前面的静态缓存,但会影响完整前缀一致性 + +**修复方案**:移除条件判断,始终输出完整字段结构 + +```go +// ✅ 正确:始终输出完整字段,空值用占位符 +projectPath := g.ProjectPath +if projectPath == "" { + projectPath = "[NOT SET]" +} +os := g.OS +if os == "" { + os = "[NOT SET]" +} +arch := g.Arch +if arch == "" { + arch = "[NOT SET]" +} + +sb.WriteString("\n\n### Environment\n") +sb.WriteString(fmt.Sprintf("- **Project Path**: %s\n", projectPath)) +sb.WriteString(fmt.Sprintf("- **Operating System**: %s\n", os)) +sb.WriteString(fmt.Sprintf("- **Architecture**: %s\n", arch)) +``` + +**预期收益**: +- 保证了 Environment 块的结构和前缀长度绝对一致 +- 最大化缓存命中率 + +--- + +## 四、P1 建议优化项 + +### 问题 C:`FunctionDef.Parameters` 使用 `map[string]any` + +**文件**:`internal/llm/engine.go`(`FunctionDef` 结构体) + +**现状**: +```go +type FunctionDef struct { + Name string `json:"name"` + Description string `json:"description,omitempty"` + Parameters map[string]any `json:"parameters,omitempty"` +} +``` + +**分析**: +- Go 标准库 `encoding/json` 对 `map` 序列化时按 key **字母排序**,行为是确定性的 ✅ +- 但如果未来切换 JSON 库(如 `sonic`、`jsoniter`),需确保启用 `SortMapKeys` 配置 +- 当前改为 struct 的工程成本高、收益低,不建议重构 + +**建议方案**:仅添加防御性注释 + +```go +// ⚠️ IMPORTANT: The current implementation relies on encoding/json's deterministic +// sorting of map keys (alphabetical order). If migrating to sonic, jsoniter, or +// another JSON library in the future, ensure SortMapKeys is enabled to maintain +// deterministic key ordering and prevent prompt cache fragmentation. +type FunctionDef struct { + Name string `json:"name"` + Description string `json:"description,omitempty"` + Parameters map[string]any `json:"parameters,omitempty"` +} +``` + +--- + +## 五、P2 暂不修复项 + +### 问题 E:未实现路由亲和性 + +**分析**: +- 当前开发环境为单节点运行,路由亲和性无实际影响 +- 引入 `prompt_cache_key` 或 Session Router 需改造 LLM Client 层,增加状态管理复杂度 + +**规划**: +- 集群部署时再引入一致性哈希路由或共享 Redis 缓存层 +- 可在 `llm.CallOptions` 中预留 `PromptCacheKey` 字段供未来使用 + +--- + +## 六、其他发现:可接受的架构代价 + +### Compact 压缩导致的缓存 Miss + +压缩引擎的 L3(丢弃早期消息)和 L2(截断工具输出)会改变消息结构,导致后续 LLM 调用的前缀变化。这是**可接受的架构代价**:缓存 Miss 是换取 Token 超限安全的必要手段,不应为保缓存而限制压缩。 + +### 动态 Agent 注册导致的工具定义变化 + +Meta-Agent 运行时注册自定义 Agent 会改变 `tool_defs` 列表。由于 `tool_defs` 作为 API 请求参数参与前缀匹配,动态注册天然导致 Cache Miss。这属于业务特性,无法避免。 + +--- + +## 七、实施优先级 + +| 优先级 | 问题 | 预计改动量 | 风险 | +|--------|------|-----------|------| +| **1** | D — FormatPrompt 条件拼接 | ~10 行 | 极低 | +| **2** | B — RepoAgent 顺序调整 | ~3 行 | 低 | +| **3** | A — Conductor 上下文移至末尾 | ~5 行 | 低(需验证 LLM 指令遵循度) | +| **4** | C — 添加防御注释 | ~5 行 | 零风险 | + +--- + +## 八、验证策略 + +1. **单元测试**:验证 `FormatPrompt` 在不同参数下输出前缀一致 +2. **集成测试**:使用 Mock LLM 拦截请求,统计前缀命中率 +3. **LLM 行为回归**:选取典型编码任务,验证修复后指令遵循率、工具调用准确率无退化 +4. **回滚方案**:所有变更通过独立 Git Commit 隔离,异常时一键 Revert \ No newline at end of file diff --git a/docs/Prompt_cache.md b/docs/Prompt_cache.md new file mode 100644 index 0000000..139d89f --- /dev/null +++ b/docs/Prompt_cache.md @@ -0,0 +1,58 @@ + + + +--- + +### 一、 提示词结构设计:严格分离静态与动态 + +LLM 缓存的底层机制是**前缀匹配(Prefix Matching)**:即缓存必须从第一个 Token 开始完全一致,一旦中途有任何一个字符不同,该字符之后的所有缓存将全部失效。 + +* **黄金法则:静态在前,动态在后** + * **最前部(Static Prefix)**:系统角色(System Prompt)、行为准则、全量工具定义(Tool Descriptions)、知识库文档。这些内容在 Agent 运行周期内几乎不变,占据了绝大多数 Token,应在此部分末尾打上缓存断点(Cache Breakpoint)。 + * **最后部(Dynamic Suffix)**:用户的当前提问、实时的环境变量、最新的观察结果(Observation)和步骤输出。 +* **反模式:严禁在系统提示词头部注入动态变量** + * 很多开发者习惯在 System Prompt 开头加上当前时间戳(`Current Time: xxx`)、请求 ID、或当前的任务进度状态。这会导致整个系统提示词前缀每一轮都在变化,使得数万 Token 的缓存 100% 彻底失效。时间戳或状态应作为独立的 System Message **追加**到对话末尾。 +* **确定性序列化(Deterministic Serialization)** + * Agent 经常需要将 JSON、字典对象或代码树序列化后放入上下文。必须确保每次序列化时的**字段顺序是固定且一致的**(例如总是按 Key 的字母排序)。由于库的随机性导致的键位倒置,会在不经意间破坏前缀一致性。 + +### 二、 对话与状态管理:遵守“只追加(Append-Only)”原则 + +在传统的软件工程中,我们习惯修改变量来更新状态;但在面向 LLM 缓存编程时,**上下文不是可编辑的变量,而是只追加的日志**。 + +* **绝对避免修改历史记录** + * **❌ 错误做法:滑动窗口截断(Sliding Window)**。当历史记录过长时,直接删除最前面的几轮对话。这会改变动态部分的开头,导致整个对话历史的缓存全毁。 + * **❌ 错误做法:修改历史消息**。如修改此前某一步的中间思考过程或纠正历史错误。 + * **✅ 正确做法:只追加内容**。以新的消息告知模型之前的错误,或将状态变化作为新的 user/system message 附在最后。 +* **使用工具调用代替模式切换** + * 不要通过修改 System Prompt 来让 Agent 切换工作模式(如从“规划模式”切换到“执行模式”)。应将所有模式对应的逻辑写死在静态工具列表中,让 Agent 通过触发特定工具来切换状态。统一的工具前缀命名(如 `browser_x`, `shell_y`)也能增加命中率的稳定性。 + +### 三、 系统架构与路由调度设计 + +即使提示词设计完美,如果系统调度不当,同样无法命中缓存。 + +* **路由粘性(Routing Stickiness)** + * **现象**:由于分布式集群中有多台推理服务器,如果同一 Agent 会话的不同步骤被负载均衡分配到了不同的机器,也会导致 Cache Miss。 + * **对策**:如果使用第三方 API(如 OpenAI),可以在 API 请求中带上 `prompt_cache_key` 参数或 `Session ID` 参数,确保具有相同前缀的请求倾向于被路由到同一台已缓存该 KV 状态的服务器上。自托管模型(如 vLLM 架构)也需配合 Session 路由打通 Prefix Cache。 +* **多智能体架构(Multi-Agent Swarm)优化** + * 不要试图构建一个包含 100 个工具、具有超级庞大 System Prompt 的“万能 Agent”。 + * 应使用多智能体架构,每个专精子 Agent 拥有高度稳定的、固定的工具集和提示词(例如专职代码审计的 Agent 只加载代码审计的 Prompt)。这样单个子 Agent 被反复调用时,其头部缓存命中率会极高。 +* **注意缓存生命周期(TTL 悬崖)** + * 多数 API 厂商(如 Anthropic)的提示词缓存默认存活时间(TTL)只有 5 分钟。 + * **对策**:如果 Agent 有异步任务或长时间等待用户反馈(超过 5 分钟),缓存会失效。设计时应尽量让 Agent 密集执行任务;或对于超高价值的共享上下文,通过低频率的定时“Ping”来预热或维持缓存。 + +### 四、 应用层缓存补充(Exact / Semantic Caching) + +除了底层的 KV 缓存,Agent 系统自身也应该设计应用级缓存,拦截对 LLM 的不必要调用。 + +* **精准匹配缓存(Exact Match Caching)** + * 对于高度重复的 Agent 宏动作或工具调用。如果前置依赖和状态(如针对同一网页的相同查询)一致,直接从 Redis 等存储中返回上一次的工具解析或摘要结果。 +* **语义缓存(Semantic Caching)** + * Agent 经常会遇到“表述不同但意图相同”的用户指令。通过引入轻量级的 Embedding 模型(如 GPTCache),计算当前请求与历史请求的向量相似度。如果相似度极高,Agent 可以直接复用之前的规划路线(Plan)或输出,从而实现 100% 避免全量 LLM 推理。 + +### 总结:Agent 缓存优化的核心清单 + +1. **静态/动态拆分**:系统设定和工具说明放在最前,会话历史次之,当前任务和动态参数压轴。 +2. **清理系统提示词的“脏数据”**:移除一切时间戳、UUID 等动态变量。 +3. **遵循 Append-Only**:绝不随意修改、删除对话历史中的中间项。 +4. **固定输出格式**:强制业务系统的序列化(如 JSON)具有确定性的键值排序。 +5. **设计路由亲和性**:保障同一 Agent 任务的后续请求发往同样的缓存节点。 diff --git a/go.sum b/go.sum index 5e303d0..66f6780 100644 --- a/go.sum +++ b/go.sum @@ -12,8 +12,6 @@ github.com/atotto/clipboard v0.1.4 h1:EH0zSVneZPSuFR11BlR9YppQTVDbh5+16AmcJi4g1z github.com/atotto/clipboard v0.1.4/go.mod h1:ZY9tmq7sm5xIbd9bOK4onWV4S6X0u6GY7Vn0Yu86PYI= github.com/aymanbagabas/go-osc52/v2 v2.0.1 h1:HwpRHbFMcZLEVr42D4p7XBqjyuxQH5SMiErDT4WkJ2k= github.com/aymanbagabas/go-osc52/v2 v2.0.1/go.mod h1:uYgXzlJ7ZpABp8OJ+exZzJJhRNQ2ASbcXHWsFqH8hp8= -github.com/aymanbagabas/go-udiff v0.2.0 h1:TK0fH4MteXUDspT88n8CKzvK0X9O2xu9yQjWpi6yML8= -github.com/aymanbagabas/go-udiff v0.2.0/go.mod h1:RE4Ex0qsGkTAJoQdQQCA0uG+nAzJO/pI/QwceO5fgrA= github.com/aymanbagabas/go-udiff v0.4.1 h1:OEIrQ8maEeDBXQDoGCbbTTXYJMYRCRO1fnodZ12Gv5o= github.com/aymanbagabas/go-udiff v0.4.1/go.mod h1:0L9PGwj20lrtmEMeyw4WKJ/TMyDtvAoK9bf2u/mNo3w= github.com/aymerick/douceur v0.2.0 h1:Mv+mAeH1Q+n9Fr+oyamOlAkUNPWPlA8PPGR0QAaYuPk= diff --git a/internal/agents/coding.go b/internal/agents/coding.go index 70fe418..6dd3fec 100644 --- a/internal/agents/coding.go +++ b/internal/agents/coding.go @@ -70,6 +70,8 @@ func NewCodingAgent(globalCtx *globalctx.GlobalCtx, llm llm.Engine, maxSteps int } case "micro_agent": fn = globalCtx.MicroAgentTool.Execute + case "deepthinking": + fn = globalCtx.DeepThinkingTool.Execute case "agent_exit": fn = globalCtx.FlowOps.ExecuteAgentExit case "ask_user_for_help": diff --git a/internal/agents/coding.prompt.md b/internal/agents/coding.prompt.md index d41781b..5decdce 100644 --- a/internal/agents/coding.prompt.md +++ b/internal/agents/coding.prompt.md @@ -34,6 +34,8 @@ You have access to the following tools. You must use them to interact with the s * **Thinking & Debugging**: * Use the `thinking` tool to analyze complex problems, plan multi-step tasks, or debug errors. * *Trigger*: If a tool execution fails (e.g., test failed, compilation error), you **MUST** use the `thinking` tool to analyze the error before retrying. **Analyze -> Plan -> Fix**. + * The `micro_agent` tool can delegate focused subtasks to a specialized micro-agent. + * The `deepthinking` tool is an extremely expensive, last-resort analysis tool — see constraints below. # Workflow 1. **Analyze**: Understand the user's intent. If ambiguous, use the `thinking` tool or ask clarifying questions (only if necessary). @@ -58,3 +60,6 @@ You have access to the following tools. You must use them to interact with the s * **Be Proactive**: Don't wait for the user to drive every step. Take initiative. * **Be Thorough**: Verify your work. Don't leave broken code. * **Be Safe**: Protect the user's environment. + +### DeepThinking Tool (Last Resort) +- **`deepthinking`**: An extremely expensive, isolated deep analysis tool. ONLY use when conventional methods (thinking tool, micro_agent, code analysis) have been exhausted and the problem requires systematic multi-dimensional analysis. Input: `context` (full problem context including errors, background, what failed) and `goal` (specific objective). This tool is VERY expensive — do NOT use for simple issues. diff --git a/internal/agents/conductor.go b/internal/agents/conductor.go index adb2f6b..3e24f03 100644 --- a/internal/agents/conductor.go +++ b/internal/agents/conductor.go @@ -294,6 +294,8 @@ func NewConductorAgent(globalCtx *globalctx.GlobalCtx, engine llm.Engine, repo * fn = globalCtx.FileOps.ExecuteReadFile case "print_dir_tree": fn = globalCtx.FileOps.ExecutePrintDirTree + case "deepthinking": + fn = globalCtx.DeepThinkingTool.Execute default: continue } @@ -383,6 +385,8 @@ func (a *ConductorAgent) getToolFunc(name string) tools.ToolFunc { } case "micro_agent": return a.GlobalCtx.MicroAgentTool.Execute + case "deepthinking": + return a.GlobalCtx.DeepThinkingTool.Execute case "agent_exit": return a.GlobalCtx.FlowOps.ExecuteAgentExit case "ask_user_for_help": @@ -654,14 +658,7 @@ func (a *ConductorAgent) Run(ctx context.Context, input string, mem *memory.Conv // Always start with System Prompt (with any registered custom agents appended) systemPrompt := a.GlobalCtx.FormatPrompt(conductorPrompt) - if len(a.customAgents) > 0 { - systemPrompt += "\n\n### Custom Agents\nThe following specialized agents have been designed by Meta-Agent and are permanently available for delegation:\n\n" - for _, ca := range a.customAgents { - systemPrompt += fmt.Sprintf("- **%s** (`delegate_%s`): %s\n", ca.DisplayName, ca.Name, ca.Description) - } - systemPrompt += "\nUse these agents via their delegate tools for tasks matching their specializations.\n" - } - + var projectContext string // 只在首次对话时加载项目上下文文件(CODEACTOR.md、CLAUDE.md、AGENTS.md), // 同一会话的后续追问无需重复注入,避免浪费 token。 // memory 中不存储 system 消息,因此 len(mem.GetMessages()) == 0 即可判断是否为首次对话。 @@ -671,8 +668,24 @@ func (a *ConductorAgent) Run(ctx context.Context, input string, mem *memory.Conv if a.Publisher != nil { a.Publisher.Publish("context_loaded", loadResult, a.Name()) } - systemPrompt = fmt.Sprintf("### Project Workspace Context\n%s\n\n", loadResult.Content) + systemPrompt + // 延迟追加:先构建完整的 system prompt(静态前缀 + 环境信息 + 自定义 Agent), + // 最后才追加项目上下文,确保静态前缀可被 LLM Prompt Cache 复用 + projectContext = fmt.Sprintf("\n\n### Project Workspace Context\n%s\n", loadResult.Content) + } + } + + // 自定义 Agent 描述 + if len(a.customAgents) > 0 { + systemPrompt += "\n\n### Custom Agents\nThe following specialized agents have been designed by Meta-Agent and are permanently available for delegation:\n\n" + for _, ca := range a.customAgents { + systemPrompt += fmt.Sprintf("- **%s** (`delegate_%s`): %s\n", ca.DisplayName, ca.Name, ca.Description) } + systemPrompt += "\nUse these agents via their delegate tools for tasks matching their specializations.\n" + } + + // 追加项目上下文(放在所有静态内容之后,确保缓存命中率) + if projectContext != "" { + systemPrompt += projectContext } messages = append(messages, llm.Message{ @@ -754,9 +767,11 @@ func (a *ConductorAgent) Run(ctx context.Context, input string, mem *memory.Conv metadata := map[string]interface{}{} if resp.Usage != nil { metadata["usage"] = map[string]interface{}{ - "prompt_tokens": resp.Usage.PromptTokens, - "completion_tokens": resp.Usage.CompletionTokens, - "total_tokens": resp.Usage.TotalTokens, + "prompt_tokens": resp.Usage.PromptTokens, + "completion_tokens": resp.Usage.CompletionTokens, + "total_tokens": resp.Usage.TotalTokens, + "cache_creation_input_tokens": resp.Usage.CacheCreationInputTokens, + "cache_read_input_tokens": resp.Usage.CacheReadInputTokens, } } if len(metadata) > 0 { diff --git a/internal/agents/conductor.prompt.md b/internal/agents/conductor.prompt.md index 93cc2b0..193a688 100644 --- a/internal/agents/conductor.prompt.md +++ b/internal/agents/conductor.prompt.md @@ -54,6 +54,9 @@ You have access to the following specialized sub-agents. You must delegate to th * **Decision Rule**: Before using Meta-Agent, first consider whether a combination of existing agents can solve the task. Only delegate to Meta-Agent when the task genuinely requires a novel agent design. Once a custom agent is registered, prefer reusing it for similar tasks rather than invoking Meta-Agent again. * **Already Registered Agents**: Check the **Custom Agents** section in the system prompt to see which custom agents have already been created and are available for delegation. +### Special Tools +- **`deepthinking`**: An extremely expensive deep analysis tool. ONLY use as a last resort when all other approaches have failed. It performs exhaustive system analysis and produces comprehensive solution designs. Input: `context` (full problem context including errors, background, what was tried) and `goal` (specific objective). + ### Workflow Strategy Your core decision loop: **Analyze → Design (if needed) → Execute → Review → Iterate**. @@ -101,6 +104,11 @@ Working agents that produce final output are: **Coding-Agent**, **Chat-Agent**, 4. **No Long-Running Processes**: Do not instruct agents to start development servers or applications (e.g., `npm run dev`). Verification should be done via unit tests, syntax checks, or compilation. 5. **Delegate Repo Analysis**: The Conductor's own `read_file`, `search_by_regex`, `list_dir`, `print_dir_tree` are **LOW-PRIORITY fallbacks** for repository understanding. You MUST delegate all codebase exploration to Repo-Agent via `delegate_repo` — it has codebase semantic tools (`semantic_search`, `query_code_skeleton`, `query_code_snippet`) that are far more effective than raw file operations. Only use your own file tools as a last resort when Repo-Agent is unavailable or its result is clearly insufficient. 6. **Enforce Parallelism**: When delegating read-only or exploration tasks, explicitly require the sub-agent to use parallel tool calls. +7. **Use DeepThinking Sparingly**: You have access to a `deepthinking` tool for extreme cases. This tool is VERY expensive (high token cost, high latency). ONLY use `deepthinking` when ALL of the following are true: + - Conventional methods have been exhausted (thinking tool, micro_agent, repo analysis, delegation to sub-agents) + - The problem involves complex multi-system interactions, deep architectural questions, or requires systematic analysis beyond normal reasoning + - Multiple attempts using standard approaches have failed + **Never** use `deepthinking` for simple issues, quick fixes, syntax errors, or straightforward coding tasks. ### Output Format You must structure your textual response (before the tool call) using the following markdown `Thought Process` block: diff --git a/internal/agents/executor.go b/internal/agents/executor.go index d5d219d..2742480 100644 --- a/internal/agents/executor.go +++ b/internal/agents/executor.go @@ -75,9 +75,11 @@ func RunAgentLoop(ctx context.Context, cfg ExecutorConfig) (string, error) { metadata := map[string]interface{}{} if resp.Usage != nil { metadata["usage"] = map[string]interface{}{ - "prompt_tokens": resp.Usage.PromptTokens, - "completion_tokens": resp.Usage.CompletionTokens, - "total_tokens": resp.Usage.TotalTokens, + "prompt_tokens": resp.Usage.PromptTokens, + "completion_tokens": resp.Usage.CompletionTokens, + "total_tokens": resp.Usage.TotalTokens, + "cache_creation_input_tokens": resp.Usage.CacheCreationInputTokens, + "cache_read_input_tokens": resp.Usage.CacheReadInputTokens, } } if len(metadata) > 0 { diff --git a/internal/agents/meta.go b/internal/agents/meta.go index 4aca676..b554760 100644 --- a/internal/agents/meta.go +++ b/internal/agents/meta.go @@ -71,9 +71,11 @@ func (a *MetaAgent) Run(ctx context.Context, input string) (string, error) { metadata := map[string]interface{}{} if resp.Usage != nil { metadata["usage"] = map[string]interface{}{ - "prompt_tokens": resp.Usage.PromptTokens, - "completion_tokens": resp.Usage.CompletionTokens, - "total_tokens": resp.Usage.TotalTokens, + "prompt_tokens": resp.Usage.PromptTokens, + "completion_tokens": resp.Usage.CompletionTokens, + "total_tokens": resp.Usage.TotalTokens, + "cache_creation_input_tokens": resp.Usage.CacheCreationInputTokens, + "cache_read_input_tokens": resp.Usage.CacheReadInputTokens, } } if len(metadata) > 0 { diff --git a/internal/agents/repo.go b/internal/agents/repo.go index 63b3aa8..bbd61ee 100644 --- a/internal/agents/repo.go +++ b/internal/agents/repo.go @@ -79,6 +79,8 @@ func NewRepoAgent(globalCtx *globalctx.GlobalCtx, llm llm.Engine, publisher *mes fn = globalCtx.RepoOps.ExecuteQueryCodeSkeleton case "query_code_snippet": fn = globalCtx.RepoOps.ExecuteQueryCodeSnippet + case "deepthinking": + fn = globalCtx.DeepThinkingTool.Execute default: continue } @@ -173,11 +175,12 @@ func (a *RepoAgent) Run(ctx context.Context, input string) (string, error) { slog.Info("RepoAgent performing pre-investigation", "project_dir", a.GlobalCtx.ProjectPath) investigation, err := a.doPreInvestigate(a.GlobalCtx.ProjectPath) + var info string if err != nil { slog.Warn("RepoAgent pre-investigation failed", "error", err) } else { // Add investigation results to system prompt - info := "\n\nRepository Information:\n" + info = "\n\nRepository Information:\n" info += "\nDirectory Tree:\n" + investigation.Data.DirectoryTree + "\n" info += "\nCore Functions:\n" for _, fn := range investigation.Data.CoreFunctions { @@ -209,10 +212,10 @@ func (a *RepoAgent) Run(ctx context.Context, input string) (string, error) { info += fmt.Sprintf("File: %s\n```%s\n%s\n```\n", sk.Filepath, sk.Language, sk.SkeletonText) } - systemPrompt += info } systemPrompt = a.GlobalCtx.FormatPrompt(systemPrompt) + systemPrompt += info cfg := ExecutorConfig{ SystemPrompt: systemPrompt, diff --git a/internal/agents/repo.prompt.md b/internal/agents/repo.prompt.md index c9210bd..91fbc94 100644 --- a/internal/agents/repo.prompt.md +++ b/internal/agents/repo.prompt.md @@ -48,4 +48,7 @@ When you need to explore or investigate the repository beyond the provided repor Output a clear, structured summary that gives a developer a solid "mental map" of the codebase. **Language Compliance**: -The output summary MUST be in the language specified in **Language Instructions**. \ No newline at end of file +The output summary MUST be in the language specified in **Language Instructions**. + +### DeepThinking Tool (Last Resort) +- **`deepthinking`**: An extremely expensive deep analysis tool. ONLY use as a last resort when all other analysis methods have failed. Input: `context` (full problem context) and `goal` (specific objective). This tool is VERY expensive — do NOT use for simple code exploration tasks. \ No newline at end of file diff --git a/internal/agents/tools.json b/internal/agents/tools.json index 2f903f2..64666cb 100644 --- a/internal/agents/tools.json +++ b/internal/agents/tools.json @@ -154,6 +154,24 @@ "required": ["system_prompt", "task"] } }, + { + "name": "deepthinking", + "description": "A deep system analysis and design tool for complex, difficult problems. This tool performs exhaustive multi-dimensional analysis — including root cause analysis, system-level impact assessment, constraint analysis, and solution architecture design — using an isolated, expensive LLM call with a specialized system prompt. **IMPORTANT: This tool is VERY EXPENSIVE (high token cost and latency). Only use it when:** 1) Conventional methods (thinking tool, micro_agent, repo analysis, delegation to sub-agents) have been tried and failed to produce a satisfactory solution, 2) The problem is inherently complex and requires systematic design, or 3) Multiple failed attempts indicate a fundamental misunderstanding that needs deep analysis. **Do NOT use this tool** for simple issues, quick fixes, or routine tasks. The input must include the full problem context (execution environment feedback, key errors, problem background, constraints) and the specific goal to achieve.", + "parameters": { + "type": "object", + "properties": { + "context": { + "type": "string", + "description": "The complete problem context including: execution environment feedback, key errors encountered, problem background, constraints, what has been tried and failed, and any other relevant information needed for deep analysis." + }, + "goal": { + "type": "string", + "description": "The specific objective to achieve. What should the deep analysis produce? A solution design, a debugging strategy, an architectural decision, or a comprehensive plan." + } + }, + "required": ["context", "goal"] + } + }, { "name": "delete_file", "description": "You can use this tool to delete files, you can delete multi files in one toolcall, and you MUST make sure the files is exist before deleting.", diff --git a/internal/app/app.go b/internal/app/app.go index 157340a..8e35bf9 100644 --- a/internal/app/app.go +++ b/internal/app/app.go @@ -65,6 +65,12 @@ func (ca *CodingAssistant) Init(engine llm.Engine, workDir string) { microAgentEngine = ca.client.GetToolEngine("micro_agent") } + // Resolve tool-specific engine for deepthinking + deepthinkingEngine := engine + if ca.client != nil { + deepthinkingEngine = ca.client.GetToolEngine("deepthinking") + } + gctx := globalctx.GlobalCtx{ SpeakLang: ca.config.Agent.SpeakLang, ProjectPath: workDir, @@ -75,16 +81,17 @@ func (ca *CodingAssistant) Init(engine llm.Engine, workDir string) { CodebaseURL: fmt.Sprintf("http://127.0.0.1:%d", ca.CodebasePort), // Tools - FileOps: tools.NewFileOperationsTool(workDir), - SearchOps: tools.NewSearchOperationsTool(workDir), - SysOps: tools.NewSystemOperationsTool(workDir), - ReplaceTool: tools.NewReplaceBlockTool(workDir), - ThinkingTool: tools.NewThinkingTool(), - MicroAgentTool: tools.NewMicroAgentTool(microAgentEngine), - ImplPlanTool: tools.NewImplPlanTool(), - FlowOps: tools.NewFlowControlTool(workDir), - RepoOps: tools.NewRepoOperationsTool(fmt.Sprintf("http://127.0.0.1:%d", ca.CodebasePort), workDir), - UserConfirmMgr: userConfirmMgr, + FileOps: tools.NewFileOperationsTool(workDir), + SearchOps: tools.NewSearchOperationsTool(workDir), + SysOps: tools.NewSystemOperationsTool(workDir), + ReplaceTool: tools.NewReplaceBlockTool(workDir), + ThinkingTool: tools.NewThinkingTool(), + MicroAgentTool: tools.NewMicroAgentTool(microAgentEngine), + ImplPlanTool: tools.NewImplPlanTool(), + FlowOps: tools.NewFlowControlTool(workDir), + RepoOps: tools.NewRepoOperationsTool(fmt.Sprintf("http://127.0.0.1:%d", ca.CodebasePort), workDir), + UserConfirmMgr: userConfirmMgr, + DeepThinkingTool: tools.NewDeepThinkingTool(deepthinkingEngine), } ca.globalCtx = &gctx diff --git a/internal/config/config.go b/internal/config/config.go index 94b2e76..3013212 100644 --- a/internal/config/config.go +++ b/internal/config/config.go @@ -75,10 +75,11 @@ type ToolLLMOverride struct { // ToolsLLMConfig holds per-tool LLM overrides. // Priority: per-tool > tools.default > agent > global. type ToolsLLMConfig struct { - UseProvider string `toml:"use_provider"` // default for all tools - MicroAgent *ToolLLMOverride `toml:"micro_agent,omitempty"` - Thinking *ToolLLMOverride `toml:"thinking,omitempty"` - ImplPlan *ToolLLMOverride `toml:"impl_plan,omitempty"` + UseProvider string `toml:"use_provider"` // default for all tools + MicroAgent *ToolLLMOverride `toml:"micro_agent,omitempty"` + Thinking *ToolLLMOverride `toml:"thinking,omitempty"` + ImplPlan *ToolLLMOverride `toml:"impl_plan,omitempty"` + DeepThinking *ToolLLMOverride `toml:"deepthinking,omitempty"` } // TopLevelConfig groups the [global] section. @@ -178,6 +179,8 @@ func (c *Config) getToolOverride(toolName string) *ToolLLMOverride { return c.Tools.LLM.Thinking case "impl_plan": return c.Tools.LLM.ImplPlan + case "deepthinking": + return c.Tools.LLM.DeepThinking default: return nil } diff --git a/internal/globalctx/global_context.go b/internal/globalctx/global_context.go index 4a6eb94..2065da1 100644 --- a/internal/globalctx/global_context.go +++ b/internal/globalctx/global_context.go @@ -23,17 +23,18 @@ type GlobalCtx struct { MaxContextTokens int // Tools - FileOps *tools.FileOperationsTool - SearchOps *tools.SearchOperationsTool - SysOps *tools.SystemOperationsTool - ReplaceTool *tools.ReplaceBlockTool - ThinkingTool *tools.ThinkingTool - MicroAgentTool *tools.MicroAgentTool - ImplPlanTool *tools.ImplPlanTool - FlowOps *tools.FlowControlTool - RepoOps *tools.RepoOperationsTool - UserConfirmMgr *tools.UserConfirmManager - Guard *tools.WorkspaceGuard + FileOps *tools.FileOperationsTool + SearchOps *tools.SearchOperationsTool + SysOps *tools.SystemOperationsTool + ReplaceTool *tools.ReplaceBlockTool + ThinkingTool *tools.ThinkingTool + MicroAgentTool *tools.MicroAgentTool + ImplPlanTool *tools.ImplPlanTool + FlowOps *tools.FlowControlTool + RepoOps *tools.RepoOperationsTool + UserConfirmMgr *tools.UserConfirmManager + Guard *tools.WorkspaceGuard + DeepThinkingTool *tools.DeepThinkingTool } func (g *GlobalCtx) FormatPrompt(prompt string) string { @@ -41,17 +42,24 @@ func (g *GlobalCtx) FormatPrompt(prompt string) string { sb.WriteString(prompt) // Environment context - sb.WriteString("\n\n### Environment\n") - if g.ProjectPath != "" { - sb.WriteString(fmt.Sprintf("- **Project Path**: %s\n", g.ProjectPath)) + projectPath := g.ProjectPath + if projectPath == "" { + projectPath = "[NOT SET]" } - if g.OS != "" { - sb.WriteString(fmt.Sprintf("- **Operating System**: %s\n", g.OS)) + os := g.OS + if os == "" { + os = "[NOT SET]" } - if g.Arch != "" { - sb.WriteString(fmt.Sprintf("- **Architecture**: %s\n", g.Arch)) + arch := g.Arch + if arch == "" { + arch = "[NOT SET]" } + sb.WriteString("\n\n### Environment\n") + sb.WriteString(fmt.Sprintf("- **Project Path**: %s\n", projectPath)) + sb.WriteString(fmt.Sprintf("- **Operating System**: %s\n", os)) + sb.WriteString(fmt.Sprintf("- **Architecture**: %s\n", arch)) + // Language if g.SpeakLang != "" { sb.WriteString(fmt.Sprintf("\n### Language Instructions\nYou MUST use **%s** for ALL output, including your internal 'Thought Process', 'Thinking Tool' usage, reasoning steps, and final responses.\n", g.SpeakLang)) diff --git a/internal/llm/engine.go b/internal/llm/engine.go index cf31383..83e5583 100644 --- a/internal/llm/engine.go +++ b/internal/llm/engine.go @@ -31,6 +31,11 @@ type ToolDef struct { } // FunctionDef defines a function tool's signature. +// +// ⚠️ IMPORTANT: The current implementation relies on encoding/json's deterministic +// sorting of map keys (alphabetical order). If migrating to sonic, jsoniter, or +// another JSON library in the future, ensure SortMapKeys is enabled to maintain +// deterministic key ordering and prevent prompt cache fragmentation. type FunctionDef struct { Name string `json:"name"` Description string `json:"description,omitempty"` @@ -52,9 +57,11 @@ type FunctionCall struct { // TokenUsage contains token usage information returned by the LLM API. type TokenUsage struct { - PromptTokens int64 `json:"prompt_tokens"` - CompletionTokens int64 `json:"completion_tokens"` - TotalTokens int64 `json:"total_tokens"` + PromptTokens int64 `json:"prompt_tokens"` + CompletionTokens int64 `json:"completion_tokens"` + TotalTokens int64 `json:"total_tokens"` + CacheCreationInputTokens int64 `json:"cache_creation_input_tokens,omitempty"` + CacheReadInputTokens int64 `json:"cache_read_input_tokens,omitempty"` } // Response represents the LLM's response to a GenerateContent call. diff --git a/internal/llm/engine_openai.go b/internal/llm/engine_openai.go index 32e241d..289daf5 100644 --- a/internal/llm/engine_openai.go +++ b/internal/llm/engine_openai.go @@ -137,6 +137,24 @@ func (e *OpenAIEngine) generateStreaming(ctx context.Context, params openai.Chat CompletionTokens: acc.Usage.CompletionTokens, TotalTokens: acc.Usage.TotalTokens, } + + // Extract cache-related tokens from prompt_tokens_details.cached_tokens (OpenAI format) + if acc.Usage.PromptTokensDetails.CachedTokens > 0 { + usage.CacheReadInputTokens = acc.Usage.PromptTokensDetails.CachedTokens + } + + // Also try to extract cache fields from raw JSON (for Anthropic-compatible APIs) + if raw := acc.Usage.RawJSON(); raw != "" { + var rawUsage map[string]any + if err := json.Unmarshal([]byte(raw), &rawUsage); err == nil { + if cacheRead, ok := rawUsage["cache_read_input_tokens"].(float64); ok && cacheRead > 0 { + usage.CacheReadInputTokens = int64(cacheRead) + } + if cacheCreate, ok := rawUsage["cache_creation_input_tokens"].(float64); ok && cacheCreate > 0 { + usage.CacheCreationInputTokens = int64(cacheCreate) + } + } + } } return &Response{ @@ -279,6 +297,24 @@ func (e *OpenAIEngine) toResponse(completion *openai.ChatCompletion) *Response { CompletionTokens: completion.Usage.CompletionTokens, TotalTokens: completion.Usage.TotalTokens, } + + // Extract cache-related tokens from prompt_tokens_details.cached_tokens (OpenAI format) + if completion.Usage.PromptTokensDetails.CachedTokens > 0 { + resp.Usage.CacheReadInputTokens = completion.Usage.PromptTokensDetails.CachedTokens + } + + // Also try to extract cache fields from raw JSON (for Anthropic-compatible APIs) + if raw := completion.Usage.RawJSON(); raw != "" { + var rawUsage map[string]any + if err := json.Unmarshal([]byte(raw), &rawUsage); err == nil { + if cacheRead, ok := rawUsage["cache_read_input_tokens"].(float64); ok && cacheRead > 0 { + resp.Usage.CacheReadInputTokens = int64(cacheRead) + } + if cacheCreate, ok := rawUsage["cache_creation_input_tokens"].(float64); ok && cacheCreate > 0 { + resp.Usage.CacheCreationInputTokens = int64(cacheCreate) + } + } + } } return resp diff --git a/internal/tools/deepthinking.go b/internal/tools/deepthinking.go new file mode 100644 index 0000000..d840551 --- /dev/null +++ b/internal/tools/deepthinking.go @@ -0,0 +1,126 @@ +package tools + +import ( + "context" + "fmt" + + "codeactor/internal/llm" +) + +// DeepThinkingTool provides system-level analysis and design capabilities. +// It uses an isolated LLM call with a specialized system prompt for deep, +// structured analysis of complex problems. This tool is EXPENSIVE and should +// only be used after conventional methods have been exhausted. +type DeepThinkingTool struct { + LLM llm.Engine +} + +// NewDeepThinkingTool creates a new DeepThinkingTool with the given LLM client. +func NewDeepThinkingTool(llm llm.Engine) *DeepThinkingTool { + return &DeepThinkingTool{LLM: llm} +} + +// Execute performs deep system analysis using an isolated LLM call. +// It takes a problem context and a goal, then returns a comprehensive solution. +// +// Parameters: +// - context: The full problem context including execution environment feedback, +// key errors, problem background, and any relevant constraints. +// - goal: The specific objective to achieve. +// +// Returns a structured analysis and solution plan. +func (t *DeepThinkingTool) Execute(ctx context.Context, params map[string]interface{}) (interface{}, error) { + problemContext, ok := params["context"].(string) + if !ok || problemContext == "" { + return nil, fmt.Errorf("context parameter is required and must be a non-empty string") + } + + goal, ok := params["goal"].(string) + if !ok || goal == "" { + return nil, fmt.Errorf("goal parameter is required and must be a non-empty string") + } + + systemPrompt := getDeepThinkingSystemPrompt() + + task := fmt.Sprintf( + "# Problem Context\n\n%s\n\n---\n\n# Goal\n\n%s\n\n---\n\nPlease perform a thorough system analysis and provide a comprehensive solution following the structure defined in your system prompt.", + problemContext, + goal, + ) + + messages := []llm.Message{ + { + Role: llm.RoleSystem, + Content: systemPrompt, + }, + { + Role: llm.RoleUser, + Content: task, + }, + } + + resp, err := t.LLM.GenerateContent(ctx, messages, nil, nil) + if err != nil { + return nil, fmt.Errorf("deepthinking LLM call failed: %w", err) + } + + if len(resp.Choices) > 0 { + return resp.Choices[0].Content, nil + } + return "", nil +} + +// getDeepThinkingSystemPrompt returns the specialized system prompt for deep analysis. +func getDeepThinkingSystemPrompt() string { + return `# Role +You are a **Deep Thinking Engine** — an elite system analyst and solution architect. You are activated only for the most challenging problems that cannot be solved by conventional methods. + +# Your Mission +Perform exhaustive, multi-dimensional analysis of the given problem context and produce a comprehensive, actionable solution. + +# Analysis Framework +You MUST structure your response using the following framework: + +## 1. Root Cause Analysis +- Identify the fundamental problem, not just symptoms +- Trace the causal chain from observed failures back to root causes +- Distinguish between primary and secondary issues + +## 2. System-Level Impact Assessment +- Map all affected components and their interdependencies +- Evaluate the blast radius: what else could break? +- Identify cascading effects and hidden risks + +## 3. Constraint Analysis +- Technical constraints (language, framework, platform, performance) +- Resource constraints (time, compute, memory, budget) +- Organizational constraints (team capabilities, existing architecture, policies) +- Risk constraints (security, reliability, data integrity) + +## 4. Solution Design +- Propose 2-3 candidate solutions with clear pros/cons for each +- For each solution, provide: + * Approach overview + * Estimated effort (Low/Medium/High) + * Risk level (Low/Medium/High) + * Trade-offs and implications + +## 5. Recommended Solution — Detailed Plan +- Select the best solution and justify the choice +- Provide a step-by-step implementation plan +- For each step: what to do, which files/components to modify, what to verify +- Include specific code patterns, architectural diagrams (in text), or pseudocode where helpful + +## 6. Verification Strategy +- How to verify the solution works correctly +- Test cases and edge cases to consider +- Rollback plan if the solution fails + +# Critical Rules +- Be thorough and systematic — this is an expensive call, make it count +- Ground ALL analysis in the provided context — do not hallucinate +- Provide concrete, actionable recommendations — not vague advice +- Consider long-term maintainability and scalability +- Flag any assumptions you are making explicitly +` +} diff --git a/internal/tui/tui_helpers.go b/internal/tui/tui_helpers.go index 9b04ed1..d3dc988 100644 --- a/internal/tui/tui_helpers.go +++ b/internal/tui/tui_helpers.go @@ -59,20 +59,52 @@ func StartTUI(taskFilePath string, ca *app.CodingAssistant, tm *http.TaskManager } func (m model) computeFieldWidth() int { const minField = 38 - const maxField = 90 + const margin = 4 // small padding from terminal edges if m.termWidth <= 0 { - return 60 + return 80 } - avail := m.termWidth - 8 + avail := m.termWidth - margin if avail < minField { return minField } - if avail > maxField { - return maxField - } return avail } +// computeInputHeight calculates the textarea height based on content lines +// and available terminal space. Height grows with content but is capped. +func (m model) computeInputHeight() int { + const minHeight = 3 + const maxHeight = 12 + + // Count lines in current value (at least 1 for empty input) + lines := strings.Count(m.input.Value(), "\n") + 1 + desired := lines + 1 // +1 line for comfortable editing headroom + + if desired < minHeight { + desired = minHeight + } + + // Cap to at most ~1/3 of terminal height so viewport remains usable + if m.termHeight > 0 { + termMax := (m.termHeight - 8) / 2 // 8 lines reserved for separator + status + token dashboard + if termMax < minHeight { + termMax = minHeight + } + if termMax > maxHeight { + termMax = maxHeight + } + if desired > termMax { + desired = termMax + } + } else { + if desired > maxHeight { + desired = maxHeight + } + } + + return desired +} + // getToolCallIDFromEventContent extracts tool_call_id from event content. func getToolCallIDFromEventContent(content interface{}) string { if m, ok := content.(map[string]interface{}); ok { diff --git a/internal/tui/tui_model.go b/internal/tui/tui_model.go index 7207413..0454988 100644 --- a/internal/tui/tui_model.go +++ b/internal/tui/tui_model.go @@ -59,7 +59,7 @@ var ( toolDoneStyle = lipgloss.NewStyle().Foreground(lipgloss.Color("114")) // green — success toolErrorStyle = lipgloss.NewStyle().Foreground(lipgloss.Color("167")) // red — error - // Mode-specific styles (vim-like edit / command modes) + // Mode-specific styles (vim-like edit / command modes) — harmonized with TUI 256-color palette commandPrefixStyle = lipgloss.NewStyle().Foreground(lipgloss.Color("214")).Bold(true) // orange ":" commandModeBarStyle = lipgloss.NewStyle(). Background(lipgloss.Color("214")). @@ -147,9 +147,11 @@ func (c *tuiEventConsumer) Consume(event *messaging.MessageEvent) error { // AgentTokenUsage tracks token consumption for a single agent. type AgentTokenUsage struct { - AgentName string - InputTokens int64 - OutputTokens int64 + AgentName string + InputTokens int64 + OutputTokens int64 + CacheCreationInputTokens int64 + CacheReadInputTokens int64 } // TUI Model @@ -220,8 +222,10 @@ type model struct { currentModel string // Token consumption tracking - inputTokens int64 // accumulated input tokens - outputTokens int64 // accumulated output tokens + inputTokens int64 // accumulated input tokens + outputTokens int64 // accumulated output tokens + cacheCreationInputTokens int64 // accumulated cache creation input tokens + cacheReadInputTokens int64 // accumulated cache read (hit) tokens // Per-agent token tracking tokenUsagePerAgent map[string]*AgentTokenUsage @@ -238,29 +242,35 @@ type model struct { func initialModel(preloadedTaskContent string, ca *app.CodingAssistant, tm *http.TaskManager, dm *datamanager.DataManager, useDarkStyle bool) model { ti := textarea.New() + + // ── Editor input styles (harmonized with TUI 256-color palette) ── + // Accent: 39 (blue, matches NameNormal/tool names) + // Text: 252 (light gray, matches Body/AIResStyle) + // Muted: 245 (gray, matches ContentLine/ParamKey) + // Subtle bg: 236 (dark gray, barely visible on dark terminals) + // Cursor line: 237 (matches SeparatorStyle) + ti.Cursor.Style = lipgloss.NewStyle().Foreground(lipgloss.Color("39")) ti.Placeholder = langManager.GetText("TaskDescPlaceholder") ti.Focus() ti.CharLimit = 0 ti.SetWidth(60) - ti.SetHeight(2) + ti.SetHeight(3) ti.ShowLineNumbers = false - // Text style for both focused and blurred states textStyle := lipgloss.NewStyle().Foreground(lipgloss.Color("252")) ti.FocusedStyle.Text = textStyle ti.BlurredStyle.Text = textStyle - // Edit mode base style: dark background shadow - editBaseStyle := lipgloss.NewStyle().Background(lipgloss.Color("235")) + editBaseStyle := lipgloss.NewStyle().Background(lipgloss.Color("236")) ti.FocusedStyle.Base = editBaseStyle ti.BlurredStyle.Base = editBaseStyle - ti.FocusedStyle.Prompt = lipgloss.NewStyle().Foreground(lipgloss.Color("39")).Bold(true).Background(lipgloss.Color("235")) - ti.BlurredStyle.Prompt = lipgloss.NewStyle().Foreground(lipgloss.Color("244")).Background(lipgloss.Color("235")) - ti.FocusedStyle.CursorLine = lipgloss.NewStyle().Background(lipgloss.Color("235")) - ti.BlurredStyle.CursorLine = lipgloss.NewStyle().Background(lipgloss.Color("235")) - ti.FocusedStyle.Placeholder = lipgloss.NewStyle().Foreground(lipgloss.Color("245")).Background(lipgloss.Color("235")) - ti.BlurredStyle.Placeholder = lipgloss.NewStyle().Foreground(lipgloss.Color("245")).Background(lipgloss.Color("235")) + ti.FocusedStyle.Prompt = lipgloss.NewStyle().Foreground(lipgloss.Color("39")).Bold(true).Background(lipgloss.Color("236")) + ti.BlurredStyle.Prompt = lipgloss.NewStyle().Foreground(lipgloss.Color("244")).Background(lipgloss.Color("236")) + ti.FocusedStyle.CursorLine = lipgloss.NewStyle().Background(lipgloss.Color("237")) + ti.BlurredStyle.CursorLine = lipgloss.NewStyle().Background(lipgloss.Color("237")) + ti.FocusedStyle.Placeholder = lipgloss.NewStyle().Foreground(lipgloss.Color("245")).Background(lipgloss.Color("236")) + ti.BlurredStyle.Placeholder = lipgloss.NewStyle().Foreground(lipgloss.Color("245")).Background(lipgloss.Color("236")) // Dynamic prompt: "❯ " on first line, " " on continuation lines ti.SetPromptFunc(2, func(line int) string { diff --git a/internal/tui/tui_render.go b/internal/tui/tui_render.go index 3b52b66..fe01a96 100644 --- a/internal/tui/tui_render.go +++ b/internal/tui/tui_render.go @@ -11,11 +11,50 @@ import ( "github.com/charmbracelet/lipgloss" ) -func (m *model) resizeViewport() { - footerHeight := 7 +// computeFooterHeight calculates the actual footer height based on current state. +// This must match the row count produced by model.View() footer rendering. +func (m *model) computeFooterHeight() int { + height := 1 // separator line + + // Input area + if m.commandMode { + height += 1 // command mode line + } else { + height += m.computeInputHeight() + // Skill autocomplete suggestions + if m.skillAutoComplete && len(m.skillSuggestions) > 0 { + height += len(m.skillSuggestions) + 1 // suggestion lines + hint line + } + } + + // Error message if m.errMsg != "" { - footerHeight++ + height += 1 } + + // Token dashboard + totalTokens := m.inputTokens + m.outputTokens + if totalTokens == 0 { + // Single line: "In: 0 | Out: 0" + height += 1 + } else { + // Dashboard with border: 2 (borders) + 1 (header) + 1 (separator) + agent rows + height += 4 // 2 borders + 1 header + 1 separator + for _, au := range m.tokenUsagePerAgent { + if au.InputTokens+au.OutputTokens > 0 { + height++ + } + } + } + + // Two blank lines + status line (see View() — footer.WriteString("\n") twice + statusLine) + height += 3 + + return height +} + +func (m *model) resizeViewport() { + footerHeight := m.computeFooterHeight() vpHeight := m.termHeight - footerHeight if vpHeight < 3 { vpHeight = 3 diff --git a/internal/tui/tui_update.go b/internal/tui/tui_update.go index ec40634..bd019ce 100644 --- a/internal/tui/tui_update.go +++ b/internal/tui/tui_update.go @@ -798,10 +798,10 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) { m.handleTaskHistoryCycle(msg.String()) return m, nil } - // Non-empty: pass to viewport for scrolling - var vpCmd tea.Cmd - m.viewport, vpCmd = m.viewport.Update(msg) - return m, vpCmd + // Input has content: pass to textarea for line navigation + var cmd tea.Cmd + m.input, cmd = m.input.Update(msg) + return m, cmd default: // Reset history cursor when user starts typing @@ -840,8 +840,9 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) { case int: completionVal = int64(v) } - m.outputTokens += completionVal } + m.outputTokens += completionVal + // Also track input tokens from API (PromptTokens) var promptVal int64 if promptTokens, ok := usageMap["prompt_tokens"]; ok { @@ -853,8 +854,35 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) { case int: promptVal = int64(v) } - m.inputTokens += promptVal } + m.inputTokens += promptVal + + // Parse cache tokens + var cacheCreationVal int64 + if cacheCreationTokens, ok := usageMap["cache_creation_input_tokens"]; ok { + switch v := cacheCreationTokens.(type) { + case float64: + cacheCreationVal = int64(v) + case int64: + cacheCreationVal = v + case int: + cacheCreationVal = int64(v) + } + } + m.cacheCreationInputTokens += cacheCreationVal + + var cacheReadVal int64 + if cacheReadTokens, ok := usageMap["cache_read_input_tokens"]; ok { + switch v := cacheReadTokens.(type) { + case float64: + cacheReadVal = int64(v) + case int64: + cacheReadVal = v + case int: + cacheReadVal = int64(v) + } + } + m.cacheReadInputTokens += cacheReadVal // Per-agent token tracking agentName := msg.event.From @@ -868,6 +896,8 @@ func (m model) Update(msg tea.Msg) (tea.Model, tea.Cmd) { } agentUsage.InputTokens += promptVal agentUsage.OutputTokens += completionVal + agentUsage.CacheCreationInputTokens += cacheCreationVal + agentUsage.CacheReadInputTokens += cacheReadVal } } else { // Fallback: estimate tokens from content string diff --git a/internal/tui/tui_view.go b/internal/tui/tui_view.go index d96c6cd..2a5f295 100644 --- a/internal/tui/tui_view.go +++ b/internal/tui/tui_view.go @@ -79,6 +79,7 @@ func (m model) View() string { } else { // ── Edit mode: textarea with dark background (via Base style), no bar ── m.input.SetWidth(m.computeFieldWidth()) + m.input.SetHeight(m.computeInputHeight()) inputLine := m.input.View() footer.WriteString(lipgloss.NewStyle().Render(inputLine)) footer.WriteString("\n") @@ -124,13 +125,13 @@ func (m model) View() string { footer.WriteString(m.renderTokenDashboard()) footer.WriteString("\n") - // Status line: mode indicator + task indicator + model name - taskIndicator := "" + // Status line: Running indicator (leftmost) + mode indicator + var runningBadge string if m.taskRunning { if m.currentModel != "" { - taskIndicator = logStatusStyle.Render(fmt.Sprintf(" ◷ Running [%s]...", m.currentModel)) + runningBadge = logStatusStyle.Render(fmt.Sprintf(" ◷ Running [%s]...", m.currentModel)) } else { - taskIndicator = logStatusStyle.Render(" ◷ Running...") + runningBadge = logStatusStyle.Render(" ◷ Running...") } } footer.WriteString("\n") @@ -144,7 +145,11 @@ func (m model) View() string { } else { statusLine = footerStyle.Render(langManager.GetText("EditModeTips")) } - footer.WriteString(lipgloss.NewStyle().MarginLeft(2).Render(statusLine + taskIndicator)) + if runningBadge != "" { + footer.WriteString(lipgloss.NewStyle().MarginLeft(2).Render(runningBadge + " " + statusLine)) + } else { + footer.WriteString(lipgloss.NewStyle().MarginLeft(2).Render(statusLine)) + } b.WriteString(footer.String()) @@ -258,11 +263,18 @@ func (m model) renderTokenDashboard() string { Padding(0, 1). Width(m.termWidth - 2) // account for viewport padding - // Header + // Header — left-aligned "Total" label + token summary headerStyle := lipgloss.NewStyle(). Bold(true). Foreground(lipgloss.Color("240")) - header := headerStyle.Render("─ Token 消耗 ") + var header string + + // Token column width for alignment + const maxAgentNameWidth = 10 + + inStr := formatToken(m.inputTokens) + outStr := formatToken(m.outputTokens) + sumStr := formatToken(totalTokens) // Total line — highlighted inputStyle := lipgloss.NewStyle(). @@ -275,14 +287,14 @@ func (m model) renderTokenDashboard() string { Bold(true). Foreground(lipgloss.Color("243")) // medium gray for sum - inStr := formatToken(m.inputTokens) - outStr := formatToken(m.outputTokens) - sumStr := formatToken(totalTokens) - - totalLine := fmt.Sprintf("Total: ") - totalLine += inputStyle.Render(fmt.Sprintf("In: %s ", inStr)) - totalLine += outputStyle.Render(fmt.Sprintf("Out: %s ", outStr)) - totalLine += sumStyle.Render(fmt.Sprintf("Σ %s", sumStr)) + // Combined header with token summary + header = headerStyle.Render(fmt.Sprintf("%-*s", maxAgentNameWidth, "Total")) + " " + + inputStyle.Render(fmt.Sprintf("In: %s ", inStr)) + + outputStyle.Render(fmt.Sprintf("Out: %s ", outStr)) + if m.cacheReadInputTokens > 0 { + header += fmt.Sprintf("Cache: %s ", formatToken(m.cacheReadInputTokens)) + } + header += sumStyle.Render(fmt.Sprintf("Σ %s", sumStr)) // Separator sepStyle := lipgloss.NewStyle().Foreground(lipgloss.Color("236")) @@ -295,16 +307,11 @@ func (m model) renderTokenDashboard() string { } } sort.Slice(agents, func(i, j int) bool { - return (agents[i].InputTokens+agents[i].OutputTokens) > (agents[j].InputTokens+agents[j].OutputTokens) + return (agents[i].InputTokens + agents[i].OutputTokens) > (agents[j].InputTokens + agents[j].OutputTokens) }) - // Calculate column widths for alignment - const maxAgentNameWidth = 18 - const colFormat = "In: %-6s Out: %-6s" - var lines []string lines = append(lines, header) - lines = append(lines, totalLine) lines = append(lines, sepStyle.Render(strings.Repeat("─", 48))) for _, au := range agents { @@ -313,7 +320,7 @@ func (m model) renderTokenDashboard() string { agentLabel = agentLabel[:maxAgentNameWidth-1] + "…" } // Pad agent name to fixed width - paddedName := fmt.Sprintf("%-"+fmt.Sprintf("%ds", maxAgentNameWidth)+"s", agentLabel) + paddedName := fmt.Sprintf("%-*s", maxAgentNameWidth, agentLabel) agentIn := formatToken(au.InputTokens) agentOut := formatToken(au.OutputTokens) @@ -326,6 +333,9 @@ func (m model) renderTokenDashboard() string { agentLine := agentStyle.Render(paddedName) agentLine += " " + agentInStyle.Render(fmt.Sprintf("In: %s ", agentIn)) agentLine += agentOutStyle.Render(fmt.Sprintf("Out: %s", agentOut)) + if au.CacheReadInputTokens > 0 { + agentLine += " " + agentInStyle.Render(fmt.Sprintf("Cache: %s", formatToken(au.CacheReadInputTokens))) + } lines = append(lines, agentLine) }