Local server for dagger cli by lukemarsden · Pull Request #5 · helixml/helix

lukemarsden · 2023-11-04T09:41:23Z

conceptual review welcome

…ct ones that won't fit in GPU memory. What to do next: why is it hanging??

**Issue #1-3: WolfLobbyID Handling** - Add WolfLobbyID to SessionMetadata (was missing) - Save WolfLobbyID when creating external agent session - Fix token response to return lobby ID instead of PIN **Issue #5: Moonlight Credentials** - Add api.credentials = 'helix' in MoonlightStreamViewer - Matches moonlight-web-config/config.json setting **Documentation**: - docs/STREAMING_ISSUES_FOUND.md - Complete review findings - 12 issues documented (3 critical fixed, 2 need action, 7 minor/future) Remaining: Wolf host pairing needed before streaming works 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

PROVEN FACTS (from core dump + source analysis): Thread Flow: 1. Thread 99 = HTTPS server (wolf.cpp:187, port 47984) 2. Processing /cancel endpoint (endpoints::https::cancel) 3. Fires StopStreamEvent SYNCHRONOUSLY (event_bus.hpp:171) 4. Handler calls gst_element_send_event FROM HTTPS THREAD 5. GStreamer recursively traverses pipeline (frames #7→#5) 6. Blocks on mutex 0x70537c0062b0 in libgstbase-1.0.so.0 7. Thread 40 (audio pipeline owner) is HEALTHY in ppoll 8. Only Thread 99 waiting on this mutex - no contention GStreamer Analysis: - gst_element_send_event IS thread-safe (uses recursive STATE_LOCK) - Documented as "MT safe" - can be called from any thread - But empirically CAUSES DEADLOCK when called from HTTPS thread - GStreamer has both recursive (STATE/PAD) and NON-recursive (live_lock) mutexes The Mystery: - WHO holds mutex 0x70537c0062b0? NOT Thread 40, not any visible thread - Options: abandoned by crashed thread, corrupted, or race condition - Cannot prove exact mechanism without debugging symbols CONFIRMED FIXES: 1. HTTPS connection leak (100% certain) - add close() in error handler 2. Replace gst_element_send_event with g_main_loop_quit (80% confidence) - Eliminates cross-thread pipeline calls - g_main_loop_quit IS thread-safe (documented) - Even though gst_element_send_event claims to be safe, empirically fails Gaps in Evidence: - No debug symbols for libgstbase (can't see frame #4 function) - Core dump partially corrupted - Can't identify mutex owner - Need symbols + reproduction to prove exact mechanism

Spec-Ref: helix-specs@792cfa369:001588_read-helixs-design2026

Spec-Ref: helix-specs@6c5123c58:001588_read-helixs-design2026

Issue #1 (stuck "Starting Desktop"): - Add defer in StartDesktop to clear external_agent_status on any error - Give waitForDesktopBridge its own 90s context decoupled from dockerCtx Issue #4 (status not cleared on stop): - StopDesktop unconditionally clears external_agent_status and status_message Issue #5 (no restart button in Starting state): - Frontend: show Stop button in "starting" state in both screenshot and stream modes - Show "may have failed to start" message after 2-minute timeout Issue #10a (duplicate sessions per spectask): - Re-read task from DB before CreateSession; skip if PlanningSessionID already set Issue #10b (scanner targets wrong sessions): - processPendingPromptsForIdleSessions now filters to canonical planning_session_id only Issue #2 (duplicate message sends): - Add ClaimPromptForSending() atomic store method (UPDATE WHERE status IN pending/failed) - Both interrupt and any-pending delivery paths use claim before send Issue #7 (promotion race gives empty zvol): - resolveDockerDataDir: acquire read lock before fresh zvol creation; re-check after Issue #3: Already handled by existing open_thread on agent_ready reconnect Issue #6: Fixed in merged PR #1947 (RecoverStaleBuilds 60s retry) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Spec-Ref: helix-specs@04b515c3c:001588_read-helixs-design2026

Three pre-baked profiles for the customer's actual deployment, each sized to its hardware and using the current best open-weights models (April 2026 — DeepSeek-V4-Pro, GLM-4.7-Flash, Qwen3.6-35B-A3B, Qwen3.5-27B). Verified composeparse handles each unchanged. Profiles: - design/sample-profiles/customer-node1-4xA100.yaml 4× A100 80GB. 4 services on GPUs 0-2 (qwen3 embeddings sharing GPU 0, GLM-4.7-Flash 31B on GPU 1, Qwen3.6-35B-A3B MoE on GPU 2). GPU 3 reserved for desktops via Decision 15. A100 has no NVENC so video encoding falls back to libx264 software — fine for 1-2 sessions (verified live in cloud GPU campaign run #4). composeparse output: 4 services, GPUCount=3 → 4-GPU host has 1 GPU of explicit headroom. - design/sample-profiles/customer-node2to4-4xL40S.yaml 4× L40S 48GB. Same 4-services-on-3-GPUs shape as Node 1; sized for L40S's smaller VRAM (Qwen3.5-27B + Qwen3.6-35B-A3B FP8 fit single cards). Deployed identically to all three nodes (2, 3, 4) — the inference router round-robins. L40S has full NVENC + display engine → hardware-accelerated desktops on GPU 3. - design/sample-profiles/customer-node5-8xMI300X.yaml 8× MI300X 192GB = 1.5 TiB total VRAM. Runs DeepSeek-V4-Pro 862B FP8 with TP=8 across all 8 cards via rocm/vllm. **Inference-only** — MI300X CDNA-3 is compute-only, can't render desktops (Mesa radeonsi refuses graphics context — verified live in cloud GPU campaign run #5). The rocm/vllm image needs explicit `entrypoint: ["vllm", "serve"]` because unlike vllm/vllm-openai its default entrypoint is /bin/bash (also verified live). Wiring across the 4 surfaces: - api/pkg/runner/composeparse/sample_profiles_test.go — locked in: customer-node1 (4 services, 3 GPUs), customer-node2to4 (4, 3), customer-node5 (1 service, 8 GPUs). Future parser changes that break any of these will fail tests. - design/sample-profiles/README.md — table updated with all three plus a new section explaining the per-node deployment. - frontend/src/components/dashboard/profileBlocks.ts — three new curated entries in the Profile Gallery: "Customer Node 1 — 4×A100 80GB", "Customer Nodes 2-4 — 4×L40S 48GB (each)", and "Customer Node 5 — 8×MI300X big-iron (inference-only)". Each card has accurate pros/cons including the desktop-headroom story per node. - integration-test/gpucloud/matrix.yaml — the existing disabled node1-a100-4x / node2-l40s-4x / node3-l40s-4x / node4-l40s-4x / node5-mi300x-8x entries now point at the new richer customer-nodeN-... profiles instead of the generic placeholders they had before. UI flexibility audit (separate question from the user): the compose YAML field in EditRunnerProfile.tsx is a plain `<TextField multiline minRows={20}>` textarea — smart users have full flexibility to define arbitrary services with any Docker image (incl. custom builds + private registries), any env vars, any CLI args, any GPU pinning. Validation is server-side via composeparse on save (parses the YAML, extracts model list + GPU count) — there's no client-side allowlist or schema enforcement. Test results: TestParse_SampleProfiles green for all 9 profiles including the new 3. Frontend builds clean (39s). Harness dry-run shows the new entries are correctly disabled (so accidental cloud spend is impossible without flipping enabled: true). Spec-Ref: helix-specs@ac4cc3643:001959_we-need-to-replace-all

lukemarsden added 8 commits November 1, 2023 21:49

wip

b8a1045

what to do next

b47c9eb

wrap err

777ba28

merge

c6162de

implement local queue for locally injected sessions, ensuring we reje…

143868f

…ct ones that won't fit in GPU memory. What to do next: why is it hanging??

switch to xsync.MapOf

a7d6816

initialize properly

66d4840

Delete after killing

f8e813a

lukemarsden merged commit f8e813a into long-running Nov 5, 2023

lukemarsden added a commit that referenced this pull request Mar 18, 2026

Progress: #1a, #1b, #4 done; start #5

6c5123c

Spec-Ref: helix-specs@792cfa369:001588_read-helixs-design2026

lukemarsden added a commit that referenced this pull request Mar 18, 2026

Progress: #5 done; start #10a

620ace9

Spec-Ref: helix-specs@6c5123c58:001588_read-helixs-design2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local server for dagger cli#5

Local server for dagger cli#5
lukemarsden merged 8 commits into
long-runningfrom
local-server-for-dagger-cli

lukemarsden commented Nov 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lukemarsden commented Nov 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant