Skip to content

Local server for dagger cli#5

Merged
lukemarsden merged 8 commits into
long-runningfrom
local-server-for-dagger-cli
Nov 5, 2023
Merged

Local server for dagger cli#5
lukemarsden merged 8 commits into
long-runningfrom
local-server-for-dagger-cli

Conversation

@lukemarsden
Copy link
Copy Markdown
Collaborator

conceptual review welcome

@lukemarsden lukemarsden merged commit f8e813a into long-running Nov 5, 2023
lukemarsden added a commit that referenced this pull request Oct 8, 2025
**Issue #1-3: WolfLobbyID Handling**
- Add WolfLobbyID to SessionMetadata (was missing)
- Save WolfLobbyID when creating external agent session
- Fix token response to return lobby ID instead of PIN

**Issue #5: Moonlight Credentials**
- Add api.credentials = 'helix' in MoonlightStreamViewer
- Matches moonlight-web-config/config.json setting

**Documentation**:
- docs/STREAMING_ISSUES_FOUND.md - Complete review findings
- 12 issues documented (3 critical fixed, 2 need action, 7 minor/future)

Remaining: Wolf host pairing needed before streaming works

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
lukemarsden added a commit that referenced this pull request Oct 16, 2025
**Issue #1-3: WolfLobbyID Handling**
- Add WolfLobbyID to SessionMetadata (was missing)
- Save WolfLobbyID when creating external agent session
- Fix token response to return lobby ID instead of PIN

**Issue #5: Moonlight Credentials**
- Add api.credentials = 'helix' in MoonlightStreamViewer
- Matches moonlight-web-config/config.json setting

**Documentation**:
- docs/STREAMING_ISSUES_FOUND.md - Complete review findings
- 12 issues documented (3 critical fixed, 2 need action, 7 minor/future)

Remaining: Wolf host pairing needed before streaming works

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
lukemarsden added a commit that referenced this pull request Nov 14, 2025
PROVEN FACTS (from core dump + source analysis):

Thread Flow:
1. Thread 99 = HTTPS server (wolf.cpp:187, port 47984)
2. Processing /cancel endpoint (endpoints::https::cancel)
3. Fires StopStreamEvent SYNCHRONOUSLY (event_bus.hpp:171)
4. Handler calls gst_element_send_event FROM HTTPS THREAD
5. GStreamer recursively traverses pipeline (frames #7#5)
6. Blocks on mutex 0x70537c0062b0 in libgstbase-1.0.so.0
7. Thread 40 (audio pipeline owner) is HEALTHY in ppoll
8. Only Thread 99 waiting on this mutex - no contention

GStreamer Analysis:
- gst_element_send_event IS thread-safe (uses recursive STATE_LOCK)
- Documented as "MT safe" - can be called from any thread
- But empirically CAUSES DEADLOCK when called from HTTPS thread
- GStreamer has both recursive (STATE/PAD) and NON-recursive (live_lock) mutexes

The Mystery:
- WHO holds mutex 0x70537c0062b0? NOT Thread 40, not any visible thread
- Options: abandoned by crashed thread, corrupted, or race condition
- Cannot prove exact mechanism without debugging symbols

CONFIRMED FIXES:
1. HTTPS connection leak (100% certain) - add close() in error handler
2. Replace gst_element_send_event with g_main_loop_quit (80% confidence)
   - Eliminates cross-thread pipeline calls
   - g_main_loop_quit IS thread-safe (documented)
   - Even though gst_element_send_event claims to be safe, empirically fails

Gaps in Evidence:
- No debug symbols for libgstbase (can't see frame #4 function)
- Core dump partially corrupted
- Can't identify mutex owner
- Need symbols + reproduction to prove exact mechanism
lukemarsden added a commit that referenced this pull request Mar 18, 2026
Spec-Ref: helix-specs@792cfa369:001588_read-helixs-design2026
lukemarsden added a commit that referenced this pull request Mar 18, 2026
Spec-Ref: helix-specs@6c5123c58:001588_read-helixs-design2026
lukemarsden added a commit that referenced this pull request Mar 18, 2026
Issue #1 (stuck "Starting Desktop"):
- Add defer in StartDesktop to clear external_agent_status on any error
- Give waitForDesktopBridge its own 90s context decoupled from dockerCtx

Issue #4 (status not cleared on stop):
- StopDesktop unconditionally clears external_agent_status and status_message

Issue #5 (no restart button in Starting state):
- Frontend: show Stop button in "starting" state in both screenshot and stream modes
- Show "may have failed to start" message after 2-minute timeout

Issue #10a (duplicate sessions per spectask):
- Re-read task from DB before CreateSession; skip if PlanningSessionID already set

Issue #10b (scanner targets wrong sessions):
- processPendingPromptsForIdleSessions now filters to canonical planning_session_id only

Issue #2 (duplicate message sends):
- Add ClaimPromptForSending() atomic store method (UPDATE WHERE status IN pending/failed)
- Both interrupt and any-pending delivery paths use claim before send

Issue #7 (promotion race gives empty zvol):
- resolveDockerDataDir: acquire read lock before fresh zvol creation; re-check after

Issue #3: Already handled by existing open_thread on agent_ready reconnect

Issue #6: Fixed in merged PR #1947 (RecoverStaleBuilds 60s retry)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Spec-Ref: helix-specs@04b515c3c:001588_read-helixs-design2026
lukemarsden added a commit that referenced this pull request Apr 30, 2026
Three pre-baked profiles for the customer's actual deployment, each
sized to its hardware and using the current best open-weights models
(April 2026 — DeepSeek-V4-Pro, GLM-4.7-Flash, Qwen3.6-35B-A3B,
Qwen3.5-27B). Verified composeparse handles each unchanged.

Profiles:

- design/sample-profiles/customer-node1-4xA100.yaml
  4× A100 80GB. 4 services on GPUs 0-2 (qwen3 embeddings sharing GPU 0,
  GLM-4.7-Flash 31B on GPU 1, Qwen3.6-35B-A3B MoE on GPU 2). GPU 3
  reserved for desktops via Decision 15. A100 has no NVENC so video
  encoding falls back to libx264 software — fine for 1-2 sessions
  (verified live in cloud GPU campaign run #4). composeparse output:
  4 services, GPUCount=3 → 4-GPU host has 1 GPU of explicit headroom.

- design/sample-profiles/customer-node2to4-4xL40S.yaml
  4× L40S 48GB. Same 4-services-on-3-GPUs shape as Node 1; sized for
  L40S's smaller VRAM (Qwen3.5-27B + Qwen3.6-35B-A3B FP8 fit single
  cards). Deployed identically to all three nodes (2, 3, 4) — the
  inference router round-robins. L40S has full NVENC + display engine
  → hardware-accelerated desktops on GPU 3.

- design/sample-profiles/customer-node5-8xMI300X.yaml
  8× MI300X 192GB = 1.5 TiB total VRAM. Runs DeepSeek-V4-Pro 862B FP8
  with TP=8 across all 8 cards via rocm/vllm. **Inference-only** — MI300X
  CDNA-3 is compute-only, can't render desktops (Mesa radeonsi refuses
  graphics context — verified live in cloud GPU campaign run #5). The
  rocm/vllm image needs explicit `entrypoint: ["vllm", "serve"]` because
  unlike vllm/vllm-openai its default entrypoint is /bin/bash (also
  verified live).

Wiring across the 4 surfaces:

- api/pkg/runner/composeparse/sample_profiles_test.go — locked in:
  customer-node1 (4 services, 3 GPUs), customer-node2to4 (4, 3),
  customer-node5 (1 service, 8 GPUs). Future parser changes that break
  any of these will fail tests.
- design/sample-profiles/README.md — table updated with all three plus a
  new section explaining the per-node deployment.
- frontend/src/components/dashboard/profileBlocks.ts — three new curated
  entries in the Profile Gallery: "Customer Node 1 — 4×A100 80GB",
  "Customer Nodes 2-4 — 4×L40S 48GB (each)", and "Customer Node 5 —
  8×MI300X big-iron (inference-only)". Each card has accurate pros/cons
  including the desktop-headroom story per node.
- integration-test/gpucloud/matrix.yaml — the existing disabled
  node1-a100-4x / node2-l40s-4x / node3-l40s-4x / node4-l40s-4x /
  node5-mi300x-8x entries now point at the new richer
  customer-nodeN-... profiles instead of the generic placeholders they
  had before.

UI flexibility audit (separate question from the user): the compose
YAML field in EditRunnerProfile.tsx is a plain `<TextField multiline
minRows={20}>` textarea — smart users have full flexibility to define
arbitrary services with any Docker image (incl. custom builds + private
registries), any env vars, any CLI args, any GPU pinning. Validation is
server-side via composeparse on save (parses the YAML, extracts model
list + GPU count) — there's no client-side allowlist or schema
enforcement.

Test results: TestParse_SampleProfiles green for all 9 profiles
including the new 3. Frontend builds clean (39s). Harness dry-run shows
the new entries are correctly disabled (so accidental cloud spend is
impossible without flipping enabled: true).

Spec-Ref: helix-specs@ac4cc3643:001959_we-need-to-replace-all
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant