Skip to content

Website#2

Merged
lukemarsden merged 12 commits into
long-runningfrom
website
Oct 23, 2023
Merged

Website#2
lukemarsden merged 12 commits into
long-runningfrom
website

Conversation

@lukemarsden
Copy link
Copy Markdown
Collaborator

NOTE: must delete keycloak db otherwise everything will be fucked

@lukemarsden lukemarsden merged commit f2b7e71 into long-running Oct 23, 2023
lukemarsden added a commit that referenced this pull request Oct 10, 2025
FINAL WORKING SOLUTION after 8 attempts:

Fix #1: Duplicate pause guard (Wolf)
- Prevents multiple EOS events
- Session count stays correct
- CONFIRMED working in logs

Fix #2: Prevent auto-leave on pause (Wolf + Helix)
- Lobbies don't auto-leave when Wolf-UI pauses
- Wolf-UI session stays connected to lobby even when disconnected
- Lobby never becomes empty
- No stale buffer accumulation
- Agents keep running

Test pattern: 1→2→3→1 should now work without rejoin hang!

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
lukemarsden added a commit that referenced this pull request Oct 16, 2025
FINAL WORKING SOLUTION after 8 attempts:

Fix #1: Duplicate pause guard (Wolf)
- Prevents multiple EOS events
- Session count stays correct
- CONFIRMED working in logs

Fix #2: Prevent auto-leave on pause (Wolf + Helix)
- Lobbies don't auto-leave when Wolf-UI pauses
- Wolf-UI session stays connected to lobby even when disconnected
- Lobby never becomes empty
- No stale buffer accumulation
- Agents keep running

Test pattern: 1→2→3→1 should now work without rejoin hang!

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
lukemarsden added a commit that referenced this pull request Nov 14, 2025
THREE CRITICAL BUGS found causing HTTPS deadlock within 16 hours:

BUG #1: GStreamer Thread-Safety Violation (PRIMARY ROOT CAUSE)
- gst_element_send_event() called from HTTPS thread (wrong context!)
- Must be called from pipeline's g_main_loop_run() thread
- HTTPS thread blocks on GStreamer internal mutex (0x70537c0062b0)
- Located in streaming.cpp:124, 132, 176, 184, 401, 524
- FIX: Use g_main_loop_quit() instead (thread-safe)

BUG #2: NVIDIA Driver Mutex Deadlock (SECONDARY)
- Multiple GStreamer pipelines compete for NVIDIA mutex (0x705580003b80)
- Circular deadlock: HTTPS→GStreamer→NVIDIA→?
- Core dump shows 2 threads stuck on same NVIDIA mutex
- Inside proprietary libEGL_nvidia.so.0 (no symbols)
- FIX: Separate CUDA contexts per pipeline OR remove NVIDIA from SSL

BUG #3: HTTPS Connection Leak (CONTRIBUTING FACTOR)
- custom-https.cpp error handler doesn't close sockets
- 17 leaked connections in 16 hours (~1/hour leak rate!)
- Connections stuck in CLOSE_WAIT forever
- From: external browsers, moonlight-web, localhost
- FIX: Add socket->close() in error handler

COMPLETE DEADLOCK CHAIN:
1. HTTPS request fires StopStreamEvent (endpoints.hpp:484)
2. Event handler runs in HTTPS thread (synchronous dispatch)
3. Calls gst_element_send_event() - WRONG THREAD (Bug #1)
4. Blocks on GStreamer mutex
5. GStreamer holds mutex, waiting on NVIDIA
6. NVIDIA mutex held by another operation
7. ALL new HTTPS requests block on continue_lock()
8. System appears completely hung for HTTPS

EVIDENCE:
- HTTP (port 47989) still works perfectly
- HTTPS (port 47984) completely hung
- Core dump shows exact mutex addresses and call stacks
- 17 leaked CLOSE_WAIT connections
- Thread 99 stuck in gst_element_send_event from wrong context

CRITICAL FIX: Replace all gst_element_send_event(eos) with g_main_loop_quit()
in event handlers at streaming.cpp:124,132,176,184,401,524
lukemarsden added a commit that referenced this pull request Nov 19, 2025
BUG #1: Inconsistent NVIDIA runtime detection pattern
- Line 811 used 'grep -i nvidia' (too broad, matches image names)
- Line 779 used 'grep -i "runtimes.*nvidia"' (correct, matches runtime only)
- Fixed line 811 to use consistent pattern
- Prevents false positives when nvidia/cuda images are present

BUG #2: Race condition after Docker installation
- Docker daemon takes 1-3 seconds to initialize after systemctl start
- check_docker_sudo() was called immediately, could fail if daemon not ready
- Added 30-second wait loop checking 'docker ps' readiness
- Applied to both Ubuntu/Debian and Fedora installation paths
- Prevents intermittent failures: "Docker is not running" after fresh install

Both fixes are defensive and prevent edge cases without changing behavior
for correctly configured systems.
chocobar added a commit that referenced this pull request Feb 11, 2026
…ation

- Fix silently swallowed Exec() error in migration (bug #1)
- Fix WHERE condition: LENGTH(name) > 255 instead of OCTET_LENGTH > 2704 (bug #2)
- Add Go-level name truncation in CreateSession, UpdateSession,
  UpdateSessionMeta, and UpdateSessionName to prevent cryptic GORM errors
- Add 6 unit tests covering truncation for ASCII, multibyte (CJK), and
  boundary cases across all session name write paths

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
lukemarsden added a commit that referenced this pull request Mar 18, 2026
Both sessions have spec_task_id set. First has no agent_type
(incomplete creation). This race likely causes duplicate message
sends (issue #2) and the agent jumping from spec to implementation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
lukemarsden added a commit that referenced this pull request Mar 18, 2026
Spec-Ref: helix-specs@620ace9f4:001588_read-helixs-design2026
lukemarsden added a commit that referenced this pull request Mar 18, 2026
Spec-Ref: helix-specs@433412c98:001588_read-helixs-design2026
lukemarsden added a commit that referenced this pull request Mar 18, 2026
Issue #1 (stuck "Starting Desktop"):
- Add defer in StartDesktop to clear external_agent_status on any error
- Give waitForDesktopBridge its own 90s context decoupled from dockerCtx

Issue #4 (status not cleared on stop):
- StopDesktop unconditionally clears external_agent_status and status_message

Issue #5 (no restart button in Starting state):
- Frontend: show Stop button in "starting" state in both screenshot and stream modes
- Show "may have failed to start" message after 2-minute timeout

Issue #10a (duplicate sessions per spectask):
- Re-read task from DB before CreateSession; skip if PlanningSessionID already set

Issue #10b (scanner targets wrong sessions):
- processPendingPromptsForIdleSessions now filters to canonical planning_session_id only

Issue #2 (duplicate message sends):
- Add ClaimPromptForSending() atomic store method (UPDATE WHERE status IN pending/failed)
- Both interrupt and any-pending delivery paths use claim before send

Issue #7 (promotion race gives empty zvol):
- resolveDockerDataDir: acquire read lock before fresh zvol creation; re-check after

Issue #3: Already handled by existing open_thread on agent_ready reconnect

Issue #6: Fixed in merged PR #1947 (RecoverStaleBuilds 60s retry)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Spec-Ref: helix-specs@04b515c3c:001588_read-helixs-design2026
lukemarsden added a commit that referenced this pull request Mar 19, 2026
Tests were missing expectations for two new store calls introduced by
the ZFS deployment issue fixes:

- ClaimPromptForSending (fix #2): added before MarkPromptAsSent /
  MarkPromptAsFailed in processAnyPendingPrompt and handleAgentReady
  tests

- GetSpecTask (fix #10b): added to all processPendingPromptsForIdleSessions
  tests; returns a SpecTask with PlanningSessionID matching the test
  session so canonical-session filtering passes through correctly

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Spec-Ref: helix-specs@04b515c3c:001588_read-helixs-design2026
chocobar pushed a commit that referenced this pull request Apr 22, 2026
The Design Review UI made two sequential API calls on "Approve Design":
1. submitReviewMutation (marks review record approved)
2. v1SpecTasksApproveSpecsCreate (approves the spec task)

If #1 succeeded but #2 failed, the review showed "approved" but the
spec task stayed in spec_review with SpecApproval == nil — creating the
inconsistent state that led to the infinite loop.

Fix: move spec task approval into submitDesignReview's "approve" case
(matching the existing pattern where "request_changes" already updates
the spec task). Remove the redundant second API call from the frontend.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Spec-Ref: helix-specs@560941003:001869_bug-report-spec-tasks
philwinder added a commit that referenced this pull request Apr 28, 2026
Previously the activation prompt only carried Body. The Worker had to
call read_events to learn Subject, From, ThreadID, Extra — exactly the
round-trip that caused the docs-engineer to misroute issue #3 to PR #2
during the github demo's E2E run.

renderTrigger now formats every populated envelope field into the
prompt, omitting empties for cleanliness. The Trigger.Body field is
dropped; callers pass the full Message instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
philwinder added a commit that referenced this pull request May 4, 2026
Previously the activation prompt only carried Body. The Worker had to
call read_events to learn Subject, From, ThreadID, Extra — exactly the
round-trip that caused the docs-engineer to misroute issue #3 to PR #2
during the github demo's E2E run.

renderTrigger now formats every populated envelope field into the
prompt, omitting empties for cleanliness. The Trigger.Body field is
dropped; callers pass the full Message instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant