Skip to content

Long running#3

Merged
lukemarsden merged 30 commits into
mainfrom
long-running
Oct 23, 2023
Merged

Long running#3
lukemarsden merged 30 commits into
mainfrom
long-running

Conversation

@lukemarsden
Copy link
Copy Markdown
Collaborator

No description provided.

@lukemarsden lukemarsden merged commit 966e0c6 into main Oct 23, 2023
lukemarsden added a commit that referenced this pull request Nov 14, 2025
THREE CRITICAL BUGS found causing HTTPS deadlock within 16 hours:

BUG #1: GStreamer Thread-Safety Violation (PRIMARY ROOT CAUSE)
- gst_element_send_event() called from HTTPS thread (wrong context!)
- Must be called from pipeline's g_main_loop_run() thread
- HTTPS thread blocks on GStreamer internal mutex (0x70537c0062b0)
- Located in streaming.cpp:124, 132, 176, 184, 401, 524
- FIX: Use g_main_loop_quit() instead (thread-safe)

BUG #2: NVIDIA Driver Mutex Deadlock (SECONDARY)
- Multiple GStreamer pipelines compete for NVIDIA mutex (0x705580003b80)
- Circular deadlock: HTTPS→GStreamer→NVIDIA→?
- Core dump shows 2 threads stuck on same NVIDIA mutex
- Inside proprietary libEGL_nvidia.so.0 (no symbols)
- FIX: Separate CUDA contexts per pipeline OR remove NVIDIA from SSL

BUG #3: HTTPS Connection Leak (CONTRIBUTING FACTOR)
- custom-https.cpp error handler doesn't close sockets
- 17 leaked connections in 16 hours (~1/hour leak rate!)
- Connections stuck in CLOSE_WAIT forever
- From: external browsers, moonlight-web, localhost
- FIX: Add socket->close() in error handler

COMPLETE DEADLOCK CHAIN:
1. HTTPS request fires StopStreamEvent (endpoints.hpp:484)
2. Event handler runs in HTTPS thread (synchronous dispatch)
3. Calls gst_element_send_event() - WRONG THREAD (Bug #1)
4. Blocks on GStreamer mutex
5. GStreamer holds mutex, waiting on NVIDIA
6. NVIDIA mutex held by another operation
7. ALL new HTTPS requests block on continue_lock()
8. System appears completely hung for HTTPS

EVIDENCE:
- HTTP (port 47989) still works perfectly
- HTTPS (port 47984) completely hung
- Core dump shows exact mutex addresses and call stacks
- 17 leaked CLOSE_WAIT connections
- Thread 99 stuck in gst_element_send_event from wrong context

CRITICAL FIX: Replace all gst_element_send_event(eos) with g_main_loop_quit()
in event handlers at streaming.cpp:124,132,176,184,401,524
lukemarsden added a commit that referenced this pull request Mar 18, 2026
Spec-Ref: helix-specs@433412c98:001588_read-helixs-design2026
lukemarsden added a commit that referenced this pull request Mar 18, 2026
Issue #1 (stuck "Starting Desktop"):
- Add defer in StartDesktop to clear external_agent_status on any error
- Give waitForDesktopBridge its own 90s context decoupled from dockerCtx

Issue #4 (status not cleared on stop):
- StopDesktop unconditionally clears external_agent_status and status_message

Issue #5 (no restart button in Starting state):
- Frontend: show Stop button in "starting" state in both screenshot and stream modes
- Show "may have failed to start" message after 2-minute timeout

Issue #10a (duplicate sessions per spectask):
- Re-read task from DB before CreateSession; skip if PlanningSessionID already set

Issue #10b (scanner targets wrong sessions):
- processPendingPromptsForIdleSessions now filters to canonical planning_session_id only

Issue #2 (duplicate message sends):
- Add ClaimPromptForSending() atomic store method (UPDATE WHERE status IN pending/failed)
- Both interrupt and any-pending delivery paths use claim before send

Issue #7 (promotion race gives empty zvol):
- resolveDockerDataDir: acquire read lock before fresh zvol creation; re-check after

Issue #3: Already handled by existing open_thread on agent_ready reconnect

Issue #6: Fixed in merged PR #1947 (RecoverStaleBuilds 60s retry)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Spec-Ref: helix-specs@04b515c3c:001588_read-helixs-design2026
lukemarsden added a commit that referenced this pull request Apr 26, 2026
…art-up

Embed auth fix #3: validate token in /auth/{authenticated,user} directly
philwinder added a commit that referenced this pull request Apr 28, 2026
Previously the activation prompt only carried Body. The Worker had to
call read_events to learn Subject, From, ThreadID, Extra — exactly the
round-trip that caused the docs-engineer to misroute issue #3 to PR #2
during the github demo's E2E run.

renderTrigger now formats every populated envelope field into the
prompt, omitting empties for cleanliness. The Trigger.Body field is
dropped; callers pass the full Message instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
philwinder added a commit that referenced this pull request May 4, 2026
Previously the activation prompt only carried Body. The Worker had to
call read_events to learn Subject, From, ThreadID, Extra — exactly the
round-trip that caused the docs-engineer to misroute issue #3 to PR #2
during the github demo's E2E run.

renderTrigger now formats every populated envelope field into the
prompt, omitting empties for cleanliness. The Trigger.Body field is
dropped; callers pass the full Message instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
lukemarsden added a commit that referenced this pull request May 15, 2026
Spec-Ref: helix-specs@abaaa8c45:002021_investigate-notion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants