Skip to content

fix: single owner for Slack bridge lifecycle (stop dual-launch race)#164

Merged
benvinegar merged 1 commit into
mainfrom
fix/bridge-single-owner
Feb 24, 2026
Merged

fix: single owner for Slack bridge lifecycle (stop dual-launch race)#164
benvinegar merged 1 commit into
mainfrom
fix/bridge-single-owner

Conversation

@baudbot-agent
Copy link
Copy Markdown
Collaborator

@baudbot-agent baudbot-agent commented Feb 24, 2026

Problem

The Slack bridge is launched in two places that fight each other:

  1. start.sh — launches bridge as ( bb_bridge_supervise ... ) & background subshell before pi starts
  2. startup-cleanup.sh — launches bridge in a slack-bridge tmux session after control-agent is live

This causes recurring issues:

  • Port 7890 conflicts — both try to bind, one crash-loops on EADDRINUSE
  • Orphaned supervisor loops — background subshell survives pi exit
  • Socket not found — bridge launched before pi means PI_SESSION_ID is wrong/missing, messages get dropped
  • Infinite restart spin — no max retries or backoff on the restart loop

Changes

1. start.sh: Remove bridge launch (cleanup only)

  • Kills stale PID file processes, tmux session, and port holders
  • Does NOT start the bridge — that's startup-cleanup.sh's job

2. startup-cleanup.sh: Add max retries + backoff to restart loop

  • Tracks consecutive fast failures (<60s runtime)
  • Gives up after 10 consecutive fast failures (logs FATAL)
  • Backs off: 5s base + 2s per failure, capped at 60s
  • Kills port holders before retrying (prevents EADDRINUSE spin)
  • Logs attempt count, runtime, and failure state

3. bin/ci/smoke-agent-runtime.sh: Relax bridge status check

  • Bridge supervisor status file is no longer created by start.sh
  • Check is now informational (log) rather than a hard gate

⚠️ Protected file

start.sh is a protected file (root-owned). Admin needs to review and deploy.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 24, 2026

Greptile Summary

Eliminated the Slack bridge dual-launch race condition by removing bridge startup from start.sh and making startup-cleanup.sh the single owner. Previously, both scripts launched competing bridge instances causing port conflicts, orphaned processes, and missing session UUIDs.

Key changes:

  • Removed bb_bridge_supervise background subshell launch and bridge-restart-policy.sh sourcing
  • Added comprehensive cleanup logic: kills stale PID file processes, terminates slack-bridge tmux sessions, force-releases port 7890
  • Bridge now only starts in startup-cleanup.sh after control-agent registers its session UUID, ensuring PI_SESSION_ID is correct
  • Detailed comments explain why start.sh can't own the bridge (session UUID not available until pi starts)

Minor issue found:

  • Placeholder PR #XXX should be updated to actual PR number #164

Confidence Score: 4/5

  • Safe to merge with one trivial documentation fix needed
  • The change correctly solves the dual-launch race condition with proper cleanup logic. The implementation follows established patterns (startup-cleanup.sh already existed and properly handles bridge launches). Only a minor placeholder PR number needs updating. The fix addresses a real production issue (orphaned processes, port conflicts, missing session UUIDs) with a clean architectural solution.
  • No files require special attention - the single comment fix is trivial

Important Files Changed

Filename Overview
start.sh Removes dual-launch race condition by delegating bridge ownership to startup-cleanup.sh, adds comprehensive cleanup logic

Sequence Diagram

sequenceDiagram
    participant Admin
    participant start.sh
    participant pi
    participant control-agent
    participant startup-cleanup.sh
    participant bridge

    Note over Admin,bridge: OLD: Dual Launch (race condition)
    Admin->>start.sh: sudo -u baudbot_agent start.sh
    start.sh->>bridge: Launch in background subshell
    Note over bridge: Port 7890 (wrong/missing PI_SESSION_ID)
    start.sh->>pi: Start pi agent
    pi->>control-agent: Register session UUID
    control-agent->>startup-cleanup.sh: Run cleanup script
    startup-cleanup.sh->>bridge: Launch in tmux (correct PI_SESSION_ID)
    Note over bridge: Port 7890 conflict (EADDRINUSE)
    
    Note over Admin,bridge: NEW: Single Owner (this PR)
    Admin->>start.sh: sudo -u baudbot_agent start.sh
    start.sh->>start.sh: Kill stale PID file
    start.sh->>start.sh: Kill tmux session slack-bridge
    start.sh->>start.sh: Force-release port 7890
    start.sh->>pi: Start pi agent
    pi->>control-agent: Register session UUID
    control-agent->>startup-cleanup.sh: Run cleanup script
    startup-cleanup.sh->>bridge: Launch in tmux with PI_SESSION_ID
    Note over bridge: Port 7890 (correct session UUID)
Loading

Last reviewed commit: 3771d6b

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread start.sh Outdated
Comment on lines +17 to +18
# bridge-restart-policy.sh no longer needed — bridge is started by
# startup-cleanup.sh, not start.sh (see PR #XXX)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

placeholder PR #XXX should be replaced with actual PR number #164

Suggested change
# bridge-restart-policy.sh no longer needed — bridge is started by
# startup-cleanup.sh, not start.sh (see PR #XXX)
# bridge-restart-policy.sh no longer needed — bridge is started by
# startup-cleanup.sh, not start.sh (see PR #164)
Prompt To Fix With AI
This is a comment left during a code review.
Path: start.sh
Line: 17-18

Comment:
placeholder `PR #XXX` should be replaced with actual PR number `#164`

```suggestion
# bridge-restart-policy.sh no longer needed — bridge is started by
# startup-cleanup.sh, not start.sh (see PR #164)
```

How can I resolve this? If you propose a fix, please make it concise.

@baudbot-agent baudbot-agent force-pushed the fix/bridge-single-owner branch 2 times, most recently from 1024a36 to f1b7f9d Compare February 24, 2026 14:02
Comment thread start.sh
@baudbot-agent baudbot-agent force-pushed the fix/bridge-single-owner branch from f1b7f9d to dc4bfe9 Compare February 24, 2026 14:15
Two problems fixed:

1. Dual bridge launch: start.sh launched a bridge as a background
   subshell before pi started, then startup-pi.sh launched another
   in tmux after. This caused port conflicts, orphaned supervisors,
   and dropped messages. start.sh now only cleans up stale processes
   — startup-pi.sh is the sole bridge owner.

2. Infinite restart loop: the bridge restart loop had no max retries
   or backoff. A fatal config error would spin forever at 5s intervals.
   Now tracks consecutive fast failures (<60s runtime), backs off
   (5s + 2s per failure, capped at 60s), gives up after 10, and
   kills port holders before retrying.

Also renames startup-cleanup.sh → startup-pi.sh to clarify that this
is the agent-side startup script (called automatically by the
control-agent on every session start), not a manual cleanup tool.
@baudbot-agent baudbot-agent force-pushed the fix/bridge-single-owner branch from dc4bfe9 to f1c249d Compare February 24, 2026 14:16
@benvinegar benvinegar merged commit 06404b5 into main Feb 24, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants