Skip to content

ops: add runtime health checks to doctor.sh#55

Merged
benvinegar merged 4 commits intomainfrom
benvinegar/doctor-runtime-checks
Feb 18, 2026
Merged

ops: add runtime health checks to doctor.sh#55
benvinegar merged 4 commits intomainfrom
benvinegar/doctor-runtime-checks

Conversation

@benvinegar
Copy link
Copy Markdown
Member

Adds runtime health checks to bin/doctor.sh, aligned with what the heartbeat monitors at runtime.

New checks

Check Pass Warn Fail
Slack bridge Responding on :7890 Not responding
Disk usage <80% ≥80% ≥90%
Stale sockets None Has stale .sock files
Worktrees ≤5 >5 (suggest cleanup)
Session logs <500MB ≥500MB (suggest pruning)

These checks run after the existing dependency/security/agent checks, so doctor.sh now covers both setup correctness and runtime health.

New checks aligned with what the heartbeat monitors at runtime:
- Slack bridge responding (curl localhost:7890)
- Disk usage warning at 80%, fail at 90%
- Stale session sockets (no owning process)
- Orphaned worktrees (>5 triggers warning)
- Session log total size (warn at 500MB)
Comment thread bin/doctor.sh Outdated
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Feb 18, 2026

Greptile Summary

Adds five runtime health checks to bin/doctor.sh: Slack bridge connectivity, disk usage thresholds, stale session sockets, worktree accumulation, and session log size. These complement the existing setup/dependency checks and align with what the heartbeat extension monitors at runtime.

  • Critical bug: Three of the five new checks use $AGENT_HOME (lines 260, 279, 294–295), which is never defined in the script. The rest of doctor.sh consistently uses $BAUDBOT_HOME (defined on line 18). Because the script runs with set -euo pipefail, the undefined variable will cause the script to abort with an "unbound variable" error before reaching the summary section. This needs to be fixed before merging.
  • Minor: The "Orphaned worktrees" section counts all worktrees (not just orphans) but uses a misleadingly-named ORPHANS variable.

Confidence Score: 2/5

  • This PR has a script-breaking bug that must be fixed before merging — 3 of the 5 new checks use an undefined variable.
  • The use of $AGENT_HOME instead of $BAUDBOT_HOME will cause doctor.sh to crash with an unbound variable error under set -u whenever any of the stale sockets, worktrees, or session logs checks are reached. The Slack bridge and disk usage checks work correctly. The fix is straightforward (replace AGENT_HOME with BAUDBOT_HOME in 4 places), but the script is broken as-is.
  • bin/doctor.sh — the undefined $AGENT_HOME variable on lines 260, 279, 294, and 295 will crash the script

Important Files Changed

Filename Overview
bin/doctor.sh Adds 5 runtime health checks (Slack bridge, disk usage, stale sockets, worktrees, session logs). Contains a critical bug: uses undefined $AGENT_HOME instead of $BAUDBOT_HOME in 3 of the 5 new checks, which will crash the script under set -u.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[doctor.sh starts] --> B[User / Deps / Secrets / Runtime / Security / Agent checks]
    B --> C[Runtime Health section]
    C --> D{Slack bridge check}
    D -->|curl :7890 returns 400| D1[PASS]
    D -->|no response| D2[WARN]
    D1 --> E{Disk usage check}
    D2 --> E
    E -->|< 80%| E1[PASS]
    E -->|>= 80%| E2[WARN]
    E -->|>= 90%| E3[FAIL]
    E1 --> F{Stale sockets check}
    E2 --> F
    E3 --> F
    F -->|"⚠ Uses $AGENT_HOME (undefined)"| F0[CRASH: unbound variable]
    F -->|"After fix: uses $BAUDBOT_HOME"| F1{fuser finds stale .sock?}
    F1 -->|none stale| F2[PASS]
    F1 -->|stale found| F3[WARN]
    F2 --> G{Worktree count}
    F3 --> G
    G -->|<= 5| G1[PASS]
    G -->|> 5| G2[WARN]
    G1 --> H{Session log size}
    G2 --> H
    H -->|< 500MB| H1[PASS]
    H -->|>= 500MB| H2[WARN]
    H1 --> I[Summary]
    H2 --> I
Loading

Last reviewed commit: 946a9d2

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment thread bin/doctor.sh Outdated
Comment thread bin/doctor.sh Outdated
fi

# Orphaned worktrees
WORKTREE_DIR="$AGENT_HOME/workspace/worktrees"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same $AGENT_HOME$BAUDBOT_HOME fix needed

Suggested change
WORKTREE_DIR="$AGENT_HOME/workspace/worktrees"
WORKTREE_DIR="$BAUDBOT_HOME/workspace/worktrees"
Prompt To Fix With AI
This is a comment left during a code review.
Path: bin/doctor.sh
Line: 279

Comment:
**Same `$AGENT_HOME``$BAUDBOT_HOME` fix needed**

```suggestion
WORKTREE_DIR="$BAUDBOT_HOME/workspace/worktrees"
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (and that whole worktree-count block was later removed to keep the PR focused).

Comment thread bin/doctor.sh Outdated
Comment thread bin/doctor.sh Outdated
- All references to undefined AGENT_HOME replaced with BAUDBOT_HOME
- Renamed ORPHANS variable to WORKTREE_COUNT (counts all worktrees, not just orphaned)
Comment thread bin/doctor.sh Outdated
Keep only the checks that answer 'why isn't the bot responding':
bridge health, disk usage, stale sockets.
Comment thread bin/doctor.sh
fi

# Disk usage
DISK_PCT=$(df / 2>/dev/null | tail -1 | awk '{print $5}' | tr -d '%')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The disk usage check uses df / which can fail on systems with long filesystem names (e.g., LVM), causing the doctor.sh script to crash.
Severity: MEDIUM

Suggested Fix

Modify the command to use the POSIX-compliant output format for df by adding the -P flag. This prevents line wrapping and ensures consistent output. The command should be df -P /.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: bin/doctor.sh#L248

Potential issue: The `doctor.sh` script parses disk usage with `df / | tail -1 | awk
'{print $5}'`. On systems with long filesystem names, such as those using LVM or network
mounts, the output of `df` wraps to a new line. This causes the parsing logic to
incorrectly assign the mount point (e.g., `/`) to the `DISK_PCT` variable. Because the
script runs with `set -e`, the subsequent attempt to perform a numeric comparison `[
"$DISK_PCT" -ge 90 ]` on a non-numeric value will cause the script to exit with an
error, preventing the health check from completing.

@benvinegar benvinegar merged commit d176e76 into main Feb 18, 2026
9 checks passed
@benvinegar benvinegar deleted the benvinegar/doctor-runtime-checks branch February 18, 2026 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant