ops: add runtime health checks to doctor.sh#55
Conversation
New checks aligned with what the heartbeat monitors at runtime: - Slack bridge responding (curl localhost:7890) - Disk usage warning at 80%, fail at 90% - Stale session sockets (no owning process) - Orphaned worktrees (>5 triggers warning) - Session log total size (warn at 500MB)
Greptile SummaryAdds five runtime health checks to
Confidence Score: 2/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[doctor.sh starts] --> B[User / Deps / Secrets / Runtime / Security / Agent checks]
B --> C[Runtime Health section]
C --> D{Slack bridge check}
D -->|curl :7890 returns 400| D1[PASS]
D -->|no response| D2[WARN]
D1 --> E{Disk usage check}
D2 --> E
E -->|< 80%| E1[PASS]
E -->|>= 80%| E2[WARN]
E -->|>= 90%| E3[FAIL]
E1 --> F{Stale sockets check}
E2 --> F
E3 --> F
F -->|"⚠ Uses $AGENT_HOME (undefined)"| F0[CRASH: unbound variable]
F -->|"After fix: uses $BAUDBOT_HOME"| F1{fuser finds stale .sock?}
F1 -->|none stale| F2[PASS]
F1 -->|stale found| F3[WARN]
F2 --> G{Worktree count}
F3 --> G
G -->|<= 5| G1[PASS]
G -->|> 5| G2[WARN]
G1 --> H{Session log size}
G2 --> H
H -->|< 500MB| H1[PASS]
H -->|>= 500MB| H2[WARN]
H1 --> I[Summary]
H2 --> I
Last reviewed commit: 946a9d2 |
| fi | ||
|
|
||
| # Orphaned worktrees | ||
| WORKTREE_DIR="$AGENT_HOME/workspace/worktrees" |
There was a problem hiding this comment.
Same $AGENT_HOME → $BAUDBOT_HOME fix needed
| WORKTREE_DIR="$AGENT_HOME/workspace/worktrees" | |
| WORKTREE_DIR="$BAUDBOT_HOME/workspace/worktrees" |
Prompt To Fix With AI
This is a comment left during a code review.
Path: bin/doctor.sh
Line: 279
Comment:
**Same `$AGENT_HOME` → `$BAUDBOT_HOME` fix needed**
```suggestion
WORKTREE_DIR="$BAUDBOT_HOME/workspace/worktrees"
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed (and that whole worktree-count block was later removed to keep the PR focused).
- All references to undefined AGENT_HOME replaced with BAUDBOT_HOME - Renamed ORPHANS variable to WORKTREE_COUNT (counts all worktrees, not just orphaned)
Keep only the checks that answer 'why isn't the bot responding': bridge health, disk usage, stale sockets.
| fi | ||
|
|
||
| # Disk usage | ||
| DISK_PCT=$(df / 2>/dev/null | tail -1 | awk '{print $5}' | tr -d '%') |
There was a problem hiding this comment.
Bug: The disk usage check uses df / which can fail on systems with long filesystem names (e.g., LVM), causing the doctor.sh script to crash.
Severity: MEDIUM
Suggested Fix
Modify the command to use the POSIX-compliant output format for df by adding the -P flag. This prevents line wrapping and ensures consistent output. The command should be df -P /.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: bin/doctor.sh#L248
Potential issue: The `doctor.sh` script parses disk usage with `df / | tail -1 | awk
'{print $5}'`. On systems with long filesystem names, such as those using LVM or network
mounts, the output of `df` wraps to a new line. This causes the parsing logic to
incorrectly assign the mount point (e.g., `/`) to the `DISK_PCT` variable. Because the
script runs with `set -e`, the subsequent attempt to perform a numeric comparison `[
"$DISK_PCT" -ge 90 ]` on a non-numeric value will cause the script to exit with an
error, preventing the health check from completing.
Adds runtime health checks to
bin/doctor.sh, aligned with what the heartbeat monitors at runtime.New checks
These checks run after the existing dependency/security/agent checks, so
doctor.shnow covers both setup correctness and runtime health.