feat(doctor): add --watch mode for continuous diagnostics#63
Conversation
Signed-off-by: Sagar <sagarkapri321@gmail.com>
btwshivam
left a comment
There was a problem hiding this comment.
the patch doesn't compile. you used // ... unchanged comments as literal code and deleted real fields. rebase on main and re-apply only the new watch-mode additions, then make check.
| # Enable AI analysis | ||
| sudo kerno doctor --ai`, | ||
| Args: cobra.NoArgs, | ||
| // ... (Use, Short, Long, Example unchanged) |
There was a problem hiding this comment.
this // ... (Use, Short, Long, Example unchanged) placeholder is a literal source-level deletion. you actually removed Use, Short, Long, Example, and Args: cobra.NoArgs from the cobra command. same anti-pattern at three more places in this file:
- line ~74:
// ... (rest of flags unchanged)deleted four flag registrations (--output,--ai,--no-ai,--quiet), leaving theuseAI,noAI,quietvars declared-but-unused (Go compile error). - inside
runDoctor:// ... (initial logic unchanged)deletedthresholds,analyzer,opts.durationdefaulting, so the next lines reference undefined names. - end of
runDoctor:// ... (rest of function unchanged)followed by a}closes the function early, andfor { ... }below ends up at file scope.
that's why Lint and Test are red. rebase on main and re-apply only the genuinely new lines: the watch field in the var block, the BoolVarP for --watch, the opts.watch field, and the if opts.watch { ... } branch inside runDoctor. then make check locally before pushing.
| // the "same" issue. It includes the rule, signal, metric, and process | ||
| // (if any), but excludes the severity and current value. | ||
| func (f *Finding) Fingerprint() string { | ||
| return fmt.Sprintf("%s|%s|%s|%s|%s", f.Rule, f.Signal, f.Metric, f.Process, f.Title) |
There was a problem hiding this comment.
including Title in the fingerprint is fragile. doctor rule titles often embed the current metric value, like syscall p99 is 240ms vs syscall p99 is 250ms next cycle. that flips an Ongoing finding into Resolved+New every tick, defeating the whole point of --watch.
key on rule + signal + process. drop Title and Metric (Metric also varies for the same reason):
func (f *Finding) Fingerprint() string {
return fmt.Sprintf("%s|%s|%s", f.Rule, f.Signal, f.Process)
}| renderWatchFrame(os.Stdout, style, cycle, start, diffed, signals, false) | ||
|
|
||
| // 6. Wait for next interval. | ||
| waitDuration := opts.interval - time.Since(cycleStart) |
There was a problem hiding this comment.
opts.interval - time.Since(cycleStart) goes negative whenever the collection window exceeds the interval. with the defaults (duration 30s, watch interval 10s), waitDuration is -20s, so the wait is skipped and the next cycle starts immediately. effective cadence becomes duration, not interval.
user's mental model for kerno doctor --watch --interval 10s is "show me a fresh view every 10s". simplest fix: when --watch is set, use a short collection window (say 2s) and let interval drive the wall-clock cadence. or clamp opts.duration = min(opts.duration, opts.interval / 2) at the top of runDoctorWatch.
| build collectorBuildResult, | ||
| opts doctorOpts, | ||
| logger *slog.Logger, | ||
| ) error { |
There was a problem hiding this comment.
issue #28 explicitly requires --watch --output json to emit NDJSON (one JSON object per cycle) so the watch loop can pipe into Loki / Logstash. this function ignores opts.output and always renders the terminal UI, so kerno doctor --watch --output json | jq just prints ANSI gibberish.
split at the top:
if opts.output == "json" {
return runDoctorWatchJSON(ctx, engine, build, opts, logger)
}
// existing terminal-UI loop belowthe JSON loop is much simpler than the TTY loop. one json.NewEncoder(os.Stdout) and emit the report per cycle.
| // 3. If finding exists in prev but not in curr, it is StatusResolved. | ||
| // 4. StatusResolved findings are kept for keepResolvedDuration (e.g. 30s) | ||
| // before being dropped. | ||
| func Diff(prev []DiffedFinding, curr []Finding, keepResolvedDuration time.Duration) []DiffedFinding { |
There was a problem hiding this comment.
no tests for Diff(). it's the core of the feature and the most subtle piece of logic in the PR. add internal/doctor/diff_test.go with table-driven cases for:
- new finding (not in prev)
- ongoing finding (same fingerprint two cycles in a row,
FirstSeenpreserved) - resolved finding within keep window (kept in output with
StatusResolved) - resolved finding past keep window (dropped from output)
these are roughly 40 lines of test code and they catch the Title-in-fingerprint issue immediately.
This PR adds a
--watch(-w) flag to thedoctorcommand, transforming it from a one-shot diagnostic into a continuous monitoring engine with a live, "sticky" terminal UI and real-time finding diffs.Fixes #28
Finding Fingerprinting: Implemented
Fingerprint()ininternal/doctor/finding.gotouniquely identify findings across cycles (Rule + Signal + Metric + Process).Diff Engine: Created
internal/doctor/diff.goto categorize findings asNew(prefix+),Ongoing(with duration), orResolved(prefix-with strikethrough). Resolved findings are kept visible for 30s before expiry.Zero-Dependency UI: Implemented an ANSI-based terminal loop in
internal/cli/doctor_watch.gotoavoid adding heavy TUI dependencies likebubbletea, ensuring high portability while still delivering a "sticky" live view.System Pulse: Added a throughput header showing aggregate events/sec (syscalls, scheduling, disk, etc.) to confirm pipeline health.
Testing
go build ./...passesgo test ./...passesgo vet ./...passesgolangci-lint run ./...passessudo ./bin/bpf-verify --read 5sconfirms 6/6 programs still load./scripts/verify.shpasses (or specific phase:./scripts/verify.sh quality)Checklist
feat(scope): subject)git commit -s)scripts/verify.sh