Skip to content

feat(doctor): add --watch mode for continuous diagnostics#63

Open
itssagarK wants to merge 1 commit into
optiqor:mainfrom
itssagarK:feat/watch-mode
Open

feat(doctor): add --watch mode for continuous diagnostics#63
itssagarK wants to merge 1 commit into
optiqor:mainfrom
itssagarK:feat/watch-mode

Conversation

@itssagarK
Copy link
Copy Markdown

This PR adds a --watch (-w) flag to the doctor command, transforming it from a one-shot diagnostic into a continuous monitoring engine with a live, "sticky" terminal UI and real-time finding diffs.

Fixes #28

Finding Fingerprinting: Implemented Fingerprint() in internal/doctor/finding.go touniquely identify findings across cycles (Rule + Signal + Metric + Process).

Diff Engine: Created internal/doctor/diff.go to categorize findings as New (prefix +), Ongoing (with duration), or Resolved (prefix - with strikethrough). Resolved findings are kept visible for 30s before expiry.

Zero-Dependency UI: Implemented an ANSI-based terminal loop in internal/cli/doctor_watch.go toavoid adding heavy TUI dependencies like bubbletea, ensuring high portability while still delivering a "sticky" live view.

System Pulse: Added a throughput header showing aggregate events/sec (syscalls, scheduling, disk, etc.) to confirm pipeline health.

Testing

  • go build ./... passes
  • go test ./... passes
  • go vet ./... passes
  • golangci-lint run ./... passes
  • Tested locally with:
  • N/A — pure docs/refactor
  • sudo ./bin/bpf-verify --read 5s confirms 6/6 programs still load
  • ./scripts/verify.sh passes (or specific phase: ./scripts/verify.sh quality)

Checklist

  • PR title follows Conventional Commits (feat(scope): subject)
  • All commits are DCO-signed (git commit -s)
  • No unrelated changes pulled in
  • Documentation updated where user-visible behavior changed
  • Added/updated tests for new code paths
  • If a new doctor rule, paired with a chaos scenario in scripts/verify.sh

Signed-off-by: Sagar <sagarkapri321@gmail.com>
@itssagarK itssagarK requested a review from btwshivam as a code owner May 13, 2026 07:07
@github-actions github-actions Bot added the area/doctor Diagnostic engine and rules label May 13, 2026
Copy link
Copy Markdown
Member

@btwshivam btwshivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the patch doesn't compile. you used // ... unchanged comments as literal code and deleted real fields. rebase on main and re-apply only the new watch-mode additions, then make check.

Comment thread internal/cli/doctor.go
# Enable AI analysis
sudo kerno doctor --ai`,
Args: cobra.NoArgs,
// ... (Use, Short, Long, Example unchanged)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this // ... (Use, Short, Long, Example unchanged) placeholder is a literal source-level deletion. you actually removed Use, Short, Long, Example, and Args: cobra.NoArgs from the cobra command. same anti-pattern at three more places in this file:

  • line ~74: // ... (rest of flags unchanged) deleted four flag registrations (--output, --ai, --no-ai, --quiet), leaving the useAI, noAI, quiet vars declared-but-unused (Go compile error).
  • inside runDoctor: // ... (initial logic unchanged) deleted thresholds, analyzer, opts.duration defaulting, so the next lines reference undefined names.
  • end of runDoctor: // ... (rest of function unchanged) followed by a } closes the function early, and for { ... } below ends up at file scope.

that's why Lint and Test are red. rebase on main and re-apply only the genuinely new lines: the watch field in the var block, the BoolVarP for --watch, the opts.watch field, and the if opts.watch { ... } branch inside runDoctor. then make check locally before pushing.

// the "same" issue. It includes the rule, signal, metric, and process
// (if any), but excludes the severity and current value.
func (f *Finding) Fingerprint() string {
return fmt.Sprintf("%s|%s|%s|%s|%s", f.Rule, f.Signal, f.Metric, f.Process, f.Title)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including Title in the fingerprint is fragile. doctor rule titles often embed the current metric value, like syscall p99 is 240ms vs syscall p99 is 250ms next cycle. that flips an Ongoing finding into Resolved+New every tick, defeating the whole point of --watch.

key on rule + signal + process. drop Title and Metric (Metric also varies for the same reason):

func (f *Finding) Fingerprint() string {
    return fmt.Sprintf("%s|%s|%s", f.Rule, f.Signal, f.Process)
}

renderWatchFrame(os.Stdout, style, cycle, start, diffed, signals, false)

// 6. Wait for next interval.
waitDuration := opts.interval - time.Since(cycleStart)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opts.interval - time.Since(cycleStart) goes negative whenever the collection window exceeds the interval. with the defaults (duration 30s, watch interval 10s), waitDuration is -20s, so the wait is skipped and the next cycle starts immediately. effective cadence becomes duration, not interval.

user's mental model for kerno doctor --watch --interval 10s is "show me a fresh view every 10s". simplest fix: when --watch is set, use a short collection window (say 2s) and let interval drive the wall-clock cadence. or clamp opts.duration = min(opts.duration, opts.interval / 2) at the top of runDoctorWatch.

build collectorBuildResult,
opts doctorOpts,
logger *slog.Logger,
) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue #28 explicitly requires --watch --output json to emit NDJSON (one JSON object per cycle) so the watch loop can pipe into Loki / Logstash. this function ignores opts.output and always renders the terminal UI, so kerno doctor --watch --output json | jq just prints ANSI gibberish.

split at the top:

if opts.output == "json" {
    return runDoctorWatchJSON(ctx, engine, build, opts, logger)
}
// existing terminal-UI loop below

the JSON loop is much simpler than the TTY loop. one json.NewEncoder(os.Stdout) and emit the report per cycle.

Comment thread internal/doctor/diff.go
// 3. If finding exists in prev but not in curr, it is StatusResolved.
// 4. StatusResolved findings are kept for keepResolvedDuration (e.g. 30s)
// before being dropped.
func Diff(prev []DiffedFinding, curr []Finding, keepResolvedDuration time.Duration) []DiffedFinding {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no tests for Diff(). it's the core of the feature and the most subtle piece of logic in the PR. add internal/doctor/diff_test.go with table-driven cases for:

  • new finding (not in prev)
  • ongoing finding (same fingerprint two cycles in a row, FirstSeen preserved)
  • resolved finding within keep window (kept in output with StatusResolved)
  • resolved finding past keep window (dropped from output)

these are roughly 40 lines of test code and they catch the Title-in-fingerprint issue immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/doctor Diagnostic engine and rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UX: kerno doctor --watch continuous mode with diff-style output

2 participants