Skip to content

feat(slm): detect crash-looping services and degrade node status (#1604)#1610

Merged
mrveiss merged 1 commit intoDev_new_guifrom
fix/1604-crash-loop-alerting
Mar 13, 2026
Merged

feat(slm): detect crash-looping services and degrade node status (#1604)#1610
mrveiss merged 1 commit intoDev_new_guifrom
fix/1604-crash-loop-alerting

Conversation

@mrveiss
Copy link
Copy Markdown
Owner

@mrveiss mrveiss commented Mar 13, 2026

Summary

  • Add NRestarts and crash-loop detection to HealthCollector agent
  • Add CRASH_LOOP enum value to ServiceStatus model
  • Update reconciler._calculate_node_status() to degrade node status when autobot services are crash-looping or have >3 restarts
  • Capture journalctl error context for crash-looping services (same as failed)

Changed Files

  • slm/agent/health_collector.py — NRestarts property, crash-loop status mapping, error context for crash-loop
  • models/database.py — CRASH_LOOP enum
  • services/reconciler.py — crash-loop aware status calculation

Test Plan

  • Deploy to .19 SLM server
  • Simulate a crash-looping service (systemctl restart in rapid succession)
  • Verify heartbeat reports crash-loop status with n_restarts count
  • Verify node status degrades to DEGRADED in SLM dashboard
  • Verify error_message field populated for crash-looping services

Closes #1604

- Add NRestarts collection and crash-loop status mapping in health_collector
- Add CRASH_LOOP enum value to ServiceStatus model
- Update reconciler to degrade node status on crash-loop or high restart count
- Capture error context (journalctl) for crash-looping services
@github-actions
Copy link
Copy Markdown

✅ SSOT Configuration Compliance: Passing

🎉 No hardcoded values detected that have SSOT config equivalents!

@mrveiss mrveiss merged commit 5e8593f into Dev_new_gui Mar 13, 2026
3 checks passed
@mrveiss mrveiss deleted the fix/1604-crash-loop-alerting branch March 13, 2026 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant