-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Monitoring, Health Checks & Service Integration
Overview
Implement comprehensive monitoring, health checking, and service integration capabilities. This includes Prometheus metrics, health endpoints, service reload management, structured logging with audit trails, and network connectivity monitoring with degraded mode support.
Requirements
Module Structure
Implement the following modules:
internal/
├── monitoring/
│ ├── metrics.go # Prometheus metrics
│ ├── health.go # Health reporting
│ └── logging.go # Structured logging enhancements
├── services/
│ ├── reload.go # Service reload management
│ └── registry.go # Service discovery
pkg/
├── network/
│ └── connectivity.go # Network health checkingPrometheus Metrics
Implement metrics in internal/monitoring/metrics.go:
type MetricsCollector struct {
// Certificate metrics
certificateExpiry prometheus.GaugeVec
certificateRenewalAttempts prometheus.CounterVec
certificateAge prometheus.Gauge
lastRenewalTimestamp prometheus.Gauge
// Authentication metrics
jwtRefreshTotal prometheus.CounterVec
keycloakAuthFailures prometheus.CounterVec
jwtCacheHits prometheus.Counter
// System health metrics
up prometheus.Gauge
connectivityStatus prometheus.Gauge
serviceReloadTotal prometheus.CounterVec
// Operational metrics
bootstrapStatus prometheus.Gauge
degradedMode prometheus.Gauge
backupFilesCount prometheus.Gauge
}
func NewMetricsCollector() *MetricsCollector {
return &MetricsCollector{
certificateExpiry: prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "hetzner_cert_rotation_certificate_expiry_seconds",
Help: "Time until certificate expiry in seconds",
},
[]string{"type"}, // bootstrap, current
),
certificateRenewalAttempts: prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "hetzner_cert_rotation_certificate_renewal_attempts_total",
Help: "Total number of certificate renewal attempts",
},
[]string{"result"}, // success, failure
),
// Initialize other metrics...
}
}
// Core functions to implement:
func (m *MetricsCollector) RegisterMetrics()
func (m *MetricsCollector) UpdateCertificateMetrics(cert *x509.Certificate)
func (m *MetricsCollector) RecordRenewalAttempt(success bool, duration time.Duration)
func (m *MetricsCollector) RecordJWTRefresh(success bool)
func (m *MetricsCollector) RecordServiceReload(service string, success bool)
func (m *MetricsCollector) SetDegradedMode(degraded bool)Health Check System
Implement health checks in internal/monitoring/health.go:
type HealthChecker struct {
storage *storage.Manager
jwtManager *auth.JWTManager
connectivity *network.ConnectivityChecker
metrics *MetricsCollector
port int
server *http.Server
logger *slog.Logger
}
type HealthStatus struct {
Status string `json:"status"` // healthy, degraded, unhealthy
Checks map[string]CheckResult `json:"checks"`
Timestamp time.Time `json:"timestamp"`
}
type CheckResult struct {
Status string `json:"status"`
Message string `json:"message,omitempty"`
Details interface{} `json:"details,omitempty"`
}
// Health check endpoints:
// GET /health - Overall health status
// GET /health/certificate - Certificate validity status
// GET /health/connectivity - Network connectivity status
// GET /health/services - Dependent service status
func (h *HealthChecker) Start() error {
mux := http.NewServeMux()
mux.HandleFunc("/health", h.handleHealth)
mux.HandleFunc("/health/certificate", h.handleCertificateHealth)
mux.HandleFunc("/health/connectivity", h.handleConnectivityHealth)
mux.HandleFunc("/health/services", h.handleServicesHealth)
mux.HandleFunc("/metrics", h.handleMetrics)
h.server = &http.Server{
Addr: fmt.Sprintf(":%d", h.port),
Handler: mux,
}
return h.server.ListenAndServe()
}
func (h *HealthChecker) handleCertificateHealth(w http.ResponseWriter, r *http.Request) {
cert, err := h.storage.ReadCurrentCertificate()
if err != nil {
h.writeError(w, "Cannot read certificate", err)
return
}
parsed, _ := x509.ParseCertificate(cert.Certificate)
remaining := time.Until(parsed.NotAfter)
status := CheckResult{
Status: "healthy",
Details: map[string]interface{}{
"serial": cert.SerialNumber,
"expires_at": parsed.NotAfter,
"remaining": remaining.String(),
"percentage": calculateLifetimePercentage(parsed),
},
}
if remaining < 24*time.Hour {
status.Status = "critical"
status.Message = "Certificate expires within 24 hours"
} else if remaining < 72*time.Hour {
status.Status = "warning"
status.Message = "Certificate expires within 72 hours"
}
h.writeJSON(w, status)
}Service Reload Management
Implement service reloading in internal/services/reload.go:
type ReloadManager struct {
services map[string]ServiceConfig
executor CommandExecutor
logger *slog.Logger
metrics *MetricsCollector
}
type ServiceConfig struct {
Name string
ReloadCommand string
HealthCheck string
Timeout time.Duration
Critical bool // If true, failure blocks rotation
}
type ReloadResult struct {
Service string
Success bool
Duration time.Duration
Error error
}
func (r *ReloadManager) ReloadAll() error {
results := make([]ReloadResult, 0, len(r.services))
hasFailure := false
hasCriticalFailure := false
// Reload services in configured order
for name, config := range r.services {
result := r.reloadService(name, config)
results = append(results, result)
if !result.Success {
hasFailure = true
if config.Critical {
hasCriticalFailure = true
}
}
r.metrics.RecordServiceReload(name, result.Success)
}
if hasCriticalFailure {
// Attempt rollback
r.logger.Error("Critical service reload failed, attempting rollback")
return fmt.Errorf("critical service reload failed")
}
if hasFailure {
r.logger.Warn("Some non-critical services failed to reload")
}
return nil
}
func (r *ReloadManager) reloadService(name string, config ServiceConfig) ReloadResult {
start := time.Now()
// Execute reload command
ctx, cancel := context.WithTimeout(context.Background(), config.Timeout)
defer cancel()
if err := r.executor.Execute(ctx, config.ReloadCommand); err != nil {
return ReloadResult{
Service: name,
Success: false,
Duration: time.Since(start),
Error: err,
}
}
// Wait for service to be healthy
if config.HealthCheck != "" {
if err := r.waitForHealth(ctx, config.HealthCheck); err != nil {
return ReloadResult{
Service: name,
Success: false,
Duration: time.Since(start),
Error: fmt.Errorf("health check failed: %w", err),
}
}
}
return ReloadResult{
Service: name,
Success: true,
Duration: time.Since(start),
}
}Network Connectivity Monitoring
Implement connectivity checking in pkg/network/connectivity.go:
type ConnectivityChecker struct {
endpoints []EndpointConfig
checkInterval time.Duration
timeout time.Duration
degradedThreshold time.Duration
currentStatus ConnectivityStatus
lastSuccessful time.Time
mutex sync.RWMutex
logger *slog.Logger
}
type EndpointConfig struct {
Name string
URL string
Critical bool
}
type ConnectivityStatus int
const (
StatusHealthy ConnectivityStatus = iota
StatusDegraded
StatusOffline
)
func (c *ConnectivityChecker) Start() {
ticker := time.NewTicker(c.checkInterval)
go func() {
for range ticker.C {
c.performHealthCheck()
}
}()
}
func (c *ConnectivityChecker) performHealthCheck() {
allHealthy := true
criticalHealthy := true
for _, endpoint := range c.endpoints {
healthy := c.checkEndpoint(endpoint)
if !healthy {
allHealthy = false
if endpoint.Critical {
criticalHealthy = false
}
}
}
c.updateStatus(allHealthy, criticalHealthy)
}
func (c *ConnectivityChecker) updateStatus(allHealthy, criticalHealthy bool) {
c.mutex.Lock()
defer c.mutex.Unlock()
oldStatus := c.currentStatus
if allHealthy {
c.currentStatus = StatusHealthy
c.lastSuccessful = time.Now()
} else if criticalHealthy {
// Check if we should enter degraded mode
if time.Since(c.lastSuccessful) > c.degradedThreshold {
c.currentStatus = StatusDegraded
}
} else {
c.currentStatus = StatusOffline
}
if oldStatus != c.currentStatus {
c.logger.Info("Connectivity status changed",
slog.String("old_status", c.statusString(oldStatus)),
slog.String("new_status", c.statusString(c.currentStatus)))
}
}Audit Logging
Enhance logging in internal/monitoring/logging.go:
type AuditLogger struct {
logger *slog.Logger
filepath string
}
type AuditEvent struct {
Timestamp time.Time `json:"timestamp"`
EventType string `json:"event_type"`
Actor Actor `json:"actor"`
Resource Resource `json:"resource"`
Outcome string `json:"outcome"`
Details map[string]interface{} `json:"details,omitempty"`
}
type Actor struct {
Identity string `json:"identity"`
IPAddress string `json:"ip_address,omitempty"`
AuthMethod string `json:"auth_method"`
}
type Resource struct {
Type string `json:"type"`
Identifier string `json:"identifier"`
}
func (a *AuditLogger) LogCertificateRenewal(serial string, success bool, details map[string]interface{}) {
event := AuditEvent{
Timestamp: time.Now().UTC(),
EventType: "certificate_renewed",
Actor: Actor{
Identity: a.getMachineIdentity(),
IPAddress: a.getLocalIP(),
AuthMethod: "jwt",
},
Resource: Resource{
Type: "certificate",
Identifier: serial,
},
Outcome: a.outcomeString(success),
Details: details,
}
a.writeAuditEvent(event)
}
func (a *AuditLogger) LogServiceReload(service string, success bool, duration time.Duration) {
event := AuditEvent{
Timestamp: time.Now().UTC(),
EventType: "service_reloaded",
Actor: Actor{
Identity: a.getMachineIdentity(),
AuthMethod: "system",
},
Resource: Resource{
Type: "service",
Identifier: service,
},
Outcome: a.outcomeString(success),
Details: map[string]interface{}{
"duration_ms": duration.Milliseconds(),
},
}
a.writeAuditEvent(event)
}Degraded Mode Management
type DegradedModeManager struct {
connectivity *ConnectivityChecker
jwtManager *auth.JWTManager
metrics *MetricsCollector
inDegraded bool
enteredAt *time.Time
mutex sync.RWMutex
logger *slog.Logger
}
func (d *DegradedModeManager) CheckAndUpdateMode() {
d.mutex.Lock()
defer d.mutex.Unlock()
status := d.connectivity.GetStatus()
if status == StatusDegraded || status == StatusOffline {
if !d.inDegraded {
d.enterDegradedMode()
}
} else {
if d.inDegraded {
d.exitDegradedMode()
}
}
}
func (d *DegradedModeManager) enterDegradedMode() {
d.inDegraded = true
now := time.Now()
d.enteredAt = &now
d.logger.Warn("Entering degraded mode due to connectivity issues",
slog.Time("entered_at", now))
d.metrics.SetDegradedMode(true)
// Notify other components
d.jwtManager.EnableDegradedMode()
}
func (d *DegradedModeManager) exitDegradedMode() {
duration := time.Since(*d.enteredAt)
d.logger.Info("Exiting degraded mode - connectivity restored",
slog.Duration("duration", duration))
d.inDegraded = false
d.enteredAt = nil
d.metrics.SetDegradedMode(false)
// Trigger immediate renewal check
// Refresh JWT token
d.jwtManager.DisableDegradedMode()
}Configuration
Required configuration:
monitoring:
metrics_port: 9091
health_port: 8081
audit_log_path: "/etc/certs/logs/rotation.log"
network:
health_check_interval: "30s"
connectivity_timeout: "10s"
degraded_mode_threshold: "5m"
endpoints:
- name: "certificate-api"
url: "https://certificate-api.internal/health"
critical: true
- name: "keycloak"
url: "https://keycloak.internal/health"
critical: true
services:
reload_commands:
nginx:
command: "systemctl reload nginx"
health_check: "curl -f http://localhost/health"
timeout: "30s"
critical: true
haproxy:
command: "systemctl reload haproxy"
health_check: "curl -f http://localhost:8080/health"
timeout: "30s"
critical: falseAcceptance Criteria
-
Metrics Collection
- Exports all specified Prometheus metrics
- Updates metrics in real-time
- Accessible via /metrics endpoint
- Compatible with Prometheus scraping
-
Health Checks
- Provides accurate health status for all components
- Returns appropriate HTTP status codes
- Includes detailed health information
- Responds within 5 seconds
-
Service Reloading
- Successfully reloads all configured services
- Handles reload failures gracefully
- Validates service health post-reload
- Respects timeout configurations
-
Audit Logging
- Logs all certificate operations
- Uses specified JSON format
- Includes all required fields
- Rotates logs appropriately
-
Degraded Mode
- Enters degraded mode after threshold
- Continues operating with cached resources
- Recovers automatically when connectivity restored
- Alerts operations team appropriately
Testing Requirements
- Unit tests for all monitoring components
- Test metric calculation and updates
- Test health check endpoints
- Test service reload with mock commands
- Test audit log formatting
- Test degraded mode transitions
- Integration tests for connectivity checking
- Load test health endpoints
- Test log rotation
- Simulate network failures
Dependencies
- Prometheus client library (github.com/prometheus/client_golang)
- Standard net/http package for health endpoints
- Standard os/exec for service commands
- Context package for timeouts
Metadata
Metadata
Assignees
Labels
No labels