Skip to content

Monitoring, Health Checks & Service Integration #209

@jmgilman

Description

@jmgilman

Monitoring, Health Checks & Service Integration

Overview

Implement comprehensive monitoring, health checking, and service integration capabilities. This includes Prometheus metrics, health endpoints, service reload management, structured logging with audit trails, and network connectivity monitoring with degraded mode support.

Requirements

Module Structure

Implement the following modules:

internal/
├── monitoring/
│   ├── metrics.go            # Prometheus metrics
│   ├── health.go             # Health reporting
│   └── logging.go            # Structured logging enhancements
├── services/
│   ├── reload.go             # Service reload management
│   └── registry.go           # Service discovery
pkg/
├── network/
│   └── connectivity.go       # Network health checking

Prometheus Metrics

Implement metrics in internal/monitoring/metrics.go:

type MetricsCollector struct {
    // Certificate metrics
    certificateExpiry          prometheus.GaugeVec
    certificateRenewalAttempts prometheus.CounterVec
    certificateAge             prometheus.Gauge
    lastRenewalTimestamp       prometheus.Gauge
    
    // Authentication metrics
    jwtRefreshTotal           prometheus.CounterVec
    keycloakAuthFailures      prometheus.CounterVec
    jwtCacheHits              prometheus.Counter
    
    // System health metrics
    up                        prometheus.Gauge
    connectivityStatus        prometheus.Gauge
    serviceReloadTotal        prometheus.CounterVec
    
    // Operational metrics
    bootstrapStatus           prometheus.Gauge
    degradedMode              prometheus.Gauge
    backupFilesCount          prometheus.Gauge
}

func NewMetricsCollector() *MetricsCollector {
    return &MetricsCollector{
        certificateExpiry: prometheus.NewGaugeVec(
            prometheus.GaugeOpts{
                Name: "hetzner_cert_rotation_certificate_expiry_seconds",
                Help: "Time until certificate expiry in seconds",
            },
            []string{"type"}, // bootstrap, current
        ),
        certificateRenewalAttempts: prometheus.NewCounterVec(
            prometheus.CounterOpts{
                Name: "hetzner_cert_rotation_certificate_renewal_attempts_total",
                Help: "Total number of certificate renewal attempts",
            },
            []string{"result"}, // success, failure
        ),
        // Initialize other metrics...
    }
}

// Core functions to implement:
func (m *MetricsCollector) RegisterMetrics()
func (m *MetricsCollector) UpdateCertificateMetrics(cert *x509.Certificate)
func (m *MetricsCollector) RecordRenewalAttempt(success bool, duration time.Duration)
func (m *MetricsCollector) RecordJWTRefresh(success bool)
func (m *MetricsCollector) RecordServiceReload(service string, success bool)
func (m *MetricsCollector) SetDegradedMode(degraded bool)

Health Check System

Implement health checks in internal/monitoring/health.go:

type HealthChecker struct {
    storage        *storage.Manager
    jwtManager     *auth.JWTManager
    connectivity   *network.ConnectivityChecker
    metrics        *MetricsCollector
    port           int
    server         *http.Server
    logger         *slog.Logger
}

type HealthStatus struct {
    Status     string                 `json:"status"` // healthy, degraded, unhealthy
    Checks     map[string]CheckResult `json:"checks"`
    Timestamp  time.Time              `json:"timestamp"`
}

type CheckResult struct {
    Status  string        `json:"status"`
    Message string        `json:"message,omitempty"`
    Details interface{}   `json:"details,omitempty"`
}

// Health check endpoints:
// GET /health - Overall health status
// GET /health/certificate - Certificate validity status
// GET /health/connectivity - Network connectivity status
// GET /health/services - Dependent service status

func (h *HealthChecker) Start() error {
    mux := http.NewServeMux()
    mux.HandleFunc("/health", h.handleHealth)
    mux.HandleFunc("/health/certificate", h.handleCertificateHealth)
    mux.HandleFunc("/health/connectivity", h.handleConnectivityHealth)
    mux.HandleFunc("/health/services", h.handleServicesHealth)
    mux.HandleFunc("/metrics", h.handleMetrics)
    
    h.server = &http.Server{
        Addr:    fmt.Sprintf(":%d", h.port),
        Handler: mux,
    }
    
    return h.server.ListenAndServe()
}

func (h *HealthChecker) handleCertificateHealth(w http.ResponseWriter, r *http.Request) {
    cert, err := h.storage.ReadCurrentCertificate()
    if err != nil {
        h.writeError(w, "Cannot read certificate", err)
        return
    }
    
    parsed, _ := x509.ParseCertificate(cert.Certificate)
    remaining := time.Until(parsed.NotAfter)
    
    status := CheckResult{
        Status: "healthy",
        Details: map[string]interface{}{
            "serial":       cert.SerialNumber,
            "expires_at":   parsed.NotAfter,
            "remaining":    remaining.String(),
            "percentage":   calculateLifetimePercentage(parsed),
        },
    }
    
    if remaining < 24*time.Hour {
        status.Status = "critical"
        status.Message = "Certificate expires within 24 hours"
    } else if remaining < 72*time.Hour {
        status.Status = "warning"
        status.Message = "Certificate expires within 72 hours"
    }
    
    h.writeJSON(w, status)
}

Service Reload Management

Implement service reloading in internal/services/reload.go:

type ReloadManager struct {
    services map[string]ServiceConfig
    executor CommandExecutor
    logger   *slog.Logger
    metrics  *MetricsCollector
}

type ServiceConfig struct {
    Name          string
    ReloadCommand string
    HealthCheck   string
    Timeout       time.Duration
    Critical      bool  // If true, failure blocks rotation
}

type ReloadResult struct {
    Service   string
    Success   bool
    Duration  time.Duration
    Error     error
}

func (r *ReloadManager) ReloadAll() error {
    results := make([]ReloadResult, 0, len(r.services))
    hasFailure := false
    hasCriticalFailure := false
    
    // Reload services in configured order
    for name, config := range r.services {
        result := r.reloadService(name, config)
        results = append(results, result)
        
        if !result.Success {
            hasFailure = true
            if config.Critical {
                hasCriticalFailure = true
            }
        }
        
        r.metrics.RecordServiceReload(name, result.Success)
    }
    
    if hasCriticalFailure {
        // Attempt rollback
        r.logger.Error("Critical service reload failed, attempting rollback")
        return fmt.Errorf("critical service reload failed")
    }
    
    if hasFailure {
        r.logger.Warn("Some non-critical services failed to reload")
    }
    
    return nil
}

func (r *ReloadManager) reloadService(name string, config ServiceConfig) ReloadResult {
    start := time.Now()
    
    // Execute reload command
    ctx, cancel := context.WithTimeout(context.Background(), config.Timeout)
    defer cancel()
    
    if err := r.executor.Execute(ctx, config.ReloadCommand); err != nil {
        return ReloadResult{
            Service:  name,
            Success:  false,
            Duration: time.Since(start),
            Error:    err,
        }
    }
    
    // Wait for service to be healthy
    if config.HealthCheck != "" {
        if err := r.waitForHealth(ctx, config.HealthCheck); err != nil {
            return ReloadResult{
                Service:  name,
                Success:  false,
                Duration: time.Since(start),
                Error:    fmt.Errorf("health check failed: %w", err),
            }
        }
    }
    
    return ReloadResult{
        Service:  name,
        Success:  true,
        Duration: time.Since(start),
    }
}

Network Connectivity Monitoring

Implement connectivity checking in pkg/network/connectivity.go:

type ConnectivityChecker struct {
    endpoints       []EndpointConfig
    checkInterval   time.Duration
    timeout         time.Duration
    degradedThreshold time.Duration
    currentStatus   ConnectivityStatus
    lastSuccessful  time.Time
    mutex           sync.RWMutex
    logger          *slog.Logger
}

type EndpointConfig struct {
    Name     string
    URL      string
    Critical bool
}

type ConnectivityStatus int

const (
    StatusHealthy ConnectivityStatus = iota
    StatusDegraded
    StatusOffline
)

func (c *ConnectivityChecker) Start() {
    ticker := time.NewTicker(c.checkInterval)
    go func() {
        for range ticker.C {
            c.performHealthCheck()
        }
    }()
}

func (c *ConnectivityChecker) performHealthCheck() {
    allHealthy := true
    criticalHealthy := true
    
    for _, endpoint := range c.endpoints {
        healthy := c.checkEndpoint(endpoint)
        if !healthy {
            allHealthy = false
            if endpoint.Critical {
                criticalHealthy = false
            }
        }
    }
    
    c.updateStatus(allHealthy, criticalHealthy)
}

func (c *ConnectivityChecker) updateStatus(allHealthy, criticalHealthy bool) {
    c.mutex.Lock()
    defer c.mutex.Unlock()
    
    oldStatus := c.currentStatus
    
    if allHealthy {
        c.currentStatus = StatusHealthy
        c.lastSuccessful = time.Now()
    } else if criticalHealthy {
        // Check if we should enter degraded mode
        if time.Since(c.lastSuccessful) > c.degradedThreshold {
            c.currentStatus = StatusDegraded
        }
    } else {
        c.currentStatus = StatusOffline
    }
    
    if oldStatus != c.currentStatus {
        c.logger.Info("Connectivity status changed",
            slog.String("old_status", c.statusString(oldStatus)),
            slog.String("new_status", c.statusString(c.currentStatus)))
    }
}

Audit Logging

Enhance logging in internal/monitoring/logging.go:

type AuditLogger struct {
    logger   *slog.Logger
    filepath string
}

type AuditEvent struct {
    Timestamp   time.Time              `json:"timestamp"`
    EventType   string                 `json:"event_type"`
    Actor       Actor                  `json:"actor"`
    Resource    Resource               `json:"resource"`
    Outcome     string                 `json:"outcome"`
    Details     map[string]interface{} `json:"details,omitempty"`
}

type Actor struct {
    Identity   string `json:"identity"`
    IPAddress  string `json:"ip_address,omitempty"`
    AuthMethod string `json:"auth_method"`
}

type Resource struct {
    Type       string `json:"type"`
    Identifier string `json:"identifier"`
}

func (a *AuditLogger) LogCertificateRenewal(serial string, success bool, details map[string]interface{}) {
    event := AuditEvent{
        Timestamp: time.Now().UTC(),
        EventType: "certificate_renewed",
        Actor: Actor{
            Identity:   a.getMachineIdentity(),
            IPAddress:  a.getLocalIP(),
            AuthMethod: "jwt",
        },
        Resource: Resource{
            Type:       "certificate",
            Identifier: serial,
        },
        Outcome: a.outcomeString(success),
        Details: details,
    }
    
    a.writeAuditEvent(event)
}

func (a *AuditLogger) LogServiceReload(service string, success bool, duration time.Duration) {
    event := AuditEvent{
        Timestamp: time.Now().UTC(),
        EventType: "service_reloaded",
        Actor: Actor{
            Identity:   a.getMachineIdentity(),
            AuthMethod: "system",
        },
        Resource: Resource{
            Type:       "service",
            Identifier: service,
        },
        Outcome: a.outcomeString(success),
        Details: map[string]interface{}{
            "duration_ms": duration.Milliseconds(),
        },
    }
    
    a.writeAuditEvent(event)
}

Degraded Mode Management

type DegradedModeManager struct {
    connectivity *ConnectivityChecker
    jwtManager   *auth.JWTManager
    metrics      *MetricsCollector
    inDegraded   bool
    enteredAt    *time.Time
    mutex        sync.RWMutex
    logger       *slog.Logger
}

func (d *DegradedModeManager) CheckAndUpdateMode() {
    d.mutex.Lock()
    defer d.mutex.Unlock()
    
    status := d.connectivity.GetStatus()
    
    if status == StatusDegraded || status == StatusOffline {
        if !d.inDegraded {
            d.enterDegradedMode()
        }
    } else {
        if d.inDegraded {
            d.exitDegradedMode()
        }
    }
}

func (d *DegradedModeManager) enterDegradedMode() {
    d.inDegraded = true
    now := time.Now()
    d.enteredAt = &now
    
    d.logger.Warn("Entering degraded mode due to connectivity issues",
        slog.Time("entered_at", now))
    
    d.metrics.SetDegradedMode(true)
    
    // Notify other components
    d.jwtManager.EnableDegradedMode()
}

func (d *DegradedModeManager) exitDegradedMode() {
    duration := time.Since(*d.enteredAt)
    
    d.logger.Info("Exiting degraded mode - connectivity restored",
        slog.Duration("duration", duration))
    
    d.inDegraded = false
    d.enteredAt = nil
    
    d.metrics.SetDegradedMode(false)
    
    // Trigger immediate renewal check
    // Refresh JWT token
    d.jwtManager.DisableDegradedMode()
}

Configuration

Required configuration:

monitoring:
  metrics_port: 9091
  health_port: 8081
  audit_log_path: "/etc/certs/logs/rotation.log"

network:
  health_check_interval: "30s"
  connectivity_timeout: "10s"
  degraded_mode_threshold: "5m"
  endpoints:
    - name: "certificate-api"
      url: "https://certificate-api.internal/health"
      critical: true
    - name: "keycloak"
      url: "https://keycloak.internal/health"
      critical: true

services:
  reload_commands:
    nginx:
      command: "systemctl reload nginx"
      health_check: "curl -f http://localhost/health"
      timeout: "30s"
      critical: true
    haproxy:
      command: "systemctl reload haproxy"
      health_check: "curl -f http://localhost:8080/health"
      timeout: "30s"
      critical: false

Acceptance Criteria

  1. Metrics Collection

    • Exports all specified Prometheus metrics
    • Updates metrics in real-time
    • Accessible via /metrics endpoint
    • Compatible with Prometheus scraping
  2. Health Checks

    • Provides accurate health status for all components
    • Returns appropriate HTTP status codes
    • Includes detailed health information
    • Responds within 5 seconds
  3. Service Reloading

    • Successfully reloads all configured services
    • Handles reload failures gracefully
    • Validates service health post-reload
    • Respects timeout configurations
  4. Audit Logging

    • Logs all certificate operations
    • Uses specified JSON format
    • Includes all required fields
    • Rotates logs appropriately
  5. Degraded Mode

    • Enters degraded mode after threshold
    • Continues operating with cached resources
    • Recovers automatically when connectivity restored
    • Alerts operations team appropriately

Testing Requirements

  • Unit tests for all monitoring components
  • Test metric calculation and updates
  • Test health check endpoints
  • Test service reload with mock commands
  • Test audit log formatting
  • Test degraded mode transitions
  • Integration tests for connectivity checking
  • Load test health endpoints
  • Test log rotation
  • Simulate network failures

Dependencies

  • Prometheus client library (github.com/prometheus/client_golang)
  • Standard net/http package for health endpoints
  • Standard os/exec for service commands
  • Context package for timeouts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions