Skip to content

Conversation

@bmcdorman
Copy link
Member

Overview

This PR adds comprehensive Prometheus metrics and monitoring for Redis connection status and operation health.

Changes Made

Application Code

  1. Added metrics module - src/metrics.ts:

    • database_redis_connection_status: Gauge tracking connection state (1=connected, 0=disconnected)
    • database_redis_operation_success_total: Counter for successful operations by type
    • database_redis_operation_failures_total: Counter for failed operations by type
  2. Updated RedisCache - src/RedisCache.ts:

    • Added event listeners for connection lifecycle events (connect, ready, error, close, end)
    • Wrapped operations in try-catch to track success/failure metrics
    • Updates connection gauge based on Redis events
  3. Added /metrics endpoint - src/index.ts:

    • Exposes Prometheus metrics at /metrics endpoint
    • Returns metrics in Prometheus format
  4. Added prom-client dependency - package.json

  5. Documentation - REDIS_MONITORING.md:

    • Complete guide on metrics, alerts, dashboard, and testing

Kubernetes/Helm Changes (in separate kipr-yamls repo)

Already pushed to kipr-yamls main branch:

  • ServiceMonitor to scrape /metrics endpoint
  • PrometheusRule with Redis alerts
  • Updated Service with labels for Prometheus discovery

Alerts Configured

  • DatabaseRedisDown: Critical alert when Redis disconnected >1 minute
  • DatabaseRedisOperationFailures: Warning when failure rate >0.1/sec for 2 minutes
  • DatabaseRedisHighFailureRate: Critical when failure rate >1/sec for 1 minute

Grafana Dashboard

A dashboard JSON is available (not included in this PR) with:

  • Redis connection status indicator
  • Operation success/failure rates
  • Success percentage gauge
  • Connection history timeline

Testing

This PR is UNTESTED and should be validated in staging:

  1. Deploy to staging
  2. Verify /metrics endpoint returns metrics
  3. Simulate Redis failure and verify alerts fire
  4. Check metrics accurately reflect Redis state

Compatibility

- Add prom-client dependency for metrics
- Create /metrics endpoint exposing Prometheus metrics
- Track Redis connection status (database_redis_connection_status)
- Track Redis operation success/failure counts by operation type
- Update RedisCache to report connection events and operation metrics
- Add comprehensive monitoring documentation

Metrics exposed:
- database_redis_connection_status: 1=connected, 0=disconnected
- database_redis_operation_success_total: successful operations by type
- database_redis_operation_failures_total: failed operations by type

Note: This PR is independent and can be deployed alongside PR #10 for
graceful Redis failure handling, or standalone for monitoring only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants