Skip to content

Monitoring Alerting

Temp edited this page Oct 3, 2025 · 2 revisions

Monitoring & Alerting Tools

Real-time monitoring, alerting, capacity planning, and resource analysis for production PostgreSQL databases.


📊 Overview

5 enterprise monitoring tools for comprehensive database observability:

Tool Purpose Category
monitor_real_time Real-time performance monitoring Observability
alert_threshold_set Metric threshold analysis and alerting Alerting
capacity_planning Growth projection and capacity forecasting Planning
resource_usage_analyze CPU/Memory/IO resource analysis Performance
replication_monitor Replication status and lag monitoring High Availability

🔧 Tool Details

monitor_real_time

Real-time monitoring of database performance metrics including queries, locks, connections, and I/O.

Parameters:

  • include_queries (boolean, optional): Include currently running queries
  • include_locks (boolean, optional): Include lock information
  • include_io (boolean, optional): Include I/O statistics
  • limit (integer, optional): Limit number of results (default: 10)

Returns:

  • timestamp: Current timestamp
  • metrics.connections: Connection statistics by state
  • metrics.active_queries: Currently running queries
  • metrics.locks: Lock information by type
  • metrics.io_statistics: Heap and index block statistics
  • metrics.database: Database size and modifications

Example:

result = monitor_real_time(
    include_queries=True,
    include_locks=True,
    include_io=True,
    limit=10
)
# Returns: {
#   "timestamp": "2025-10-03 15:30:45",
#   "metrics": {
#     "connections": {
#       "total": 25,
#       "by_state": [
#         {"state": "active", "count": 3},
#         {"state": "idle", "count": 22}
#       ]
#     },
#     "active_queries": {...},
#     "locks": {"total": 5, "blocked": 0},
#     "io_statistics": {
#       "heap_hit_ratio_percent": 99.2,
#       "index_hit_ratio_percent": 99.8
#     }
#   }
# }

Use Cases:

  • Production monitoring dashboards
  • Real-time performance troubleshooting
  • Lock contention detection
  • Connection pool monitoring

alert_threshold_set

Analyze database metrics against configurable alert thresholds.

Parameters:

  • metric_type (string, required): Metric to check (cache_hit_ratio, connection_count, transaction_age, database_size, replication_lag)
  • warning_threshold (number, required): Warning threshold value
  • critical_threshold (number, required): Critical threshold value
  • check_current (boolean, optional): Check current value against thresholds

Returns:

  • metric_type: Metric being monitored
  • thresholds: Configured warning and critical thresholds
  • current_value: Current metric value
  • alert_status: Current status (ok, warning, critical)
  • unit: Measurement unit

Example:

# Monitor cache hit ratio
result = alert_threshold_set(
    metric_type="cache_hit_ratio",
    warning_threshold=95.0,
    critical_threshold=90.0,
    check_current=True
)
# Returns: {
#   "metric_type": "cache_hit_ratio",
#   "thresholds": {"warning": 95.0, "critical": 90.0},
#   "current_value": 99.3,
#   "alert_status": "ok",
#   "unit": "percent"
# }

# Monitor connection pool
result = alert_threshold_set(
    metric_type="connection_count",
    warning_threshold=80,
    critical_threshold=95,
    check_current=True
)

Supported Metrics:

  • cache_hit_ratio - Buffer cache hit percentage
  • connection_count - Active database connections
  • transaction_age - Longest running transaction age (seconds)
  • database_size - Total database size (bytes)
  • replication_lag - Replication lag (seconds, replicas only)

Use Cases:

  • Automated alerting systems
  • Threshold-based monitoring
  • Performance degradation detection
  • Capacity warning systems

capacity_planning

Analyze database growth trends and project future capacity needs.

Parameters:

  • forecast_days (integer, required): Number of days to forecast
  • include_table_growth (boolean, optional): Include per-table growth analysis
  • include_index_growth (boolean, optional): Include per-index growth analysis

Returns:

  • current_state: Current database size and top tables/indexes
  • projections: Growth forecasts for specified period
  • recommendations: Storage and capacity recommendations

Example:

result = capacity_planning(
    forecast_days=90,
    include_table_growth=True,
    include_index_growth=True
)
# Returns: {
#   "current_state": {
#     "total_size": {"bytes": 50000000000, "gb": 46.57, "pretty": "47 GB"},
#     "user_data_size_mb": 35000,
#     "index_size_mb": 8500,
#     "top_tables": [
#       {"table": "orders", "size_mb": 12500, "row_count": 5000000},
#       {"table": "users", "size_mb": 8900, "row_count": 2000000}
#     ]
#   },
#   "projections": {
#     "forecast_days": 90,
#     "estimated_daily_growth_mb": 450,
#     "estimated_total_growth_gb": 38.6,
#     "projected_total_size_gb": 85.2
#   },
#   "recommendations": {
#     "recommended_storage_gb": 127.8,
#     "buffer_percentage": 50,
#     "planning_horizon_days": 90
#   }
# }

Use Cases:

  • Storage capacity planning
  • Budget forecasting
  • Growth trend analysis
  • Infrastructure scaling decisions

resource_usage_analyze

Analyze CPU, memory, and I/O resource usage patterns.

Parameters:

  • include_cpu (boolean, optional): Include CPU usage analysis
  • include_memory (boolean, optional): Include memory usage analysis
  • include_io (boolean, optional): Include I/O usage analysis

Returns:

  • resource_analysis.memory: Buffer cache and shared memory stats
  • resource_analysis.io: Disk I/O and cache hit ratios
  • resource_analysis.cpu: Query execution time statistics
  • recommendations: Resource optimization suggestions

Example:

result = resource_usage_analyze(
    include_cpu=True,
    include_memory=True,
    include_io=True
)
# Returns: {
#   "resource_analysis": {
#     "memory": {
#       "shared_buffers": "16384",
#       "buffer_cache_hit_ratio": 99.5,
#       "buffer_hits": 5234567,
#       "disk_reads": 25678
#     },
#     "io": {
#       "heap_blocks_from_disk": 12345,
#       "heap_blocks_from_cache": 987654,
#       "heap_hit_ratio": 98.8,
#       "index_blocks_from_disk": 3456,
#       "index_blocks_from_cache": 654321,
#       "index_hit_ratio": 99.5
#     },
#     "cpu": {
#       "total_execution_time_ms": 125678,
#       "total_calls": 456789,
#       "avg_query_time_ms": 0.275,
#       "max_query_time_ms": 1250.45
#     }
#   },
#   "recommendations": [
#     {
#       "category": "memory",
#       "priority": "info",
#       "recommendation": "Buffer cache hit ratio excellent at 99.5%"
#     }
#   ]
# }

Requirements: pg_stat_statements extension for CPU analysis

Use Cases:

  • Performance bottleneck identification
  • Resource optimization
  • Infrastructure right-sizing
  • Cost optimization

replication_monitor

Monitor replication status, lag, and health for primary and replica databases.

Parameters:

  • include_wal_status (boolean, optional): Include WAL sender/receiver status
  • include_slots (boolean, optional): Include replication slot information

Returns:

  • replication_status.is_replica: Whether this is a replica
  • replication_status.role: Database role (primary/replica)
  • replication_status.wal_senders: WAL sender connections (primary only)
  • replication_status.replication_slots: Active replication slots
  • lag_info: Replication lag statistics (replica only)

Example:

# On primary database
result = replication_monitor(
    include_wal_status=True,
    include_slots=True
)
# Returns: {
#   "replication_status": {
#     "is_replica": False,
#     "role": "primary",
#     "wal_senders": {
#       "count": 2,
#       "senders": [
#         {
#           "application_name": "replica1",
#           "client_addr": "10.0.1.5",
#           "state": "streaming",
#           "sync_state": "async",
#           "sent_lsn": "0/5A2F3C0",
#           "write_lsn": "0/5A2F3C0",
#           "flush_lsn": "0/5A2F3C0"
#         }
#       ]
#     },
#     "replication_slots": {
#       "total_count": 2,
#       "inactive_count": 0,
#       "slots": [...]
#     }
#   }
# }

# On replica database
result = replication_monitor(include_wal_status=True)
# Returns: {
#   "replication_status": {
#     "is_replica": True,
#     "role": "replica"
#   },
#   "lag_info": {
#     "receive_lsn": "0/5A2F3C0",
#     "replay_lsn": "0/5A2F380",
#     "lag_bytes": 64,
#     "lag_seconds": 0.05,
#     "is_replaying": True
#   }
# }

Use Cases:

  • High availability monitoring
  • Replication health checks
  • Lag detection and alerting
  • Disaster recovery readiness

🎯 Common Workflows

Production Monitoring Dashboard

# 1. Real-time metrics
metrics = monitor_real_time(
    include_queries=True,
    include_locks=True,
    include_io=True
)

# 2. Check thresholds
cache_status = alert_threshold_set(
    metric_type="cache_hit_ratio",
    warning_threshold=95,
    critical_threshold=90,
    check_current=True
)

conn_status = alert_threshold_set(
    metric_type="connection_count",
    warning_threshold=80,
    critical_threshold=95,
    check_current=True
)

# 3. Resource analysis
resources = resource_usage_analyze(
    include_cpu=True,
    include_memory=True,
    include_io=True
)

Capacity Planning Review

# 1. Analyze growth
capacity = capacity_planning(
    forecast_days=90,
    include_table_growth=True,
    include_index_growth=True
)

# 2. Current resource usage
resources = resource_usage_analyze(
    include_cpu=True,
    include_memory=True,
    include_io=True
)

# 3. Project needs
# Use capacity["recommendations"]["recommended_storage_gb"]
# Use capacity["projections"]["projected_total_size_gb"]

Replication Health Check

# 1. Check replication status
repl_status = replication_monitor(
    include_wal_status=True,
    include_slots=True
)

# 2. Monitor lag
if repl_status["replication_status"]["is_replica"]:
    lag_alert = alert_threshold_set(
        metric_type="replication_lag",
        warning_threshold=5,
        critical_threshold=30,
        check_current=True
    )

📊 Monitoring Best Practices

1. Regular Health Checks

  • Run monitor_real_time() every 1-5 minutes
  • Check alert_threshold_set() for key metrics
  • Review resource_usage_analyze() daily

2. Capacity Planning

  • Run capacity_planning() monthly
  • Track growth trends over time
  • Plan upgrades 3 months in advance

3. Alert Thresholds

Recommended Thresholds:

# Cache hit ratio
cache_warning = 95.0    # Below 95% investigate
cache_critical = 90.0   # Below 90% urgent action

# Connections
conn_warning = 80       # 80% of max_connections
conn_critical = 95      # 95% of max_connections

# Transaction age
txn_warning = 300       # 5 minutes
txn_critical = 1800     # 30 minutes

# Replication lag
lag_warning = 5         # 5 seconds
lag_critical = 30       # 30 seconds

4. Replication Monitoring

  • Monitor lag every 30 seconds
  • Alert on WAL sender disconnections
  • Watch for inactive replication slots

🚨 Alert Response Guide

High Connection Count

# 1. Identify connections
metrics = monitor_real_time(include_queries=True)

# 2. Analyze queries
from Core import get_top_queries
slow_queries = get_top_queries(sort_by="calls", limit=20)

# 3. Check for connection leaks
# Review application connection pooling

Low Cache Hit Ratio

# 1. Analyze resource usage
resources = resource_usage_analyze(include_memory=True, include_io=True)

# 2. Check buffer cache size
# Consider increasing shared_buffers in postgresql.conf

# 3. Review query patterns
# Identify queries causing excessive disk I/O

Replication Lag

# 1. Monitor replication
repl = replication_monitor(include_wal_status=True)

# 2. Check network connectivity
# Verify network bandwidth between primary and replica

# 3. Analyze replica load
# Check if replica is under heavy read load

📚 Related Documentation


See Home for more tool categories.

Clone this wiki locally