-
Notifications
You must be signed in to change notification settings - Fork 0
Monitoring Alerting
Real-time monitoring, alerting, capacity planning, and resource analysis for production PostgreSQL databases.
5 enterprise monitoring tools for comprehensive database observability:
Tool | Purpose | Category |
---|---|---|
monitor_real_time |
Real-time performance monitoring | Observability |
alert_threshold_set |
Metric threshold analysis and alerting | Alerting |
capacity_planning |
Growth projection and capacity forecasting | Planning |
resource_usage_analyze |
CPU/Memory/IO resource analysis | Performance |
replication_monitor |
Replication status and lag monitoring | High Availability |
Real-time monitoring of database performance metrics including queries, locks, connections, and I/O.
Parameters:
-
include_queries
(boolean, optional): Include currently running queries -
include_locks
(boolean, optional): Include lock information -
include_io
(boolean, optional): Include I/O statistics -
limit
(integer, optional): Limit number of results (default: 10)
Returns:
-
timestamp
: Current timestamp -
metrics.connections
: Connection statistics by state -
metrics.active_queries
: Currently running queries -
metrics.locks
: Lock information by type -
metrics.io_statistics
: Heap and index block statistics -
metrics.database
: Database size and modifications
Example:
result = monitor_real_time(
include_queries=True,
include_locks=True,
include_io=True,
limit=10
)
# Returns: {
# "timestamp": "2025-10-03 15:30:45",
# "metrics": {
# "connections": {
# "total": 25,
# "by_state": [
# {"state": "active", "count": 3},
# {"state": "idle", "count": 22}
# ]
# },
# "active_queries": {...},
# "locks": {"total": 5, "blocked": 0},
# "io_statistics": {
# "heap_hit_ratio_percent": 99.2,
# "index_hit_ratio_percent": 99.8
# }
# }
# }
Use Cases:
- Production monitoring dashboards
- Real-time performance troubleshooting
- Lock contention detection
- Connection pool monitoring
Analyze database metrics against configurable alert thresholds.
Parameters:
-
metric_type
(string, required): Metric to check (cache_hit_ratio
,connection_count
,transaction_age
,database_size
,replication_lag
) -
warning_threshold
(number, required): Warning threshold value -
critical_threshold
(number, required): Critical threshold value -
check_current
(boolean, optional): Check current value against thresholds
Returns:
-
metric_type
: Metric being monitored -
thresholds
: Configured warning and critical thresholds -
current_value
: Current metric value -
alert_status
: Current status (ok
,warning
,critical
) -
unit
: Measurement unit
Example:
# Monitor cache hit ratio
result = alert_threshold_set(
metric_type="cache_hit_ratio",
warning_threshold=95.0,
critical_threshold=90.0,
check_current=True
)
# Returns: {
# "metric_type": "cache_hit_ratio",
# "thresholds": {"warning": 95.0, "critical": 90.0},
# "current_value": 99.3,
# "alert_status": "ok",
# "unit": "percent"
# }
# Monitor connection pool
result = alert_threshold_set(
metric_type="connection_count",
warning_threshold=80,
critical_threshold=95,
check_current=True
)
Supported Metrics:
-
cache_hit_ratio
- Buffer cache hit percentage -
connection_count
- Active database connections -
transaction_age
- Longest running transaction age (seconds) -
database_size
- Total database size (bytes) -
replication_lag
- Replication lag (seconds, replicas only)
Use Cases:
- Automated alerting systems
- Threshold-based monitoring
- Performance degradation detection
- Capacity warning systems
Analyze database growth trends and project future capacity needs.
Parameters:
-
forecast_days
(integer, required): Number of days to forecast -
include_table_growth
(boolean, optional): Include per-table growth analysis -
include_index_growth
(boolean, optional): Include per-index growth analysis
Returns:
-
current_state
: Current database size and top tables/indexes -
projections
: Growth forecasts for specified period -
recommendations
: Storage and capacity recommendations
Example:
result = capacity_planning(
forecast_days=90,
include_table_growth=True,
include_index_growth=True
)
# Returns: {
# "current_state": {
# "total_size": {"bytes": 50000000000, "gb": 46.57, "pretty": "47 GB"},
# "user_data_size_mb": 35000,
# "index_size_mb": 8500,
# "top_tables": [
# {"table": "orders", "size_mb": 12500, "row_count": 5000000},
# {"table": "users", "size_mb": 8900, "row_count": 2000000}
# ]
# },
# "projections": {
# "forecast_days": 90,
# "estimated_daily_growth_mb": 450,
# "estimated_total_growth_gb": 38.6,
# "projected_total_size_gb": 85.2
# },
# "recommendations": {
# "recommended_storage_gb": 127.8,
# "buffer_percentage": 50,
# "planning_horizon_days": 90
# }
# }
Use Cases:
- Storage capacity planning
- Budget forecasting
- Growth trend analysis
- Infrastructure scaling decisions
Analyze CPU, memory, and I/O resource usage patterns.
Parameters:
-
include_cpu
(boolean, optional): Include CPU usage analysis -
include_memory
(boolean, optional): Include memory usage analysis -
include_io
(boolean, optional): Include I/O usage analysis
Returns:
-
resource_analysis.memory
: Buffer cache and shared memory stats -
resource_analysis.io
: Disk I/O and cache hit ratios -
resource_analysis.cpu
: Query execution time statistics -
recommendations
: Resource optimization suggestions
Example:
result = resource_usage_analyze(
include_cpu=True,
include_memory=True,
include_io=True
)
# Returns: {
# "resource_analysis": {
# "memory": {
# "shared_buffers": "16384",
# "buffer_cache_hit_ratio": 99.5,
# "buffer_hits": 5234567,
# "disk_reads": 25678
# },
# "io": {
# "heap_blocks_from_disk": 12345,
# "heap_blocks_from_cache": 987654,
# "heap_hit_ratio": 98.8,
# "index_blocks_from_disk": 3456,
# "index_blocks_from_cache": 654321,
# "index_hit_ratio": 99.5
# },
# "cpu": {
# "total_execution_time_ms": 125678,
# "total_calls": 456789,
# "avg_query_time_ms": 0.275,
# "max_query_time_ms": 1250.45
# }
# },
# "recommendations": [
# {
# "category": "memory",
# "priority": "info",
# "recommendation": "Buffer cache hit ratio excellent at 99.5%"
# }
# ]
# }
Requirements: pg_stat_statements extension for CPU analysis
Use Cases:
- Performance bottleneck identification
- Resource optimization
- Infrastructure right-sizing
- Cost optimization
Monitor replication status, lag, and health for primary and replica databases.
Parameters:
-
include_wal_status
(boolean, optional): Include WAL sender/receiver status -
include_slots
(boolean, optional): Include replication slot information
Returns:
-
replication_status.is_replica
: Whether this is a replica -
replication_status.role
: Database role (primary/replica) -
replication_status.wal_senders
: WAL sender connections (primary only) -
replication_status.replication_slots
: Active replication slots -
lag_info
: Replication lag statistics (replica only)
Example:
# On primary database
result = replication_monitor(
include_wal_status=True,
include_slots=True
)
# Returns: {
# "replication_status": {
# "is_replica": False,
# "role": "primary",
# "wal_senders": {
# "count": 2,
# "senders": [
# {
# "application_name": "replica1",
# "client_addr": "10.0.1.5",
# "state": "streaming",
# "sync_state": "async",
# "sent_lsn": "0/5A2F3C0",
# "write_lsn": "0/5A2F3C0",
# "flush_lsn": "0/5A2F3C0"
# }
# ]
# },
# "replication_slots": {
# "total_count": 2,
# "inactive_count": 0,
# "slots": [...]
# }
# }
# }
# On replica database
result = replication_monitor(include_wal_status=True)
# Returns: {
# "replication_status": {
# "is_replica": True,
# "role": "replica"
# },
# "lag_info": {
# "receive_lsn": "0/5A2F3C0",
# "replay_lsn": "0/5A2F380",
# "lag_bytes": 64,
# "lag_seconds": 0.05,
# "is_replaying": True
# }
# }
Use Cases:
- High availability monitoring
- Replication health checks
- Lag detection and alerting
- Disaster recovery readiness
# 1. Real-time metrics
metrics = monitor_real_time(
include_queries=True,
include_locks=True,
include_io=True
)
# 2. Check thresholds
cache_status = alert_threshold_set(
metric_type="cache_hit_ratio",
warning_threshold=95,
critical_threshold=90,
check_current=True
)
conn_status = alert_threshold_set(
metric_type="connection_count",
warning_threshold=80,
critical_threshold=95,
check_current=True
)
# 3. Resource analysis
resources = resource_usage_analyze(
include_cpu=True,
include_memory=True,
include_io=True
)
# 1. Analyze growth
capacity = capacity_planning(
forecast_days=90,
include_table_growth=True,
include_index_growth=True
)
# 2. Current resource usage
resources = resource_usage_analyze(
include_cpu=True,
include_memory=True,
include_io=True
)
# 3. Project needs
# Use capacity["recommendations"]["recommended_storage_gb"]
# Use capacity["projections"]["projected_total_size_gb"]
# 1. Check replication status
repl_status = replication_monitor(
include_wal_status=True,
include_slots=True
)
# 2. Monitor lag
if repl_status["replication_status"]["is_replica"]:
lag_alert = alert_threshold_set(
metric_type="replication_lag",
warning_threshold=5,
critical_threshold=30,
check_current=True
)
- Run
monitor_real_time()
every 1-5 minutes - Check
alert_threshold_set()
for key metrics - Review
resource_usage_analyze()
daily
- Run
capacity_planning()
monthly - Track growth trends over time
- Plan upgrades 3 months in advance
Recommended Thresholds:
# Cache hit ratio
cache_warning = 95.0 # Below 95% investigate
cache_critical = 90.0 # Below 90% urgent action
# Connections
conn_warning = 80 # 80% of max_connections
conn_critical = 95 # 95% of max_connections
# Transaction age
txn_warning = 300 # 5 minutes
txn_critical = 1800 # 30 minutes
# Replication lag
lag_warning = 5 # 5 seconds
lag_critical = 30 # 30 seconds
- Monitor lag every 30 seconds
- Alert on WAL sender disconnections
- Watch for inactive replication slots
# 1. Identify connections
metrics = monitor_real_time(include_queries=True)
# 2. Analyze queries
from Core import get_top_queries
slow_queries = get_top_queries(sort_by="calls", limit=20)
# 3. Check for connection leaks
# Review application connection pooling
# 1. Analyze resource usage
resources = resource_usage_analyze(include_memory=True, include_io=True)
# 2. Check buffer cache size
# Consider increasing shared_buffers in postgresql.conf
# 3. Review query patterns
# Identify queries causing excessive disk I/O
# 1. Monitor replication
repl = replication_monitor(include_wal_status=True)
# 2. Check network connectivity
# Verify network bandwidth between primary and replica
# 3. Analyze replica load
# Check if replica is under heavy read load
- Core Database Tools - Basic health monitoring
- Performance Intelligence - Query optimization
- Backup & Recovery - Backup strategies
- Security Best Practices - Secure monitoring
See Home for more tool categories.