Skip to content

Conversation

dementii-priadko
Copy link

@dementii-priadko dementii-priadko commented Sep 16, 2025

Database name

PostgreSQL - This PR adds a Prometheus exporter for WAL-G PostgreSQL backup and WAL monitoring.

Pull request description

Describe what this PR adds

This PR introduces a WAL-G Prometheus Exporter that provides comprehensive observability for WAL-G backup operations for PostgreSQL databases.

🎯 What This PR Adds

This PR adds a complete Prometheus exporter (/exporter directory) with the following capabilities:

Core Exporter Components:

  • exporter.go - Main Prometheus collector implementation
  • main.go - HTTP server and CLI interface
  • pitr.go - Point-in-time recovery window calculations
  • wal_lag.go - LSN parsing and WAL lag calculation logic
  • mock-wal-g - Mock script for testing and development
  • go.mod/go.sum - Go module dependencies

Key Features:

📊 Backup Monitoring

  • Accurate backup type detection: Uses WAL-G's _D_ suffix naming convention to correctly distinguish full vs incremental backups
  • Base backup tracking: Incremental backups include base_backup label showing which full backup they're based on
  • Dual timestamps: Separate start and finish timestamps for backup duration calculation
  • Rich metadata: Backup name, type, WAL file, LSN range, permanence status
  • Success tracking: Only counts successful backups

📈 WAL Stream Monitoring

  • WAL completion timestamps per timeline
  • WAL integrity status monitoring per timeline
  • LSN lag calculations in bytes
  • Missing WAL segment detection
  • Timeline-based analytics

🔍 Storage Health Monitoring

  • Storage connectivity health checks with latency measurement
  • Automated storage aliveness detection
  • Connection failure tracking and alerting

PITR & Recovery Monitoring

  • Point-in-time recovery window size calculations
  • Recovery capability assessment
  • Backup retention visibility

🔧 Operational Metrics

  • Exporter scrape duration and error tracking
  • WAL-G operation error classification and counting
  • Performance monitoring for the exporter itself

📊 Metrics Provided

Backup Metrics

walg_backup_start_timestamp{backup_name, backup_type, wal_file, start_lsn, finish_lsn, permanent, base_backup}
walg_backup_finish_timestamp{backup_name, backup_type, wal_file, start_lsn, finish_lsn, permanent, base_backup}  
walg_backup_count{backup_type}

Critical Labels:

  • backup_type: full or delta (correctly determined by _D_ suffix presence)
  • base_backup: For incremental backups, shows which full backup they're based on
  • backup_name: Complete backup identifier
  • LSN range and WAL file information

WAL Metrics

walg_wal_timestamp{timeline}
walg_lsn_lag_bytes{timeline}
walg_wal_integrity_status{timeline}

Storage & Health Metrics

walg_storage_alive
walg_storage_latency_seconds
walg_pitr_window_seconds
walg_errors_total{operation, error_type}
walg_scrape_duration_seconds
walg_scrape_errors_total

🔧 Technical Implementation Highlights

Correct Backup Type Detection

One of the key technical achievements is accurate backup type classification:

The Problem: Naive implementations often mark ALL backups as "full" because they all start with base_ prefix.

The Solution: This exporter correctly uses WAL-G's actual naming convention:

  • Full backups: base_000000010000000000000025 (no _D_ suffix)
  • Incremental backups: base_000000010000000500000007_D_000000010000000000000025 (contains _D_)
func (b *BackupInfo) IsFullBackup() bool {
    // Incremental/delta backups have "_D_" in their name
    return !strings.Contains(b.BackupName, "_D_")
}

func (b *BackupInfo) GetBaseBackupName() string {
    // Extract base backup identifier from incremental backup name
    deltaIndex := strings.Index(b.BackupName, "_D_")
    baseIdentifier := b.BackupName[deltaIndex+3:]
    return "base_" + baseIdentifier
}

⏱️ Dual Timestamp Architecture

  • walg_backup_start_timestamp - When backup operation started
  • walg_backup_finish_timestamp - When backup completed successfully
  • Benefits: Calculate backup duration, precise recovery point tracking, better alerting

🧪 Comprehensive Testing Framework

  • Mock WAL-G implementation for development and CI
  • Realistic test data generation
  • Full integration testing capabilities

🚀 Usage

Basic Usage

cd exporter
go build -o walg-exporter .
./walg-exporter --walg.path=/usr/local/bin/wal-g

Configuration Options

./walg-exporter \
  --web.listen-address=":9351" \
  --walg.path="/usr/local/bin/wal-g" \
  --scrape.interval="60s"

Prometheus Integration

scrape_configs:
  - job_name: 'walg-exporter'
    static_configs:
      - targets: ['localhost:9351']
    scrape_interval: 60s

📈 Monitoring Examples

Backup Age Monitoring

# Time since last successful backup
(time() - walg_backup_finish_timestamp) / 3600

# Backup duration tracking
walg_backup_finish_timestamp - walg_backup_start_timestamp

# Incremental backup chain analysis  
walg_backup_finish_timestamp{backup_type="delta"} * on(base_backup) group_left walg_backup_finish_timestamp{backup_type="full"}

Storage Health

# Storage connectivity issues
walg_storage_alive == 0

# High storage latency alerts
walg_storage_latency_seconds > 5

🧪 Testing

Development Testing

# Test with mock WAL-G
./walg-exporter --walg.path=./mock-wal-g --scrape.interval=10s

# Verify metrics
curl http://localhost:9351/metrics

Integration Testing

# Test with real WAL-G
./walg-exporter --walg.path=/path/to/wal-g

# Health check
curl http://localhost:9351/

📋 Files Added

This PR adds the complete /exporter directory with:

  • exporter.go - Core Prometheus collector (466 lines)
  • main.go - HTTP server and CLI interface
  • pitr.go - PITR window calculation logic
  • wal_lag.go - LSN parsing and lag calculation
  • mock-wal-g - Testing mock script
  • README.md - Comprehensive documentation
  • go.mod/go.sum - Go module configuration

🎯 Value Proposition

This exporter transforms WAL-G from a "black box" backup solution into a fully observable system:

  • Visibility: Complete backup and WAL operation monitoring
  • Accuracy: Correct backup type classification (fixes common misclassification issues)
  • Relationships: Track incremental backup chains and dependencies
  • Performance: Monitor backup duration, storage latency, and operation health
  • Integration: Seamless Prometheus/Grafana integration with rich labels
  • Production Ready: Comprehensive error handling, health checks, and operational metrics

🔗 Dependencies

The exporter requires:

  • Go 1.19+ for building
  • WAL-G binary accessible via PATH or --walg.path
  • Proper WAL-G configuration (storage credentials, etc.)
  • Network access to storage backend for health checks

📚 Documentation

Complete documentation is provided in /exporter/README.md including:

  • Installation and configuration guide
  • Metrics reference with examples
  • Grafana dashboard queries
  • Troubleshooting guide
  • Development and testing instructions

The exporter provides the following metrics:

### Backup Metrics
- `walg_backup_lag_seconds{backup_type}` - Time since last backup-push in seconds

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for all timestamps, let's specify clearly if it's timestamp of beginning of the process of end of it


### Backup Metrics
- `walg_backup_lag_seconds{backup_type}` - Time since last backup-push in seconds
- `walg_backup_count{backup_type}` - Number of backups (full/delta)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

successful attempts only or all of them?

- `walg_backup_timestamp{backup_type}` - Timestamp of last backup

### WAL Metrics
- `walg_wal_lag_seconds{timeline}` - Time since last wal-push in seconds

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `walg_wal_lag_seconds{timeline}` - Time since last wal-push in seconds
- `walg_wal_lag_seconds{timeline}` - Time since last successful wal-push in seconds

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question: "time since" is a derived metric. Isn't it better to export timestamps and let monitoring decide what to show to users/AI, raw timestamps or lag values (or both)?

- `walg_wal_integrity_status{timeline}` - WAL integrity status (1 = OK, 0 = ERROR)

### PITR Metrics
- `walg_pitr_window_seconds` - Point-in-time recovery window size in seconds

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we have gaps / multiple windows?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants