Skip to content

feat: Prometheus dashboards — Phase 1 (12 dashboards + 3 new collectors)#58

Merged
imajaydwivedi merged 4 commits into
devfrom
feat/prometheus-dashboards-phase-1
Apr 19, 2026
Merged

feat: Prometheus dashboards — Phase 1 (12 dashboards + 3 new collectors)#58
imajaydwivedi merged 4 commits into
devfrom
feat/prometheus-dashboards-phase-1

Conversation

@imajaydwivedi

Copy link
Copy Markdown
Owner

Phase 1 — Prometheus-backed Grafana dashboards

Ports 12 SQLMonitor dashboards to a new sql_exporter/Prometheus-Dashboards/ folder, and adds 3 new sql_exporter collectors to cover the metrics those dashboards depend on.

New sql_exporter collectors (3)

File Purpose Cardinality
mssql_sqlagent_jobs.collector.yml Per-job status / outcome / duration / next-run / step-failures-24h from msdb 1 row per job
mssql_backup_history.collector.yml Per-(database, backup_type) last-time/size/duration/age from msdb.dbo.backupset 1 row per (db, backup_type)
mssql_xevent.collector.yml Aggregated events/cpu/duration/reads/writes over last 5 min, per (event, db, result, app). Capped at TOP 500 rows bounded

Registered in sql_exporter.yml under two new jobs: mssql_msdb, mssql_xevent.

Generated dashboards (12)

Each dashboard is built from a small Python spec (_specs/<name>.py) via generate.py.

UID Title Data panels Deep-link tiles
prom_core_metrics_trend Core Metrics - Trend 9 0
prom_wait_stats Wait Stats 4 0
prom_disk_space Disk Space 5 0
prom_ag_health_state Ag Health State 3 0
prom_sql_agent_jobs SQL Agent Jobs 6 0
prom_backup_history Backup History 6 0
prom_xevent_trend XEvent - Trend 4 0
prom_database_file_io_stats Database File IO Stats 12 0
prom_dba_inventory DBA Inventory 6 8
prom_monitoring_live_all_servers Monitoring - Live - All Servers 15 6
prom_monitoring_live_distributed Monitoring - Live - Distributed 52 6
prom_monitoring_perfmon_quest Monitoring - Perfmon Counters - Quest Softwares - Distributed 51 4

Totals: 173 Prometheus-backed data panels + 24 legacy_link_panel(...) deep-link tiles for panels that require the SQLMonitor inventory DB (alert history, AG-vs-nonAG backup split, LAMA config-change deltas, dm_os_memory_clerks snapshot, tempdb/log_space cache tables, sql_server_patching).

High-fidelity PromQL patterns

  • increase(metric[$__range]) — selective-duration deltas (File IO, Wait Stats).
  • @ end() offset $__range — prior-window comparison tables.
  • quantile_over_time($percentile_q, (expr)[$trend_window:]) — Core Metrics - Trend percentile trends.
  • topk($top_n, sum by (…) (…)) — XEvent / wait-type / memory-consumer trends.
  • time() - timestamp(up == 1) — data-collection-issue detection.

Regeneration & validation

cd sql_exporter/Prometheus-Dashboards
python3 generate.py                    # rebuild every dashboard
python3 _tools/validate.py             # structural JSON + target/expr sanity check

All 12 dashboards build clean and pass validation with no warnings.

Next phases

  • Phase 2 — 5 text-bound dashboards (WhoIsActive Workload, XEvent Workload, SQLMonitor-Alerts, Blitz Server Health, BlitzIndex Analysis) as numeric-summary + deep-link dashboards.
  • Phase 3sql_exporter/README-sql_exporter.md refresh + docs/deployment/prometheus.md cross-links.
  • Phase 4 — deploy collectors to sqlmonitor / AgHost-1A / AgHost-1B and verify metrics on https://prometheus.ajaydwivedi.com.

Pull Request opened by Augment Code with guidance from the PR author

Add three new collectors and register them in sql_exporter.yml under
two new jobs (mssql_msdb, mssql_xevent):

- mssql_sqlagent_jobs: per-job enabled / last_run_outcome /
  last_run_duration_seconds / last_run_end_time_utc /
  next_run_time_utc / is_running / step_failures_last_24h, sourced
  from msdb.dbo.sysjobs/sysjobhistory/sysjobactivity/sysjobschedules.

- mssql_backup_history: per-(database, backup_type) last_time_utc /
  last_duration_seconds / last_size_bytes / last_compressed_size_bytes
  / age_seconds / count_last_24h, sourced from msdb.dbo.backupset.

- mssql_xevent: aggregated events_count / cpu_time_ms_sum /
  duration_seconds_sum / logical_reads_sum / physical_reads_sum /
  writes_sum per (event_name, database_name, result, client_app_name)
  over the most recent 5 minutes. Caps at TOP 500 to bound cardinality.
  Guarded with an existence check on DBA.dbo.xevent_metrics so the
  collector is a no-op on instances that don't run the XEvent
  collector proc.
Port 12 SQLMonitor Grafana dashboards to Prometheus under
sql_exporter/Prometheus-Dashboards/. Each dashboard is generated from a
small Python spec; every spec produces a *.json that imports directly
into Grafana (schemaVersion 42, __inputs-bound DS_PROMETHEUS).

Dashboards (UID / data panels / text-link panels):

  prom_core_metrics_trend             9 / 0
  prom_wait_stats                     4 / 0
  prom_disk_space                     5 / 0
  prom_ag_health_state                3 / 0
  prom_sql_agent_jobs                 6 / 0
  prom_backup_history                 6 / 0
  prom_xevent_trend                   4 / 0
  prom_database_file_io_stats        12 / 0
  prom_dba_inventory                  6 / 8
  prom_monitoring_live_all_servers   15 / 6
  prom_monitoring_live_distributed   52 / 6
  prom_monitoring_perfmon_quest      51 / 4

Helper library (_lib/prom_dashboard.py, _lib/build.py):
  - Panel, Target, query_var, custom_var, constant_var dataclasses.
  - row() and legacy_link_panel() helpers.
  - build_dashboard(): per-panel-type options, thresholds, transforms.
  - write_dashboard(): JSON serialization.

Specs (_specs/*.py) use high-fidelity PromQL patterns:
  - increase(metric[$__range]) for selective-duration deltas.
  - @ end() offset $__range for prior-window comparison tables.
  - quantile_over_time($percentile_q, (expr)[$trend_window:]) for
    percentile trends.
  - topk($top_n, sum by (…) (…)) for bounded series rendering.

Panels that require the SQLMonitor inventory DB (alert history,
AG-vs-nonAG backup split, LAMA config-change deltas, dm_os_memory_clerks
snapshot, tempdb/log_space cache tables, sql_server_patching) use
legacy_link_panel(...) to markdown-link back to the SQL-backed
dashboard, keeping every source section accounted for.

Developer tooling (_tools/):
  - inspect_panels.py: source dashboard panel inventory.
  - validate.py: structural JSON + target/expr sanity check; all 12
    generated dashboards validate clean.
- docs/prometheus.md: add rows for mssql_sqlagent_jobs, mssql_backup_history,
  mssql_xevent to the collectors table; add a 'Prometheus-backed dashboard
  pack' section with the 12 Phase 1 dashboards and regeneration commands.

- sql_exporter/README-sql_exporter.md: add a 'Collectors' table describing
  every mssql_*.collector.yml file, its job binding, scrape interval and
  the metric prefix it publishes.
@imajaydwivedi imajaydwivedi marked this pull request as ready for review April 19, 2026 11:19
@imajaydwivedi imajaydwivedi merged commit a79d372 into dev Apr 19, 2026
@imajaydwivedi imajaydwivedi deleted the feat/prometheus-dashboards-phase-1 branch April 19, 2026 11:20
imajaydwivedi added a commit that referenced this pull request Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant