feat: Add Prometheus/Grafana monitoring and observability stack by klagrida · Pull Request #3 · gridatek/kldp

klagrida · 2025-12-14T14:15:47Z

Implement kube-prometheus-stack for comprehensive cluster monitoring:

Infrastructure:

Helm values configuration (core/monitoring/prometheus-grafana-values.yaml)
kube-prometheus-stack with Prometheus Operator
Prometheus Server with 7-day retention and 5Gi storage
Grafana with pre-built Kubernetes dashboards
Alertmanager for alert management
Node Exporter for node metrics
Kube State Metrics for Kubernetes object metrics

Components:

Prometheus: Metrics collection and storage (512Mi-1Gi memory)
Grafana: Visualization dashboards (128Mi-256Mi memory)
Alertmanager: Alert routing and management (64Mi-128Mi memory)
Exporters: Node and kube-state-metrics exporters

Installation:

Installation script (scripts/install-monitoring.sh)
Uses prometheus-community Helm repository
Automatic namespace creation
15-minute timeout for complete stack deployment

Developer Experience:

Makefile targets: install-monitoring, grafana, prometheus, alertmanager
Port forwarding helpers for all UIs
Enhanced status command with monitoring pods and release info
Default admin credentials (admin/admin)

Monitoring Capabilities:

Kubernetes cluster metrics (nodes, pods, deployments)
Resource utilization (CPU, memory, disk, network)
KLDP component metrics (Airflow, MinIO, Spark)
Custom scrape configs for application monitoring
Pre-configured alerting rules for critical issues

Default Dashboards:

Kubernetes cluster overview
Node resource usage
Pod resource usage
Persistent volume monitoring
Network I/O and latency

Integrations:

Airflow metrics scraping (webserver, scheduler)
MinIO metrics collection
Spark Operator metrics
Custom ServiceMonitor support

Configuration Highlights:

Optimized for local development resources
7-day data retention
NodePort services for easy access
Default Kubernetes dashboards enabled
Alerting rules for common issues

Enables comprehensive observability:

Performance monitoring and optimization
Resource usage tracking
Troubleshooting and debugging
Capacity planning
SLA/SLO tracking

🤖 Generated with Claude Code

Implement kube-prometheus-stack for comprehensive cluster monitoring: **Infrastructure:** - Helm values configuration (core/monitoring/prometheus-grafana-values.yaml) - kube-prometheus-stack with Prometheus Operator - Prometheus Server with 7-day retention and 5Gi storage - Grafana with pre-built Kubernetes dashboards - Alertmanager for alert management - Node Exporter for node metrics - Kube State Metrics for Kubernetes object metrics **Components:** - **Prometheus**: Metrics collection and storage (512Mi-1Gi memory) - **Grafana**: Visualization dashboards (128Mi-256Mi memory) - **Alertmanager**: Alert routing and management (64Mi-128Mi memory) - **Exporters**: Node and kube-state-metrics exporters **Installation:** - Installation script (scripts/install-monitoring.sh) - Uses prometheus-community Helm repository - Automatic namespace creation - 15-minute timeout for complete stack deployment **Developer Experience:** - Makefile targets: install-monitoring, grafana, prometheus, alertmanager - Port forwarding helpers for all UIs - Enhanced status command with monitoring pods and release info - Default admin credentials (admin/admin) **Monitoring Capabilities:** - Kubernetes cluster metrics (nodes, pods, deployments) - Resource utilization (CPU, memory, disk, network) - KLDP component metrics (Airflow, MinIO, Spark) - Custom scrape configs for application monitoring - Pre-configured alerting rules for critical issues **Default Dashboards:** - Kubernetes cluster overview - Node resource usage - Pod resource usage - Persistent volume monitoring - Network I/O and latency **Integrations:** - Airflow metrics scraping (webserver, scheduler) - MinIO metrics collection - Spark Operator metrics - Custom ServiceMonitor support **Configuration Highlights:** - Optimized for local development resources - 7-day data retention - NodePort services for easy access - Default Kubernetes dashboards enabled - Alerting rules for common issues Enables comprehensive observability: - Performance monitoring and optimization - Resource usage tracking - Troubleshooting and debugging - Capacity planning - SLA/SLO tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Enhance CI workflow to validate all components installation and functionality: **Extended Testing Coverage:** - ✅ Airflow with KubernetesExecutor (existing) - ✅ MinIO object storage (NEW) - ✅ Spark Operator (NEW) - ✅ Prometheus/Grafana monitoring stack (NEW) **Changes:** - Increased job timeout from 45 to 90 minutes - Added MinIO installation and verification - Added Spark Operator installation and verification - Added Prometheus/Grafana installation and verification - Enhanced resource reporting across all namespaces - Improved failure diagnostics for all components **MinIO Validation:** - Helm install with OCI registry - Pod readiness check (600s timeout) - Storage namespace verification **Spark Operator Validation:** - Helm install from spark-operator repository - Operator pod readiness check (300s timeout) - CRD availability verification **Monitoring Stack Validation:** - kube-prometheus-stack installation - Prometheus and Grafana pod readiness (600s each) - Monitoring namespace verification **Enhanced Debugging:** - Resource status for all 4 namespaces (airflow, storage, spark, monitoring) - Services and PVCs across all namespaces - Events from all namespaces (last 20 per namespace) - Failed pod descriptions for all namespaces **Test Flow:** 1. DAG syntax validation 2. Minikube cluster setup 3. Install Airflow → verify 4. Install MinIO → verify 5. Install Spark Operator → verify 6. Install Monitoring → verify 7. Final component status check This ensures the complete KLDP stack works end-to-end in CI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The specific image tags (bitnami/minio:2025.7.23-debian-12-r3) don't exist, causing ImagePullBackOff errors in CI. Let the Helm chart use its default compatible image versions instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Create separate CI values to speed up MinIO installation: - Disabled persistence (use emptyDir instead of PVC) - Disabled console to save resources - Reduced resource limits (128Mi/256Mi) - Single bucket for testing - Faster timeout (10m vs 15m) This avoids PVC provisioning delays and reduces resource usage in CI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

This workflow isolates MinIO installation for diagnostic purposes: - Tests Docker registry connectivity and DNS resolution - Inspects Helm chart to verify default image tags - Attempts to pull various MinIO image tags - Pre-loads images into Minikube - Installs MinIO with verbose Helm output - Comprehensive pod status and event logging - Tests alternative image sources (quay.io) if Bitnami fails - Provides final diagnostic summary Can be triggered manually via workflow_dispatch or auto-runs on feat/add-prometheus-grafana branch for quick iteration. This is temporary for debugging CI image pull failures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Replaced Unicode characters (✓, ✗, box drawing chars) with ASCII equivalents ([OK], [FAIL], =) to fix YAML validation issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Updated install-minio.sh to support KLDP_MINIO_VALUES_FILE env var for CI values file override. Both CI workflows now use the script instead of duplicating helm commands. Benefits: - Single source of truth for MinIO installation - Easier to maintain and debug - Consistent behavior between local and CI - If it works locally, it works in CI Environment variables: - KLDP_MINIO_VERSION: Override MinIO chart version - KLDP_MINIO_VALUES_FILE: Override values file path (for CI) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

No longer needed since we unified the installation approach. Both local and CI now use scripts/install-minio.sh. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Updated all installation scripts to support environment variable overrides for values files, and updated CI to use scripts instead of inline helm commands. Scripts updated: - install-airflow.sh: KLDP_AIRFLOW_VALUES_FILE - install-spark.sh: KLDP_SPARK_VALUES_FILE - install-monitoring.sh: KLDP_MONITORING_VALUES_FILE Benefits: - Single source of truth for ALL installations - Consistent behavior between local and CI - Easier to maintain and debug - Test locally = works in CI Environment variables for CI overrides: - KLDP_AIRFLOW_VALUES_FILE (defaults to core/airflow/values.yaml) - KLDP_MINIO_VALUES_FILE (defaults to core/storage/minio-values.yaml) - KLDP_SPARK_VALUES_FILE (defaults to core/compute/spark-operator-values.yaml) - KLDP_MONITORING_VALUES_FILE (defaults to core/monitoring/prometheus-grafana-values.yaml) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Removed CI-specific values files and unified configuration: - Deleted: core/airflow/values-ci-emptydir.yaml - Deleted: core/storage/minio-values-ci.yaml - Optimized main values files to work well in both environments - Updated CI workflow to use same values as local development Benefits: - CI validates the REAL configuration users run locally - No configuration drift between environments - Easier debugging: if it works locally, it works in CI - Fewer files to maintain - Single source of truth per component Resource optimizations in core/airflow/values.yaml: - Reduced memory requests to 256Mi for scheduler/webserver - Reduced triggerer to 128Mi - Still functional for local dev and fits in CI (2 CPUs, 4GB RAM) Updated CLAUDE.md: - Documented single source of truth approach - Simplified CI troubleshooting section - Updated local testing instructions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Airflow 3.x replaced webserver with api-server component. Changes: - Wait for scheduler instead of webserver (webserver no longer exists) - Update port-forward command to use api-server service - Update log commands to use correct component labels This fixes CI failure where script was looking for non-existent webserver component. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Reduce resource usage in CI by disabling triggerer component which is not critical for basic validation. Changes: - Disable triggerer (saves 128Mi RAM, 100m CPU) - Increase scheduler ready timeout from 5min to 10min - Gives scheduler more time to stabilize in resource-constrained CI This should fix scheduler 1/2 ready issue in CI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

khalil and others added 13 commits December 14, 2025 15:13

Remove temporary debug workflow

2138d09

No longer needed since we unified the installation approach. Both local and CI now use scripts/install-minio.sh. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Remove remaining Airflow CI values file

6943b53

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Prometheus/Grafana monitoring and observability stack#3

feat: Add Prometheus/Grafana monitoring and observability stack#3
klagrida wants to merge 13 commits intomainfrom
feat/add-prometheus-grafana

klagrida commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

klagrida commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant