Skip to content

feat: Add Prometheus/Grafana monitoring and observability stack#3

Open
klagrida wants to merge 13 commits intomainfrom
feat/add-prometheus-grafana
Open

feat: Add Prometheus/Grafana monitoring and observability stack#3
klagrida wants to merge 13 commits intomainfrom
feat/add-prometheus-grafana

Conversation

@klagrida
Copy link
Copy Markdown
Contributor

Implement kube-prometheus-stack for comprehensive cluster monitoring:

Infrastructure:

  • Helm values configuration (core/monitoring/prometheus-grafana-values.yaml)
  • kube-prometheus-stack with Prometheus Operator
  • Prometheus Server with 7-day retention and 5Gi storage
  • Grafana with pre-built Kubernetes dashboards
  • Alertmanager for alert management
  • Node Exporter for node metrics
  • Kube State Metrics for Kubernetes object metrics

Components:

  • Prometheus: Metrics collection and storage (512Mi-1Gi memory)
  • Grafana: Visualization dashboards (128Mi-256Mi memory)
  • Alertmanager: Alert routing and management (64Mi-128Mi memory)
  • Exporters: Node and kube-state-metrics exporters

Installation:

  • Installation script (scripts/install-monitoring.sh)
  • Uses prometheus-community Helm repository
  • Automatic namespace creation
  • 15-minute timeout for complete stack deployment

Developer Experience:

  • Makefile targets: install-monitoring, grafana, prometheus, alertmanager
  • Port forwarding helpers for all UIs
  • Enhanced status command with monitoring pods and release info
  • Default admin credentials (admin/admin)

Monitoring Capabilities:

  • Kubernetes cluster metrics (nodes, pods, deployments)
  • Resource utilization (CPU, memory, disk, network)
  • KLDP component metrics (Airflow, MinIO, Spark)
  • Custom scrape configs for application monitoring
  • Pre-configured alerting rules for critical issues

Default Dashboards:

  • Kubernetes cluster overview
  • Node resource usage
  • Pod resource usage
  • Persistent volume monitoring
  • Network I/O and latency

Integrations:

  • Airflow metrics scraping (webserver, scheduler)
  • MinIO metrics collection
  • Spark Operator metrics
  • Custom ServiceMonitor support

Configuration Highlights:

  • Optimized for local development resources
  • 7-day data retention
  • NodePort services for easy access
  • Default Kubernetes dashboards enabled
  • Alerting rules for common issues

Enables comprehensive observability:

  • Performance monitoring and optimization
  • Resource usage tracking
  • Troubleshooting and debugging
  • Capacity planning
  • SLA/SLO tracking

🤖 Generated with Claude Code

khalil and others added 13 commits December 14, 2025 15:13
Implement kube-prometheus-stack for comprehensive cluster monitoring:

**Infrastructure:**
- Helm values configuration (core/monitoring/prometheus-grafana-values.yaml)
- kube-prometheus-stack with Prometheus Operator
- Prometheus Server with 7-day retention and 5Gi storage
- Grafana with pre-built Kubernetes dashboards
- Alertmanager for alert management
- Node Exporter for node metrics
- Kube State Metrics for Kubernetes object metrics

**Components:**
- **Prometheus**: Metrics collection and storage (512Mi-1Gi memory)
- **Grafana**: Visualization dashboards (128Mi-256Mi memory)
- **Alertmanager**: Alert routing and management (64Mi-128Mi memory)
- **Exporters**: Node and kube-state-metrics exporters

**Installation:**
- Installation script (scripts/install-monitoring.sh)
- Uses prometheus-community Helm repository
- Automatic namespace creation
- 15-minute timeout for complete stack deployment

**Developer Experience:**
- Makefile targets: install-monitoring, grafana, prometheus, alertmanager
- Port forwarding helpers for all UIs
- Enhanced status command with monitoring pods and release info
- Default admin credentials (admin/admin)

**Monitoring Capabilities:**
- Kubernetes cluster metrics (nodes, pods, deployments)
- Resource utilization (CPU, memory, disk, network)
- KLDP component metrics (Airflow, MinIO, Spark)
- Custom scrape configs for application monitoring
- Pre-configured alerting rules for critical issues

**Default Dashboards:**
- Kubernetes cluster overview
- Node resource usage
- Pod resource usage
- Persistent volume monitoring
- Network I/O and latency

**Integrations:**
- Airflow metrics scraping (webserver, scheduler)
- MinIO metrics collection
- Spark Operator metrics
- Custom ServiceMonitor support

**Configuration Highlights:**
- Optimized for local development resources
- 7-day data retention
- NodePort services for easy access
- Default Kubernetes dashboards enabled
- Alerting rules for common issues

Enables comprehensive observability:
- Performance monitoring and optimization
- Resource usage tracking
- Troubleshooting and debugging
- Capacity planning
- SLA/SLO tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Enhance CI workflow to validate all components installation and functionality:

**Extended Testing Coverage:**
- ✅ Airflow with KubernetesExecutor (existing)
- ✅ MinIO object storage (NEW)
- ✅ Spark Operator (NEW)
- ✅ Prometheus/Grafana monitoring stack (NEW)

**Changes:**
- Increased job timeout from 45 to 90 minutes
- Added MinIO installation and verification
- Added Spark Operator installation and verification
- Added Prometheus/Grafana installation and verification
- Enhanced resource reporting across all namespaces
- Improved failure diagnostics for all components

**MinIO Validation:**
- Helm install with OCI registry
- Pod readiness check (600s timeout)
- Storage namespace verification

**Spark Operator Validation:**
- Helm install from spark-operator repository
- Operator pod readiness check (300s timeout)
- CRD availability verification

**Monitoring Stack Validation:**
- kube-prometheus-stack installation
- Prometheus and Grafana pod readiness (600s each)
- Monitoring namespace verification

**Enhanced Debugging:**
- Resource status for all 4 namespaces (airflow, storage, spark, monitoring)
- Services and PVCs across all namespaces
- Events from all namespaces (last 20 per namespace)
- Failed pod descriptions for all namespaces

**Test Flow:**
1. DAG syntax validation
2. Minikube cluster setup
3. Install Airflow → verify
4. Install MinIO → verify
5. Install Spark Operator → verify
6. Install Monitoring → verify
7. Final component status check

This ensures the complete KLDP stack works end-to-end in CI.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The specific image tags (bitnami/minio:2025.7.23-debian-12-r3) don't exist,
causing ImagePullBackOff errors in CI.

Let the Helm chart use its default compatible image versions instead.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Create separate CI values to speed up MinIO installation:
- Disabled persistence (use emptyDir instead of PVC)
- Disabled console to save resources
- Reduced resource limits (128Mi/256Mi)
- Single bucket for testing
- Faster timeout (10m vs 15m)

This avoids PVC provisioning delays and reduces resource usage in CI.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This workflow isolates MinIO installation for diagnostic purposes:
- Tests Docker registry connectivity and DNS resolution
- Inspects Helm chart to verify default image tags
- Attempts to pull various MinIO image tags
- Pre-loads images into Minikube
- Installs MinIO with verbose Helm output
- Comprehensive pod status and event logging
- Tests alternative image sources (quay.io) if Bitnami fails
- Provides final diagnostic summary

Can be triggered manually via workflow_dispatch or auto-runs on
feat/add-prometheus-grafana branch for quick iteration.

This is temporary for debugging CI image pull failures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replaced Unicode characters (✓, ✗, box drawing chars) with ASCII
equivalents ([OK], [FAIL], =) to fix YAML validation issues.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated install-minio.sh to support KLDP_MINIO_VALUES_FILE env var
for CI values file override. Both CI workflows now use the script
instead of duplicating helm commands.

Benefits:
- Single source of truth for MinIO installation
- Easier to maintain and debug
- Consistent behavior between local and CI
- If it works locally, it works in CI

Environment variables:
- KLDP_MINIO_VERSION: Override MinIO chart version
- KLDP_MINIO_VALUES_FILE: Override values file path (for CI)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
No longer needed since we unified the installation approach.
Both local and CI now use scripts/install-minio.sh.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated all installation scripts to support environment variable
overrides for values files, and updated CI to use scripts instead
of inline helm commands.

Scripts updated:
- install-airflow.sh: KLDP_AIRFLOW_VALUES_FILE
- install-spark.sh: KLDP_SPARK_VALUES_FILE
- install-monitoring.sh: KLDP_MONITORING_VALUES_FILE

Benefits:
- Single source of truth for ALL installations
- Consistent behavior between local and CI
- Easier to maintain and debug
- Test locally = works in CI

Environment variables for CI overrides:
- KLDP_AIRFLOW_VALUES_FILE (defaults to core/airflow/values.yaml)
- KLDP_MINIO_VALUES_FILE (defaults to core/storage/minio-values.yaml)
- KLDP_SPARK_VALUES_FILE (defaults to core/compute/spark-operator-values.yaml)
- KLDP_MONITORING_VALUES_FILE (defaults to core/monitoring/prometheus-grafana-values.yaml)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Removed CI-specific values files and unified configuration:
- Deleted: core/airflow/values-ci-emptydir.yaml
- Deleted: core/storage/minio-values-ci.yaml
- Optimized main values files to work well in both environments
- Updated CI workflow to use same values as local development

Benefits:
- CI validates the REAL configuration users run locally
- No configuration drift between environments
- Easier debugging: if it works locally, it works in CI
- Fewer files to maintain
- Single source of truth per component

Resource optimizations in core/airflow/values.yaml:
- Reduced memory requests to 256Mi for scheduler/webserver
- Reduced triggerer to 128Mi
- Still functional for local dev and fits in CI (2 CPUs, 4GB RAM)

Updated CLAUDE.md:
- Documented single source of truth approach
- Simplified CI troubleshooting section
- Updated local testing instructions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Airflow 3.x replaced webserver with api-server component.

Changes:
- Wait for scheduler instead of webserver (webserver no longer exists)
- Update port-forward command to use api-server service
- Update log commands to use correct component labels

This fixes CI failure where script was looking for non-existent
webserver component.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Reduce resource usage in CI by disabling triggerer component
which is not critical for basic validation.

Changes:
- Disable triggerer (saves 128Mi RAM, 100m CPU)
- Increase scheduler ready timeout from 5min to 10min
- Gives scheduler more time to stabilize in resource-constrained CI

This should fix scheduler 1/2 ready issue in CI.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant