This is a demo project created by me (Nicole van der Hoeven) for my talk at Øredev 2025 in Malmö, Sweden, on November 5th, 2025. The talk is called "The Spice Must Flow: The Fremen Guide to Sustainable Observability". You can find my slides live here or you can download the PDF with speaker notes here. See the very end of this README for some references I used in the slides. Below is the abstract of the talk.
Nobody wants to be wasteful. But how do you balance that with the need for enough data to fix things when they go wrong? After all, no matter how much it costs, there are SLOs to keep. The spice-- the telemetry-- must flow.
The Fremen are a people who live on the planet Arrakis, simultaneously rich in spice while poor in water. But the Fremen have managed to thrive, not just survive, on very little while continuing to produce spice. What would it take to apply this "desert planet thinking" to observability?
In this talk, you'll learn about the hidden environmental and financial costs of your observability stack and how to reduce them. You'll learn about the complications of measuring cost for ephemeral resources on Kubernetes, how to use KEDA to "right-size" resources when you don't know what the right size is, and how to convince your teams to drop all but the most meaningful telemetry data. In the end, the spice must flow, but there must still be enough water to sustain all.
The Spice Runner game is a Dune-themed browser game with full observability and advanced autoscaling capabilities. This project demonstrates production-grade Kubernetes deployment patterns with comprehensive monitoring and intelligent scaling.
You can deploy this game yourself, or play it at nvdh.dev/spice.
This project includes the following features:
- Interactive Game: Dune-themed endless runner browser game
- Leaderboard System: Full-stack leaderboard with Go API, PostgreSQL, Redis, and OpenTelemetry instrumentation
- Full Instrumentation: Faro for frontend instrumentation, Alloy for telemetry collection
- Full Observability: Metrics, logs, and traces with Prometheus, Loki, and Tempo
- Distributed Tracing: OpenTelemetry instrumentation across frontend and backend services
- Pod Autoscaling: Pod autoscaling based on HTTP traffic, CPU, and memory with KEDA
- Cluster Autoscaler: Automatic node provisioning based on pending pods with GKE Cluster Autoscaler
- Real-time Monitoring: Grafana
- Energy Monitoring: Kepler integration for power consumption and energy tracking
- Load Testing: k6 for load testing
The application uses a multi-layer architecture with integrated observability and autoscaling:
The Spice Runner includes a production-grade leaderboard system that demonstrates real-world OpenTelemetry instrumentation:
- Public Leaderboard Page - View top 10 players at
/spice/leaderboard.html - Go API with full OpenTelemetry tracing and custom metrics
- PostgreSQL for persistent score storage
- Redis caching layer for performance optimization
- Distributed tracing showing request flow from frontend → API → database → cache
- Grafana dashboard visualizing leaderboard data, traces, and performance metrics
Key observability features:
- Trace every score submission from browser to database
- Track cache hit ratios and database query performance
- Identify bottlenecks with span attributes (e.g., slow COUNT queries)
- Correlate metrics with traces using exemplars
- Monitor anti-cheat validation and error rates
See docs/LEADERBOARD-SYSTEM.md for detailed documentation.
# Quick deploy (requires GCP project ID)
export GCP_PROJECT_ID=your-project-id
./deploy-leaderboard.sh
# Or deploy manually
cd leaderboard-api
docker build --platform linux/amd64 -t gcr.io/$GCP_PROJECT_ID/spice-runner-leaderboard:latest .
docker push gcr.io/$GCP_PROJECT_ID/spice-runner-leaderboard:latest
kubectl apply -f ../k8s/leaderboard-postgres.yaml
kubectl apply -f ../k8s/leaderboard-redis.yaml
kubectl apply -f ../k8s/leaderboard-api.yaml
kubectl apply -f ../k8s/leaderboard-dashboard.yamlThe Grafana dashboard is automatically provisioned via ConfigMap and appears in the "Leaderboard" folder.
This guide helps you deploy the Spice Runner game to Google Kubernetes Engine (GKE) with full observability and autoscaling.
Before you begin, ensure you have the following:
- A GKE cluster running
kubectlconfigured to access your clustergcloudCLI installed and authenticated
To deploy the application to GKE, run the following commands:
# Configure your cluster
export CLUSTER_NAME="spice-runner-cluster"
export REGION="us-central1"
export ZONE="us-central1-a"
export GCP_PROJECT_ID=$(gcloud config get-value project)
# Create Grafana admin credentials secret
kubectl create secret generic grafana-admin-credentials \
--from-literal=admin-user=admin \
--from-literal=admin-password=YOUR_PASSWORD_HERE \
-n observability \
--dry-run=client -o yaml | kubectl apply -f -
# Deploy the observability stack
kubectl apply -f k8s/observability-stack.yaml
# Deploy the application
kubectl apply -f k8s/deployment-cloud-stack.yaml
kubectl apply -f k8s/service.yaml
# Apply KEDA autoscaling
kubectl apply -f k8s/keda-scaledobject.yaml
# Enable GKE Cluster Autoscaler
gcloud container clusters update $CLUSTER_NAME \
--enable-autoscaling \
--node-pool=default-pool \
--min-nodes=1 \
--max-nodes=10 \
--zone=$ZONEThe application supports two levels of autoscaling to handle varying workloads efficiently.
KEDA provides horizontal pod autoscaling based on multiple metrics:
- Min replicas: 1 (always available, prevents 502 errors)
- Max replicas: 20 (production), 200 (demo)
- Triggers: HTTP request rate, CPU utilization, memory utilization
For setup details, refer to the KEDA testing guide.
The KEDA configuration in k8s/keda-scaledobject.yaml includes two sets of values:
Production Configuration (Current)
- Purpose: Cost-effective, stable operation for real-world usage
- maxReplicaCount:
20- Reasonable limit for actual workload - pollingInterval:
30s- Reduces API load on Prometheus - cooldownPeriod:
300s(5 min) - Prevents scaling flapping - HTTP threshold:
10 req/s- Better pod utilization (fewer, busier pods) - activationThreshold:
1 req/s- Requires actual traffic - CPU threshold:
70%- Allows good resource utilization
Demo Configuration (For Live Demonstrations)
- Purpose: Highly responsive scaling for visual demonstrations
- maxReplicaCount:
200- Shows extreme scaling capability - pollingInterval:
5s- Very responsive to changes - cooldownPeriod:
30s- Fast scale-down for demos - HTTP threshold:
1 req/s- Approximates 1 pod per session - activationThreshold:
0.2 req/s- Triggers on any activity - CPU threshold:
50%- Scales up more aggressively
To switch to demo mode, update the values in k8s/keda-scaledobject.yaml using the inline comments marked # DEMO VALUES, then run:
kubectl apply -f k8s/keda-scaledobject.yamlNote on Scale-to-Zero: The configuration keeps minReplicaCount: 1 to ensure the service is always available. With GCP Ingress, scaling to zero (minReplicaCount: 0) causes 502 errors when users visit the site because:
- No pods are running to handle requests
- KEDA can't detect incoming traffic without running pods
- The load balancer fails before KEDA can scale up
To implement true scale-to-zero without 502 errors, consider:
- KEDA HTTP Add-on (queues requests during cold starts)
- Knative Serving (built-in request queuing)
- Google Cloud Run (managed scale-to-zero)
GKE Cluster Autoscaler provides production-ready node autoscaling:
- Automatic node provisioning: Adds nodes when pods cannot be scheduled
- Node removal: Removes underutilized nodes to save costs
- Production-ready: Fully supported by Google
- Configuration: Configured for min 1 node, max 10 nodes
- Seamless integration: Works automatically with KEDA pod autoscaling
You can test the autoscaling behavior using automated load tests or manual scaling.
To run automated load tests, use the following commands:
# KEDA load tests
./scripts/run-hpa-test.shTo manually test scaling behavior, run the following commands:
# Scale up
kubectl scale deployment spice-runner --replicas=20
# Watch autoscaling
kubectl get pods -w
kubectl get nodes -wThe following documentation provides detailed guides for setup, configuration, and operations.
- Leaderboard system guide - Complete documentation with OpenTelemetry instrumentation
- Leaderboard API README - API-specific documentation
- KEDA quickstart - Quick KEDA setup
- KEDA testing guide - KEDA testing procedures
- HPA testing guide - Horizontal Pod Autoscaler guide
- Pod-per-session scaling - Advanced scaling patterns
- GKE cluster autoscaler - Node-level autoscaling configuration
- Observability and autoscaling - Integration guide
- Kubernetes monitoring setup - Kubernetes monitoring configuration
- Kubernetes Dashboard guide - Dashboard usage
- Pod-per-session monitoring - Session-based monitoring
- Kepler deployment summary - Energy monitoring deployment
- Kepler guide - Power consumption monitoring
- Implementation summary - Implementation overview
The project is organized as follows:
spice-runner/
├── k8s/ # Kubernetes manifests
│ ├── deployment-cloud-stack.yaml # Main game deployment
│ ├── service.yaml # Game service
│ ├── keda-scaledobject.yaml # KEDA autoscaling
│ ├── observability-stack.yaml # Grafana, Prometheus, Loki, Tempo
│ ├── leaderboard-postgres.yaml # PostgreSQL for leaderboard
│ ├── leaderboard-redis.yaml # Redis cache
│ ├── leaderboard-api.yaml # Leaderboard API service
│ ├── leaderboard-dashboard.yaml # Leaderboard Grafana dashboard
│ ├── kepler.yaml # Kepler energy monitoring
│ └── kepler-dashboard.yaml # Kepler Grafana dashboard
├── leaderboard-api/ # Go API with OpenTelemetry
│ ├── main.go # API implementation
│ ├── Dockerfile # Container image
│ └── go.mod # Go dependencies
├── scripts/ # Automation scripts
│ ├── install-keda.sh # Install KEDA
│ ├── deploy-kepler.sh # Deploy Kepler
│ ├── run-hpa-test.sh # Run KEDA tests
│ ├── hpa-load-test.js # Load testing
│ ├── mega-spike-test.js # Spike testing
│ ├── ultimate-demo-test.js # Comprehensive demo
│ ├── faro-init.js # Faro RUM initialization
│ ├── faro-instrumentation.js # Game instrumentation
│ ├── leaderboard-client.js # Leaderboard frontend
│ └── otel-metrics.js # Game session metrics
├── grafana-dashboards/ # Grafana dashboards
│ └── leaderboard-observability.json # Leaderboard dashboard
├── docs/ # Documentation
│ └── LEADERBOARD-SYSTEM.md # Leaderboard guide
├── img/ # Game graphics
├── index.html # Game frontend with leaderboard
├── leaderboard.html # Public leaderboard page (top 10)
├── nginx.conf # NGINX configuration
├── Dockerfile # Game container image
└── deploy-leaderboard.sh # Leaderboard deployment script
After you deploy the application, you can access the following services:
- Game:
http://<YOUR_DOMAIN>/spice/ - Leaderboard:
http://<YOUR_DOMAIN>/spice/leaderboard.html- Public leaderboard showing top 10 players - Grafana: Access via LoadBalancer IP (requires login with credentials set during deployment)
- Get IP:
kubectl get service grafana -n observability - Default username:
admin - Password: Set via the
grafana-admin-credentialssecret - Kepler Dashboard: Available in Grafana under "Energy" folder - "Kepler Energy & Power Consumption"
- Leaderboard Dashboard: Available in Grafana under "Leaderboard" folder - "Spice Runner - Leaderboard Observability"
- Get IP:
- Leaderboard API:
http://<YOUR_DOMAIN>/spice/leaderboard/api/health - Prometheus:
http://prometheus.observability.svc.cluster.local:9090
The project is configured with production-friendly values to optimize costs:
- Production mode: Currently active (see Production vs Demo Configuration)
- Max 20 pods (vs 200 in demo mode)
- 30-second polling interval (vs 5 seconds)
- 5-minute cooldown (vs 30 seconds)
- Better pod utilization (10 req/s threshold vs 1 req/s)
- Cost impact: Runs 1 pod minimum (~$1-3/month), scales only when needed
- Automatic scaling: Adds nodes only when needed, removes when idle
- Resource efficiency: Right-sizes the cluster based on actual demand
- Cost savings: Reduces waste by scaling down during low traffic
- Configuration: Adjust
--min-nodesand--max-nodesto control costs
- Set appropriate resource requests and limits on pods (already configured)
- Use preemptible/spot node pools for non-critical workloads
- Monitor usage in Grafana and adjust autoscaling parameters
- Switch to demo configuration only when needed for demonstrations
This section helps you diagnose and resolve common issues.
To diagnose pod scaling issues, run the following commands:
# Check KEDA
kubectl get scaledobject
kubectl describe scaledobject spice-runner-keda
# Check metrics
kubectl get --raw /apis/external.metrics.k8s.io/v1beta1If nodes aren't being added when pods are pending:
# Check cluster autoscaler status
kubectl get events --all-namespaces | grep -i autoscal
# Check node pool autoscaling configuration
gcloud container node-pools describe default-pool \
--cluster=CLUSTER_NAME \
--zone=ZONEIf you experience unexpectedly high costs, take the following actions:
- Review instance types: Run
kubectl get nodes -o wideto see active node types - Check NodePool limits: Run
kubectl get nodepools -o yamlto verify configuration - Set up billing alerts in GCP Console to monitor spending
If you cannot log into Grafana:
# Verify the secret exists
kubectl get secret grafana-admin-credentials -n observability
# Check if Grafana is using the secret
kubectl describe deployment grafana -n observability | grep -A 5 "Environment"
# Reset the password if needed
kubectl delete secret grafana-admin-credentials -n observability
kubectl create secret generic grafana-admin-credentials \
--from-literal=admin-user=admin \
--from-literal=admin-password=NEW_PASSWORD_HERE \
-n observability
# Restart Grafana to pick up new credentials
kubectl rollout restart deployment/grafana -n observability- Electricity map per country
- GCP Carbon Data Across Google Regions
- CNCF TAG: Environmental Sustainability: repo
- (blog) How Grafana Labs switched to Karpenter to reduce costs and complexities in Amazon EKS: blog
- Statistics on data centre energy usage: O'Brien, I (2024). Data center emissions probably 662% higher than big tech claims. Can it keep up the ruse? Retrieved from The Guardian in October 2025: https://www.theguardian.com/technology/2024/sep/15/data-center-gas-emissions-tech
- Statistics on global energy consumption: Ritchie, H., Rosado, P., & Roser, M. (2020). Energy production and consumption. Our World in Data. Retrieved in October 2025 from https://ourworldindata.org/energy-production-consumption
- Dune Art Wallpaper, HD Movies 4K Wallpapers, Images and Background - Wallpapers Den: wallpapersden.com
- Sci Fi Dune Planet, HD wallpaper: https://www.peakpx.com
- Mystery of the Spacing Guild in Lynch's 'Dune' Movie: https://dunenewsnet.com/2024/03/the-spacing-guild-mystery-david-lynch-dune-movie/
- The Dune: Prophecy TV Series Is Doing Much More Than Setting Up the Bene Gesserit: https://www.ign.com/articles/the-dune-prophecy-tv-series-is-doing-much-more-than-setting-up-the-bene-gesserit
- Discovering the Fremen, Part 2: ‘Dune’s Ecologists: https://dunenewsnet.com/2025/03/discovering-the-fremen-ecologists-of-dune/
- How Dune created the sinister sounds of those menacing sandworms: https://ew.com/awards/oscars/dune-sandworm-sound-engineers/
- Dune 2 Trailer Hints At A Tricky Timeline Change: https://www.inverse.com/entertainment/dune-2-trailer-paul-eye-color-future-visions
- Dune: Part Two Makes One Key Change From the Book. The Result Is Brilliant.: https://slate.com/culture/2024/03/dune-part-2-zendaya-chani-timothee-chalamet-paul-book-versus-movie.html
- Fremen Sietch Water Cache: https://www.pinterest.com/pin/fremen-sietch-water-cache-by-mattw--692498880203474723/
- Dune: what the climate of Arrakis can tell us about the hunt for habitable exoplanets: https://theconversation.com/dune-what-the-climate-of-arrakis-can-tell-us-about-the-hunt-for-habitable-exoplanets-225145
- What Is A Mentat In Dune?: https://www.thegamer.com/dune-what-is-a-mentat/
Built with: Kubernetes, KEDA, GKE Cluster Autoscaler, Grafana Alloy, Prometheus, Loki, Tempo, Kepler, and k6.


