Skip to content

nicolevanderhoeven/spice-runner

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

115 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spice Runner

This is a demo project created by me (Nicole van der Hoeven) for my talk at Øredev 2025 in Malmö, Sweden, on November 5th, 2025. The talk is called "The Spice Must Flow: The Fremen Guide to Sustainable Observability". You can find my slides live here or you can download the PDF with speaker notes here. See the very end of this README for some references I used in the slides. Below is the abstract of the talk.

Talk title cover

Nobody wants to be wasteful. But how do you balance that with the need for enough data to fix things when they go wrong? After all, no matter how much it costs, there are SLOs to keep. The spice-- the telemetry-- must flow.

The Fremen are a people who live on the planet Arrakis, simultaneously rich in spice while poor in water. But the Fremen have managed to thrive, not just survive, on very little while continuing to produce spice. What would it take to apply this "desert planet thinking" to observability?

In this talk, you'll learn about the hidden environmental and financial costs of your observability stack and how to reduce them. You'll learn about the complications of measuring cost for ephemeral resources on Kubernetes, how to use KEDA to "right-size" resources when you don't know what the right size is, and how to convince your teams to drop all but the most meaningful telemetry data. In the end, the spice must flow, but there must still be enough water to sustain all.

The Spice Runner game is a Dune-themed browser game with full observability and advanced autoscaling capabilities. This project demonstrates production-grade Kubernetes deployment patterns with comprehensive monitoring and intelligent scaling.

Spice Runner gameplay

You can deploy this game yourself, or play it at nvdh.dev/spice.

Features

This project includes the following features:

  • Interactive Game: Dune-themed endless runner browser game
  • Leaderboard System: Full-stack leaderboard with Go API, PostgreSQL, Redis, and OpenTelemetry instrumentation
  • Full Instrumentation: Faro for frontend instrumentation, Alloy for telemetry collection
  • Full Observability: Metrics, logs, and traces with Prometheus, Loki, and Tempo
  • Distributed Tracing: OpenTelemetry instrumentation across frontend and backend services
  • Pod Autoscaling: Pod autoscaling based on HTTP traffic, CPU, and memory with KEDA
  • Cluster Autoscaler: Automatic node provisioning based on pending pods with GKE Cluster Autoscaler
  • Real-time Monitoring: Grafana
  • Energy Monitoring: Kepler integration for power consumption and energy tracking
  • Load Testing: k6 for load testing

Architecture

The application uses a multi-layer architecture with integrated observability and autoscaling:

Spice Runner architecture

Leaderboard System

The Spice Runner includes a production-grade leaderboard system that demonstrates real-world OpenTelemetry instrumentation:

  • Public Leaderboard Page - View top 10 players at /spice/leaderboard.html
  • Go API with full OpenTelemetry tracing and custom metrics
  • PostgreSQL for persistent score storage
  • Redis caching layer for performance optimization
  • Distributed tracing showing request flow from frontend → API → database → cache
  • Grafana dashboard visualizing leaderboard data, traces, and performance metrics

Key observability features:

  • Trace every score submission from browser to database
  • Track cache hit ratios and database query performance
  • Identify bottlenecks with span attributes (e.g., slow COUNT queries)
  • Correlate metrics with traces using exemplars
  • Monitor anti-cheat validation and error rates

See docs/LEADERBOARD-SYSTEM.md for detailed documentation.

Deploy the Leaderboard

# Quick deploy (requires GCP project ID)
export GCP_PROJECT_ID=your-project-id
./deploy-leaderboard.sh

# Or deploy manually
cd leaderboard-api
docker build --platform linux/amd64 -t gcr.io/$GCP_PROJECT_ID/spice-runner-leaderboard:latest .
docker push gcr.io/$GCP_PROJECT_ID/spice-runner-leaderboard:latest

kubectl apply -f ../k8s/leaderboard-postgres.yaml
kubectl apply -f ../k8s/leaderboard-redis.yaml
kubectl apply -f ../k8s/leaderboard-api.yaml
kubectl apply -f ../k8s/leaderboard-dashboard.yaml

The Grafana dashboard is automatically provisioned via ConfigMap and appears in the "Leaderboard" folder.

Quick start

This guide helps you deploy the Spice Runner game to Google Kubernetes Engine (GKE) with full observability and autoscaling.

Before you begin, ensure you have the following:

  • A GKE cluster running
  • kubectl configured to access your cluster
  • gcloud CLI installed and authenticated

Deploy to GKE

To deploy the application to GKE, run the following commands:

# Configure your cluster
export CLUSTER_NAME="spice-runner-cluster"
export REGION="us-central1"
export ZONE="us-central1-a"
export GCP_PROJECT_ID=$(gcloud config get-value project)

# Create Grafana admin credentials secret
kubectl create secret generic grafana-admin-credentials \
  --from-literal=admin-user=admin \
  --from-literal=admin-password=YOUR_PASSWORD_HERE \
  -n observability \
  --dry-run=client -o yaml | kubectl apply -f -

# Deploy the observability stack
kubectl apply -f k8s/observability-stack.yaml

# Deploy the application
kubectl apply -f k8s/deployment-cloud-stack.yaml
kubectl apply -f k8s/service.yaml

# Apply KEDA autoscaling
kubectl apply -f k8s/keda-scaledobject.yaml

# Enable GKE Cluster Autoscaler
gcloud container clusters update $CLUSTER_NAME \
  --enable-autoscaling \
  --node-pool=default-pool \
  --min-nodes=1 \
  --max-nodes=10 \
  --zone=$ZONE

Autoscaling

The application supports two levels of autoscaling to handle varying workloads efficiently.

KEDA (Pod autoscaling)

KEDA provides horizontal pod autoscaling based on multiple metrics:

  • Min replicas: 1 (always available, prevents 502 errors)
  • Max replicas: 20 (production), 200 (demo)
  • Triggers: HTTP request rate, CPU utilization, memory utilization

For setup details, refer to the KEDA testing guide.

Production vs Demo Configuration

The KEDA configuration in k8s/keda-scaledobject.yaml includes two sets of values:

Production Configuration (Current)

  • Purpose: Cost-effective, stable operation for real-world usage
  • maxReplicaCount: 20 - Reasonable limit for actual workload
  • pollingInterval: 30s - Reduces API load on Prometheus
  • cooldownPeriod: 300s (5 min) - Prevents scaling flapping
  • HTTP threshold: 10 req/s - Better pod utilization (fewer, busier pods)
  • activationThreshold: 1 req/s - Requires actual traffic
  • CPU threshold: 70% - Allows good resource utilization

Demo Configuration (For Live Demonstrations)

  • Purpose: Highly responsive scaling for visual demonstrations
  • maxReplicaCount: 200 - Shows extreme scaling capability
  • pollingInterval: 5s - Very responsive to changes
  • cooldownPeriod: 30s - Fast scale-down for demos
  • HTTP threshold: 1 req/s - Approximates 1 pod per session
  • activationThreshold: 0.2 req/s - Triggers on any activity
  • CPU threshold: 50% - Scales up more aggressively

To switch to demo mode, update the values in k8s/keda-scaledobject.yaml using the inline comments marked # DEMO VALUES, then run:

kubectl apply -f k8s/keda-scaledobject.yaml

Note on Scale-to-Zero: The configuration keeps minReplicaCount: 1 to ensure the service is always available. With GCP Ingress, scaling to zero (minReplicaCount: 0) causes 502 errors when users visit the site because:

  • No pods are running to handle requests
  • KEDA can't detect incoming traffic without running pods
  • The load balancer fails before KEDA can scale up

To implement true scale-to-zero without 502 errors, consider:

  • KEDA HTTP Add-on (queues requests during cold starts)
  • Knative Serving (built-in request queuing)
  • Google Cloud Run (managed scale-to-zero)

GKE Cluster Autoscaler (Node Autoscaling)

GKE Cluster Autoscaler provides production-ready node autoscaling:

  • Automatic node provisioning: Adds nodes when pods cannot be scheduled
  • Node removal: Removes underutilized nodes to save costs
  • Production-ready: Fully supported by Google
  • Configuration: Configured for min 1 node, max 10 nodes
  • Seamless integration: Works automatically with KEDA pod autoscaling

Testing

You can test the autoscaling behavior using automated load tests or manual scaling.

Load testing

To run automated load tests, use the following commands:

# KEDA load tests
./scripts/run-hpa-test.sh

Manual testing

To manually test scaling behavior, run the following commands:

# Scale up
kubectl scale deployment spice-runner --replicas=20

# Watch autoscaling
kubectl get pods -w
kubectl get nodes -w

Documentation

The following documentation provides detailed guides for setup, configuration, and operations.

Leaderboard system

Autoscaling

Monitoring and operations

Setup and configuration

Project structure

The project is organized as follows:

spice-runner/
├── k8s/                                # Kubernetes manifests
│   ├── deployment-cloud-stack.yaml        # Main game deployment
│   ├── service.yaml                       # Game service
│   ├── keda-scaledobject.yaml            # KEDA autoscaling
│   ├── observability-stack.yaml          # Grafana, Prometheus, Loki, Tempo
│   ├── leaderboard-postgres.yaml         # PostgreSQL for leaderboard
│   ├── leaderboard-redis.yaml            # Redis cache
│   ├── leaderboard-api.yaml              # Leaderboard API service
│   ├── leaderboard-dashboard.yaml        # Leaderboard Grafana dashboard
│   ├── kepler.yaml                       # Kepler energy monitoring
│   └── kepler-dashboard.yaml             # Kepler Grafana dashboard
├── leaderboard-api/                    # Go API with OpenTelemetry
│   ├── main.go                            # API implementation
│   ├── Dockerfile                         # Container image
│   └── go.mod                             # Go dependencies
├── scripts/                            # Automation scripts
│   ├── install-keda.sh                    # Install KEDA
│   ├── deploy-kepler.sh                   # Deploy Kepler
│   ├── run-hpa-test.sh                    # Run KEDA tests
│   ├── hpa-load-test.js                   # Load testing
│   ├── mega-spike-test.js                 # Spike testing
│   ├── ultimate-demo-test.js              # Comprehensive demo
│   ├── faro-init.js                       # Faro RUM initialization
│   ├── faro-instrumentation.js            # Game instrumentation
│   ├── leaderboard-client.js              # Leaderboard frontend
│   └── otel-metrics.js                    # Game session metrics
├── grafana-dashboards/                 # Grafana dashboards
│   └── leaderboard-observability.json     # Leaderboard dashboard
├── docs/                               # Documentation
│   └── LEADERBOARD-SYSTEM.md              # Leaderboard guide
├── img/                                # Game graphics
├── index.html                          # Game frontend with leaderboard
├── leaderboard.html                    # Public leaderboard page (top 10)
├── nginx.conf                          # NGINX configuration
├── Dockerfile                          # Game container image
└── deploy-leaderboard.sh               # Leaderboard deployment script

Monitoring URLs

After you deploy the application, you can access the following services:

  • Game: http://<YOUR_DOMAIN>/spice/
  • Leaderboard: http://<YOUR_DOMAIN>/spice/leaderboard.html - Public leaderboard showing top 10 players
  • Grafana: Access via LoadBalancer IP (requires login with credentials set during deployment)
    • Get IP: kubectl get service grafana -n observability
    • Default username: admin
    • Password: Set via the grafana-admin-credentials secret
    • Kepler Dashboard: Available in Grafana under "Energy" folder - "Kepler Energy & Power Consumption"
    • Leaderboard Dashboard: Available in Grafana under "Leaderboard" folder - "Spice Runner - Leaderboard Observability"
  • Leaderboard API: http://<YOUR_DOMAIN>/spice/leaderboard/api/health
  • Prometheus: http://prometheus.observability.svc.cluster.local:9090

Cost Optimization

The project is configured with production-friendly values to optimize costs:

KEDA Pod Autoscaling

  • Production mode: Currently active (see Production vs Demo Configuration)
    • Max 20 pods (vs 200 in demo mode)
    • 30-second polling interval (vs 5 seconds)
    • 5-minute cooldown (vs 30 seconds)
    • Better pod utilization (10 req/s threshold vs 1 req/s)
  • Cost impact: Runs 1 pod minimum (~$1-3/month), scales only when needed

GKE Cluster Autoscaler

  • Automatic scaling: Adds nodes only when needed, removes when idle
  • Resource efficiency: Right-sizes the cluster based on actual demand
  • Cost savings: Reduces waste by scaling down during low traffic
  • Configuration: Adjust --min-nodes and --max-nodes to control costs

Additional Optimization Tips

  • Set appropriate resource requests and limits on pods (already configured)
  • Use preemptible/spot node pools for non-critical workloads
  • Monitor usage in Grafana and adjust autoscaling parameters
  • Switch to demo configuration only when needed for demonstrations

Troubleshooting

This section helps you diagnose and resolve common issues.

Pods not scaling

To diagnose pod scaling issues, run the following commands:

# Check KEDA
kubectl get scaledobject
kubectl describe scaledobject spice-runner-keda

# Check metrics
kubectl get --raw /apis/external.metrics.k8s.io/v1beta1

Nodes Not Scaling

If nodes aren't being added when pods are pending:

# Check cluster autoscaler status
kubectl get events --all-namespaces | grep -i autoscal

# Check node pool autoscaling configuration
gcloud container node-pools describe default-pool \
  --cluster=CLUSTER_NAME \
  --zone=ZONE

High costs

If you experience unexpectedly high costs, take the following actions:

  • Review instance types: Run kubectl get nodes -o wide to see active node types
  • Check NodePool limits: Run kubectl get nodepools -o yaml to verify configuration
  • Set up billing alerts in GCP Console to monitor spending

Grafana login issues

If you cannot log into Grafana:

# Verify the secret exists
kubectl get secret grafana-admin-credentials -n observability

# Check if Grafana is using the secret
kubectl describe deployment grafana -n observability | grep -A 5 "Environment"

# Reset the password if needed
kubectl delete secret grafana-admin-credentials -n observability
kubectl create secret generic grafana-admin-credentials \
  --from-literal=admin-user=admin \
  --from-literal=admin-password=NEW_PASSWORD_HERE \
  -n observability

# Restart Grafana to pick up new credentials
kubectl rollout restart deployment/grafana -n observability

References

Images


Built with: Kubernetes, KEDA, GKE Cluster Autoscaler, Grafana Alloy, Prometheus, Loki, Tempo, Kepler, and k6.

About

That "unable to connect" Chrome dino game, except you're a Fremen and you want to apply water discipline to Kubernetes clusters.

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages

  • JavaScript 56.6%
  • Shell 18.4%
  • HTML 16.9%
  • Go 7.8%
  • Dockerfile 0.3%