A cloud-native, fault-tolerant distributed task queue demonstrating advanced backend engineering concepts
Modern microservices frequently encounter operations that are:
- Slow (video transcoding, report generation, ML inference)
- Blocking (external API calls with unpredictable latency)
- Unreliable (third-party services with rate limits or transient failures)
Handling these synchronously in API handlers leads to:
- ❌ High API latency and poor user experience
- ❌ Resource exhaustion under load (thread/goroutine starvation)
- ❌ Lost jobs when servers crash mid-execution
- ❌ Cascading failures from downstream service overload
This project implements a distributed, asynchronous task queue that:
- ✅ Decouples ingestion from execution - API returns instantly while work happens in background
- ✅ Guarantees at-least-once delivery - No job is ever lost, even during crashes
- ✅ Enforces distributed rate limiting - Protects downstream services from overload
- ✅ Self-heals via zombie task recovery - Automatically retries stuck jobs
- ✅ Provides deep observability - Real-time metrics for queue depth, latency distribution, and errors
- 🛡️ At-Least-Once Delivery using Redis atomic operations (
RPOPLPUSH) - 🔄 Zombie Task Recovery via automatic reaper process
- 💾 Durable Task Storage in Redis with guaranteed persistence
- 🚨 Graceful Degradation on rate limit failures
- ⚡ Bounded Concurrency with Go channels to prevent memory overload
- 📊 Horizontal Scalability - Add more consumer nodes as load increases
- 🎯 Efficient Serialization with Protocol Buffers
- 🔥 Non-blocking Dispatcher that backpressures automatically
- 📈 Prometheus Metrics (Gauges, Histograms, Counters)
- 🐳 Multi-Stage Docker Build for minimal container size (~10MB)
- 🔐 Type-Safe Task Definitions via Protobuf schemas
- 🌐 Distributed Rate Limiting with atomic Lua scripts
1. ENQUEUE 2. ATOMIC CLAIM 3. PROCESS 4. COMPLETE
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Producer │ │Dispatcher│ │ Worker │ │ Redis │
│ LPUSH │──────→│RPOPLPUSH │───────────→│ Execute │────────→│ LREM │
│ │ │(Atomic) │ │ Task │ │ Remove │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │
▼ ▼
queue:pending queue:processing
│ │
│ │
└───────────────────────┘
│
▼
If Worker Crashes?
│
▼
┌──────────────┐
│ Reaper │
│ Moves back │
│ to pending │
└──────────────┘
This project showcases production-level understanding across five critical domains:
| Concept | Implementation | Why It Matters |
|---|---|---|
| Bounded Worker Pool | Fixed-size goroutine pool (5 workers) with buffered channel | Prevents unbounded goroutine creation that could exhaust memory under heavy load |
| Backpressure Handling | Dispatcher blocks on channel send when workers are busy | Naturally throttles Redis polling rate to match processing capacity |
| Lock-Free Coordination | Go channels for worker communication instead of mutexes | Eliminates deadlock risks and reduces contention overhead |
Code Example:
// Bounded channel acts as a natural throttle
var TaskChannel = make(chan *proto.Task, config.MaxWorkerPoolSize)
// Dispatcher blocks here if all workers are busy
TaskChannel <- task // This line provides automatic backpressure!| Concept | Implementation | Why It Matters |
|---|---|---|
| At-Least-Once Delivery | Redis RPOPLPUSH atomically moves task to processing queue before execution |
Guarantees task isn't lost even if worker crashes mid-execution |
| Idempotency | Tasks must be designed to handle duplicate execution | Critical for retry scenarios where same task may run multiple times |
| Zombie Detection | Reaper goroutine scans processing queue every 30s | Recovers tasks from dead workers without manual intervention |
The Reliability Guarantee:
// ATOMIC operation: task is in exactly ONE queue at all times
result, err := rdb.BRPopLPush(ctx, config.PendingQueue, config.ProcessingQueue, 0)
// If worker crashes here, task is still in ProcessingQueue
// Reaper will recover it after timeout| Concept | Implementation | Why It Matters |
|---|---|---|
| Race Condition Prevention | Lua script executes atomically on Redis server | Eliminates "check-then-set" race conditions across multiple workers |
| Fixed-Window Rate Limiting | Redis counter with TTL, incremented via Lua | Protects downstream services from overload across all consumer instances |
| Atomic Read-Modify-Write | GET → CHECK → INCR → EXPIRE all in single Lua script |
Prevents two workers from both thinking they're "request #100" |
The Race Condition Problem (Without Lua):
Worker A: GET count → 99 Worker B: GET count → 99
Worker A: CHECK (99 < 100) ✅ Worker B: CHECK (99 < 100) ✅
Worker A: INCR → 100 Worker B: INCR → 101 ❌ EXCEEDED!
The Solution (With Atomic Lua):
-- Runs atomically on Redis server - no race conditions possible
local count = redis.call('GET', key)
if count < max then
redis.call('INCR', key)
return 1 -- allowed
else
return 0 -- denied
end| Metric Type | Use Case | Example Query |
|---|---|---|
| Gauge | Current queue depth | scheduler_task_queue_depth - Alert if > 1000 |
| Histogram | Latency distribution | histogram_quantile(0.99, scheduler_task_processing_duration_seconds) |
| Counter | Error rate tracking | rate(scheduler_worker_errors_total[5m]) |
Why This Matters:
- Detect Bottlenecks: If queue depth keeps growing, you need more workers
- SLA Monitoring: Track P95/P99 latency to ensure you meet performance targets
- Alerting: Set up PagerDuty alerts when error rate exceeds threshold
| Technique | Implementation | Benefit |
|---|---|---|
| Multi-Stage Docker Build | golang:alpine → scratch |
Reduces final image from 800MB to ~10MB |
| Static Binary Compilation | CGO_ENABLED=0 |
Binary runs on minimal scratch image without libc dependencies |
| Horizontal Scaling | Stateless consumers | Deploy 10 consumer pods in Kubernetes to 10x throughput |
Dockerfile Strategy:
# Stage 1: Build
FROM golang:1.21-alpine AS builder
RUN go build -o consumer
# Stage 2: Minimal runtime
FROM scratch
COPY --from=builder /app/consumer /consumer
ENTRYPOINT ["/consumer"]| Component | Technology | Justification |
|---|---|---|
| Backend Language | Go 1.25+ | Superior concurrency primitives (goroutines/channels), low memory footprint, fast compilation |
| Message Broker | Redis 7.0+ | Atomic list operations (RPOPLPUSH) provide reliability guarantees not available in Kafka/RabbitMQ |
| Serialization | Protocol Buffers | Type-safe, backward-compatible schemas with 10x smaller payload than JSON |
| Monitoring | Prometheus + Grafana | Industry-standard time-series DB with rich query language (PromQL) |
| Containerization | Docker (Multi-Stage) | Minimal attack surface with scratch base image (~10MB final size) |
| Scripting | Lua (Redis) | Server-side atomic operations to prevent distributed race conditions |
- Docker & Docker Compose
- Go 1.20+
- Redis 7.0+ (or use Docker setup below)
protoccompiler (for regenerating.protofiles)
git clone https://github.com/heydeepakch/distributed-task-scheduler.git
cd distributed-task-scheduler
# Install dependencies
go mod download# Option A: Using Docker (Recommended)
docker run -d -p 6379:6379 redis:7-alpine
# Option B: Local Redis
redis-server
# Start Prometheus (on a different port to avoid conflict)
docker run -d -p 9091:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus# Terminal 1: Start worker with 5 concurrent goroutines
go run consumer/main.go
# Expected output:
# --- Starting Distributed Task Scheduler Worker Node ---
# Successfully connected to Redis.
# Starting 5 Worker Goroutines...
# Dispatcher started. Pulling tasks from Redis...
# Metrics server listening on :9090# Terminal 2: Enqueue 5 sample tasks
go run producer/main.go
# Expected output:
# --- Starting Producer Node (API Server) ---
# Producer: 🚀 Task abc-123 submitted successfully to Redis!
# Producer: 🚀 Task def-456 submitted successfully to Redis!
# ...# Prometheus metrics endpoint (raw)
curl http://localhost:9090/metrics
# Key metrics to observe:
# - scheduler_task_queue_depth: Current backlog size
# - scheduler_task_processing_duration_seconds: Latency histogram
# - scheduler_worker_errors_total: Failure count# Access Prometheus dashboard
open http://localhost:9091
# Sample PromQL queries:
# 1. Queue depth over time:
scheduler_task_queue_depth
# 2. P99 latency (99th percentile):
histogram_quantile(0.99, rate(scheduler_task_processing_duration_seconds_bucket[5m]))
# 3. Error rate per minute:
rate(scheduler_worker_errors_total[1m])distributed-task-scheduler/
│
├── consumer/ # Worker node (task processor)
│ └── main.go # Dispatcher, worker pool, reaper logic
│
├── producer/ # API server (task submitter)
│ └── main.go # Enqueues tasks to Redis
│
├── proto/ # Protobuf schemas & generated code
│ ├── task.proto # Task message definition
│ └── task.pb.go # Auto-generated Go code
│
├── worker/ # Business logic layer
│ └── executor.go # Simulates task execution (3s sleep)
│
├── pkg/
│ └── ratelimiter/ # Distributed rate limiter
│ └── ratelimiter.go # Executes Lua script atomically
│
├── lua/
│ └── rate_limit.lua # Fixed-window rate limiting algorithm
│
├── metrics/
│ └── metrics.go # Prometheus metric definitions
│
├── config/
│ └── config.go # Centralized configuration
│
├── Dockerfile # Multi-stage build for minimal image
├── prometheus.yml # Prometheus scrape configuration
├── go.mod # Dependency management
└── README.md # This file
// 1. Create a typed task
task := &proto.Task{
Id: uuid.New().String(),
Type: "EMAIL_SEND",
Payload: []byte(`{"to": "user@example.com"}`),
SubmittedAt: time.Now().Unix(),
}
// 2. Serialize to bytes (efficient binary format)
taskData, _ := proto.Marshal(task)
// 3. Enqueue to Redis (non-blocking)
rdb.LPush(ctx, "queue:pending", taskData)Result: Task is durably stored in Redis. Producer can immediately return HTTP 202 Accepted to client.
// CRITICAL: Atomic move from pending → processing
// This guarantees task is "reserved" before execution
result, err := rdb.BRPopLPush(
ctx,
"queue:pending", // Source queue
"queue:processing", // Destination queue
0, // Block forever until task available
).Result()
// Deserialize and send to worker pool
task := &proto.Task{}
proto.Unmarshal([]byte(result), task)
// Send to bounded channel (blocks if workers are busy)
TaskChannel <- task // Natural backpressure!Why RPOPLPUSH?
- Atomic: Task can't be "lost" between pop and push
- Visible: Task remains in
processingqueue until explicitly removed - Recoverable: Reaper can detect tasks stuck in
processingqueue
// 1. Check distributed rate limit (only for certain task types)
if task.Type == "EMAIL_SEND" {
allowed, count, err := ratelimiter.IsAllowed(
ctx, rdb,
"rate:email_service", // Global key across all workers
100, // Max 100 requests
60, // Per 60 seconds
)
if !allowed {
// CRITICAL: Return without LREM
// Task stays in processing queue
// Reaper will retry after timeout
return
}
}
// 2. Execute business logic
err := worker.ExecuteTask(task) // Simulates 3s of work
// 3. On success, remove from processing queue
if err == nil {
taskData, _ := proto.Marshal(task)
rdb.LRem(ctx, "queue:processing", 1, taskData)
}Rate Limiting Deep Dive:
The Lua script runs atomically on the Redis server:
local count = redis.call('GET', KEYS[1])
count = tonumber(count) or 0
if count < max_requests then
redis.call('INCR', KEYS[1])
if count == 0 then
redis.call('EXPIRE', KEYS[1], window_seconds)
end
return {1, count + 1} -- Allowed
else
return {0, count} -- Denied
endThis prevents the race condition where two workers both think they're request #100.
ticker := time.NewTicker(30 * time.Second)
for range ticker.C {
// Move one task from processing back to pending
result, err := rdb.RPopLPush(
ctx,
"queue:processing",
"queue:pending",
).Result()
if err == redis.Nil {
// Processing queue is empty - all good!
continue
}
// Log which task was recovered
task := &proto.Task{}
proto.Unmarshal([]byte(result), task)
fmt.Printf("⚠️ ZOMBIE REAPED: Task %s\n", task.Id)
}When Does This Trigger?
- Worker crashes mid-execution
- Rate limit causes worker to abandon task
- Network partition prevents LREM from completing
Result: Task is automatically retried without manual intervention.
| Metric Name | Type | Description | Alert Threshold |
|---|---|---|---|
scheduler_task_queue_depth |
Gauge | Current # of tasks in pending queue | > 1000 (capacity issue) |
scheduler_task_processing_duration_seconds |
Histogram | Task execution latency distribution | P99 > 10s (performance degradation) |
scheduler_worker_errors_total |
Counter | Failed task count by error type | Rate > 5/min (systemic failure) |
# 1. Queue backlog trend (smoothed over 5 minutes)
avg_over_time(scheduler_task_queue_depth[5m])
# 2. 95th percentile latency
histogram_quantile(0.95,
rate(scheduler_task_processing_duration_seconds_bucket[5m])
)
# 3. Error rate per second
rate(scheduler_worker_errors_total[1m])
# 4. Queue drain rate (tasks/sec)
rate(scheduler_task_processing_duration_seconds_count[1m])
- Add Prometheus as data source:
http://prometheus:9090 - Create panels with above queries
- Set alerts:
- Queue depth > 1000 for 5 minutes → PagerDuty
- Error rate > 1% → Slack notification
- P99 latency > 15s → Email to on-call engineer
- ✅ At-least-once delivery guarantee
- ✅ Distributed rate limiting
- ✅ Automatic zombie task recovery
- ✅ Prometheus metrics for observability
- ✅ Docker containerization
- ✅ Bounded concurrency to prevent overload
| Missing Feature | Why It Matters | How to Implement |
|---|---|---|
| Dead Letter Queue (DLQ) | Tasks that fail repeatedly (poison messages) should be quarantined | After 3 retries, move task to queue:dlq for manual review |
| Task Priority | Urgent tasks should jump the queue | Use Redis Sorted Sets with priority scores instead of lists |
| Task Timeouts | Prevent runaway tasks from blocking workers forever | Add per-task timeout in worker with context.WithTimeout |
| Graceful Shutdown | Workers should finish in-flight tasks before terminating | Catch SIGTERM, stop dispatcher, drain channel, then exit |
| Distributed Tracing | Track task journey across producer → Redis → consumer | Add OpenTelemetry spans with trace IDs in task payload |
| Schema Versioning | Handle backward-compatible protobuf schema changes | Use protobuf field numbers and optional fields |
| Multi-Tenancy | Isolate tasks by customer/team | Use separate queues per tenant or add tenant ID to task |
- Before: I thought channels were just "queues"
- After: I understand channels as synchronization primitives that provide backpressure and coordination
- Key Insight:
TaskChannel <- taskis both a queue write and a flow control mechanism
- Before: I assumed Redis commands were "fast enough" to be reliable
- After: I learned about race conditions in distributed systems (two workers checking the same counter)
- Key Insight: Lua scripts are essential for maintaining correctness across multiple nodes
- Before: I thought "just retry on failure" was enough
- After: I learned about zombie tasks, idempotency, and the importance of visibility into in-flight work
- Key Insight:
RPOPLPUSHmoves tasks to a "visible" processing queue, making recovery possible
- Before: I added logging as an afterthought
- After: I designed metrics first to understand system behavior
- Key Insight: You can't fix what you can't measure. Histograms reveal latency issues that averages hide.
- Before: I used JSON everywhere because it's "readable"
- After: I learned protobuf provides type safety, backward compatibility, and 10x smaller payloads
- Key Insight: The
.protoschema acts as a contract between producer and consumer
- Kubernetes Deployment Manifests (Deployment, Service, HPA)
- Grafana Dashboard JSON for one-click monitoring setup
- Benchmark Suite to measure throughput (tasks/sec) under various loads
- Dead Letter Queue for poison message handling
- Task Priority Queue using Redis Sorted Sets
- OpenTelemetry Integration for distributed tracing
- Graceful Shutdown to drain in-flight tasks on SIGTERM
- Admin Dashboard (Web UI) to inspect queue state
- Redis RPOPLPUSH Pattern - Reliable Queue Pattern
- Go Concurrency Patterns - Advanced channel usage
- Prometheus Best Practices - Metric naming conventions
- Protocol Buffers Guide - Schema design
- Designing Data-Intensive Applications (Book) - Distributed systems theory
This is a portfolio project demonstrating backend engineering skills. If you have suggestions for improvements or find issues:
- Open an issue describing the problem
- Submit a PR with test coverage
- Follow existing code style (run
gofmt)
MIT License - feel free to use this as a learning resource or starting point for your own projects.
Built by Deepak Chaudhary as a demonstration of production-grade backend engineering skills for software developer roles.
Connect with me:
- GitHub: @heydeepakch
- LinkedIn: Deepak Choudhary
- Email: hellodeepakch@gmail.com
If this project helped you learn about distributed systems, please ⭐ star the repo!
Built with ❤️ using Go, Redis!
