Skip to content

Production-grade distributed task scheduler demonstrating reliable queue patterns, distributed rate limiting, and fault-tolerant architecture with Go + Redis + Prometheus.

Notifications You must be signed in to change notification settings

heydeepakch/distributed-task-scheduler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Task Scheduler: A Production-Grade Asynchronous Job Processing System

Go Version Redis Prometheus Docker

A cloud-native, fault-tolerant distributed task queue demonstrating advanced backend engineering concepts

FeaturesArchitectureSetupKey Learnings


The Problem & Solution

The Problem

Modern microservices frequently encounter operations that are:

  • Slow (video transcoding, report generation, ML inference)
  • Blocking (external API calls with unpredictable latency)
  • Unreliable (third-party services with rate limits or transient failures)

Handling these synchronously in API handlers leads to:

  • ❌ High API latency and poor user experience
  • ❌ Resource exhaustion under load (thread/goroutine starvation)
  • ❌ Lost jobs when servers crash mid-execution
  • ❌ Cascading failures from downstream service overload

The Solution

This project implements a distributed, asynchronous task queue that:

  • Decouples ingestion from execution - API returns instantly while work happens in background
  • Guarantees at-least-once delivery - No job is ever lost, even during crashes
  • Enforces distributed rate limiting - Protects downstream services from overload
  • Self-heals via zombie task recovery - Automatically retries stuck jobs
  • Provides deep observability - Real-time metrics for queue depth, latency distribution, and errors

Features

Reliability & Fault Tolerance

  • 🛡️ At-Least-Once Delivery using Redis atomic operations (RPOPLPUSH)
  • 🔄 Zombie Task Recovery via automatic reaper process
  • 💾 Durable Task Storage in Redis with guaranteed persistence
  • 🚨 Graceful Degradation on rate limit failures

Performance & Scalability

  • Bounded Concurrency with Go channels to prevent memory overload
  • 📊 Horizontal Scalability - Add more consumer nodes as load increases
  • 🎯 Efficient Serialization with Protocol Buffers
  • 🔥 Non-blocking Dispatcher that backpressures automatically

Production Readiness

  • 📈 Prometheus Metrics (Gauges, Histograms, Counters)
  • 🐳 Multi-Stage Docker Build for minimal container size (~10MB)
  • 🔐 Type-Safe Task Definitions via Protobuf schemas
  • 🌐 Distributed Rate Limiting with atomic Lua scripts

System Architecture

High-Level Component Diagram

High-level Diagram

The Task Lifecycle Flow

1. ENQUEUE          2. ATOMIC CLAIM         3. PROCESS           4. COMPLETE
┌──────────┐       ┌──────────┐            ┌──────────┐         ┌──────────┐
│ Producer │       │Dispatcher│            │  Worker  │         │  Redis   │
│  LPUSH   │──────→│RPOPLPUSH │───────────→│ Execute  │────────→│  LREM    │
│          │       │(Atomic)  │            │   Task   │         │ Remove   │
└──────────┘       └──────────┘            └──────────┘         └──────────┘
                         │                       │
                         ▼                       ▼
                   queue:pending           queue:processing
                         │                       │
                         │                       │
                         └───────────────────────┘
                                    │
                                    ▼
                            If Worker Crashes?
                                    │
                                    ▼
                            ┌──────────────┐
                            │   Reaper     │
                            │ Moves back   │
                            │ to pending   │
                            └──────────────┘

Key Engineering Concepts Demonstrated

This project showcases production-level understanding across five critical domains:

1. Concurrency & Parallelism

Concept Implementation Why It Matters
Bounded Worker Pool Fixed-size goroutine pool (5 workers) with buffered channel Prevents unbounded goroutine creation that could exhaust memory under heavy load
Backpressure Handling Dispatcher blocks on channel send when workers are busy Naturally throttles Redis polling rate to match processing capacity
Lock-Free Coordination Go channels for worker communication instead of mutexes Eliminates deadlock risks and reduces contention overhead

Code Example:

// Bounded channel acts as a natural throttle
var TaskChannel = make(chan *proto.Task, config.MaxWorkerPoolSize)

// Dispatcher blocks here if all workers are busy
TaskChannel <- task  // This line provides automatic backpressure!

2. Distributed Systems Reliability

Concept Implementation Why It Matters
At-Least-Once Delivery Redis RPOPLPUSH atomically moves task to processing queue before execution Guarantees task isn't lost even if worker crashes mid-execution
Idempotency Tasks must be designed to handle duplicate execution Critical for retry scenarios where same task may run multiple times
Zombie Detection Reaper goroutine scans processing queue every 30s Recovers tasks from dead workers without manual intervention

The Reliability Guarantee:

// ATOMIC operation: task is in exactly ONE queue at all times
result, err := rdb.BRPopLPush(ctx, config.PendingQueue, config.ProcessingQueue, 0)
// If worker crashes here, task is still in ProcessingQueue
// Reaper will recover it after timeout

3. Distributed State Management

Concept Implementation Why It Matters
Race Condition Prevention Lua script executes atomically on Redis server Eliminates "check-then-set" race conditions across multiple workers
Fixed-Window Rate Limiting Redis counter with TTL, incremented via Lua Protects downstream services from overload across all consumer instances
Atomic Read-Modify-Write GET → CHECK → INCR → EXPIRE all in single Lua script Prevents two workers from both thinking they're "request #100"

The Race Condition Problem (Without Lua):

Worker A: GET count → 99          Worker B: GET count → 99
Worker A: CHECK (99 < 100) ✅      Worker B: CHECK (99 < 100) ✅  
Worker A: INCR → 100               Worker B: INCR → 101  ❌ EXCEEDED!

The Solution (With Atomic Lua):

-- Runs atomically on Redis server - no race conditions possible
local count = redis.call('GET', key)
if count < max then
    redis.call('INCR', key)
    return 1  -- allowed
else
    return 0  -- denied
end

4. Observability & Production Operations

Metric Type Use Case Example Query
Gauge Current queue depth scheduler_task_queue_depth - Alert if > 1000
Histogram Latency distribution histogram_quantile(0.99, scheduler_task_processing_duration_seconds)
Counter Error rate tracking rate(scheduler_worker_errors_total[5m])

Why This Matters:

  • Detect Bottlenecks: If queue depth keeps growing, you need more workers
  • SLA Monitoring: Track P95/P99 latency to ensure you meet performance targets
  • Alerting: Set up PagerDuty alerts when error rate exceeds threshold

5. Cloud-Native Deployment

Technique Implementation Benefit
Multi-Stage Docker Build golang:alpinescratch Reduces final image from 800MB to ~10MB
Static Binary Compilation CGO_ENABLED=0 Binary runs on minimal scratch image without libc dependencies
Horizontal Scaling Stateless consumers Deploy 10 consumer pods in Kubernetes to 10x throughput

Dockerfile Strategy:

# Stage 1: Build
FROM golang:1.21-alpine AS builder
RUN go build -o consumer

# Stage 2: Minimal runtime
FROM scratch
COPY --from=builder /app/consumer /consumer
ENTRYPOINT ["/consumer"]

🛠️ Technology Stack

Component Technology Justification
Backend Language Go 1.25+ Superior concurrency primitives (goroutines/channels), low memory footprint, fast compilation
Message Broker Redis 7.0+ Atomic list operations (RPOPLPUSH) provide reliability guarantees not available in Kafka/RabbitMQ
Serialization Protocol Buffers Type-safe, backward-compatible schemas with 10x smaller payload than JSON
Monitoring Prometheus + Grafana Industry-standard time-series DB with rich query language (PromQL)
Containerization Docker (Multi-Stage) Minimal attack surface with scratch base image (~10MB final size)
Scripting Lua (Redis) Server-side atomic operations to prevent distributed race conditions

Quick Start

Prerequisites

  • Docker & Docker Compose
  • Go 1.20+
  • Redis 7.0+ (or use Docker setup below)
  • protoc compiler (for regenerating .proto files)

1. Clone & Setup

git clone https://github.com/heydeepakch/distributed-task-scheduler.git
cd distributed-task-scheduler

# Install dependencies
go mod download

2. Start Infrastructure (Redis + Monitoring)

# Option A: Using Docker (Recommended)
docker run -d -p 6379:6379 redis:7-alpine

# Option B: Local Redis
redis-server

# Start Prometheus (on a different port to avoid conflict)
docker run -d -p 9091:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

3. Run the Consumer (Worker Node)

# Terminal 1: Start worker with 5 concurrent goroutines
go run consumer/main.go

# Expected output:
# --- Starting Distributed Task Scheduler Worker Node ---
# Successfully connected to Redis.
# Starting 5 Worker Goroutines...
# Dispatcher started. Pulling tasks from Redis...
# Metrics server listening on :9090

4. Submit Tasks (Producer)

# Terminal 2: Enqueue 5 sample tasks
go run producer/main.go

# Expected output:
# --- Starting Producer Node (API Server) ---
# Producer: 🚀 Task abc-123 submitted successfully to Redis!
# Producer: 🚀 Task def-456 submitted successfully to Redis!
# ...

5. View Metrics

# Prometheus metrics endpoint (raw)
curl http://localhost:9090/metrics

# Key metrics to observe:
# - scheduler_task_queue_depth: Current backlog size
# - scheduler_task_processing_duration_seconds: Latency histogram
# - scheduler_worker_errors_total: Failure count

6. Monitor with Prometheus UI

# Access Prometheus dashboard
open http://localhost:9091

# Sample PromQL queries:
# 1. Queue depth over time:
scheduler_task_queue_depth

# 2. P99 latency (99th percentile):
histogram_quantile(0.99, rate(scheduler_task_processing_duration_seconds_bucket[5m]))

# 3. Error rate per minute:
rate(scheduler_worker_errors_total[1m])

📁 Project Structure

distributed-task-scheduler/
│
├── consumer/                    # Worker node (task processor)
│   └── main.go                 # Dispatcher, worker pool, reaper logic
│
├── producer/                   # API server (task submitter)
│   └── main.go                 # Enqueues tasks to Redis
│
├── proto/                      # Protobuf schemas & generated code
│   ├── task.proto              # Task message definition
│   └── task.pb.go              # Auto-generated Go code
│
├── worker/                     # Business logic layer
│   └── executor.go             # Simulates task execution (3s sleep)
│
├── pkg/
│   └── ratelimiter/            # Distributed rate limiter
│       └── ratelimiter.go      # Executes Lua script atomically
│
├── lua/
│   └── rate_limit.lua          # Fixed-window rate limiting algorithm
│
├── metrics/
│   └── metrics.go              # Prometheus metric definitions
│
├── config/
│   └── config.go               # Centralized configuration
│
├── Dockerfile                  # Multi-stage build for minimal image
├── prometheus.yml              # Prometheus scrape configuration
├── go.mod                      # Dependency management
└── README.md                   # This file

How It Works

Phase 1: Task Submission (Producer)

// 1. Create a typed task
task := &proto.Task{
    Id:          uuid.New().String(),
    Type:        "EMAIL_SEND",
    Payload:     []byte(`{"to": "user@example.com"}`),
    SubmittedAt: time.Now().Unix(),
}

// 2. Serialize to bytes (efficient binary format)
taskData, _ := proto.Marshal(task)

// 3. Enqueue to Redis (non-blocking)
rdb.LPush(ctx, "queue:pending", taskData)

Result: Task is durably stored in Redis. Producer can immediately return HTTP 202 Accepted to client.


Phase 2: Atomic Task Claiming (Dispatcher)

// CRITICAL: Atomic move from pending → processing
// This guarantees task is "reserved" before execution
result, err := rdb.BRPopLPush(
    ctx,
    "queue:pending",      // Source queue
    "queue:processing",   // Destination queue
    0,                    // Block forever until task available
).Result()

// Deserialize and send to worker pool
task := &proto.Task{}
proto.Unmarshal([]byte(result), task)

// Send to bounded channel (blocks if workers are busy)
TaskChannel <- task  // Natural backpressure!

Why RPOPLPUSH?

  • Atomic: Task can't be "lost" between pop and push
  • Visible: Task remains in processing queue until explicitly removed
  • Recoverable: Reaper can detect tasks stuck in processing queue

Phase 3: Rate-Limited Execution (Worker)

// 1. Check distributed rate limit (only for certain task types)
if task.Type == "EMAIL_SEND" {
    allowed, count, err := ratelimiter.IsAllowed(
        ctx, rdb,
        "rate:email_service", // Global key across all workers
        100,                  // Max 100 requests
        60,                   // Per 60 seconds
    )
    
    if !allowed {
        // CRITICAL: Return without LREM
        // Task stays in processing queue
        // Reaper will retry after timeout
        return
    }
}

// 2. Execute business logic
err := worker.ExecuteTask(task)  // Simulates 3s of work

// 3. On success, remove from processing queue
if err == nil {
    taskData, _ := proto.Marshal(task)
    rdb.LRem(ctx, "queue:processing", 1, taskData)
}

Rate Limiting Deep Dive:

The Lua script runs atomically on the Redis server:

local count = redis.call('GET', KEYS[1])
count = tonumber(count) or 0

if count < max_requests then
    redis.call('INCR', KEYS[1])
    if count == 0 then
        redis.call('EXPIRE', KEYS[1], window_seconds)
    end
    return {1, count + 1}  -- Allowed
else
    return {0, count}      -- Denied
end

This prevents the race condition where two workers both think they're request #100.


Phase 4: Zombie Recovery (Reaper)

ticker := time.NewTicker(30 * time.Second)

for range ticker.C {
    // Move one task from processing back to pending
    result, err := rdb.RPopLPush(
        ctx,
        "queue:processing",
        "queue:pending",
    ).Result()
    
    if err == redis.Nil {
        // Processing queue is empty - all good!
        continue
    }
    
    // Log which task was recovered
    task := &proto.Task{}
    proto.Unmarshal([]byte(result), task)
    fmt.Printf("⚠️ ZOMBIE REAPED: Task %s\n", task.Id)
}

When Does This Trigger?

  • Worker crashes mid-execution
  • Rate limit causes worker to abandon task
  • Network partition prevents LREM from completing

Result: Task is automatically retried without manual intervention.


Monitoring & Observability

Key Metrics Exposed

Metric Name Type Description Alert Threshold
scheduler_task_queue_depth Gauge Current # of tasks in pending queue > 1000 (capacity issue)
scheduler_task_processing_duration_seconds Histogram Task execution latency distribution P99 > 10s (performance degradation)
scheduler_worker_errors_total Counter Failed task count by error type Rate > 5/min (systemic failure)

Sample Prometheus Queries

# 1. Queue backlog trend (smoothed over 5 minutes)
avg_over_time(scheduler_task_queue_depth[5m])

# 2. 95th percentile latency
histogram_quantile(0.95, 
  rate(scheduler_task_processing_duration_seconds_bucket[5m])
)

# 3. Error rate per second
rate(scheduler_worker_errors_total[1m])

# 4. Queue drain rate (tasks/sec)
rate(scheduler_task_processing_duration_seconds_count[1m])

Grafana Dashboard Setup

  1. Add Prometheus as data source: http://prometheus:9090
  2. Create panels with above queries
  3. Set alerts:
    • Queue depth > 1000 for 5 minutes → PagerDuty
    • Error rate > 1% → Slack notification
    • P99 latency > 15s → Email to on-call engineer

Production Considerations

What's Implemented

  • ✅ At-least-once delivery guarantee
  • ✅ Distributed rate limiting
  • ✅ Automatic zombie task recovery
  • ✅ Prometheus metrics for observability
  • ✅ Docker containerization
  • ✅ Bounded concurrency to prevent overload

What's Missing for Production (Learning Opportunities)

Missing Feature Why It Matters How to Implement
Dead Letter Queue (DLQ) Tasks that fail repeatedly (poison messages) should be quarantined After 3 retries, move task to queue:dlq for manual review
Task Priority Urgent tasks should jump the queue Use Redis Sorted Sets with priority scores instead of lists
Task Timeouts Prevent runaway tasks from blocking workers forever Add per-task timeout in worker with context.WithTimeout
Graceful Shutdown Workers should finish in-flight tasks before terminating Catch SIGTERM, stop dispatcher, drain channel, then exit
Distributed Tracing Track task journey across producer → Redis → consumer Add OpenTelemetry spans with trace IDs in task payload
Schema Versioning Handle backward-compatible protobuf schema changes Use protobuf field numbers and optional fields
Multi-Tenancy Isolate tasks by customer/team Use separate queues per tenant or add tenant ID to task

What I Learned Building This

1. Concurrency is Hard, But Go Makes It Manageable

  • Before: I thought channels were just "queues"
  • After: I understand channels as synchronization primitives that provide backpressure and coordination
  • Key Insight: TaskChannel <- task is both a queue write and a flow control mechanism

2. Distributed Systems Require Atomic Operations

  • Before: I assumed Redis commands were "fast enough" to be reliable
  • After: I learned about race conditions in distributed systems (two workers checking the same counter)
  • Key Insight: Lua scripts are essential for maintaining correctness across multiple nodes

3. Reliability Requires Design, Not Just Retries

  • Before: I thought "just retry on failure" was enough
  • After: I learned about zombie tasks, idempotency, and the importance of visibility into in-flight work
  • Key Insight: RPOPLPUSH moves tasks to a "visible" processing queue, making recovery possible

4. Observability is Not Optional

  • Before: I added logging as an afterthought
  • After: I designed metrics first to understand system behavior
  • Key Insight: You can't fix what you can't measure. Histograms reveal latency issues that averages hide.

5. Protobuf > JSON for Systems Integration

  • Before: I used JSON everywhere because it's "readable"
  • After: I learned protobuf provides type safety, backward compatibility, and 10x smaller payloads
  • Key Insight: The .proto schema acts as a contract between producer and consumer

Future Enhancements

  • Kubernetes Deployment Manifests (Deployment, Service, HPA)
  • Grafana Dashboard JSON for one-click monitoring setup
  • Benchmark Suite to measure throughput (tasks/sec) under various loads
  • Dead Letter Queue for poison message handling
  • Task Priority Queue using Redis Sorted Sets
  • OpenTelemetry Integration for distributed tracing
  • Graceful Shutdown to drain in-flight tasks on SIGTERM
  • Admin Dashboard (Web UI) to inspect queue state

📚 References & Further Reading


🤝 Contributing

This is a portfolio project demonstrating backend engineering skills. If you have suggestions for improvements or find issues:

  1. Open an issue describing the problem
  2. Submit a PR with test coverage
  3. Follow existing code style (run gofmt)

📄 License

MIT License - feel free to use this as a learning resource or starting point for your own projects.


About

Built by Deepak Chaudhary as a demonstration of production-grade backend engineering skills for software developer roles.

Connect with me:


If this project helped you learn about distributed systems, please ⭐ star the repo!

Built with ❤️ using Go, Redis!

About

Production-grade distributed task scheduler demonstrating reliable queue patterns, distributed rate limiting, and fault-tolerant architecture with Go + Redis + Prometheus.

Resources

Stars

Watchers

Forks