Distributed Task Scheduler: A Production-Grade Asynchronous Job Processing System

A cloud-native, fault-tolerant distributed task queue demonstrating advanced backend engineering concepts

Features • Architecture • Setup • Key Learnings

The Problem & Solution

The Problem

Modern microservices frequently encounter operations that are:

Slow (video transcoding, report generation, ML inference)
Blocking (external API calls with unpredictable latency)
Unreliable (third-party services with rate limits or transient failures)

Handling these synchronously in API handlers leads to:

❌ High API latency and poor user experience
❌ Resource exhaustion under load (thread/goroutine starvation)
❌ Lost jobs when servers crash mid-execution
❌ Cascading failures from downstream service overload

The Solution

This project implements a distributed, asynchronous task queue that:

✅ Decouples ingestion from execution - API returns instantly while work happens in background
✅ Guarantees at-least-once delivery - No job is ever lost, even during crashes
✅ Enforces distributed rate limiting - Protects downstream services from overload
✅ Self-heals via zombie task recovery - Automatically retries stuck jobs
✅ Provides deep observability - Real-time metrics for queue depth, latency distribution, and errors

Features

Reliability & Fault Tolerance

🛡️ At-Least-Once Delivery using Redis atomic operations (RPOPLPUSH)
🔄 Zombie Task Recovery via automatic reaper process
💾 Durable Task Storage in Redis with guaranteed persistence
🚨 Graceful Degradation on rate limit failures

Performance & Scalability

⚡ Bounded Concurrency with Go channels to prevent memory overload
📊 Horizontal Scalability - Add more consumer nodes as load increases
🎯 Efficient Serialization with Protocol Buffers
🔥 Non-blocking Dispatcher that backpressures automatically

Production Readiness

📈 Prometheus Metrics (Gauges, Histograms, Counters)
🐳 Multi-Stage Docker Build for minimal container size (~10MB)
🔐 Type-Safe Task Definitions via Protobuf schemas
🌐 Distributed Rate Limiting with atomic Lua scripts

System Architecture

High-Level Component Diagram

The Task Lifecycle Flow

1. ENQUEUE          2. ATOMIC CLAIM         3. PROCESS           4. COMPLETE
┌──────────┐       ┌──────────┐            ┌──────────┐         ┌──────────┐
│ Producer │       │Dispatcher│            │  Worker  │         │  Redis   │
│  LPUSH   │──────→│RPOPLPUSH │───────────→│ Execute  │────────→│  LREM    │
│          │       │(Atomic)  │            │   Task   │         │ Remove   │
└──────────┘       └──────────┘            └──────────┘         └──────────┘
                         │                       │
                         ▼                       ▼
                   queue:pending           queue:processing
                         │                       │
                         │                       │
                         └───────────────────────┘
                                    │
                                    ▼
                            If Worker Crashes?
                                    │
                                    ▼
                            ┌──────────────┐
                            │   Reaper     │
                            │ Moves back   │
                            │ to pending   │
                            └──────────────┘

Key Engineering Concepts Demonstrated

This project showcases production-level understanding across five critical domains:

1. Concurrency & Parallelism

Concept	Implementation	Why It Matters
Bounded Worker Pool	Fixed-size goroutine pool (5 workers) with buffered channel	Prevents unbounded goroutine creation that could exhaust memory under heavy load
Backpressure Handling	Dispatcher blocks on channel send when workers are busy	Naturally throttles Redis polling rate to match processing capacity
Lock-Free Coordination	Go channels for worker communication instead of mutexes	Eliminates deadlock risks and reduces contention overhead

Code Example:

// Bounded channel acts as a natural throttle
var TaskChannel = make(chan *proto.Task, config.MaxWorkerPoolSize)

// Dispatcher blocks here if all workers are busy
TaskChannel <- task  // This line provides automatic backpressure!

2. Distributed Systems Reliability

Concept	Implementation	Why It Matters
At-Least-Once Delivery	Redis `RPOPLPUSH` atomically moves task to processing queue before execution	Guarantees task isn't lost even if worker crashes mid-execution
Idempotency	Tasks must be designed to handle duplicate execution	Critical for retry scenarios where same task may run multiple times
Zombie Detection	Reaper goroutine scans processing queue every 30s	Recovers tasks from dead workers without manual intervention

The Reliability Guarantee:

// ATOMIC operation: task is in exactly ONE queue at all times
result, err := rdb.BRPopLPush(ctx, config.PendingQueue, config.ProcessingQueue, 0)
// If worker crashes here, task is still in ProcessingQueue
// Reaper will recover it after timeout

3. Distributed State Management

Concept	Implementation	Why It Matters
Race Condition Prevention	Lua script executes atomically on Redis server	Eliminates "check-then-set" race conditions across multiple workers
Fixed-Window Rate Limiting	Redis counter with TTL, incremented via Lua	Protects downstream services from overload across all consumer instances
Atomic Read-Modify-Write	`GET → CHECK → INCR → EXPIRE` all in single Lua script	Prevents two workers from both thinking they're "request #100"

The Race Condition Problem (Without Lua):

Worker A: GET count → 99          Worker B: GET count → 99
Worker A: CHECK (99 < 100) ✅      Worker B: CHECK (99 < 100) ✅  
Worker A: INCR → 100               Worker B: INCR → 101  ❌ EXCEEDED!

The Solution (With Atomic Lua):

-- Runs atomically on Redis server - no race conditions possible
local count = redis.call('GET', key)
if count < max then
    redis.call('INCR', key)
    return 1  -- allowed
else
    return 0  -- denied
end

4. Observability & Production Operations

Metric Type	Use Case	Example Query
Gauge	Current queue depth	`scheduler_task_queue_depth` - Alert if > 1000
Histogram	Latency distribution	`histogram_quantile(0.99, scheduler_task_processing_duration_seconds)`
Counter	Error rate tracking	`rate(scheduler_worker_errors_total[5m])`

Why This Matters:

Detect Bottlenecks: If queue depth keeps growing, you need more workers
SLA Monitoring: Track P95/P99 latency to ensure you meet performance targets
Alerting: Set up PagerDuty alerts when error rate exceeds threshold

5. Cloud-Native Deployment

Technique	Implementation	Benefit
Multi-Stage Docker Build	`golang:alpine` → `scratch`	Reduces final image from 800MB to ~10MB
Static Binary Compilation	`CGO_ENABLED=0`	Binary runs on minimal `scratch` image without libc dependencies
Horizontal Scaling	Stateless consumers	Deploy 10 consumer pods in Kubernetes to 10x throughput

Dockerfile Strategy:

# Stage 1: Build
FROM golang:1.21-alpine AS builder
RUN go build -o consumer

# Stage 2: Minimal runtime
FROM scratch
COPY --from=builder /app/consumer /consumer
ENTRYPOINT ["/consumer"]

🛠️ Technology Stack

Component	Technology	Justification
Backend Language	Go 1.25+	Superior concurrency primitives (goroutines/channels), low memory footprint, fast compilation
Message Broker	Redis 7.0+	Atomic list operations (`RPOPLPUSH`) provide reliability guarantees not available in Kafka/RabbitMQ
Serialization	Protocol Buffers	Type-safe, backward-compatible schemas with 10x smaller payload than JSON
Monitoring	Prometheus + Grafana	Industry-standard time-series DB with rich query language (PromQL)
Containerization	Docker (Multi-Stage)	Minimal attack surface with `scratch` base image (~10MB final size)
Scripting	Lua (Redis)	Server-side atomic operations to prevent distributed race conditions

Quick Start

Prerequisites

Docker & Docker Compose
Go 1.20+
Redis 7.0+ (or use Docker setup below)
protoc compiler (for regenerating .proto files)

1. Clone & Setup

git clone https://github.com/heydeepakch/distributed-task-scheduler.git
cd distributed-task-scheduler

# Install dependencies
go mod download

2. Start Infrastructure (Redis + Monitoring)

# Option A: Using Docker (Recommended)
docker run -d -p 6379:6379 redis:7-alpine

# Option B: Local Redis
redis-server

# Start Prometheus (on a different port to avoid conflict)
docker run -d -p 9091:9090 -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

3. Run the Consumer (Worker Node)

# Terminal 1: Start worker with 5 concurrent goroutines
go run consumer/main.go

# Expected output:
# --- Starting Distributed Task Scheduler Worker Node ---
# Successfully connected to Redis.
# Starting 5 Worker Goroutines...
# Dispatcher started. Pulling tasks from Redis...
# Metrics server listening on :9090

4. Submit Tasks (Producer)

# Terminal 2: Enqueue 5 sample tasks
go run producer/main.go

# Expected output:
# --- Starting Producer Node (API Server) ---
# Producer: 🚀 Task abc-123 submitted successfully to Redis!
# Producer: 🚀 Task def-456 submitted successfully to Redis!
# ...

5. View Metrics

# Prometheus metrics endpoint (raw)
curl http://localhost:9090/metrics

# Key metrics to observe:
# - scheduler_task_queue_depth: Current backlog size
# - scheduler_task_processing_duration_seconds: Latency histogram
# - scheduler_worker_errors_total: Failure count

6. Monitor with Prometheus UI

# Access Prometheus dashboard
open http://localhost:9091

# Sample PromQL queries:
# 1. Queue depth over time:
scheduler_task_queue_depth

# 2. P99 latency (99th percentile):
histogram_quantile(0.99, rate(scheduler_task_processing_duration_seconds_bucket[5m]))

# 3. Error rate per minute:
rate(scheduler_worker_errors_total[1m])

📁 Project Structure

distributed-task-scheduler/
│
├── consumer/                    # Worker node (task processor)
│   └── main.go                 # Dispatcher, worker pool, reaper logic
│
├── producer/                   # API server (task submitter)
│   └── main.go                 # Enqueues tasks to Redis
│
├── proto/                      # Protobuf schemas & generated code
│   ├── task.proto              # Task message definition
│   └── task.pb.go              # Auto-generated Go code
│
├── worker/                     # Business logic layer
│   └── executor.go             # Simulates task execution (3s sleep)
│
├── pkg/
│   └── ratelimiter/            # Distributed rate limiter
│       └── ratelimiter.go      # Executes Lua script atomically
│
├── lua/
│   └── rate_limit.lua          # Fixed-window rate limiting algorithm
│
├── metrics/
│   └── metrics.go              # Prometheus metric definitions
│
├── config/
│   └── config.go               # Centralized configuration
│
├── Dockerfile                  # Multi-stage build for minimal image
├── prometheus.yml              # Prometheus scrape configuration
├── go.mod                      # Dependency management
└── README.md                   # This file

How It Works

Phase 1: Task Submission (Producer)

// 1. Create a typed task
task := &proto.Task{
    Id:          uuid.New().String(),
    Type:        "EMAIL_SEND",
    Payload:     []byte(`{"to": "user@example.com"}`),
    SubmittedAt: time.Now().Unix(),
}

// 2. Serialize to bytes (efficient binary format)
taskData, _ := proto.Marshal(task)

// 3. Enqueue to Redis (non-blocking)
rdb.LPush(ctx, "queue:pending", taskData)

Result: Task is durably stored in Redis. Producer can immediately return HTTP 202 Accepted to client.

Phase 2: Atomic Task Claiming (Dispatcher)

// CRITICAL: Atomic move from pending → processing
// This guarantees task is "reserved" before execution
result, err := rdb.BRPopLPush(
    ctx,
    "queue:pending",      // Source queue
    "queue:processing",   // Destination queue
    0,                    // Block forever until task available
).Result()

// Deserialize and send to worker pool
task := &proto.Task{}
proto.Unmarshal([]byte(result), task)

// Send to bounded channel (blocks if workers are busy)
TaskChannel <- task  // Natural backpressure!

Why RPOPLPUSH?

Atomic: Task can't be "lost" between pop and push
Visible: Task remains in processing queue until explicitly removed
Recoverable: Reaper can detect tasks stuck in processing queue

Phase 3: Rate-Limited Execution (Worker)

// 1. Check distributed rate limit (only for certain task types)
if task.Type == "EMAIL_SEND" {
    allowed, count, err := ratelimiter.IsAllowed(
        ctx, rdb,
        "rate:email_service", // Global key across all workers
        100,                  // Max 100 requests
        60,                   // Per 60 seconds
    )
    
    if !allowed {
        // CRITICAL: Return without LREM
        // Task stays in processing queue
        // Reaper will retry after timeout
        return
    }
}

// 2. Execute business logic
err := worker.ExecuteTask(task)  // Simulates 3s of work

// 3. On success, remove from processing queue
if err == nil {
    taskData, _ := proto.Marshal(task)
    rdb.LRem(ctx, "queue:processing", 1, taskData)
}

Rate Limiting Deep Dive:

The Lua script runs atomically on the Redis server:

local count = redis.call('GET', KEYS[1])
count = tonumber(count) or 0

if count < max_requests then
    redis.call('INCR', KEYS[1])
    if count == 0 then
        redis.call('EXPIRE', KEYS[1], window_seconds)
    end
    return {1, count + 1}  -- Allowed
else
    return {0, count}      -- Denied
end

This prevents the race condition where two workers both think they're request #100.

Phase 4: Zombie Recovery (Reaper)

ticker := time.NewTicker(30 * time.Second)

for range ticker.C {
    // Move one task from processing back to pending
    result, err := rdb.RPopLPush(
        ctx,
        "queue:processing",
        "queue:pending",
    ).Result()
    
    if err == redis.Nil {
        // Processing queue is empty - all good!
        continue
    }
    
    // Log which task was recovered
    task := &proto.Task{}
    proto.Unmarshal([]byte(result), task)
    fmt.Printf("⚠️ ZOMBIE REAPED: Task %s\n", task.Id)
}

When Does This Trigger?

Worker crashes mid-execution
Rate limit causes worker to abandon task
Network partition prevents LREM from completing

Result: Task is automatically retried without manual intervention.

Monitoring & Observability

Key Metrics Exposed

Metric Name	Type	Description	Alert Threshold
`scheduler_task_queue_depth`	Gauge	Current # of tasks in pending queue	> 1000 (capacity issue)
`scheduler_task_processing_duration_seconds`	Histogram	Task execution latency distribution	P99 > 10s (performance degradation)
`scheduler_worker_errors_total`	Counter	Failed task count by error type	Rate > 5/min (systemic failure)

Sample Prometheus Queries

# 1. Queue backlog trend (smoothed over 5 minutes)
avg_over_time(scheduler_task_queue_depth[5m])

# 2. 95th percentile latency
histogram_quantile(0.95, 
  rate(scheduler_task_processing_duration_seconds_bucket[5m])
)

# 3. Error rate per second
rate(scheduler_worker_errors_total[1m])

# 4. Queue drain rate (tasks/sec)
rate(scheduler_task_processing_duration_seconds_count[1m])

Grafana Dashboard Setup

Add Prometheus as data source: http://prometheus:9090
Create panels with above queries
Set alerts:
- Queue depth > 1000 for 5 minutes → PagerDuty
- Error rate > 1% → Slack notification
- P99 latency > 15s → Email to on-call engineer

Production Considerations

What's Implemented

✅ At-least-once delivery guarantee
✅ Distributed rate limiting
✅ Automatic zombie task recovery
✅ Prometheus metrics for observability
✅ Docker containerization
✅ Bounded concurrency to prevent overload

What's Missing for Production (Learning Opportunities)

Missing Feature	Why It Matters	How to Implement
Dead Letter Queue (DLQ)	Tasks that fail repeatedly (poison messages) should be quarantined	After 3 retries, move task to `queue:dlq` for manual review
Task Priority	Urgent tasks should jump the queue	Use Redis Sorted Sets with priority scores instead of lists
Task Timeouts	Prevent runaway tasks from blocking workers forever	Add per-task timeout in worker with `context.WithTimeout`
Graceful Shutdown	Workers should finish in-flight tasks before terminating	Catch `SIGTERM`, stop dispatcher, drain channel, then exit
Distributed Tracing	Track task journey across producer → Redis → consumer	Add OpenTelemetry spans with trace IDs in task payload
Schema Versioning	Handle backward-compatible protobuf schema changes	Use protobuf field numbers and optional fields
Multi-Tenancy	Isolate tasks by customer/team	Use separate queues per tenant or add tenant ID to task

What I Learned Building This

1. Concurrency is Hard, But Go Makes It Manageable

Before: I thought channels were just "queues"
After: I understand channels as synchronization primitives that provide backpressure and coordination
Key Insight: TaskChannel <- task is both a queue write and a flow control mechanism

2. Distributed Systems Require Atomic Operations

Before: I assumed Redis commands were "fast enough" to be reliable
After: I learned about race conditions in distributed systems (two workers checking the same counter)
Key Insight: Lua scripts are essential for maintaining correctness across multiple nodes

3. Reliability Requires Design, Not Just Retries

Before: I thought "just retry on failure" was enough
After: I learned about zombie tasks, idempotency, and the importance of visibility into in-flight work
Key Insight: RPOPLPUSH moves tasks to a "visible" processing queue, making recovery possible

4. Observability is Not Optional

Before: I added logging as an afterthought
After: I designed metrics first to understand system behavior
Key Insight: You can't fix what you can't measure. Histograms reveal latency issues that averages hide.

5. Protobuf > JSON for Systems Integration

Before: I used JSON everywhere because it's "readable"
After: I learned protobuf provides type safety, backward compatibility, and 10x smaller payloads
Key Insight: The .proto schema acts as a contract between producer and consumer

Future Enhancements

Kubernetes Deployment Manifests (Deployment, Service, HPA)
Grafana Dashboard JSON for one-click monitoring setup
Benchmark Suite to measure throughput (tasks/sec) under various loads
Dead Letter Queue for poison message handling
Task Priority Queue using Redis Sorted Sets
OpenTelemetry Integration for distributed tracing
Graceful Shutdown to drain in-flight tasks on SIGTERM
Admin Dashboard (Web UI) to inspect queue state

📚 References & Further Reading

Redis RPOPLPUSH Pattern - Reliable Queue Pattern
Go Concurrency Patterns - Advanced channel usage
Prometheus Best Practices - Metric naming conventions
Protocol Buffers Guide - Schema design
Designing Data-Intensive Applications (Book) - Distributed systems theory

🤝 Contributing

This is a portfolio project demonstrating backend engineering skills. If you have suggestions for improvements or find issues:

Open an issue describing the problem
Submit a PR with test coverage
Follow existing code style (run gofmt)

📄 License

MIT License - feel free to use this as a learning resource or starting point for your own projects.

About

Built by Deepak Chaudhary as a demonstration of production-grade backend engineering skills for software developer roles.

Connect with me:

If this project helped you learn about distributed systems, please ⭐ star the repo!

Built with ❤️ using Go, Redis!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
config		config
consumer		consumer
lua		lua
metrics		metrics
pkg/ratelimiter		pkg/ratelimiter
producer		producer
proto		proto
worker		worker
Dockerfile		Dockerfile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
prometheus.yml		prometheus.yml

heydeepakch/distributed-task-scheduler

Folders and files

Latest commit

History

Repository files navigation

Distributed Task Scheduler: A Production-Grade Asynchronous Job Processing System

The Problem & Solution

The Problem

The Solution

Features

Reliability & Fault Tolerance

Performance & Scalability

Production Readiness

System Architecture

High-Level Component Diagram

The Task Lifecycle Flow

Key Engineering Concepts Demonstrated

1. Concurrency & Parallelism

2. Distributed Systems Reliability

3. Distributed State Management

4. Observability & Production Operations

5. Cloud-Native Deployment

🛠️ Technology Stack

Quick Start

Prerequisites

1. Clone & Setup

2. Start Infrastructure (Redis + Monitoring)

3. Run the Consumer (Worker Node)

4. Submit Tasks (Producer)

5. View Metrics

6. Monitor with Prometheus UI

📁 Project Structure

How It Works

Phase 1: Task Submission (Producer)

Phase 2: Atomic Task Claiming (Dispatcher)

Phase 3: Rate-Limited Execution (Worker)

Phase 4: Zombie Recovery (Reaper)

Monitoring & Observability

Key Metrics Exposed

Sample Prometheus Queries

Grafana Dashboard Setup

Production Considerations

What's Implemented

What's Missing for Production (Learning Opportunities)

What I Learned Building This

1. Concurrency is Hard, But Go Makes It Manageable

2. Distributed Systems Require Atomic Operations

3. Reliability Requires Design, Not Just Retries

4. Observability is Not Optional

5. Protobuf > JSON for Systems Integration

Future Enhancements

📚 References & Further Reading

🤝 Contributing

📄 License

About

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages