Distributed Web Crawler (Go + Redis + Docker)

This project is a small, readable example of how distributed crawling works. It is built for learning, so the code stays practical and simple.

The system has three pieces:

a Seeder that pushes starting URLs,
one or more Workers that crawl pages,
and a Storage module that writes results to JSON files.

Redis is used as the shared queue and coordination layer between worker processes.

About the project

Distributed workers consuming from one queue
Concurrent crawling with Go goroutines
URL deduplication with Redis Set
Per-domain rate limiting
Crawl depth scheduling (MAX_DEPTH)
Metadata extraction (title, description, links)
Simple structured output for later indexing or analysis

Architecture

Components

Seeder: Adds initial URLs from SEED_URLS into Redis.
Worker: Pulls URLs with BLPOP, crawls pages, extracts metadata, and pushes newly found links.
Storage: Appends crawl results into JSON Lines files under ./data.
Redis: Stores queue items, dedupe set, and domain rate-limit keys.

Text Diagram

+------------------+           +------------------+
|  Seeder Service  |           |   Worker N       |
|  (ROLE=seeder)   |           | goroutines BLPOP |
+--------+---------+           +---------+--------+
         |                               |
         | RPUSH CrawlItem (url, depth)  |
         v                               |
      +--+-------------------------------+--+
      |              Redis                  |
      |  List: crawl:queue                  |
      |  Set:  crawl:seen (dedupe)          |
      |  Key:  crawl:domain:last:<domain>  |
      +--+-------------------------------+--+
         ^                               |
         | RPUSH discovered links        | Fetch HTML
+--------+---------+                     | Extract metadata
|   Worker 1       |                     | Store JSON
| goroutines BLPOP |---------------------+
+--------+---------+
         |
         v
+----------------------+
| data/results-*.jsonl |
+----------------------+

Workflow (How It Actually Runs)

Seeder starts and reads SEED_URLS.
Each seed URL is added to Redis only if it has not been seen before.
Workers block on Redis with BLPOP and wait for the next URL.
A worker fetches the page and extracts:
- page title
- meta description
- outgoing links
The result is written to a JSON Lines file.
Discovered links are re-queued with depth + 1.
Crawling stops expanding once MAX_DEPTH is reached.

Important behavior:

SEED_URLS are starting points only.
If MAX_DEPTH is greater than 0, workers will crawl links discovered from those seeds.
URL dedupe ensures the same URL is not crawled twice.

Rate Limiting and Deduplication

Deduplication: Redis SADD on crawl:seen means only new URLs enter the queue.
Rate limiting: Redis SETNX with expiry on crawl:domain:last:<domain> limits how often the same domain is crawled.

This keeps the crawler from repeatedly hitting the same URL or hammering one domain too quickly.

Setup and Run

1. Start Everything

docker compose up --build

2. Scale Workers (Optional)

docker compose up --build --scale worker=3

3. Stop Services

docker compose down

If you want a clean restart (new Redis data and empty results):

docker compose down -v
rm -rf data
docker compose up --build

Configuration You Will Use Most

SEED_URLS: Comma-separated starting URLs
MAX_DEPTH: How far from seed pages to continue crawling
CONCURRENCY: Goroutines per worker container
RATE_LIMIT_SECONDS: Minimum gap between crawls for the same domain

Where Results Are Stored

Results are written to:

./data/results-<worker-id>.jsonl

Each line is a single JSON object, for example:

{
  "url": "https://example.com",
  "depth": 0,
  "title": "Example Domain",
  "description": "...",
  "timestamp": "2026-03-16T12:00:00Z",
  "links": [
    "https://www.iana.org/domains/example"
  ]
}

Project File Map

main.go: app entrypoint and config loading
seeder.go: pushes initial URLs into queue
worker.go: worker loop and concurrent crawl workers
crawler.go: HTTP fetch and HTML metadata extraction
queue.go: Redis queue, dedupe, and rate limit logic
storage.go: JSON Lines file writer
docker-compose.yml: Redis + seeder + worker services
Dockerfile: builds one binary for both roles

Notes

This project is intentionally not production-heavy. It skips features like robots.txt handling, retries with backoff, advanced observability, and autoscaling policies.

The goal is to clearly explain distributed crawling with a clean codebase you can read in one sitting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Web Crawler (Go + Redis + Docker)

About the project

Architecture

Components

Text Diagram

Workflow (How It Actually Runs)

Rate Limiting and Deduplication

Setup and Run

1. Start Everything

2. Scale Workers (Optional)

3. Stop Services

Configuration You Will Use Most

Where Results Are Stored

Project File Map

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
crawler.go		crawler.go
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
main.go		main.go
queue.go		queue.go
seeder.go		seeder.go
storage.go		storage.go
worker.go		worker.go

Folders and files

Latest commit

History

Repository files navigation

Distributed Web Crawler (Go + Redis + Docker)

About the project

Architecture

Components

Text Diagram

Workflow (How It Actually Runs)

Rate Limiting and Deduplication

Setup and Run

1. Start Everything

2. Scale Workers (Optional)

3. Stop Services

Configuration You Will Use Most

Where Results Are Stored

Project File Map

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages