This project is a small, readable example of how distributed crawling works. It is built for learning, so the code stays practical and simple.
The system has three pieces:
- a Seeder that pushes starting URLs,
- one or more Workers that crawl pages,
- and a Storage module that writes results to JSON files.
Redis is used as the shared queue and coordination layer between worker processes.
- Distributed workers consuming from one queue
- Concurrent crawling with Go goroutines
- URL deduplication with Redis Set
- Per-domain rate limiting
- Crawl depth scheduling (
MAX_DEPTH) - Metadata extraction (title, description, links)
- Simple structured output for later indexing or analysis
Seeder: Adds initial URLs fromSEED_URLSinto Redis.Worker: Pulls URLs withBLPOP, crawls pages, extracts metadata, and pushes newly found links.Storage: Appends crawl results into JSON Lines files under./data.Redis: Stores queue items, dedupe set, and domain rate-limit keys.
+------------------+ +------------------+
| Seeder Service | | Worker N |
| (ROLE=seeder) | | goroutines BLPOP |
+--------+---------+ +---------+--------+
| |
| RPUSH CrawlItem (url, depth) |
v |
+--+-------------------------------+--+
| Redis |
| List: crawl:queue |
| Set: crawl:seen (dedupe) |
| Key: crawl:domain:last:<domain> |
+--+-------------------------------+--+
^ |
| RPUSH discovered links | Fetch HTML
+--------+---------+ | Extract metadata
| Worker 1 | | Store JSON
| goroutines BLPOP |---------------------+
+--------+---------+
|
v
+----------------------+
| data/results-*.jsonl |
+----------------------+
- Seeder starts and reads
SEED_URLS. - Each seed URL is added to Redis only if it has not been seen before.
- Workers block on Redis with
BLPOPand wait for the next URL. - A worker fetches the page and extracts:
- page title
- meta description
- outgoing links
- The result is written to a JSON Lines file.
- Discovered links are re-queued with
depth + 1. - Crawling stops expanding once
MAX_DEPTHis reached.
Important behavior:
SEED_URLSare starting points only.- If
MAX_DEPTHis greater than0, workers will crawl links discovered from those seeds. - URL dedupe ensures the same URL is not crawled twice.
- Deduplication: Redis
SADDoncrawl:seenmeans only new URLs enter the queue. - Rate limiting: Redis
SETNXwith expiry oncrawl:domain:last:<domain>limits how often the same domain is crawled.
This keeps the crawler from repeatedly hitting the same URL or hammering one domain too quickly.
docker compose up --builddocker compose up --build --scale worker=3docker compose downIf you want a clean restart (new Redis data and empty results):
docker compose down -v
rm -rf data
docker compose up --buildSEED_URLS: Comma-separated starting URLsMAX_DEPTH: How far from seed pages to continue crawlingCONCURRENCY: Goroutines per worker containerRATE_LIMIT_SECONDS: Minimum gap between crawls for the same domain
Results are written to:
./data/results-<worker-id>.jsonl
Each line is a single JSON object, for example:
{
"url": "https://example.com",
"depth": 0,
"title": "Example Domain",
"description": "...",
"timestamp": "2026-03-16T12:00:00Z",
"links": [
"https://www.iana.org/domains/example"
]
}main.go: app entrypoint and config loadingseeder.go: pushes initial URLs into queueworker.go: worker loop and concurrent crawl workerscrawler.go: HTTP fetch and HTML metadata extractionqueue.go: Redis queue, dedupe, and rate limit logicstorage.go: JSON Lines file writerdocker-compose.yml: Redis + seeder + worker servicesDockerfile: builds one binary for both roles
This project is intentionally not production-heavy. It skips features like robots.txt handling, retries with backoff, advanced observability, and autoscaling policies.
The goal is to clearly explain distributed crawling with a clean codebase you can read in one sitting.