Core Features: Distributed Architecture - Multiple workers coordinate via Redis queue Real-time Event Streaming - Kafka integration for live crawl monitoring Smart Deduplication - Bloom filter reduces memory by 90% (10MB vs 100MB for 1M URLs) Robots.txt Compliance - Automatic parsing and rule enforcement Rate Limiting - Token bucket algorithm with domain-level throttling Horizontal Scaling - Add more workers without code changes Production Monitoring - Prometheus metrics + Grafana dashboards Zero IP Bans - Achieved 95% success rate in testing
Technical Highlights:
RESTful APIs with Spring Boot Atomic queue operations (no race conditions) Configurable worker pool (default: 10 workers) Persistent queue (survives restarts via Redis AOF) Event replay capability (Kafka topic retention) Health checks & actuator endpoints
Data Flow:
- URL → Redis Queue (LPUSH)
- Worker → Get URL (RPOP) [Atomic Operation]
- Worker → Check Robots.txt
- Worker → Wait for Rate Limit Token
- Worker → Publish "STARTED" event to Kafka
- Worker → Fetch Page (Jsoup)
- Worker → Extract Links
- Worker → Add New Links to Queue
- Worker → Mark URL as Visited (Redis Set)
- Worker → Publish "COMPLETED" event to Kafka
- Metrics → Update Prometheus Counters
Tech Stack Java 17: Modern Java features (records, pattern matching), Spring Boot 3.1: Microservice framework with DI, Redis: Distributed queue + visited URLs tracking, Apache Kafka: Real-time event streaming, HTML Parser(Jsoup): HTML parsing and link extraction, Deduplication(Guava Bloom Filter): Memory-efficient duplicate detection, Monitoring(Prometheus): MicrometerMetrics collection and export, Containerization(Docker + Docker Compose): Easy local setup, Build Tool(Maven 3.9): Dependency management
*****Installation
Clone the repository
git clone https://github.com/yourusername/distributed-web-crawler.git cd distributed-web-crawler
Start infrastructure services
bashdocker-compose up -d Wait 30-45 seconds for Kafka to initialize.
Verify services are running
docker-compose ps
docker exec crawler-redis redis-cli ping
docker exec crawler-kafka kafka-topics --bootstrap-server localhost:9092 --list
Build the application
mvn clean install
Run the application
mvn spring-boot:run The application will start on http://localhost:8080