Skip to content

mishel26/web-crawler

Repository files navigation

Core Features: Distributed Architecture - Multiple workers coordinate via Redis queue Real-time Event Streaming - Kafka integration for live crawl monitoring Smart Deduplication - Bloom filter reduces memory by 90% (10MB vs 100MB for 1M URLs) Robots.txt Compliance - Automatic parsing and rule enforcement Rate Limiting - Token bucket algorithm with domain-level throttling Horizontal Scaling - Add more workers without code changes Production Monitoring - Prometheus metrics + Grafana dashboards Zero IP Bans - Achieved 95% success rate in testing

Technical Highlights:

RESTful APIs with Spring Boot Atomic queue operations (no race conditions) Configurable worker pool (default: 10 workers) Persistent queue (survives restarts via Redis AOF) Event replay capability (Kafka topic retention) Health checks & actuator endpoints

Data Flow:

  1. URL → Redis Queue (LPUSH)
  2. Worker → Get URL (RPOP) [Atomic Operation]
  3. Worker → Check Robots.txt
  4. Worker → Wait for Rate Limit Token
  5. Worker → Publish "STARTED" event to Kafka
  6. Worker → Fetch Page (Jsoup)
  7. Worker → Extract Links
  8. Worker → Add New Links to Queue
  9. Worker → Mark URL as Visited (Redis Set)
  10. Worker → Publish "COMPLETED" event to Kafka
  11. Metrics → Update Prometheus Counters

Tech Stack Java 17: Modern Java features (records, pattern matching), Spring Boot 3.1: Microservice framework with DI, Redis: Distributed queue + visited URLs tracking, Apache Kafka: Real-time event streaming, HTML Parser(Jsoup): HTML parsing and link extraction, Deduplication(Guava Bloom Filter): Memory-efficient duplicate detection, Monitoring(Prometheus): MicrometerMetrics collection and export, Containerization(Docker + Docker Compose): Easy local setup, Build Tool(Maven 3.9): Dependency management

*****Installation

Clone the repository

git clone https://github.com/yourusername/distributed-web-crawler.git cd distributed-web-crawler

Start infrastructure services

bashdocker-compose up -d Wait 30-45 seconds for Kafka to initialize.

Verify services are running

docker-compose ps

Test Redis

docker exec crawler-redis redis-cli ping

Expected: PONG

Test Kafka

docker exec crawler-kafka kafka-topics --bootstrap-server localhost:9092 --list

Build the application

mvn clean install

Run the application

mvn spring-boot:run The application will start on http://localhost:8080

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages