Test Redis

Core Features: Distributed Architecture - Multiple workers coordinate via Redis queue Real-time Event Streaming - Kafka integration for live crawl monitoring Smart Deduplication - Bloom filter reduces memory by 90% (10MB vs 100MB for 1M URLs) Robots.txt Compliance - Automatic parsing and rule enforcement Rate Limiting - Token bucket algorithm with domain-level throttling Horizontal Scaling - Add more workers without code changes Production Monitoring - Prometheus metrics + Grafana dashboards Zero IP Bans - Achieved 95% success rate in testing

Technical Highlights:

RESTful APIs with Spring Boot Atomic queue operations (no race conditions) Configurable worker pool (default: 10 workers) Persistent queue (survives restarts via Redis AOF) Event replay capability (Kafka topic retention) Health checks & actuator endpoints

Data Flow:

URL → Redis Queue (LPUSH)
Worker → Get URL (RPOP) [Atomic Operation]
Worker → Check Robots.txt
Worker → Wait for Rate Limit Token
Worker → Publish "STARTED" event to Kafka
Worker → Fetch Page (Jsoup)
Worker → Extract Links
Worker → Add New Links to Queue
Worker → Mark URL as Visited (Redis Set)
Worker → Publish "COMPLETED" event to Kafka
Metrics → Update Prometheus Counters

Tech Stack Java 17: Modern Java features (records, pattern matching), Spring Boot 3.1: Microservice framework with DI, Redis: Distributed queue + visited URLs tracking, Apache Kafka: Real-time event streaming, HTML Parser(Jsoup): HTML parsing and link extraction, Deduplication(Guava Bloom Filter): Memory-efficient duplicate detection, Monitoring(Prometheus): MicrometerMetrics collection and export, Containerization(Docker + Docker Compose): Easy local setup, Build Tool(Maven 3.9): Dependency management

*****Installation

Clone the repository

git clone https://github.com/yourusername/distributed-web-crawler.git cd distributed-web-crawler

Start infrastructure services

bashdocker-compose up -d Wait 30-45 seconds for Kafka to initialize.

Verify services are running

docker-compose ps

Test Redis

docker exec crawler-redis redis-cli ping

Expected: PONG

Test Kafka

docker exec crawler-kafka kafka-topics --bootstrap-server localhost:9092 --list

Build the application

mvn clean install

Run the application

mvn spring-boot:run The application will start on http://localhost:8080

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.mvn/wrapper		.mvn/wrapper
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
application.yml		application.yml
docker-compose.yml		docker-compose.yml
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Test Redis

Expected: PONG

Test Kafka

About

Uh oh!

Releases

Packages

Languages

mishel26/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Test Redis

Expected: PONG

Test Kafka

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages