The Java Concurrent Web Crawler is a Spring Boot-based application that provides a concurrent web crawling service with user authentication and authorization. The application allows users to submit keyword-based crawl requests, processes them concurrently, and caches the results for efficient retrieval.
- JWT-based Authentication: Secure user authentication using JSON Web Tokens
- Role-based Authorization: Support for different user roles (e.g., ADMIN, USER)
- Concurrent Web Crawling: Parallel processing of web crawling tasks
- Redis Caching: Efficient caching of crawl states and results
- PostgreSQL Database: Persistent storage of users and crawl requests
- Token Blacklisting: Secure logout functionality with Redis-based token blacklisting
- RESTful API: Clean REST endpoints for crawling and authentication
- Java 25
- Spring Boot 3.x
- Spring Data JPA
- Spring Security
- Spring MVC
- Lombok
- JWT (jjwt)
- PostgreSQL 17
- Redis 7
- Liquibase (Database migrations)
- Docker & Docker Compose
java-concurrent-web-crawler/
├── src/
│ ├── main/
│ │ ├── java/
│ │ │ └── com.concurrent_web_crawler/
│ │ │ ├── auth/ # Authentication module
│ │ │ │ ├── dto/ # Data Transfer Objects
│ │ │ │ ├── model/ # Domain models (UserAccount, Role)
│ │ │ │ ├── port.out/ # Output ports
│ │ │ │ ├── repository/ # JPA repositories
│ │ │ │ ├── security/ # Security configuration
│ │ │ │ ├── service/ # Business logic
│ │ │ │ └── web/ # REST controllers
│ │ │ ├── crawler/ # Web crawler module
│ │ │ │ ├── dto/ # DTOs for crawling
│ │ │ │ ├── enumerator/ # Enums (CrawlStatus)
│ │ │ │ ├── infra.executor/ # Crawl job execution
│ │ │ │ ├── model/ # Domain models
│ │ │ │ ├── port.out/ # Output ports
│ │ │ │ ├── repository/ # JPA repositories
│ │ │ │ ├── service/ # Crawling business logic
│ │ │ │ ├── util/ # Utility classes
│ │ │ │ └── web/ # REST controllers
│ │ │ ├── shared/ # Shared components
│ │ │ │ ├── config/ # Application configuration
│ │ │ │ └── exception/ # Exception handling
│ │ │ └── CrawlerApplication.java # Application entry point
│ │ └── resources/
│ │ ├── db.changelog/ # Liquibase migrations
│ │ └── application.properties # Application config
│ └── test/
├── docker-compose.yml
├── pom.xml
└── .env.example```
The application follows a Hexagonal Architecture (Ports and Adapters) pattern:
- Domain Layer: Core business logic and models
- Port Layer: Interfaces defining boundaries
- Adapter Layer: Implementations (Web controllers, Repositories, External services)
- Shared Layer: Cross-cutting concerns (Configuration, Exception handling)
Login with username and password.
Request Body:
json { "username": "string", "password": "string" }
Response (Success):
json { "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...", "expiresIn": 3600 }
Logout and blacklist the current token.
Headers: Authorization: Bearer <token>
Response:
json { "message": "Logged out successfully" }
Get current authenticated user information.
Headers: Authorization: Bearer <token>
Response:
json { "id": 1, "username": "john_doe", "roles": ["ROLE_USER"] }
Start a new web crawl for a given keyword.
Headers: Authorization: Bearer <token>
Request Body:
json { "keyword": "spring boot tutorial" }
Response:
json { "crawlId": "abc123xyz" }
Get the current state of a crawl request.
Headers: Authorization: Bearer <token>
Response:
json { "id": "abc123xyz", "status": "ACTIVE", "urls": string[] }
id(BIGINT, PK)username(VARCHAR, UNIQUE)email(VARCHAR, UNIQUE)password(VARCHAR, hashed)role(VARCHAR)created_at(TIMESTAMP)
id(BIGINT, PK)keyword_normalized(VARCHAR)status(VARCHAR)result_json(JSONB)created_by(BIGINT, FK → UserAccount)created_at(TIMESTAMP)completed_at(TIMESTAMP)
- Java 25 (or compatible JDK)
- Maven 3.8+
- Docker and Docker Compose
bash git clone <repository-url> cd java-concurrent-web-crawler
Copy the example environment file and update the values:
bash cp .env.example .env
Edit .env with your configuration:
properties
App DB
APP_DB_HOST=localhost
APP_DB_PORT=5432
APP_DB_NAME=app_db
APP_DB_USER=app_user
APP_DB_PASSWORD=app_password
Common
DB_DRIVER=org.postgresql.Driver
LIQUIBASE_CHANGELOG=classpath:db/changelog/db.changelog-master.yaml
LIQUIBASE_DEFAULT_SCHEMA=public
SPRING_APP_NAME=crawler
JWT Configuration (IMPORTANT: Use a strong secret with at least 32 characters)
JWT_SECRET=your-secret-key-at-least-32-characters-long
JWT_ACCESS_TTL_SECONDS=3600
JWT_REFRESH_TTL_SECONDS=1209600
Redis
REDIS_HOST=localhost REDIS_PORT=6379
Base URL (optional)
BASE_URL=http://localhost:8080
bash docker-compose up -d
This will start:
- PostgreSQL on port
5432 - Redis on port
6380
bash ./mvnw clean install
bash ./mvnw spring-boot:run
The application will start on http://localhost:8080.
The main configuration file is located at src/main/resources/application.properties.
Key configurations:
properties
Server
server.port=8080
Database
spring.datasource.url=jdbc:postgresql://{APP_DB_HOST:localhost}:{APP_DB_PORT:5432}/{APP_DB_NAME:app_db}
spring.datasource.username={APP_DB_USER:app_user}
spring.datasource.password={APP_DB_PASSWORD:app_password}
spring.datasource.driver-class-name={DB_DRIVER:org.postgresql.Driver}
JPA/Hibernate
spring.jpa.hibernate.ddl-auto=none
spring.jpa.show-sql=true
Liquibase
spring.liquibase.enabled=true
spring.liquibase.change-log={LIQUIBASE_CHANGELOG:classpath:db/changelog/db.changelog-master.yaml}
spring.liquibase.default-schema={LIQUIBASE_DEFAULT_SCHEMA:public}
Redis
spring.data.redis.host={REDIS_HOST:localhost}
spring.data.redis.port={REDIS_PORT:6379}
JWT
security.jwt.secret={JWT_SECRET}
security.jwt.access-ttl-seconds={JWT_ACCESS_TTL_SECONDS:3600}
security.jwt.refresh-ttl-seconds=${JWT_REFRESH_TTL_SECONDS:1209600}
Cache
spring.cache.type=redis
Application
spring.application.name=${SPRING_APP_NAME:crawler}
The application uses JWT tokens for stateless authentication:
- User logs in with username/password
- Server validates credentials and returns a JWT token
- Client includes the token in
Authorization: Bearer <token>header for subsequent requests - Server validates the token for each protected endpoint
When a user logs out, the token is added to a Redis-based blacklist to prevent reuse.
User passwords are hashed using BCrypt before storage.
The application uses Redis for multiple caching layers:
- Crawl State Cache (
crawlState): In-progress crawl states - Final Crawl State Cache (
crawlStateFinal): Completed crawl results - Token Blacklist: Revoked JWT tokens
The web crawler uses:
- ConcurrentHashMap: Thread-safe storage of active crawl states
- ExecutorService: Concurrent processing of crawl jobs
- Callback Pattern:
onStateDone()callback for crawl completion
bash ./mvnw test
bash ./mvnw clean
The JAR file will be created in target/ directory.
You can containerize the application:
dockerfile FROM eclipse-temurin:25-jdk-alpine WORKDIR /app COPY target/*.jar app.jar EXPOSE 8080 ENTRYPOINT ["java", "-jar", "app.jar"]
Build and run:
docker build -t java-concurrent-web-crawler .
docker run -p 8080:8080 --env-file .env java-concurrent-web-crawlerSpring Boot Actuator endpoints (if enabled):
/actuator/health- Application health status/actuator/info- Application information/actuator/metrics- Application metrics
This project is licensed under the MIT License - see the LICENSE file for details.