Job Collection Pipeline

Overview

The Job Collection Pipeline is a comprehensive system designed to crawl job postings from LinkedIn and Indeed, process the data in real-time, and store it in MongoDB. The system utilizes Docker, Kafka, Puppeteer, Spring Boot, and LLM (Large Language Model) integration to achieve its goals.

Components

Docker Compose Configuration (Kafka-compose.yml)
- Contains services for MongoDB, Zookeeper, Kafka, Kafdrop, and the Ollama LLM model.
- Allows easy deployment and management of the required services using Docker.
Model Initialization (Entrypoint.sh)
- Executes the necessary steps to pull the required model for Ollama inside the container.
- Ensures that the LLM model is available for use within the system.
Crawler (Crawler Package)
- Provides a comprehensive package for crawling job postings from LinkedIn and Indeed.
- Utilizes Puppeteer to scrape data efficiently from the target websites.
Kafka Streaming (Kafka Integration)
- Streams the crawled job data in real-time to the Spring Boot application using Kafka.
- Generates CSV files containing all crawled postings and unscrapable URLs for further analysis.
Kafka Management (Kafdrop, Docker Terminal)
- Facilitates the management and monitoring of Kafka topics using Kafdrop or Docker terminal.
- Allows users to inspect message queues and debug any issues that may arise.
Spring Boot Application
- Consumes data from Kafka and processes it.
- Extracts skills from job descriptions using text processing and fuzzy analysis.
- Integrates with the LLM model for advanced text analysis (optional).
- Inserts processed data into MongoDB using the MongoDB API in Java.

Usage

Setup Docker Compose
- Ensure Docker Compose is installed on your system.
- Run docker-compose -f Kafka-compose.yml up to start all required services.
Initialize Model
- Execute entrypoint.sh to pull the required model for Ollama.
Run Crawler
- Navigate to the crawler package and execute the crawling script.
- Monitor the process and check generated CSV files for crawled data and unscrapable URLs.
Manage Kafka
- Access Kafdrop or use Docker terminal to manage Kafka topics and messages.
Configure Spring Boot
- Modify application.properties to configure MongoDB URI and other settings as needed.
Run Spring Boot Application
- Start the Spring Boot application to consume Kafka messages and process job postings.
- Verify that data is correctly inserted into MongoDB.

Integration with LLM Model (Optional)

Uncomment the relevant code in the Kafka consumer service file to enable integration with the LLM model.
Follow the instructions provided in the code comments to chat with the LLM model and extract specific information from job descriptions.

Video Demonstration

Screencast.from.03-17-24.03-47-18.2.mp4

Credits

FuzzyMatching - Medium Article
LLM Model - Ollama

Contributing

Contributions to the Job Collection Pipeline are welcome! Feel free to submit bug reports, feature requests, or pull requests to help improve the pipeline.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
app		app
crawlers		crawlers
gradle/wrapper		gradle/wrapper
preprocess		preprocess
.gitattributes		.gitattributes
.gitignore		.gitignore
Kafka-docker.yml		Kafka-docker.yml
README.md		README.md
entrypoint_ollama.sh		entrypoint_ollama.sh
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Job Collection Pipeline

Overview

Components

Usage

Integration with LLM Model (Optional)

Video Demonstration

Credits

Contributing

License

About

Releases

Packages

Languages

modhpranav/JobCollectionPipeline

Folders and files

Latest commit

History

Repository files navigation

Job Collection Pipeline

Overview

Components

Usage

Integration with LLM Model (Optional)

Video Demonstration

Credits

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages