A comprehensive Retrieval-Augmented Generation (RAG) system that processes website documentation and provides accurate answers to user queries.
- Automated web crawling with configurable depth
- Efficient document processing and chunking
- Vector embeddings using state-of-the-art models
- Fast similarity search using ChromaDB
- REST API interface for easy integration
- Comprehensive logging and monitoring
- Python 3.8+
- Internet connection for web crawling
- Clone the repository:
git clone <repository-url>
cd <repository-directory>- Install dependencies:
pip install -r requirements.txt- Start the API server:
python app.py- Initialize the model with a website URL:
curl -X POST "http://localhost:8000/initialize" \
-H "Content-Type: application/json" \
-d '{"base_url": "https://example.com", "max_pages": 100}'- Query the model:
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"question": "What is the main purpose of this website?"}'- Get system statistics:
curl "http://localhost:8000/stats"POST /initialize: Initialize the RAG model with a website URLPOST /query: Submit a question to the RAG modelGET /stats: Get system statistics
The system can be configured through the following parameters:
max_pages: Maximum number of pages to crawl (default: 100)chunk_size: Size of document chunks (default: 1000)chunk_overlap: Overlap between chunks (default: 200)
The system consists of four main components:
- Web Crawler: Systematically traverses and extracts content from websites
- Document Processor: Processes and chunks text, generates embeddings
- Vector Database: Stores and retrieves document embeddings
- RAG Model: Combines retrieval and generation for accurate responses
- The system uses efficient vector similarity search for fast retrieval
- Document chunking is optimized for context preservation
- The embedding model is chosen for its balance of performance and accuracy
- The system includes comprehensive error handling and logging
This project is licensed under the GNU AFFERO GENERAL PUBLIC LICENSE - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.