RAG System for Website Documentation

A comprehensive Retrieval-Augmented Generation (RAG) system that processes website documentation and provides accurate answers to user queries.

Features

Automated web crawling with configurable depth
Efficient document processing and chunking
Vector embeddings using state-of-the-art models
Fast similarity search using ChromaDB
REST API interface for easy integration
Comprehensive logging and monitoring

Prerequisites

Python 3.8+
Internet connection for web crawling

Installation

Clone the repository:

git clone <repository-url>
cd <repository-directory>

Install dependencies:

pip install -r requirements.txt

Usage

Start the API server:

python app.py

Initialize the model with a website URL:

curl -X POST "http://localhost:8000/initialize" \
     -H "Content-Type: application/json" \
     -d '{"base_url": "https://example.com", "max_pages": 100}'

Query the model:

curl -X POST "http://localhost:8000/query" \
     -H "Content-Type: application/json" \
     -d '{"question": "What is the main purpose of this website?"}'

Get system statistics:

curl "http://localhost:8000/stats"

API Endpoints

POST /initialize: Initialize the RAG model with a website URL
POST /query: Submit a question to the RAG model
GET /stats: Get system statistics

Configuration

The system can be configured through the following parameters:

max_pages: Maximum number of pages to crawl (default: 100)
chunk_size: Size of document chunks (default: 1000)
chunk_overlap: Overlap between chunks (default: 200)

Architecture

The system consists of four main components:

Web Crawler: Systematically traverses and extracts content from websites
Document Processor: Processes and chunks text, generates embeddings
Vector Database: Stores and retrieves document embeddings
RAG Model: Combines retrieval and generation for accurate responses

Performance Considerations

The system uses efficient vector similarity search for fast retrieval
Document chunking is optimized for context preservation
The embedding model is chosen for its balance of performance and accuracy
The system includes comprehensive error handling and logging

License

This project is licensed under the GNU AFFERO GENERAL PUBLIC LICENSE - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
crawler.py		crawler.py
processor.py		processor.py
rag_model.py		rag_model.py
requirements.txt		requirements.txt
vector_db.py		vector_db.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG System for Website Documentation

Features

Prerequisites

Installation

Usage

API Endpoints

Configuration

Architecture

Performance Considerations

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG System for Website Documentation

Features

Prerequisites

Installation

Usage

API Endpoints

Configuration

Architecture

Performance Considerations

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages