Skip to content
This repository was archived by the owner on Jan 13, 2026. It is now read-only.

hotosm/docs-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG System for Website Documentation

A comprehensive Retrieval-Augmented Generation (RAG) system that processes website documentation and provides accurate answers to user queries.

Features

  • Automated web crawling with configurable depth
  • Efficient document processing and chunking
  • Vector embeddings using state-of-the-art models
  • Fast similarity search using ChromaDB
  • REST API interface for easy integration
  • Comprehensive logging and monitoring

Prerequisites

  • Python 3.8+
  • Internet connection for web crawling

Installation

  1. Clone the repository:
git clone <repository-url>
cd <repository-directory>
  1. Install dependencies:
pip install -r requirements.txt

Usage

  1. Start the API server:
python app.py
  1. Initialize the model with a website URL:
curl -X POST "http://localhost:8000/initialize" \
     -H "Content-Type: application/json" \
     -d '{"base_url": "https://example.com", "max_pages": 100}'
  1. Query the model:
curl -X POST "http://localhost:8000/query" \
     -H "Content-Type: application/json" \
     -d '{"question": "What is the main purpose of this website?"}'
  1. Get system statistics:
curl "http://localhost:8000/stats"

API Endpoints

  • POST /initialize: Initialize the RAG model with a website URL
  • POST /query: Submit a question to the RAG model
  • GET /stats: Get system statistics

Configuration

The system can be configured through the following parameters:

  • max_pages: Maximum number of pages to crawl (default: 100)
  • chunk_size: Size of document chunks (default: 1000)
  • chunk_overlap: Overlap between chunks (default: 200)

Architecture

The system consists of four main components:

  1. Web Crawler: Systematically traverses and extracts content from websites
  2. Document Processor: Processes and chunks text, generates embeddings
  3. Vector Database: Stores and retrieves document embeddings
  4. RAG Model: Combines retrieval and generation for accurate responses

Performance Considerations

  • The system uses efficient vector similarity search for fast retrieval
  • Document chunking is optimized for context preservation
  • The embedding model is chosen for its balance of performance and accuracy
  • The system includes comprehensive error handling and logging

License

This project is licensed under the GNU AFFERO GENERAL PUBLIC LICENSE - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

RAG LLM model and chat interface, providing a user-facing knowledge base of HOT's tools

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages