Skip to content

ps011/rag-builder

Repository files navigation

RAG Builder

A Retrieval-Augmented Generation (RAG) system that builds a searchable knowledge base from multiple data sources including local files and GitHub repositories. This implementation runs entirely locally using Ollama for the language model and Hugging Face embeddings.

Overview

This project creates a RAG system that:

  1. Loads documents from multiple sources (local directories, GitHub repositories)
  2. Splits them into manageable chunks
  3. Creates vector embeddings using Hugging Face's local embedding model
  4. Stores them in a ChromaDB vector database
  5. Provides both CLI and web interfaces to query your knowledge base using Ollama

Features

  • Multi-Source Support: Index documents from local directories and GitHub repositories simultaneously
  • Extensible Architecture: Plugin-based system makes it easy to add new data sources
  • All Text Files: Supports all common text file formats (markdown, code, config files, etc.)
  • GitHub Integration: Index public and private repositories with personal access token support
  • Text Chunking: Intelligently splits documents into overlapping chunks for better retrieval
  • Local Vector Storage: Uses ChromaDB for persistent vector storage
  • Local Embeddings: Uses Hugging Face's all-MiniLM-L6-v2 model for reliable semantic understanding
  • Local LLM: Uses Ollama for running language models locally (no API keys required)
  • Interactive Querying: CLI and web interfaces to ask questions about your knowledge base
  • Context-Aware Responses: Generates answers based on retrieved context from your documents
  • Hybrid Search: Combines semantic similarity and exact keyword matching for better accuracy
  • Enhanced Retrieval: Improved relevance scoring and filtering for more precise results
  • Source Attribution: See which source (local/GitHub) each search result comes from
  • Web Interface: Modern, responsive web application with source management UI
  • REST API: HTTP endpoints for integration with other applications

Prerequisites

  • Node.js (v14 or higher)
  • Ollama installed and running locally
  • ChromaDB server running locally
  • GitHub personal access token (for private repositories)

Installation

  1. Clone the repository:
git clone <repository-url>
cd rag-builder
  1. Install dependencies:
npm install
  1. Create a .env file in the project root with your configuration:
# Core Configuration
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=llama3
CHROMA_URL=http://localhost:8000

# Source Configuration
SOURCE_TYPE=local                    # 'local' or 'github'
ENABLE_MULTIPLE_SOURCES=true         # Enable multiple sources simultaneously

# Local Source Configuration
OBSIDIAN_VAULT_PATH=/path/to/your/obsidian/vault   # Legacy support
LOCAL_PATH=/path/to/your/documents                  # Or use this
LOCAL_FILE_EXTENSIONS=                              # Leave empty for all text files

# GitHub Source Configuration
GITHUB_REPO_URL=https://github.com/owner/repo
GITHUB_TOKEN=ghp_xxxxxxxxxxxx                       # Optional for public repos
GITHUB_BRANCH=main
GITHUB_FILE_EXTENSIONS=                             # Leave empty for all text files

# Search and Chunking Parameters
SEARCH_RESULTS_COUNT=8
RELEVANCE_THRESHOLD=0.25
ENABLE_QUERY_EXPANSION=true
ENABLE_RERANKING=true
CHUNK_SIZE=1200
CHUNK_OVERLAP=300
  1. Install and start Ollama:
# Install Ollama (if not already installed)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull the model you want to use (default is llama3)
ollama pull llama3

# Start Ollama server
ollama serve
  1. Install and start ChromaDB:
# Install ChromaDB
pip install chromadb

# Start ChromaDB server
chroma run --host localhost --port 8000

Usage

Web Interface (Recommended)

  1. Start the web server:
npm run dev
# or
npm run web
  1. Open your browser and navigate to http://localhost:3000

  2. Use the Settings button to:

    • Add local directory sources
    • Add GitHub repository sources
    • Manage multiple sources
    • Configure file extensions to index
  3. Click "Refresh" to rebuild the knowledge base after adding sources

CLI Mode

npm start
# or
node src/main.js

For CLI mode, configure sources via environment variables in .env file.

Force Refresh

To rebuild the vector store with all configured sources:

node src/main.js --refresh
# or
node src/web.js --refresh

Data Sources

Local File System

Index any directory containing text files:

  • Supports all text file formats by default
  • Configure specific extensions if needed
  • Recursively processes subdirectories
  • Preserves directory structure in metadata

GitHub Repositories

Index public or private GitHub repositories:

  • Supports authentication via personal access tokens
  • Configurable file extensions
  • Respects GitHub API rate limits
  • Indexes specific branches
  • Handles large repositories efficiently

Adding New Sources

The plugin architecture makes it easy to add new sources:

  1. Extend the BaseSource class
  2. Implement required methods (loadDocuments, validateSource, etc.)
  3. Register in sourceManager.js
  4. Add UI controls if needed

Future source ideas: Notion, Google Drive, Confluence, RSS feeds, etc.

Deploying with Docker

The easiest way to run the entire RAG Builder stack (including the web app, Ollama, and ChromaDB) is with Docker.

Prerequisites

Running with Docker

  1. Configure Sources: Edit the docker-compose.yml file to set your environment variables and mount your local directories.

  2. Build and Start Services:

    docker-compose up --build
  3. Pull the Ollama Model:

    docker-compose exec ollama ollama pull llama3
  4. Access the Application: Navigate to http://localhost:3000

API Endpoints

Source Management

  • GET /api/sources - List active sources
  • GET /api/sources/types - List available source types
  • POST /api/sources/validate - Validate source configuration
  • POST /api/sources/add - Add new source
  • DELETE /api/sources/:id - Remove source

Core Functionality

  • GET /api/health - Health check
  • POST /api/init - Initialize RAG system
  • POST /api/query - Process a query
  • POST /api/refresh - Refresh/rebuild vector store
  • GET /api/settings/vault-path - Get current vault path (legacy)
  • POST /api/settings/vault-path - Update vault path (legacy)

Debug Endpoints

  • GET /api/debug/stats - Get system statistics
  • POST /api/debug/search - Test search functionality

Project Structure

rag-builder/
├── src/
│   ├── main.js                    # CLI entry point
│   ├── web.js                     # Web server entry point
│   └── modules/
│       ├── config.js              # Configuration management
│       ├── cli.js                 # CLI interface
│       ├── webServer.js           # Web server and API endpoints
│       ├── source/                # Source plugins
│       │   ├── baseSource.js      # Base source interface
│       │   ├── localSource.js     # Local file system source
│       │   ├── githubSource.js    # GitHub repository source
│       │   ├── sourceManager.js   # Source management
│       │   └── loader.js          # Legacy loader (backward compatibility)
│       ├── indexing/              # Document processing
│       │   ├── chunker.js         # Text chunking
│       │   └── embedder.js        # Embedding generation
│       ├── store/                 # Storage
│       │   └── chroma.js          # ChromaDB vector store
│       └── query/                 # Search and retrieval
│           ├── search.js          # Basic search
│           └── advancedSearch.js  # Enhanced search features
├── public/                        # Web interface
│   ├── index.html                 # HTML template
│   ├── styles.css                 # CSS styling
│   └── script.js                  # Frontend JavaScript
├── package.json                   # Dependencies and scripts
├── docker-compose.yml             # Docker configuration
└── .env                          # Environment configuration

Configuration Options

Environment Variables

Core Settings

Source Settings

  • SOURCE_TYPE: Default source type ('local' or 'github')
  • ENABLE_MULTIPLE_SOURCES: Enable multiple sources (default: true)

Local Source

  • LOCAL_PATH or OBSIDIAN_VAULT_PATH: Path to local directory
  • LOCAL_FILE_EXTENSIONS: Comma-separated list (empty for all text files)

GitHub Source

  • GITHUB_REPO_URL: Repository URL
  • GITHUB_TOKEN: Personal access token (optional for public repos)
  • GITHUB_BRANCH: Branch to index (default: main)
  • GITHUB_FILE_EXTENSIONS: Comma-separated list (empty for all text files)

Search Parameters

  • SEARCH_RESULTS_COUNT: Number of documents to retrieve (default: 8)
  • RELEVANCE_THRESHOLD: Minimum relevance score (default: 0.25)
  • ENABLE_QUERY_EXPANSION: Enable query expansion (default: true)
  • ENABLE_RERANKING: Enable result reranking (default: true)

Chunking Parameters

  • CHUNK_SIZE: Chunk size in characters (default: 1200)
  • CHUNK_OVERLAP: Chunk overlap in characters (default: 300)

How It Works

  1. Source Configuration: Configure multiple data sources through web UI or environment variables
  2. Document Loading: Each source plugin loads documents according to its implementation
  3. Text Processing: Documents are split into overlapping chunks for optimal retrieval
  4. Embedding Creation: Vector embeddings generated using local Hugging Face model
  5. Vector Storage: Embeddings stored in ChromaDB with source metadata
  6. Query Processing: Natural language queries retrieve relevant chunks from all sources
  7. Response Generation: Ollama generates contextual answers based on retrieved content
  8. Source Attribution: Results show which source each piece of information came from

Troubleshooting

Common Issues

  1. "No documents found": Add at least one source in the settings
  2. "Could not connect to Ollama": Ensure Ollama is running (ollama serve)
  3. "ChromaDB connection error": Start ChromaDB (chroma run --host localhost --port 8000)
  4. "GitHub rate limit": Add a personal access token for higher limits
  5. "Repository not found": Check URL format and access permissions

Performance Tips

  • For large repositories, consider specifying file extensions to limit scope
  • Use personal access tokens for better GitHub API rate limits
  • The first run downloads the embedding model - ensure internet connectivity
  • Subsequent runs are faster as ChromaDB persists the vector store

License

ISC

Contributing

Feel free to submit issues and enhancement requests! The plugin architecture makes it easy to add new data sources.

About

A repository to store code for building a RAG for my notes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors