A Retrieval-Augmented Generation (RAG) system that builds a searchable knowledge base from multiple data sources including local files and GitHub repositories. This implementation runs entirely locally using Ollama for the language model and Hugging Face embeddings.
This project creates a RAG system that:
- Loads documents from multiple sources (local directories, GitHub repositories)
- Splits them into manageable chunks
- Creates vector embeddings using Hugging Face's local embedding model
- Stores them in a ChromaDB vector database
- Provides both CLI and web interfaces to query your knowledge base using Ollama
- Multi-Source Support: Index documents from local directories and GitHub repositories simultaneously
- Extensible Architecture: Plugin-based system makes it easy to add new data sources
- All Text Files: Supports all common text file formats (markdown, code, config files, etc.)
- GitHub Integration: Index public and private repositories with personal access token support
- Text Chunking: Intelligently splits documents into overlapping chunks for better retrieval
- Local Vector Storage: Uses ChromaDB for persistent vector storage
- Local Embeddings: Uses Hugging Face's
all-MiniLM-L6-v2model for reliable semantic understanding - Local LLM: Uses Ollama for running language models locally (no API keys required)
- Interactive Querying: CLI and web interfaces to ask questions about your knowledge base
- Context-Aware Responses: Generates answers based on retrieved context from your documents
- Hybrid Search: Combines semantic similarity and exact keyword matching for better accuracy
- Enhanced Retrieval: Improved relevance scoring and filtering for more precise results
- Source Attribution: See which source (local/GitHub) each search result comes from
- Web Interface: Modern, responsive web application with source management UI
- REST API: HTTP endpoints for integration with other applications
- Node.js (v14 or higher)
- Ollama installed and running locally
- ChromaDB server running locally
- GitHub personal access token (for private repositories)
- Clone the repository:
git clone <repository-url>
cd rag-builder- Install dependencies:
npm install- Create a
.envfile in the project root with your configuration:
# Core Configuration
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=llama3
CHROMA_URL=http://localhost:8000
# Source Configuration
SOURCE_TYPE=local # 'local' or 'github'
ENABLE_MULTIPLE_SOURCES=true # Enable multiple sources simultaneously
# Local Source Configuration
OBSIDIAN_VAULT_PATH=/path/to/your/obsidian/vault # Legacy support
LOCAL_PATH=/path/to/your/documents # Or use this
LOCAL_FILE_EXTENSIONS= # Leave empty for all text files
# GitHub Source Configuration
GITHUB_REPO_URL=https://github.com/owner/repo
GITHUB_TOKEN=ghp_xxxxxxxxxxxx # Optional for public repos
GITHUB_BRANCH=main
GITHUB_FILE_EXTENSIONS= # Leave empty for all text files
# Search and Chunking Parameters
SEARCH_RESULTS_COUNT=8
RELEVANCE_THRESHOLD=0.25
ENABLE_QUERY_EXPANSION=true
ENABLE_RERANKING=true
CHUNK_SIZE=1200
CHUNK_OVERLAP=300- Install and start Ollama:
# Install Ollama (if not already installed)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the model you want to use (default is llama3)
ollama pull llama3
# Start Ollama server
ollama serve- Install and start ChromaDB:
# Install ChromaDB
pip install chromadb
# Start ChromaDB server
chroma run --host localhost --port 8000- Start the web server:
npm run dev
# or
npm run web-
Open your browser and navigate to
http://localhost:3000 -
Use the Settings button to:
- Add local directory sources
- Add GitHub repository sources
- Manage multiple sources
- Configure file extensions to index
-
Click "Refresh" to rebuild the knowledge base after adding sources
npm start
# or
node src/main.jsFor CLI mode, configure sources via environment variables in .env file.
To rebuild the vector store with all configured sources:
node src/main.js --refresh
# or
node src/web.js --refreshIndex any directory containing text files:
- Supports all text file formats by default
- Configure specific extensions if needed
- Recursively processes subdirectories
- Preserves directory structure in metadata
Index public or private GitHub repositories:
- Supports authentication via personal access tokens
- Configurable file extensions
- Respects GitHub API rate limits
- Indexes specific branches
- Handles large repositories efficiently
The plugin architecture makes it easy to add new sources:
- Extend the
BaseSourceclass - Implement required methods (loadDocuments, validateSource, etc.)
- Register in
sourceManager.js - Add UI controls if needed
Future source ideas: Notion, Google Drive, Confluence, RSS feeds, etc.
The easiest way to run the entire RAG Builder stack (including the web app, Ollama, and ChromaDB) is with Docker.
- Docker installed on your system
- Docker Compose (usually included with Docker Desktop)
-
Configure Sources: Edit the
docker-compose.ymlfile to set your environment variables and mount your local directories. -
Build and Start Services:
docker-compose up --build
-
Pull the Ollama Model:
docker-compose exec ollama ollama pull llama3 -
Access the Application: Navigate to
http://localhost:3000
GET /api/sources- List active sourcesGET /api/sources/types- List available source typesPOST /api/sources/validate- Validate source configurationPOST /api/sources/add- Add new sourceDELETE /api/sources/:id- Remove source
GET /api/health- Health checkPOST /api/init- Initialize RAG systemPOST /api/query- Process a queryPOST /api/refresh- Refresh/rebuild vector storeGET /api/settings/vault-path- Get current vault path (legacy)POST /api/settings/vault-path- Update vault path (legacy)
GET /api/debug/stats- Get system statisticsPOST /api/debug/search- Test search functionality
rag-builder/
├── src/
│ ├── main.js # CLI entry point
│ ├── web.js # Web server entry point
│ └── modules/
│ ├── config.js # Configuration management
│ ├── cli.js # CLI interface
│ ├── webServer.js # Web server and API endpoints
│ ├── source/ # Source plugins
│ │ ├── baseSource.js # Base source interface
│ │ ├── localSource.js # Local file system source
│ │ ├── githubSource.js # GitHub repository source
│ │ ├── sourceManager.js # Source management
│ │ └── loader.js # Legacy loader (backward compatibility)
│ ├── indexing/ # Document processing
│ │ ├── chunker.js # Text chunking
│ │ └── embedder.js # Embedding generation
│ ├── store/ # Storage
│ │ └── chroma.js # ChromaDB vector store
│ └── query/ # Search and retrieval
│ ├── search.js # Basic search
│ └── advancedSearch.js # Enhanced search features
├── public/ # Web interface
│ ├── index.html # HTML template
│ ├── styles.css # CSS styling
│ └── script.js # Frontend JavaScript
├── package.json # Dependencies and scripts
├── docker-compose.yml # Docker configuration
└── .env # Environment configuration
OLLAMA_HOST: Ollama server URL (default: http://localhost:11434)OLLAMA_MODEL: Model name to use (default: llama3)CHROMA_URL: ChromaDB server URL (default: http://localhost:8000)
SOURCE_TYPE: Default source type ('local' or 'github')ENABLE_MULTIPLE_SOURCES: Enable multiple sources (default: true)
LOCAL_PATHorOBSIDIAN_VAULT_PATH: Path to local directoryLOCAL_FILE_EXTENSIONS: Comma-separated list (empty for all text files)
GITHUB_REPO_URL: Repository URLGITHUB_TOKEN: Personal access token (optional for public repos)GITHUB_BRANCH: Branch to index (default: main)GITHUB_FILE_EXTENSIONS: Comma-separated list (empty for all text files)
SEARCH_RESULTS_COUNT: Number of documents to retrieve (default: 8)RELEVANCE_THRESHOLD: Minimum relevance score (default: 0.25)ENABLE_QUERY_EXPANSION: Enable query expansion (default: true)ENABLE_RERANKING: Enable result reranking (default: true)
CHUNK_SIZE: Chunk size in characters (default: 1200)CHUNK_OVERLAP: Chunk overlap in characters (default: 300)
- Source Configuration: Configure multiple data sources through web UI or environment variables
- Document Loading: Each source plugin loads documents according to its implementation
- Text Processing: Documents are split into overlapping chunks for optimal retrieval
- Embedding Creation: Vector embeddings generated using local Hugging Face model
- Vector Storage: Embeddings stored in ChromaDB with source metadata
- Query Processing: Natural language queries retrieve relevant chunks from all sources
- Response Generation: Ollama generates contextual answers based on retrieved content
- Source Attribution: Results show which source each piece of information came from
- "No documents found": Add at least one source in the settings
- "Could not connect to Ollama": Ensure Ollama is running (
ollama serve) - "ChromaDB connection error": Start ChromaDB (
chroma run --host localhost --port 8000) - "GitHub rate limit": Add a personal access token for higher limits
- "Repository not found": Check URL format and access permissions
- For large repositories, consider specifying file extensions to limit scope
- Use personal access tokens for better GitHub API rate limits
- The first run downloads the embedding model - ensure internet connectivity
- Subsequent runs are faster as ChromaDB persists the vector store
ISC
Feel free to submit issues and enhancement requests! The plugin architecture makes it easy to add new data sources.