RAG Builder

A Retrieval-Augmented Generation (RAG) system that builds a searchable knowledge base from multiple data sources including local files and GitHub repositories. This implementation runs entirely locally using Ollama for the language model and Hugging Face embeddings.

Overview

This project creates a RAG system that:

Loads documents from multiple sources (local directories, GitHub repositories)
Splits them into manageable chunks
Creates vector embeddings using Hugging Face's local embedding model
Stores them in a ChromaDB vector database
Provides both CLI and web interfaces to query your knowledge base using Ollama

Features

Multi-Source Support: Index documents from local directories and GitHub repositories simultaneously
Extensible Architecture: Plugin-based system makes it easy to add new data sources
All Text Files: Supports all common text file formats (markdown, code, config files, etc.)
GitHub Integration: Index public and private repositories with personal access token support
Text Chunking: Intelligently splits documents into overlapping chunks for better retrieval
Local Vector Storage: Uses ChromaDB for persistent vector storage
Local Embeddings: Uses Hugging Face's all-MiniLM-L6-v2 model for reliable semantic understanding
Local LLM: Uses Ollama for running language models locally (no API keys required)
Interactive Querying: CLI and web interfaces to ask questions about your knowledge base
Context-Aware Responses: Generates answers based on retrieved context from your documents
Hybrid Search: Combines semantic similarity and exact keyword matching for better accuracy
Enhanced Retrieval: Improved relevance scoring and filtering for more precise results
Source Attribution: See which source (local/GitHub) each search result comes from
Web Interface: Modern, responsive web application with source management UI
REST API: HTTP endpoints for integration with other applications

Prerequisites

Node.js (v14 or higher)
Ollama installed and running locally
ChromaDB server running locally
GitHub personal access token (for private repositories)

Installation

Clone the repository:

git clone <repository-url>
cd rag-builder

Install dependencies:

npm install

Create a .env file in the project root with your configuration:

# Core Configuration
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=llama3
CHROMA_URL=http://localhost:8000

# Source Configuration
SOURCE_TYPE=local                    # 'local' or 'github'
ENABLE_MULTIPLE_SOURCES=true         # Enable multiple sources simultaneously

# Local Source Configuration
OBSIDIAN_VAULT_PATH=/path/to/your/obsidian/vault   # Legacy support
LOCAL_PATH=/path/to/your/documents                  # Or use this
LOCAL_FILE_EXTENSIONS=                              # Leave empty for all text files

# GitHub Source Configuration
GITHUB_REPO_URL=https://github.com/owner/repo
GITHUB_TOKEN=ghp_xxxxxxxxxxxx                       # Optional for public repos
GITHUB_BRANCH=main
GITHUB_FILE_EXTENSIONS=                             # Leave empty for all text files

# Search and Chunking Parameters
SEARCH_RESULTS_COUNT=8
RELEVANCE_THRESHOLD=0.25
ENABLE_QUERY_EXPANSION=true
ENABLE_RERANKING=true
CHUNK_SIZE=1200
CHUNK_OVERLAP=300

Install and start Ollama:

# Install Ollama (if not already installed)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull the model you want to use (default is llama3)
ollama pull llama3

# Start Ollama server
ollama serve

Install and start ChromaDB:

# Install ChromaDB
pip install chromadb

# Start ChromaDB server
chroma run --host localhost --port 8000

Usage

Web Interface (Recommended)

Start the web server:

npm run dev
# or
npm run web

Open your browser and navigate to http://localhost:3000
Use the Settings button to:
- Add local directory sources
- Add GitHub repository sources
- Manage multiple sources
- Configure file extensions to index
Click "Refresh" to rebuild the knowledge base after adding sources

CLI Mode

npm start
# or
node src/main.js

For CLI mode, configure sources via environment variables in .env file.

Force Refresh

To rebuild the vector store with all configured sources:

node src/main.js --refresh
# or
node src/web.js --refresh

Data Sources

Local File System

Index any directory containing text files:

Supports all text file formats by default
Configure specific extensions if needed
Recursively processes subdirectories
Preserves directory structure in metadata

GitHub Repositories

Index public or private GitHub repositories:

Supports authentication via personal access tokens
Configurable file extensions
Respects GitHub API rate limits
Indexes specific branches
Handles large repositories efficiently

Adding New Sources

The plugin architecture makes it easy to add new sources:

Extend the BaseSource class
Implement required methods (loadDocuments, validateSource, etc.)
Register in sourceManager.js
Add UI controls if needed

Future source ideas: Notion, Google Drive, Confluence, RSS feeds, etc.

Deploying with Docker

The easiest way to run the entire RAG Builder stack (including the web app, Ollama, and ChromaDB) is with Docker.

Prerequisites

Docker installed on your system
Docker Compose (usually included with Docker Desktop)

Running with Docker

Configure Sources: Edit the docker-compose.yml file to set your environment variables and mount your local directories.
Build and Start Services:
```
docker-compose up --build
```

Pull the Ollama Model:

docker-compose exec ollama ollama pull llama3

Access the Application: Navigate to http://localhost:3000

API Endpoints

Source Management

GET /api/sources - List active sources
GET /api/sources/types - List available source types
POST /api/sources/validate - Validate source configuration
POST /api/sources/add - Add new source
DELETE /api/sources/:id - Remove source

Core Functionality

GET /api/health - Health check
POST /api/init - Initialize RAG system
POST /api/query - Process a query
POST /api/refresh - Refresh/rebuild vector store
GET /api/settings/vault-path - Get current vault path (legacy)
POST /api/settings/vault-path - Update vault path (legacy)

Debug Endpoints

GET /api/debug/stats - Get system statistics
POST /api/debug/search - Test search functionality

Project Structure

rag-builder/
├── src/
│   ├── main.js                    # CLI entry point
│   ├── web.js                     # Web server entry point
│   └── modules/
│       ├── config.js              # Configuration management
│       ├── cli.js                 # CLI interface
│       ├── webServer.js           # Web server and API endpoints
│       ├── source/                # Source plugins
│       │   ├── baseSource.js      # Base source interface
│       │   ├── localSource.js     # Local file system source
│       │   ├── githubSource.js    # GitHub repository source
│       │   ├── sourceManager.js   # Source management
│       │   └── loader.js          # Legacy loader (backward compatibility)
│       ├── indexing/              # Document processing
│       │   ├── chunker.js         # Text chunking
│       │   └── embedder.js        # Embedding generation
│       ├── store/                 # Storage
│       │   └── chroma.js          # ChromaDB vector store
│       └── query/                 # Search and retrieval
│           ├── search.js          # Basic search
│           └── advancedSearch.js  # Enhanced search features
├── public/                        # Web interface
│   ├── index.html                 # HTML template
│   ├── styles.css                 # CSS styling
│   └── script.js                  # Frontend JavaScript
├── package.json                   # Dependencies and scripts
├── docker-compose.yml             # Docker configuration
└── .env                          # Environment configuration

Configuration Options

Environment Variables

Core Settings

OLLAMA_HOST: Ollama server URL (default: http://localhost:11434)
OLLAMA_MODEL: Model name to use (default: llama3)
CHROMA_URL: ChromaDB server URL (default: http://localhost:8000)

Source Settings

SOURCE_TYPE: Default source type ('local' or 'github')
ENABLE_MULTIPLE_SOURCES: Enable multiple sources (default: true)

Local Source

LOCAL_PATH or OBSIDIAN_VAULT_PATH: Path to local directory
LOCAL_FILE_EXTENSIONS: Comma-separated list (empty for all text files)

GitHub Source

GITHUB_REPO_URL: Repository URL
GITHUB_TOKEN: Personal access token (optional for public repos)
GITHUB_BRANCH: Branch to index (default: main)
GITHUB_FILE_EXTENSIONS: Comma-separated list (empty for all text files)

Search Parameters

SEARCH_RESULTS_COUNT: Number of documents to retrieve (default: 8)
RELEVANCE_THRESHOLD: Minimum relevance score (default: 0.25)
ENABLE_QUERY_EXPANSION: Enable query expansion (default: true)
ENABLE_RERANKING: Enable result reranking (default: true)

Chunking Parameters

CHUNK_SIZE: Chunk size in characters (default: 1200)
CHUNK_OVERLAP: Chunk overlap in characters (default: 300)

How It Works

Source Configuration: Configure multiple data sources through web UI or environment variables
Document Loading: Each source plugin loads documents according to its implementation
Text Processing: Documents are split into overlapping chunks for optimal retrieval
Embedding Creation: Vector embeddings generated using local Hugging Face model
Vector Storage: Embeddings stored in ChromaDB with source metadata
Query Processing: Natural language queries retrieve relevant chunks from all sources
Response Generation: Ollama generates contextual answers based on retrieved content
Source Attribution: Results show which source each piece of information came from

Troubleshooting

Common Issues

"No documents found": Add at least one source in the settings
"Could not connect to Ollama": Ensure Ollama is running (ollama serve)
"ChromaDB connection error": Start ChromaDB (chroma run --host localhost --port 8000)
"GitHub rate limit": Add a personal access token for higher limits
"Repository not found": Check URL format and access permissions

Performance Tips

For large repositories, consider specifying file extensions to limit scope
Use personal access tokens for better GitHub API rate limits
The first run downloads the embedding model - ensure internet connectivity
Subsequent runs are faster as ChromaDB persists the vector store

License

ISC

Contributing

Feel free to submit issues and enhancement requests! The plugin architecture makes it easy to add new data sources.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
public		public
src		src
.dockerignore		.dockerignore
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.json		.prettierrc.json
Dockerfile		Dockerfile
README.md		README.md
chroma.Dockerfile		chroma.Dockerfile
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
eslint.config.js		eslint.config.js
ollama.Dockerfile		ollama.Dockerfile
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

RAG Builder

Overview

Features

Prerequisites

Installation

Usage

Web Interface (Recommended)

CLI Mode

Force Refresh

Data Sources

Local File System

GitHub Repositories

Adding New Sources

Deploying with Docker

Prerequisites

Running with Docker

API Endpoints

Source Management

Core Functionality

Debug Endpoints

Project Structure

Configuration Options

Environment Variables

Core Settings

Source Settings

Local Source

GitHub Source

Search Parameters

Chunking Parameters

How It Works

Troubleshooting

Common Issues

Performance Tips

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages