Skip to content

project with UI for managing vector stores by allowing a user to add new stores via url/document upload

License

Notifications You must be signed in to change notification settings

medright/vectorize-ui

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vectorize UI - Developer README

Overview

Vectorize UI is a Ruby on Rails application that provides a web interface for creating and managing vector stores using PostgreSQL's pgvector extension. The application enables semantic search capabilities over documents through OpenAI embeddings, supporting both file uploads and Git repository ingestion.

Tech Stack

  • Ruby: 3.2.2
  • Rails: 8.0.2
  • Database: PostgreSQL with pgvector extension
  • Frontend: Tailwind CSS, Turbo Rails, Importmap
  • AI/ML: OpenAI API for embeddings (text-embedding-3-small)
  • Background Jobs: Solid Queue
  • Caching: Solid Cache
  • WebSockets: Solid Cable (Action Cable)
  • File Processing: Active Storage, rubyzip
  • Deployment: Kamal, Docker, Thruster

Prerequisites

Before setting up the application, ensure you have the following installed:

Required Software

  • Ruby 3.2.2 (use rbenv or rvm to manage Ruby versions)
  • PostgreSQL 14+ with pgvector extension
  • Bundler (gem install bundler)
  • Git (for repository cloning and repo ingestion features)
  • Node.js (for asset management, though importmap reduces this need)

Installing PostgreSQL with pgvector

The application requires the pgvector extension for PostgreSQL to store and query vector embeddings.

macOS (using Homebrew)

brew install postgresql
brew install pgvector

# Start PostgreSQL
brew services start postgresql

Ubuntu/Debian

sudo apt-get update
sudo apt-get install postgresql postgresql-contrib
sudo apt-get install postgresql-16-pgvector

# Start PostgreSQL
sudo systemctl start postgresql
sudo systemctl enable postgresql

Docker

If you prefer using Docker for PostgreSQL:

docker run -d \
  --name postgres-vectorize \
  -e POSTGRES_PASSWORD=postgres \
  -p 5432:5432 \
  ankane/pgvector

Verify pgvector Installation

Connect to your PostgreSQL instance and verify pgvector is available:

psql -U postgres

# Inside psql:
CREATE EXTENSION IF NOT EXISTS vector;
SELECT * FROM pg_extension WHERE extname = 'vector';

Installation & Setup

1. Clone the Repository

git clone https://github.com/medright/vectorize-ui.git
cd vectorize-ui

2. Install Ruby Dependencies

bundle install

3. Configure Environment Variables

Create a .env file in the root directory with the following variables:

# Required: OpenAI API Key for embeddings
OPENAI_API_KEY=sk-your-openai-api-key-here

# Optional: Use a different OpenAI model (default: text-embedding-3-small)
OPENAI_EMBED_MODEL=text-embedding-3-small

# Optional: Use alternative OpenAI-compatible endpoint (e.g., OpenRouter)
# OPENAI_BASE_URL=https://openrouter.ai/api/v1

# Optional: Database configuration (if not using defaults)
# DATABASE_URL=postgresql://user:password@localhost/vectorize_ui_development

# Optional: Rails environment (default: development)
# RAILS_ENV=development

Important: You must have a valid OpenAI API key to use the embedding and search features. Get one from OpenAI Platform.

4. Database Setup

Create and migrate the database:

# Create the database
rails db:create

# Run migrations (this will enable pgvector extension and create tables)
rails db:migrate

The migrations will automatically:

  • Enable the pgvector extension in PostgreSQL
  • Create tables for vector_stores, documents, chunks
  • Set up vector indexes for efficient similarity search

5. Verify Setup

Check that everything is configured correctly:

# Check database connection and pgvector
rails runner "puts ActiveRecord::Base.connection.execute('SELECT * FROM pg_extension WHERE extname = \'vector\';').to_a"

# Check OpenAI API connectivity (requires OPENAI_API_KEY in .env)
rails runner "puts Embedding::OpenaiProvider.embed(['test']).inspect"

Linting & Pre-commit Hooks

This project runs RuboCop via pre-commit so commits fail fast when style regressions slip in. Install the runner once, then wire up the hook:

brew install pre-commit              # macOS example (alternatively use pipx or pip)
pre-commit install                   # installs .git/hooks/pre-commit
pre-commit run rubocop --all-files   # optional full-project lint + autocorrect

The hook executes bin/rubocop --parallel --force-exclusion --autocorrect, so it honors .rubocop.yml, skips excluded files, and safely fixes offenses on the files staged for commit. For more aggressive fixes, run bin/rubocop -A manually.

Running the Application

Development Mode

The application uses Procfile.dev to run multiple processes:

# Install foreman if you don't have it
gem install foreman

# Start all services (web server + CSS watcher)
bin/dev

This will start:

  • Rails server on http://localhost:3000
  • Tailwind CSS watcher for live CSS compilation

Alternatively, run services separately:

# Terminal 1: Rails server
bin/rails server

# Terminal 2: Tailwind CSS watcher
bin/rails tailwindcss:watch

Access the Application

Open your browser and navigate to:

http://localhost:3000

Background Jobs

The application uses Solid Queue for background job processing (document processing, embedding generation). In development, jobs run automatically in the same process. For production-like testing:

# Start background job worker
bundle exec rake solid_queue:start

Features

1. Vector Store Management

  • Create Vector Stores: Set up named vector stores with configurable embedding dimensions
  • List Vector Stores: View all created vector stores with their metadata
  • Delete Vector Stores: Remove vector stores and all associated documents (cascade delete)
  • Configurable Dimensions: Support for different embedding dimensions (default: 1536)

2. Document Ingestion

File Upload

  • Single/Multiple File Upload: Upload individual text files or multiple files at once
  • ZIP Archive Support: Upload ZIP files containing multiple documents
  • Supported File Types:
    • Markdown (.md, .markdown, .mdx)
    • Text files (.txt)
    • Code files (.rb, .py, .js, .ts, .tsx, .jsx, .go, .java, .kt, .rs, .c, .cpp, .h, .hpp, .swift, .sh)
    • Configuration files (.json, .yml, .yaml, .xml, .toml, .ini, .cfg)
    • Web files (.html, .css, .scss, .sql)
  • Active Storage Integration: Files are stored using Rails Active Storage

Git Repository Ingestion

  • Clone & Index Repositories: Provide a Git repository URL to clone and index the entire codebase
  • Incremental Updates: Re-process repositories to detect changes and update only modified files
  • Change Detection: Uses git diff to identify added, modified, renamed, and deleted files
  • Smart Filtering:
    • Ignores common directories (.git, node_modules, vendor, etc.)
    • Respects .gitignore patterns
    • Filters by file extensions for text content
    • Size limits to prevent processing of large binary files
  • Language Detection: Automatically detects programming language from file extensions
  • Repository Metadata: Tracks commit SHA, file paths, line numbers, and repository URLs

3. Text Processing & Chunking

Intelligent Chunking

  • Token-Aware Chunking: Uses tiktoken for accurate token counting
  • Semantic Splitting: Preserves document structure by splitting on meaningful boundaries
    • Markdown: Code blocks, headers, paragraphs
    • Code: Function boundaries, class definitions
    • Prose: Paragraphs, sentences
  • Configurable Chunk Size: Default 1000 tokens with 120 token overlap
  • Context Preservation: Adds context from neighboring chunks for better semantic understanding
  • Content-Type Aware: Different chunking strategies for code vs documentation

Processing Pipeline

  • Extract text from files or repositories
  • Chunk content into token-sized pieces
  • Generate embeddings via OpenAI API
  • Store chunks with metadata in PostgreSQL
  • Write vector embeddings to pgvector column

4. Vector Search

  • Semantic Search: Query vector stores using natural language
  • Cosine Similarity: Uses pgvector's cosine distance operator (<=>)
  • Fast Vector Lookups: IVFFlat index for efficient similarity search
  • Paginated Results: Browse search results with pagination (20 per page)
  • Result Context: View matching text chunks with source information
  • Real-time Search: Instant search results as you type

5. Document Management

  • Document Status Tracking: Monitor processing status (queued, running, succeeded, failed)
  • Progress Updates: Real-time progress updates via Turbo Streams
  • Error Handling: Detailed error messages for failed processing
  • Retry Mechanism: Re-process failed documents with a single click
  • Document Metadata: Track source type, processing timestamps, and file information
  • File Tree View: For repository documents, browse the file tree structure
  • Rich Text Notes: Add notes to documents using Action Text

6. Chunk Inspection

  • View Chunks: Examine individual text chunks with their embeddings
  • Metadata Display: See token count, checksum, model used, dimensions
  • Source Tracking: View file path, language, line numbers, commit SHA
  • Filter by Path: List all chunks from a specific file in repository documents

7. Background Processing

  • Asynchronous Processing: Documents are processed in background jobs
  • Solid Queue Integration: Reliable job queue backed by PostgreSQL
  • Rate Limit Handling: Automatic retry with exponential backoff for API rate limits
  • Batch Embedding: Efficient batching of embedding requests (max 128 items per batch)
  • Job Monitoring: Track job status and progress in the UI

8. Real-time Updates

  • Turbo Streams: Live updates without page refresh
  • Status Broadcasting: Real-time document processing status
  • Dynamic UI Updates: New documents appear instantly in the list

Architecture

Models

  • VectorStore: Container for related documents
  • Document: Represents uploaded files or ingested repositories
  • Chunk: Text segments with embeddings stored as pgvector

Controllers

  • VectorStoresController: CRUD operations for vector stores, search endpoint
  • DocumentsController: Document upload/creation, display, retry failed processing
  • ChunksController: View individual chunks (show only)

Services

  • Embedding::OpenaiProvider: OpenAI API integration for embeddings. Includes the nested PgVectorWriter class for direct SQL operations for writing vector data.
  • TokenChunker: Advanced token-aware text chunking with semantic boundaries
  • Chunker: Simple character-based chunking (fallback)
  • RepoWalker: Git repository file collection with filtering
  • ZipExtractor: Extract and process ZIP archive contents

Jobs

  • ProcessDocumentJob: Main job for processing documents
    • Handles both file uploads and repository ingestion
    • Chunks text content
    • Generates embeddings in batches
    • Writes vectors to database
    • Tracks processing status and errors

Database Schema

  • vector_stores: Name, description, embedding_dimensions
  • documents: Title, notes (ActionText), source_type, source_ref, status, progress, processing timestamps, files_tree, last_processed_commit
  • chunks: Content, token_count, checksum, vector (pgvector), path, language, line numbers, model, dimensions, commit_sha, repo_url
  • Uses Solid Queue tables for job management
  • Uses Solid Cache tables for caching
  • Uses Solid Cable tables for WebSockets
  • Uses Active Storage tables for file attachments

Development Workflow

Making Code Changes

Rails development mode supports hot reloading - code changes take effect immediately without restarting the server.

Tailwind CSS

CSS classes are compiled via Tailwind. The CSS watcher (bin/rails tailwindcss:watch) automatically recompiles when you change HTML/ERB files.

Console

Access the Rails console for debugging:

rails console

Database Console

Access PostgreSQL directly:

rails dbconsole

Testing Document Processing

Test with a Sample File

  1. Create a vector store via the UI
  2. Upload a markdown or code file
  3. Monitor the processing status
  4. Once complete, use the search feature to query the content

Test with a Git Repository

  1. Create a vector store via the UI
  2. Add a document with a Git repository URL (e.g., https://github.com/rails/rails)
  3. Wait for cloning and processing to complete
  4. Search the repository content semantically

Troubleshooting

Common Issues

pgvector extension not found

# Ensure pgvector is installed for your PostgreSQL version
# For PostgreSQL 16:
sudo apt-get install postgresql-16-pgvector

# Then recreate the database
rails db:drop db:create db:migrate

OpenAI API errors

  • Verify your OPENAI_API_KEY is set in .env
  • Check API key validity at OpenAI Platform
  • Monitor rate limits; the app includes automatic retry logic

Background jobs not processing

  • In development, jobs run in-process by default
  • Check logs: tail -f log/development.log
  • Verify database connectivity

Repository cloning fails

  • Ensure Git is installed: git --version
  • Check repository URL is accessible
  • Verify SSH keys if using private repositories

Production Deployment

The application includes Docker support and Kamal configuration for deployment.

Docker

# Build image
docker build -t vectorize-ui .

# Run container
docker run -d \
  -p 80:80 \
  -e RAILS_MASTER_KEY=<your-master-key> \
  -e DATABASE_URL=<your-database-url> \
  -e OPENAI_API_KEY=<your-api-key> \
  vectorize-ui

Kamal

See .kamal/ directory for deployment configuration.

Environment Variables Reference

Variable Required Default Description
OPENAI_API_KEY Yes - OpenAI API key for embeddings
OPENAI_EMBED_MODEL No text-embedding-3-small OpenAI embedding model
OPENAI_BASE_URL No https://api.openai.com/v1 OpenAI API base URL
DATABASE_URL No (from database.yml) PostgreSQL connection string
RAILS_ENV No development Rails environment
RAILS_MAX_THREADS No 5 Max threads for Puma
RAILS_MASTER_KEY Production - Decrypts credentials.yml.enc

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

License

This project is released under the PolyForm Strict License 1.0.0. See LICENSE for the complete terms

Support

For issues and questions:

  • Open an issue on GitHub
  • Check existing issues for solutions
  • Review logs in log/development.log

About

project with UI for managing vector stores by allowing a user to add new stores via url/document upload

Resources

License

Stars

Watchers

Forks