Vectorize UI - Developer README

Overview

Vectorize UI is a Ruby on Rails application that provides a web interface for creating and managing vector stores using PostgreSQL's pgvector extension. The application enables semantic search capabilities over documents through OpenAI embeddings, supporting both file uploads and Git repository ingestion.

Tech Stack

Ruby: 3.2.2
Rails: 8.0.2
Database: PostgreSQL with pgvector extension
Frontend: Tailwind CSS, Turbo Rails, Importmap
AI/ML: OpenAI API for embeddings (text-embedding-3-small)
Background Jobs: Solid Queue
Caching: Solid Cache
WebSockets: Solid Cable (Action Cable)
File Processing: Active Storage, rubyzip
Deployment: Kamal, Docker, Thruster

Prerequisites

Before setting up the application, ensure you have the following installed:

Required Software

Ruby 3.2.2 (use rbenv or rvm to manage Ruby versions)
PostgreSQL 14+ with pgvector extension
Bundler (gem install bundler)
Git (for repository cloning and repo ingestion features)
Node.js (for asset management, though importmap reduces this need)

Installing PostgreSQL with pgvector

The application requires the pgvector extension for PostgreSQL to store and query vector embeddings.

macOS (using Homebrew)

brew install postgresql
brew install pgvector

# Start PostgreSQL
brew services start postgresql

Ubuntu/Debian

sudo apt-get update
sudo apt-get install postgresql postgresql-contrib
sudo apt-get install postgresql-16-pgvector

# Start PostgreSQL
sudo systemctl start postgresql
sudo systemctl enable postgresql

Docker

If you prefer using Docker for PostgreSQL:

docker run -d \
  --name postgres-vectorize \
  -e POSTGRES_PASSWORD=postgres \
  -p 5432:5432 \
  ankane/pgvector

Verify pgvector Installation

Connect to your PostgreSQL instance and verify pgvector is available:

psql -U postgres

# Inside psql:
CREATE EXTENSION IF NOT EXISTS vector;
SELECT * FROM pg_extension WHERE extname = 'vector';

Installation & Setup

1. Clone the Repository

git clone https://github.com/medright/vectorize-ui.git
cd vectorize-ui

2. Install Ruby Dependencies

bundle install

3. Configure Environment Variables

Create a .env file in the root directory with the following variables:

# Required: OpenAI API Key for embeddings
OPENAI_API_KEY=sk-your-openai-api-key-here

# Optional: Use a different OpenAI model (default: text-embedding-3-small)
OPENAI_EMBED_MODEL=text-embedding-3-small

# Optional: Use alternative OpenAI-compatible endpoint (e.g., OpenRouter)
# OPENAI_BASE_URL=https://openrouter.ai/api/v1

# Optional: Database configuration (if not using defaults)
# DATABASE_URL=postgresql://user:password@localhost/vectorize_ui_development

# Optional: Rails environment (default: development)
# RAILS_ENV=development

Important: You must have a valid OpenAI API key to use the embedding and search features. Get one from OpenAI Platform.

4. Database Setup

Create and migrate the database:

# Create the database
rails db:create

# Run migrations (this will enable pgvector extension and create tables)
rails db:migrate

The migrations will automatically:

Enable the pgvector extension in PostgreSQL
Create tables for vector_stores, documents, chunks
Set up vector indexes for efficient similarity search

5. Verify Setup

Check that everything is configured correctly:

# Check database connection and pgvector
rails runner "puts ActiveRecord::Base.connection.execute('SELECT * FROM pg_extension WHERE extname = \'vector\';').to_a"

# Check OpenAI API connectivity (requires OPENAI_API_KEY in .env)
rails runner "puts Embedding::OpenaiProvider.embed(['test']).inspect"

Linting & Pre-commit Hooks

This project runs RuboCop via pre-commit so commits fail fast when style regressions slip in. Install the runner once, then wire up the hook:

brew install pre-commit              # macOS example (alternatively use pipx or pip)
pre-commit install                   # installs .git/hooks/pre-commit
pre-commit run rubocop --all-files   # optional full-project lint + autocorrect

The hook executes bin/rubocop --parallel --force-exclusion --autocorrect, so it honors .rubocop.yml, skips excluded files, and safely fixes offenses on the files staged for commit. For more aggressive fixes, run bin/rubocop -A manually.

Running the Application

Development Mode

The application uses Procfile.dev to run multiple processes:

# Install foreman if you don't have it
gem install foreman

# Start all services (web server + CSS watcher)
bin/dev

This will start:

Rails server on http://localhost:3000
Tailwind CSS watcher for live CSS compilation

Alternatively, run services separately:

# Terminal 1: Rails server
bin/rails server

# Terminal 2: Tailwind CSS watcher
bin/rails tailwindcss:watch

Access the Application

Open your browser and navigate to:

http://localhost:3000

Background Jobs

The application uses Solid Queue for background job processing (document processing, embedding generation). In development, jobs run automatically in the same process. For production-like testing:

# Start background job worker
bundle exec rake solid_queue:start

Features

1. Vector Store Management

Create Vector Stores: Set up named vector stores with configurable embedding dimensions
List Vector Stores: View all created vector stores with their metadata
Delete Vector Stores: Remove vector stores and all associated documents (cascade delete)
Configurable Dimensions: Support for different embedding dimensions (default: 1536)

2. Document Ingestion

File Upload

Single/Multiple File Upload: Upload individual text files or multiple files at once
ZIP Archive Support: Upload ZIP files containing multiple documents
Supported File Types:
- Markdown (.md, .markdown, .mdx)
- Text files (.txt)
- Code files (.rb, .py, .js, .ts, .tsx, .jsx, .go, .java, .kt, .rs, .c, .cpp, .h, .hpp, .swift, .sh)
- Configuration files (.json, .yml, .yaml, .xml, .toml, .ini, .cfg)
- Web files (.html, .css, .scss, .sql)
Active Storage Integration: Files are stored using Rails Active Storage

Git Repository Ingestion

Clone & Index Repositories: Provide a Git repository URL to clone and index the entire codebase
Incremental Updates: Re-process repositories to detect changes and update only modified files
Change Detection: Uses git diff to identify added, modified, renamed, and deleted files
Smart Filtering:
- Ignores common directories (.git, node_modules, vendor, etc.)
- Respects .gitignore patterns
- Filters by file extensions for text content
- Size limits to prevent processing of large binary files
Language Detection: Automatically detects programming language from file extensions
Repository Metadata: Tracks commit SHA, file paths, line numbers, and repository URLs

3. Text Processing & Chunking

Intelligent Chunking

Token-Aware Chunking: Uses tiktoken for accurate token counting
Semantic Splitting: Preserves document structure by splitting on meaningful boundaries
- Markdown: Code blocks, headers, paragraphs
- Code: Function boundaries, class definitions
- Prose: Paragraphs, sentences
Configurable Chunk Size: Default 1000 tokens with 120 token overlap
Context Preservation: Adds context from neighboring chunks for better semantic understanding
Content-Type Aware: Different chunking strategies for code vs documentation

Processing Pipeline

Extract text from files or repositories
Chunk content into token-sized pieces
Generate embeddings via OpenAI API
Store chunks with metadata in PostgreSQL
Write vector embeddings to pgvector column

4. Vector Search

Semantic Search: Query vector stores using natural language
Cosine Similarity: Uses pgvector's cosine distance operator (<=>)
Fast Vector Lookups: IVFFlat index for efficient similarity search
Paginated Results: Browse search results with pagination (20 per page)
Result Context: View matching text chunks with source information
Real-time Search: Instant search results as you type

5. Document Management

Document Status Tracking: Monitor processing status (queued, running, succeeded, failed)
Progress Updates: Real-time progress updates via Turbo Streams
Error Handling: Detailed error messages for failed processing
Retry Mechanism: Re-process failed documents with a single click
Document Metadata: Track source type, processing timestamps, and file information
File Tree View: For repository documents, browse the file tree structure
Rich Text Notes: Add notes to documents using Action Text

6. Chunk Inspection

View Chunks: Examine individual text chunks with their embeddings
Metadata Display: See token count, checksum, model used, dimensions
Source Tracking: View file path, language, line numbers, commit SHA
Filter by Path: List all chunks from a specific file in repository documents

7. Background Processing

Asynchronous Processing: Documents are processed in background jobs
Solid Queue Integration: Reliable job queue backed by PostgreSQL
Rate Limit Handling: Automatic retry with exponential backoff for API rate limits
Batch Embedding: Efficient batching of embedding requests (max 128 items per batch)
Job Monitoring: Track job status and progress in the UI

8. Real-time Updates

Turbo Streams: Live updates without page refresh
Status Broadcasting: Real-time document processing status
Dynamic UI Updates: New documents appear instantly in the list

Architecture

Models

VectorStore: Container for related documents
Document: Represents uploaded files or ingested repositories
Chunk: Text segments with embeddings stored as pgvector

Controllers

VectorStoresController: CRUD operations for vector stores, search endpoint
DocumentsController: Document upload/creation, display, retry failed processing
ChunksController: View individual chunks (show only)

Services

Embedding::OpenaiProvider: OpenAI API integration for embeddings. Includes the nested PgVectorWriter class for direct SQL operations for writing vector data.
TokenChunker: Advanced token-aware text chunking with semantic boundaries
Chunker: Simple character-based chunking (fallback)
RepoWalker: Git repository file collection with filtering
ZipExtractor: Extract and process ZIP archive contents

Jobs

ProcessDocumentJob: Main job for processing documents
- Handles both file uploads and repository ingestion
- Chunks text content
- Generates embeddings in batches
- Writes vectors to database
- Tracks processing status and errors

Database Schema

vector_stores: Name, description, embedding_dimensions
documents: Title, notes (ActionText), source_type, source_ref, status, progress, processing timestamps, files_tree, last_processed_commit
chunks: Content, token_count, checksum, vector (pgvector), path, language, line numbers, model, dimensions, commit_sha, repo_url
Uses Solid Queue tables for job management
Uses Solid Cache tables for caching
Uses Solid Cable tables for WebSockets
Uses Active Storage tables for file attachments

Development Workflow

Making Code Changes

Rails development mode supports hot reloading - code changes take effect immediately without restarting the server.

Tailwind CSS

CSS classes are compiled via Tailwind. The CSS watcher (bin/rails tailwindcss:watch) automatically recompiles when you change HTML/ERB files.

Console

Access the Rails console for debugging:

rails console

Database Console

Access PostgreSQL directly:

rails dbconsole

Testing Document Processing

Test with a Sample File

Create a vector store via the UI
Upload a markdown or code file
Monitor the processing status
Once complete, use the search feature to query the content

Test with a Git Repository

Create a vector store via the UI
Add a document with a Git repository URL (e.g., https://github.com/rails/rails)
Wait for cloning and processing to complete
Search the repository content semantically

Troubleshooting

Common Issues

pgvector extension not found

# Ensure pgvector is installed for your PostgreSQL version
# For PostgreSQL 16:
sudo apt-get install postgresql-16-pgvector

# Then recreate the database
rails db:drop db:create db:migrate

OpenAI API errors

Verify your OPENAI_API_KEY is set in .env
Check API key validity at OpenAI Platform
Monitor rate limits; the app includes automatic retry logic

Background jobs not processing

In development, jobs run in-process by default
Check logs: tail -f log/development.log
Verify database connectivity

Repository cloning fails

Ensure Git is installed: git --version
Check repository URL is accessible
Verify SSH keys if using private repositories

Production Deployment

The application includes Docker support and Kamal configuration for deployment.

Docker

# Build image
docker build -t vectorize-ui .

# Run container
docker run -d \
  -p 80:80 \
  -e RAILS_MASTER_KEY=<your-master-key> \
  -e DATABASE_URL=<your-database-url> \
  -e OPENAI_API_KEY=<your-api-key> \
  vectorize-ui

Kamal

See .kamal/ directory for deployment configuration.

Environment Variables Reference

Variable	Required	Default	Description
`OPENAI_API_KEY`	Yes	-	OpenAI API key for embeddings
`OPENAI_EMBED_MODEL`	No	`text-embedding-3-small`	OpenAI embedding model
`OPENAI_BASE_URL`	No	`https://api.openai.com/v1`	OpenAI API base URL
`DATABASE_URL`	No	(from database.yml)	PostgreSQL connection string
`RAILS_ENV`	No	`development`	Rails environment
`RAILS_MAX_THREADS`	No	`5`	Max threads for Puma
`RAILS_MASTER_KEY`	Production	-	Decrypts credentials.yml.enc

Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

License

This project is released under the PolyForm Strict License 1.0.0. See LICENSE for the complete terms

Support

For issues and questions:

Open an issue on GitHub
Check existing issues for solutions
Review logs in log/development.log

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github		.github
.kamal/hooks		.kamal/hooks
app		app
bin		bin
config		config
db		db
lib/tasks		lib/tasks
log		log
public		public
script		script
storage		storage
tmp		tmp
vendor		vendor
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.rubocop.yml		.rubocop.yml
.ruby-version		.ruby-version
Dockerfile		Dockerfile
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
Procfile.dev		Procfile.dev
README.md		README.md
Rakefile		Rakefile
config.ru		config.ru

License

medright/vectorize-ui

Folders and files

Latest commit

History

Repository files navigation

Vectorize UI - Developer README

Overview

Tech Stack

Prerequisites

Required Software

Installing PostgreSQL with pgvector

macOS (using Homebrew)

Ubuntu/Debian

Docker

Verify pgvector Installation

Installation & Setup

1. Clone the Repository

2. Install Ruby Dependencies

3. Configure Environment Variables

4. Database Setup

5. Verify Setup

Linting & Pre-commit Hooks

Running the Application

Development Mode

Access the Application

Background Jobs

Features

1. Vector Store Management

2. Document Ingestion

File Upload

Git Repository Ingestion

3. Text Processing & Chunking

Intelligent Chunking

Processing Pipeline

4. Vector Search

5. Document Management

6. Chunk Inspection

7. Background Processing

8. Real-time Updates

Architecture

Models

Controllers

Services

Jobs

Database Schema

Development Workflow

Making Code Changes

Tailwind CSS

Console

Database Console

Testing Document Processing

Test with a Sample File

Test with a Git Repository

Troubleshooting

Common Issues

Production Deployment

Docker

Kamal

Environment Variables Reference

Contributing

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages