Skip to content

Framework for semantic search and RAG (Retrieval-Augmented Generation) applications. Built with Google's EmbeddingGemma model, hybrid search with dense + sparse vectors, and configurable cross-encoder rerankers.

License

Notifications You must be signed in to change notification settings

neurelectra/embedding-server

Repository files navigation

Gemma Embedding Framework

A complete, production-ready framework for semantic search and RAG (Retrieval-Augmented Generation) applications. Built with Google's EmbeddingGemma model, hybrid search with dense + sparse vectors, and configurable cross-encoder rerankers.

Overview

This framework provides everything you need to build semantic search applications:

  • Embedding Service - Core application, high-performance Python service with gRPC, HTTP, and HTTPS support
  • API Server - Express.js REST gateway with document processing and Qdrant integration
  • Node.js SDK - Complete client library with Qdrant integration and hybrid search
  • Document Processing - Built-in parsers for PDF, Office, HTML, TMX, and XLIFF formats
  • Hybrid Search - Dense (semantic) + sparse (keyword/IDF) vectors with RRF fusion

The core application is the Embedding Service. The API Server and SDK, including document processing, are just support for testing and education, so you can evalute how indexing and search works across different strategies.

┌─────────────────────────────────────────────────────────────────┐
│                          Your App                               │
│                                                                 │
│  import { GemmaClient } from '@neurelectra/gemma-embed-sdk';    │
│  const client = new GemmaClient({ embeddingHost, qdrantUrl });  │
│  await client.search('docs', 'query', { mode: 'hybrid' });      │
└──────────────────────────────┬──────────────────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │ gRPC                 │ REST                 │
        ▼                      ▼                      ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│Embedding Service│    │   API Server    │    │     Qdrant      │
│ - EmbeddingGemma│    │  (Test Bed)     │    │  Vector Store   │
│ - BGE Reranker  │    │    :3000        │    │  Dense + Sparse │
│   :50051        │    └─────────────────┘    │    :6333        │
└─────────────────┘                           └─────────────────┘

Features

Feature Description
768-dim Embeddings EmbeddingGemma-300M produces high-quality semantic vectors
Hybrid Search Dense (semantic) + sparse (keyword) vectors with corpus-wide IDF
Search Modes dense, sparse, hybrid, or auto detection
Two-Stage Retrieval Vector search + cross-encoder reranking for best accuracy
Multi-Protocol gRPC for speed, HTTP/HTTPS REST for convenience
Document Processing PDF, DOCX, XLSX, HTML, TMX, XLIFF support with chunking
Bilingual Support TMX/XLIFF parsers for translation memory search
GPU Acceleration Automatic CUDA detection for 10-20x speedup
Configurable Rerankers Speed, quality, balanced, or multilingual presets
Let's Encrypt Built-in SSL certificate provisioning

Quick Start

Prerequisites

Step 1: Hugging Face Setup

The EmbeddingGemma model requires accepting Google's license:

  1. Go to huggingface.co/google/embeddinggemma-300m
  2. Click "Agree and access repository" to accept the terms
  3. Generate an access token at huggingface.co/settings/tokens
  4. Set the token as an environment variable:
export HF_TOKEN=hf_your_token_here

Step 2: Start Qdrant (Optional)

If you want to use the full RAG pipeline with document storage:

docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant

Create a collection for your documents. For best search quality, use hybrid collections with both dense (semantic) and sparse (keyword/IDF) vectors:

# Hybrid collection (recommended) - dense + sparse with IDF
curl -X PUT 'http://localhost:6333/collections/documents' \
  -H 'Content-Type: application/json' \
  -d '{
    "vectors": {
      "dense": {"size": 768, "distance": "Cosine"}
    },
    "sparse_vectors": {
      "sparse": {"modifier": "idf"}
    }
  }'

# Or dense-only collection (simpler, semantic search only)
curl -X PUT 'http://localhost:6333/collections/documents_dense' \
  -H 'Content-Type: application/json' \
  -d '{"vectors": {"size": 768, "distance": "Cosine"}}'

Why Hybrid? Hybrid collections combine the best of both approaches:

  • Dense vectors capture semantic meaning ("car" matches "automobile")
  • Sparse vectors with IDF capture keyword importance across the entire corpus
  • RRF fusion combines both for superior search quality

Step 3: Build and Start

# Clone the repository
git clone https://github.com/your-org/gemma-embedding-framework.git
cd gemma-embedding-framework

# Build with default reranker (bge-gemma - quality)
docker-compose build

# Start all services
docker-compose up

The services will be available at:

Web Interface

The API server includes a web UI for testing at http://localhost:3000:

Frontend Overview

Features:

  • Collection Management - Create and select vector collections
  • Document Indexing - Upload files or paste text with configurable chunking
  • Semantic Search - Search with multiple modes and reranking strategies

See the Frontend User Guide for detailed usage instructions.


Using the API Server for Testing

The API server provides a convenient REST interface for testing the embedding pipeline.

Health Check

curl http://localhost:3000/health

Response:

{
  "status": "ready",
  "embedding_service": true,
  "service_info": {
    "embeddingModel": "google/embeddinggemma-300m",
    "rerankerModel": "BAAI/bge-reranker-v2-gemma",
    "device": "cuda"
  },
  "capabilities": {
    "extensions": [".txt", ".pdf", ".docx", ".html", ".tmx", ".xliff"],
    "chunkingStrategies": ["fixed", "sentence", "paragraph", "segment"]
  }
}

Upload a Document

# Upload a text file
curl -X POST http://localhost:3000/upload \
  -F "file=@document.txt"

# Upload a PDF with custom chunking
curl -X POST http://localhost:3000/upload \
  -F "file=@report.pdf" \
  -F 'options={"chunking": {"strategy": "paragraph"}}'

# Upload raw text
curl -X POST http://localhost:3000/upload \
  -H "Content-Type: application/json" \
  -d '{"text": "Your content here..."}'

Response:

{
  "status": "Indexed!",
  "document_id": "doc_1704672000000",
  "chunks_created": 5,
  "total_tokens": 1234,
  "latency": {
    "parse_ms": 45.2,
    "chunk_ms": 2.1,
    "embed_ms": 850.5,
    "qdrant_ms": 120.3,
    "total_ms": 1018.1
  }
}

Semantic Search

# Hybrid search (recommended for hybrid collections)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "limit": 5,
    "searchMode": "hybrid"
  }'

# Auto-detect search mode based on collection type (default)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "limit": 5,
    "searchMode": "auto"
  }'

# Dense-only search (semantic similarity)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "searchMode": "dense"
  }'

# Sparse-only search (keyword/IDF matching)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "searchMode": "sparse"
  }'

# Without reranking (faster)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "limit": 5,
    "skipRerank": true
  }'

Search Modes:

Mode Description Best For
auto Detects collection type and uses appropriate mode Default, recommended
hybrid Dense + sparse with RRF fusion Best quality on hybrid collections
dense Semantic similarity only Conceptual queries, synonyms
sparse Keyword/IDF matching only Exact keyword queries

Response:

{
  "results": [
    {
      "text": "Machine learning is a subset of artificial intelligence...",
      "original_score": 0.85,
      "rerank_score": 0.92,
      "metadata": {
        "document_id": "doc_123",
        "filename": "ml-intro.pdf",
        "chunk_index": 3
      }
    }
  ],
  "searchMode": "hybrid",
  "reranked": true,
  "latency": {
    "embed_ms": 45.2,
    "qdrant_ms": 12.5,
    "rerank_ms": 85.3,
    "total_ms": 143.0
  }
}

Direct Embedding

# Single embedding
curl -X POST http://localhost:3000/embed \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, world!",
    "taskType": "query"
  }'

# Batch embeddings
curl -X POST http://localhost:3000/embed/batch \
  -H "Content-Type: application/json" \
  -d '{
    "texts": ["First document", "Second document"],
    "taskType": "document"
  }'

Reranking

curl -X POST http://localhost:3000/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "capital of France",
    "documents": [
      "Paris is a beautiful city.",
      "Berlin is the capital of Germany.",
      "Paris is the capital of France."
    ]
  }'

SDK Usage

For programmatic integration, use the Node.js SDK:

npm install @neurelectra/gemma-embed-sdk
# or link locally during development
cd sdk && npm link
cd ../your-project && npm link @neurelectra/gemma-embed-sdk

Full RAG Pipeline with Hybrid Search (Recommended)

import { GemmaClient } from '@neurelectra/gemma-embed-sdk';

// Create unified client
const client = new GemmaClient({
  embeddingHost: 'localhost:50051',
  qdrantUrl: 'http://localhost:6333'
});

await client.waitForReady();

// Create hybrid collection with IDF weighting
await client.qdrant.createCollection('my_docs', {
  vectorType: 'hybrid',     // dense + sparse vectors
  dimensions: 768,
  distance: 'Cosine',
  sparse: { useIdf: true }  // corpus-wide IDF weighting
});

// Index documents with chunking strategy
await client.index('my_docs', [
  { text: 'Machine learning is a subset of AI...', metadata: { source: 'ml.pdf' } },
  { text: 'Neural networks use layers of neurons...', metadata: { source: 'nn.pdf' } }
], {
  strategy: 'sentence',  // 'fixed' | 'sentence' | 'paragraph'
  chunkSize: 1000,
  chunkOverlap: 200
});

// Hybrid search (combines dense semantic + sparse keyword with IDF)
const results = await client.search('my_docs', 'What is machine learning?', {
  mode: 'hybrid',   // 'dense' | 'sparse' | 'hybrid'
  limit: 5,
  rerank: true,
  rerankStrategy: 'smart'
});

console.log(results);
client.close();

Embedding Client Only

import { EmbeddingClient } from '@neurelectra/gemma-embed-sdk';

// Connect to embedding service
const client = new EmbeddingClient('localhost:50051');
await client.waitForReady();

// Generate embeddings
const { vector, metadata } = await client.embed('What is AI?', 'query');
console.log(`Dimensions: ${vector.length}`); // 768

// Batch processing
const batch = await client.embedBatch([
  'First document',
  'Second document'
], 'document');

// Reranking
const ranked = await client.rerank(
  'machine learning',
  ['ML is great', 'Weather today', 'Deep learning'],
  { topK: 2, strategy: 'smart' }
);

// Cleanup
client.close();

Direct Qdrant Access

import { QdrantManager, SparseEncoder } from '@neurelectra/gemma-embed-sdk';

const qdrant = new QdrantManager('http://localhost:6333');
const encoder = new SparseEncoder({ minTokenLength: 3 });

// List collections
const collections = await qdrant.listCollections();

// Count points
const count = await qdrant.count('my_docs');

// Export/import vocabulary for sparse encoder
const vocab = encoder.exportVocab();
encoder.importVocab(vocab);

See sdk/README.md for full SDK documentation.


Document Processing

The API server supports multiple document formats with intelligent chunking.

Supported Formats

Format Extensions Parser
Plain Text .txt, .md, .csv TextParser
PDF .pdf PdfParser
Microsoft Office .docx, .xlsx, .pptx OfficeParser
HTML .html, .htm HtmlParser
TMX .tmx TmxParser (bilingual)
XLIFF .xlf, .xliff XliffParser (bilingual)

Chunking Strategies

Strategy Description Best For
fixed Fixed character size with intelligent boundary breaking General documents, predictable chunk sizes
sentence Groups sentences up to target size, handles abbreviations Articles, essays, well-formatted text
paragraph Preserves paragraph boundaries, sub-splits long paragraphs Reports, structured documents
segment Uses existing segments from bilingual files TMX, XLIFF translation files

Chunking in SDK:

// Fixed chunking (default) - character-based with smart boundaries
await client.index('docs', documents, {
  strategy: 'fixed',
  chunkSize: 500,
  chunkOverlap: 50
});

// Sentence chunking - groups sentences, handles abbreviations (Mr., Dr., etc.)
await client.index('docs', documents, {
  strategy: 'sentence',
  chunkSize: 1000,
  chunkOverlap: 200
});

// Paragraph chunking - preserves document structure
await client.index('docs', documents, {
  strategy: 'paragraph',
  chunkSize: 1500
});

Chunking in API:

# Upload with sentence chunking
curl -X POST http://localhost:3000/upload \
  -F "file=@document.pdf" \
  -F 'options={"chunking": {"strategy": "sentence", "chunkSize": 1000, "overlap": 200}}'

Upload Options

{
  "chunking": {
    "strategy": "fixed",
    "chunkSize": 1000,
    "overlap": 200
  },
  "parser": {
    "preserveFormatting": false
  },
  "bilingual": {
    "skipTarget": false,
    "sourceOnly": false
  }
}

Reranker Selection

Choose the reranker that fits your use case:

Preset Model Size Latency Languages Best For
speed ms-marco-MiniLM ~90MB ~20ms English Low-latency English apps
balanced mxbai-rerank-base ~400MB ~50ms English Speed/quality trade-off
quality bge-reranker-v2-gemma ~2GB ~80ms English Best English accuracy
multilingual bge-reranker-v2-m3 ~568MB ~60ms 100+ languages Non-English content

Important: For non-English content (Portuguese, Spanish, Chinese, etc.), use multilingual. The other rerankers perform poorly on non-English text.

# Build with speed reranker (English, fastest)
docker-compose build --build-arg RERANKER_MODEL=speed

# Build with quality reranker (English, default)
docker-compose build --build-arg RERANKER_MODEL=quality

# Build with multilingual reranker (100+ languages)
docker-compose build --build-arg RERANKER_MODEL=multilingual

Algorithmic Reranking Strategies

In addition to model-based reranking, the service supports algorithmic reranking that runs locally without neural network inference. These strategies run directly in the API server (~1-3ms) without calling the embedding service.

Strategy Speed Description Best For
smart ~1ms max(BM25, vector) per document Recommended default
bm25 ~2ms BM25 keyword matching Keyword queries, any language
hybrid ~1ms Vector + BM25 weighted average Semantic + keyword balance
model ~80ms Cross-encoder neural reranking Best accuracy (GPU recommended)
mmr ~60ms Maximal Marginal Relevance Reducing redundant results

Smart Strategy (Recommended):

The smart strategy takes the best of BM25 and vector similarity for each document:

  • When keywords match: Uses BM25 score (100%)
  • When no keyword match: Falls back to vector score (semantic similarity)

This solves the problem where BM25 returns 0% for semantically similar content without exact keyword matches.

# Smart - Best of BM25 and vector (recommended)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning",
    "rerankStrategy": "smart"
  }'

# BM25 - Fast, keyword-based (great for non-English content!)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning",
    "rerankStrategy": "bm25"
  }'

# Hybrid - Weighted average of vector and BM25
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "database optimization",
    "rerankStrategy": "hybrid"
  }'

# MMR - For diverse results (reduce redundancy)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "python tutorials",
    "rerankStrategy": "mmr"
  }'

Tip: For CPU-only deployments, use smart, bm25, or hybrid for fast reranking (~1-3ms). Reserve model and mmr for GPU deployments where they can run efficiently.

Skip Reranking

For fastest searches (vector similarity only):

curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Sua pergunta aqui",
    "skipRerank": true
  }'

Hybrid Search (Dense + Sparse)

Hybrid search combines two complementary approaches for superior search quality:

How It Works

Vector Type What It Captures Example
Dense Semantic meaning "car" matches "automobile", "vehicle"
Sparse (IDF) Keyword importance "Maduro" matches exact name mentions

Problem with Dense-Only: Semantic search may miss exact keyword matches that are critical for names, technical terms, or specific phrases.

Problem with Sparse-Only (BM25): Keyword matching misses semantically similar content when words differ ("injured" vs "wounded").

Solution - Hybrid: Combine both with RRF (Reciprocal Rank Fusion) to get the best of both worlds.

Collection Types

When creating collections, choose the appropriate type:

// Hybrid collection (recommended) - best search quality
await client.qdrant.createCollection('my_docs', {
  vectorType: 'hybrid',
  sparse: { useIdf: true }  // Corpus-wide IDF weighting
});

// Dense-only collection - simpler, semantic search only
await client.qdrant.createCollection('simple_docs', {
  vectorType: 'dense'
});

Search Mode Selection

Mode Uses Best For
auto Detects collection type Default - recommended
hybrid Dense + Sparse with RRF Best quality on hybrid collections
dense Dense vectors only Conceptual/synonym queries
sparse Sparse vectors only Exact keyword matching

Why IDF Matters

Without IDF: Common words like "the", "is", "a" have equal weight to important terms.

With IDF: Terms that appear in fewer documents (like proper nouns, technical terms) get higher weight, improving keyword matching quality.

The SDK's SparseEncoder generates TF (term frequency) vectors, and Qdrant applies IDF weighting at search time using corpus-wide statistics.


Project Structure

gemma-embedding-framework/
├── README.md                    # This file
├── docker-compose.yml           # Service orchestration
├── vector.proto                 # gRPC service contract
├── docs/                        # Documentation
│   ├── FRONTEND_GUIDE.md        # Web UI user guide
│   └── images/                  # Screenshots for documentation
├── embedding-service-python/    # Python embedding service
│   ├── Dockerfile
│   ├── server.py               # gRPC + HTTP server
│   ├── download_models.py      # Model pre-download script
│   ├── rerankers/              # Algorithmic rerankers (Python)
│   │   ├── bm25.py             # BM25 implementation
│   │   ├── mmr.py              # MMR diversity ranking
│   │   ├── hybrid.py           # Vector + BM25 hybrid
│   │   └── smart.py            # max(BM25, vector)
│   └── README.md               # Detailed service docs
├── api-server/                  # Express.js REST gateway (Test Bed)
│   ├── Dockerfile
│   ├── server.js               # Main API server
│   ├── package.json
│   ├── rerankers/              # Local algorithmic rerankers
│   │   ├── bm25.js             # BM25 keyword ranking
│   │   ├── smart.js            # max(BM25, vector) strategy
│   │   └── hybrid.js           # Weighted vector + BM25
│   ├── public/                 # Web UI
│   │   └── index.html          # Search interface with collection management
│   └── lib/                    # Document processing
│       ├── processor.js        # Main orchestrator
│       ├── parsers/            # Format parsers
│       ├── chunkers/           # Chunking strategies
│       └── plugins/            # TMX, XLIFF plugins
└── sdk/                         # Node.js client SDK
    ├── package.json
    ├── README.md               # SDK documentation
    └── src/
        ├── index.js            # Module exports
        ├── embedding-client.js # gRPC client for embeddings
        ├── qdrant-manager.js   # Qdrant collection/search operations
        ├── sparse-encoder.js   # Text → sparse vector conversion
        ├── gemma-client.js     # Unified client (embeddings + Qdrant)
        └── chunkers/           # Text chunking strategies
            ├── base.js         # BaseChunker & ChunkerRegistry
            ├── fixed.js        # Fixed-size chunking
            ├── sentence.js     # Sentence-based chunking
            └── paragraph.js    # Paragraph-based chunking

Configuration

Environment Variables

Variable Default Description
HF_TOKEN (required) Hugging Face access token
RERANKER_MODEL bge-gemma Reranker preset or model ID
QDRANT_URL http://localhost:6333 Qdrant connection URL
COLLECTION_NAME documents Qdrant collection name
PORT 3000 API server port
GRPC_PORT 50051 Embedding service gRPC port
HTTP_ENABLED false Enable HTTP REST on embedding service
HTTP_PORT 8080 Embedding service HTTP port
PRODUCTION_MODE false Suppress verbose logging (hides query text, instance IDs)

Docker Compose Override

Create docker-compose.override.yml for local customization:

services:
  embedding-service:
    environment:
      - HTTP_ENABLED=true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Performance

Typical latencies (RTX 3080):

Operation Latency
Single embedding ~50ms
Batch embedding (10 texts) ~150ms
Rerank (20 documents) ~80ms
Full search pipeline ~200ms

Tips for optimization:

  • Use batch endpoints for multiple texts
  • Enable GPU acceleration
  • Use gRPC for lowest latency
  • Pre-filter with metadata before reranking

Troubleshooting

"401 Unauthorized" during build

Verify your Hugging Face token and that you've accepted the model license:

curl -H "Authorization: Bearer $HF_TOKEN" https://huggingface.co/api/whoami

"Service not ready" from API server

The embedding service needs time to load models. Wait for the log message:

[Startup] Embedding service is ready!

Slow performance

  • Ensure GPU is being used: check for "device": "cuda" in /health
  • Use batch endpoints instead of single requests
  • Consider the speed reranker for lower latency

License

This framework uses:


Contributing

Contributions are welcome! Please open an issue or pull request.

Support

For issues and feature requests, please open an issue in the repository.

About

Framework for semantic search and RAG (Retrieval-Augmented Generation) applications. Built with Google's EmbeddingGemma model, hybrid search with dense + sparse vectors, and configurable cross-encoder rerankers.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published