Gemma Embedding Framework

A complete, production-ready framework for semantic search and RAG (Retrieval-Augmented Generation) applications. Built with Google's EmbeddingGemma model, hybrid search with dense + sparse vectors, and configurable cross-encoder rerankers.

Overview

This framework provides everything you need to build semantic search applications:

Embedding Service - Core application, high-performance Python service with gRPC, HTTP, and HTTPS support
API Server - Express.js REST gateway with document processing and Qdrant integration
Node.js SDK - Complete client library with Qdrant integration and hybrid search
Document Processing - Built-in parsers for PDF, Office, HTML, TMX, and XLIFF formats
Hybrid Search - Dense (semantic) + sparse (keyword/IDF) vectors with RRF fusion

The core application is the Embedding Service. The API Server and SDK, including document processing, are just support for testing and education, so you can evalute how indexing and search works across different strategies.

┌─────────────────────────────────────────────────────────────────┐
│                          Your App                               │
│                                                                 │
│  import { GemmaClient } from '@neurelectra/gemma-embed-sdk';    │
│  const client = new GemmaClient({ embeddingHost, qdrantUrl });  │
│  await client.search('docs', 'query', { mode: 'hybrid' });      │
└──────────────────────────────┬──────────────────────────────────┘
                               │
        ┌──────────────────────┼──────────────────────┐
        │ gRPC                 │ REST                 │
        ▼                      ▼                      ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│Embedding Service│    │   API Server    │    │     Qdrant      │
│ - EmbeddingGemma│    │  (Test Bed)     │    │  Vector Store   │
│ - BGE Reranker  │    │    :3000        │    │  Dense + Sparse │
│   :50051        │    └─────────────────┘    │    :6333        │
└─────────────────┘                           └─────────────────┘

Features

Feature	Description
768-dim Embeddings	EmbeddingGemma-300M produces high-quality semantic vectors
Hybrid Search	Dense (semantic) + sparse (keyword) vectors with corpus-wide IDF
Search Modes	`dense`, `sparse`, `hybrid`, or `auto` detection
Two-Stage Retrieval	Vector search + cross-encoder reranking for best accuracy
Multi-Protocol	gRPC for speed, HTTP/HTTPS REST for convenience
Document Processing	PDF, DOCX, XLSX, HTML, TMX, XLIFF support with chunking
Bilingual Support	TMX/XLIFF parsers for translation memory search
GPU Acceleration	Automatic CUDA detection for 10-20x speedup
Configurable Rerankers	Speed, quality, balanced, or multilingual presets
Let's Encrypt	Built-in SSL certificate provisioning

Quick Start

Prerequisites

Docker and Docker Compose
Hugging Face account with access token
(Optional) Qdrant for vector storage

Step 1: Hugging Face Setup

The EmbeddingGemma model requires accepting Google's license:

Go to huggingface.co/google/embeddinggemma-300m
Click "Agree and access repository" to accept the terms
Generate an access token at huggingface.co/settings/tokens
Set the token as an environment variable:

export HF_TOKEN=hf_your_token_here

Step 2: Start Qdrant (Optional)

If you want to use the full RAG pipeline with document storage:

docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant

Create a collection for your documents. For best search quality, use hybrid collections with both dense (semantic) and sparse (keyword/IDF) vectors:

# Hybrid collection (recommended) - dense + sparse with IDF
curl -X PUT 'http://localhost:6333/collections/documents' \
  -H 'Content-Type: application/json' \
  -d '{
    "vectors": {
      "dense": {"size": 768, "distance": "Cosine"}
    },
    "sparse_vectors": {
      "sparse": {"modifier": "idf"}
    }
  }'

# Or dense-only collection (simpler, semantic search only)
curl -X PUT 'http://localhost:6333/collections/documents_dense' \
  -H 'Content-Type: application/json' \
  -d '{"vectors": {"size": 768, "distance": "Cosine"}}'

Why Hybrid? Hybrid collections combine the best of both approaches:

Dense vectors capture semantic meaning ("car" matches "automobile")
Sparse vectors with IDF capture keyword importance across the entire corpus
RRF fusion combines both for superior search quality

Step 3: Build and Start

# Clone the repository
git clone https://github.com/your-org/gemma-embedding-framework.git
cd gemma-embedding-framework

# Build with default reranker (bge-gemma - quality)
docker-compose build

# Start all services
docker-compose up

The services will be available at:

API Server: http://localhost:3000
Embedding Service (gRPC): localhost:50051
Embedding Service (HTTP): http://localhost:8080 (when enabled)

Web Interface

The API server includes a web UI for testing at http://localhost:3000:

Features:

Collection Management - Create and select vector collections
Document Indexing - Upload files or paste text with configurable chunking
Semantic Search - Search with multiple modes and reranking strategies

See the Frontend User Guide for detailed usage instructions.

Using the API Server for Testing

The API server provides a convenient REST interface for testing the embedding pipeline.

Health Check

curl http://localhost:3000/health

Response:

{
  "status": "ready",
  "embedding_service": true,
  "service_info": {
    "embeddingModel": "google/embeddinggemma-300m",
    "rerankerModel": "BAAI/bge-reranker-v2-gemma",
    "device": "cuda"
  },
  "capabilities": {
    "extensions": [".txt", ".pdf", ".docx", ".html", ".tmx", ".xliff"],
    "chunkingStrategies": ["fixed", "sentence", "paragraph", "segment"]
  }
}

Upload a Document

# Upload a text file
curl -X POST http://localhost:3000/upload \
  -F "file=@document.txt"

# Upload a PDF with custom chunking
curl -X POST http://localhost:3000/upload \
  -F "file=@report.pdf" \
  -F 'options={"chunking": {"strategy": "paragraph"}}'

# Upload raw text
curl -X POST http://localhost:3000/upload \
  -H "Content-Type: application/json" \
  -d '{"text": "Your content here..."}'

Response:

{
  "status": "Indexed!",
  "document_id": "doc_1704672000000",
  "chunks_created": 5,
  "total_tokens": 1234,
  "latency": {
    "parse_ms": 45.2,
    "chunk_ms": 2.1,
    "embed_ms": 850.5,
    "qdrant_ms": 120.3,
    "total_ms": 1018.1
  }
}

Semantic Search

# Hybrid search (recommended for hybrid collections)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "limit": 5,
    "searchMode": "hybrid"
  }'

# Auto-detect search mode based on collection type (default)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "limit": 5,
    "searchMode": "auto"
  }'

# Dense-only search (semantic similarity)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "searchMode": "dense"
  }'

# Sparse-only search (keyword/IDF matching)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "searchMode": "sparse"
  }'

# Without reranking (faster)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "limit": 5,
    "skipRerank": true
  }'

Search Modes:

Mode	Description	Best For
`auto`	Detects collection type and uses appropriate mode	Default, recommended
`hybrid`	Dense + sparse with RRF fusion	Best quality on hybrid collections
`dense`	Semantic similarity only	Conceptual queries, synonyms
`sparse`	Keyword/IDF matching only	Exact keyword queries

Response:

{
  "results": [
    {
      "text": "Machine learning is a subset of artificial intelligence...",
      "original_score": 0.85,
      "rerank_score": 0.92,
      "metadata": {
        "document_id": "doc_123",
        "filename": "ml-intro.pdf",
        "chunk_index": 3
      }
    }
  ],
  "searchMode": "hybrid",
  "reranked": true,
  "latency": {
    "embed_ms": 45.2,
    "qdrant_ms": 12.5,
    "rerank_ms": 85.3,
    "total_ms": 143.0
  }
}

Direct Embedding

# Single embedding
curl -X POST http://localhost:3000/embed \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello, world!",
    "taskType": "query"
  }'

# Batch embeddings
curl -X POST http://localhost:3000/embed/batch \
  -H "Content-Type: application/json" \
  -d '{
    "texts": ["First document", "Second document"],
    "taskType": "document"
  }'

Reranking

curl -X POST http://localhost:3000/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "query": "capital of France",
    "documents": [
      "Paris is a beautiful city.",
      "Berlin is the capital of Germany.",
      "Paris is the capital of France."
    ]
  }'

SDK Usage

For programmatic integration, use the Node.js SDK:

npm install @neurelectra/gemma-embed-sdk
# or link locally during development
cd sdk && npm link
cd ../your-project && npm link @neurelectra/gemma-embed-sdk

Full RAG Pipeline with Hybrid Search (Recommended)

import { GemmaClient } from '@neurelectra/gemma-embed-sdk';

// Create unified client
const client = new GemmaClient({
  embeddingHost: 'localhost:50051',
  qdrantUrl: 'http://localhost:6333'
});

await client.waitForReady();

// Create hybrid collection with IDF weighting
await client.qdrant.createCollection('my_docs', {
  vectorType: 'hybrid',     // dense + sparse vectors
  dimensions: 768,
  distance: 'Cosine',
  sparse: { useIdf: true }  // corpus-wide IDF weighting
});

// Index documents with chunking strategy
await client.index('my_docs', [
  { text: 'Machine learning is a subset of AI...', metadata: { source: 'ml.pdf' } },
  { text: 'Neural networks use layers of neurons...', metadata: { source: 'nn.pdf' } }
], {
  strategy: 'sentence',  // 'fixed' | 'sentence' | 'paragraph'
  chunkSize: 1000,
  chunkOverlap: 200
});

// Hybrid search (combines dense semantic + sparse keyword with IDF)
const results = await client.search('my_docs', 'What is machine learning?', {
  mode: 'hybrid',   // 'dense' | 'sparse' | 'hybrid'
  limit: 5,
  rerank: true,
  rerankStrategy: 'smart'
});

console.log(results);
client.close();

Embedding Client Only

import { EmbeddingClient } from '@neurelectra/gemma-embed-sdk';

// Connect to embedding service
const client = new EmbeddingClient('localhost:50051');
await client.waitForReady();

// Generate embeddings
const { vector, metadata } = await client.embed('What is AI?', 'query');
console.log(`Dimensions: ${vector.length}`); // 768

// Batch processing
const batch = await client.embedBatch([
  'First document',
  'Second document'
], 'document');

// Reranking
const ranked = await client.rerank(
  'machine learning',
  ['ML is great', 'Weather today', 'Deep learning'],
  { topK: 2, strategy: 'smart' }
);

// Cleanup
client.close();

Direct Qdrant Access

import { QdrantManager, SparseEncoder } from '@neurelectra/gemma-embed-sdk';

const qdrant = new QdrantManager('http://localhost:6333');
const encoder = new SparseEncoder({ minTokenLength: 3 });

// List collections
const collections = await qdrant.listCollections();

// Count points
const count = await qdrant.count('my_docs');

// Export/import vocabulary for sparse encoder
const vocab = encoder.exportVocab();
encoder.importVocab(vocab);

See sdk/README.md for full SDK documentation.

Document Processing

The API server supports multiple document formats with intelligent chunking.

Supported Formats

Format	Extensions	Parser
Plain Text	`.txt`, `.md`, `.csv`	TextParser
PDF	`.pdf`	PdfParser
Microsoft Office	`.docx`, `.xlsx`, `.pptx`	OfficeParser
HTML	`.html`, `.htm`	HtmlParser
TMX	`.tmx`	TmxParser (bilingual)
XLIFF	`.xlf`, `.xliff`	XliffParser (bilingual)

Chunking Strategies

Strategy	Description	Best For
`fixed`	Fixed character size with intelligent boundary breaking	General documents, predictable chunk sizes
`sentence`	Groups sentences up to target size, handles abbreviations	Articles, essays, well-formatted text
`paragraph`	Preserves paragraph boundaries, sub-splits long paragraphs	Reports, structured documents
`segment`	Uses existing segments from bilingual files	TMX, XLIFF translation files

Chunking in SDK:

// Fixed chunking (default) - character-based with smart boundaries
await client.index('docs', documents, {
  strategy: 'fixed',
  chunkSize: 500,
  chunkOverlap: 50
});

// Sentence chunking - groups sentences, handles abbreviations (Mr., Dr., etc.)
await client.index('docs', documents, {
  strategy: 'sentence',
  chunkSize: 1000,
  chunkOverlap: 200
});

// Paragraph chunking - preserves document structure
await client.index('docs', documents, {
  strategy: 'paragraph',
  chunkSize: 1500
});

Chunking in API:

# Upload with sentence chunking
curl -X POST http://localhost:3000/upload \
  -F "file=@document.pdf" \
  -F 'options={"chunking": {"strategy": "sentence", "chunkSize": 1000, "overlap": 200}}'

Upload Options

{
  "chunking": {
    "strategy": "fixed",
    "chunkSize": 1000,
    "overlap": 200
  },
  "parser": {
    "preserveFormatting": false
  },
  "bilingual": {
    "skipTarget": false,
    "sourceOnly": false
  }
}

Reranker Selection

Choose the reranker that fits your use case:

Preset	Model	Size	Latency	Languages	Best For
`speed`	ms-marco-MiniLM	~90MB	~20ms	English	Low-latency English apps
`balanced`	mxbai-rerank-base	~400MB	~50ms	English	Speed/quality trade-off
`quality`	bge-reranker-v2-gemma	~2GB	~80ms	English	Best English accuracy
`multilingual`	bge-reranker-v2-m3	~568MB	~60ms	100+ languages	Non-English content

Important: For non-English content (Portuguese, Spanish, Chinese, etc.), use multilingual. The other rerankers perform poorly on non-English text.

# Build with speed reranker (English, fastest)
docker-compose build --build-arg RERANKER_MODEL=speed

# Build with quality reranker (English, default)
docker-compose build --build-arg RERANKER_MODEL=quality

# Build with multilingual reranker (100+ languages)
docker-compose build --build-arg RERANKER_MODEL=multilingual

Algorithmic Reranking Strategies

In addition to model-based reranking, the service supports algorithmic reranking that runs locally without neural network inference. These strategies run directly in the API server (~1-3ms) without calling the embedding service.

Strategy	Speed	Description	Best For
`smart`	~1ms	max(BM25, vector) per document	Recommended default
`bm25`	~2ms	BM25 keyword matching	Keyword queries, any language
`hybrid`	~1ms	Vector + BM25 weighted average	Semantic + keyword balance
`model`	~80ms	Cross-encoder neural reranking	Best accuracy (GPU recommended)
`mmr`	~60ms	Maximal Marginal Relevance	Reducing redundant results

Smart Strategy (Recommended):

The smart strategy takes the best of BM25 and vector similarity for each document:

When keywords match: Uses BM25 score (100%)
When no keyword match: Falls back to vector score (semantic similarity)

This solves the problem where BM25 returns 0% for semantically similar content without exact keyword matches.

# Smart - Best of BM25 and vector (recommended)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning",
    "rerankStrategy": "smart"
  }'

# BM25 - Fast, keyword-based (great for non-English content!)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "machine learning",
    "rerankStrategy": "bm25"
  }'

# Hybrid - Weighted average of vector and BM25
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "database optimization",
    "rerankStrategy": "hybrid"
  }'

# MMR - For diverse results (reduce redundancy)
curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "python tutorials",
    "rerankStrategy": "mmr"
  }'

Tip: For CPU-only deployments, use smart, bm25, or hybrid for fast reranking (~1-3ms). Reserve model and mmr for GPU deployments where they can run efficiently.

Skip Reranking

For fastest searches (vector similarity only):

curl -X POST http://localhost:3000/search \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Sua pergunta aqui",
    "skipRerank": true
  }'

Hybrid Search (Dense + Sparse)

Hybrid search combines two complementary approaches for superior search quality:

How It Works

Vector Type	What It Captures	Example
Dense	Semantic meaning	"car" matches "automobile", "vehicle"
Sparse (IDF)	Keyword importance	"Maduro" matches exact name mentions

Problem with Dense-Only: Semantic search may miss exact keyword matches that are critical for names, technical terms, or specific phrases.

Problem with Sparse-Only (BM25): Keyword matching misses semantically similar content when words differ ("injured" vs "wounded").

Solution - Hybrid: Combine both with RRF (Reciprocal Rank Fusion) to get the best of both worlds.

Collection Types

When creating collections, choose the appropriate type:

// Hybrid collection (recommended) - best search quality
await client.qdrant.createCollection('my_docs', {
  vectorType: 'hybrid',
  sparse: { useIdf: true }  // Corpus-wide IDF weighting
});

// Dense-only collection - simpler, semantic search only
await client.qdrant.createCollection('simple_docs', {
  vectorType: 'dense'
});

Search Mode Selection

Mode	Uses	Best For
`auto`	Detects collection type	Default - recommended
`hybrid`	Dense + Sparse with RRF	Best quality on hybrid collections
`dense`	Dense vectors only	Conceptual/synonym queries
`sparse`	Sparse vectors only	Exact keyword matching

Why IDF Matters

Without IDF: Common words like "the", "is", "a" have equal weight to important terms.

With IDF: Terms that appear in fewer documents (like proper nouns, technical terms) get higher weight, improving keyword matching quality.

The SDK's SparseEncoder generates TF (term frequency) vectors, and Qdrant applies IDF weighting at search time using corpus-wide statistics.

Project Structure

gemma-embedding-framework/
├── README.md                    # This file
├── docker-compose.yml           # Service orchestration
├── vector.proto                 # gRPC service contract
├── docs/                        # Documentation
│   ├── FRONTEND_GUIDE.md        # Web UI user guide
│   └── images/                  # Screenshots for documentation
├── embedding-service-python/    # Python embedding service
│   ├── Dockerfile
│   ├── server.py               # gRPC + HTTP server
│   ├── download_models.py      # Model pre-download script
│   ├── rerankers/              # Algorithmic rerankers (Python)
│   │   ├── bm25.py             # BM25 implementation
│   │   ├── mmr.py              # MMR diversity ranking
│   │   ├── hybrid.py           # Vector + BM25 hybrid
│   │   └── smart.py            # max(BM25, vector)
│   └── README.md               # Detailed service docs
├── api-server/                  # Express.js REST gateway (Test Bed)
│   ├── Dockerfile
│   ├── server.js               # Main API server
│   ├── package.json
│   ├── rerankers/              # Local algorithmic rerankers
│   │   ├── bm25.js             # BM25 keyword ranking
│   │   ├── smart.js            # max(BM25, vector) strategy
│   │   └── hybrid.js           # Weighted vector + BM25
│   ├── public/                 # Web UI
│   │   └── index.html          # Search interface with collection management
│   └── lib/                    # Document processing
│       ├── processor.js        # Main orchestrator
│       ├── parsers/            # Format parsers
│       ├── chunkers/           # Chunking strategies
│       └── plugins/            # TMX, XLIFF plugins
└── sdk/                         # Node.js client SDK
    ├── package.json
    ├── README.md               # SDK documentation
    └── src/
        ├── index.js            # Module exports
        ├── embedding-client.js # gRPC client for embeddings
        ├── qdrant-manager.js   # Qdrant collection/search operations
        ├── sparse-encoder.js   # Text → sparse vector conversion
        ├── gemma-client.js     # Unified client (embeddings + Qdrant)
        └── chunkers/           # Text chunking strategies
            ├── base.js         # BaseChunker & ChunkerRegistry
            ├── fixed.js        # Fixed-size chunking
            ├── sentence.js     # Sentence-based chunking
            └── paragraph.js    # Paragraph-based chunking

Configuration

Environment Variables

Variable	Default	Description
`HF_TOKEN`	(required)	Hugging Face access token
`RERANKER_MODEL`	`bge-gemma`	Reranker preset or model ID
`QDRANT_URL`	`http://localhost:6333`	Qdrant connection URL
`COLLECTION_NAME`	`documents`	Qdrant collection name
`PORT`	`3000`	API server port
`GRPC_PORT`	`50051`	Embedding service gRPC port
`HTTP_ENABLED`	`false`	Enable HTTP REST on embedding service
`HTTP_PORT`	`8080`	Embedding service HTTP port
`PRODUCTION_MODE`	`false`	Suppress verbose logging (hides query text, instance IDs)

Docker Compose Override

Create docker-compose.override.yml for local customization:

services:
  embedding-service:
    environment:
      - HTTP_ENABLED=true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Performance

Typical latencies (RTX 3080):

Operation	Latency
Single embedding	~50ms
Batch embedding (10 texts)	~150ms
Rerank (20 documents)	~80ms
Full search pipeline	~200ms

Tips for optimization:

Use batch endpoints for multiple texts
Enable GPU acceleration
Use gRPC for lowest latency
Pre-filter with metadata before reranking

Troubleshooting

"401 Unauthorized" during build

Verify your Hugging Face token and that you've accepted the model license:

curl -H "Authorization: Bearer $HF_TOKEN" https://huggingface.co/api/whoami

"Service not ready" from API server

The embedding service needs time to load models. Wait for the log message:

[Startup] Embedding service is ready!

Slow performance

Ensure GPU is being used: check for "device": "cuda" in /health
Use batch endpoints instead of single requests
Consider the speed reranker for lower latency

License

This framework uses:

EmbeddingGemma - Subject to Google's terms
BGE-Reranker-v2-Gemma - MIT License
BGE-Reranker-v2-M3 - MIT License (multilingual)
MS-MARCO MiniLM - Apache 2.0

Contributing

Contributions are welcome! Please open an issue or pull request.

Support

For issues and feature requests, please open an issue in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
api-server		api-server
docs		docs
embedding-service-python		embedding-service-python
nginx		nginx
sdk		sdk
specs		specs
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package.json		package.json
vector.proto		vector.proto

License

neurelectra/embedding-server

Folders and files

Latest commit

History

Repository files navigation

Gemma Embedding Framework

Overview

Features

Quick Start

Prerequisites

Step 1: Hugging Face Setup

Step 2: Start Qdrant (Optional)

Step 3: Build and Start

Web Interface

Using the API Server for Testing

Health Check

Upload a Document

Semantic Search

Direct Embedding

Reranking

SDK Usage

Full RAG Pipeline with Hybrid Search (Recommended)

Embedding Client Only

Direct Qdrant Access

Document Processing

Supported Formats

Chunking Strategies

Upload Options

Reranker Selection

Algorithmic Reranking Strategies

Skip Reranking

Hybrid Search (Dense + Sparse)

How It Works

Collection Types

Search Mode Selection

Why IDF Matters

Project Structure

Configuration

Environment Variables

Docker Compose Override

Performance

Troubleshooting

"401 Unauthorized" during build

"Service not ready" from API server

Slow performance

License

Contributing

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages