A complete, production-ready framework for semantic search and RAG (Retrieval-Augmented Generation) applications. Built with Google's EmbeddingGemma model, hybrid search with dense + sparse vectors, and configurable cross-encoder rerankers.
This framework provides everything you need to build semantic search applications:
- Embedding Service - Core application, high-performance Python service with gRPC, HTTP, and HTTPS support
- API Server - Express.js REST gateway with document processing and Qdrant integration
- Node.js SDK - Complete client library with Qdrant integration and hybrid search
- Document Processing - Built-in parsers for PDF, Office, HTML, TMX, and XLIFF formats
- Hybrid Search - Dense (semantic) + sparse (keyword/IDF) vectors with RRF fusion
The core application is the Embedding Service. The API Server and SDK, including document processing, are just support for testing and education, so you can evalute how indexing and search works across different strategies.
┌─────────────────────────────────────────────────────────────────┐
│ Your App │
│ │
│ import { GemmaClient } from '@neurelectra/gemma-embed-sdk'; │
│ const client = new GemmaClient({ embeddingHost, qdrantUrl }); │
│ await client.search('docs', 'query', { mode: 'hybrid' }); │
└──────────────────────────────┬──────────────────────────────────┘
│
┌──────────────────────┼──────────────────────┐
│ gRPC │ REST │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│Embedding Service│ │ API Server │ │ Qdrant │
│ - EmbeddingGemma│ │ (Test Bed) │ │ Vector Store │
│ - BGE Reranker │ │ :3000 │ │ Dense + Sparse │
│ :50051 │ └─────────────────┘ │ :6333 │
└─────────────────┘ └─────────────────┘
| Feature | Description |
|---|---|
| 768-dim Embeddings | EmbeddingGemma-300M produces high-quality semantic vectors |
| Hybrid Search | Dense (semantic) + sparse (keyword) vectors with corpus-wide IDF |
| Search Modes | dense, sparse, hybrid, or auto detection |
| Two-Stage Retrieval | Vector search + cross-encoder reranking for best accuracy |
| Multi-Protocol | gRPC for speed, HTTP/HTTPS REST for convenience |
| Document Processing | PDF, DOCX, XLSX, HTML, TMX, XLIFF support with chunking |
| Bilingual Support | TMX/XLIFF parsers for translation memory search |
| GPU Acceleration | Automatic CUDA detection for 10-20x speedup |
| Configurable Rerankers | Speed, quality, balanced, or multilingual presets |
| Let's Encrypt | Built-in SSL certificate provisioning |
- Docker and Docker Compose
- Hugging Face account with access token
- (Optional) Qdrant for vector storage
The EmbeddingGemma model requires accepting Google's license:
- Go to huggingface.co/google/embeddinggemma-300m
- Click "Agree and access repository" to accept the terms
- Generate an access token at huggingface.co/settings/tokens
- Set the token as an environment variable:
export HF_TOKEN=hf_your_token_hereIf you want to use the full RAG pipeline with document storage:
docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrantCreate a collection for your documents. For best search quality, use hybrid collections with both dense (semantic) and sparse (keyword/IDF) vectors:
# Hybrid collection (recommended) - dense + sparse with IDF
curl -X PUT 'http://localhost:6333/collections/documents' \
-H 'Content-Type: application/json' \
-d '{
"vectors": {
"dense": {"size": 768, "distance": "Cosine"}
},
"sparse_vectors": {
"sparse": {"modifier": "idf"}
}
}'
# Or dense-only collection (simpler, semantic search only)
curl -X PUT 'http://localhost:6333/collections/documents_dense' \
-H 'Content-Type: application/json' \
-d '{"vectors": {"size": 768, "distance": "Cosine"}}'Why Hybrid? Hybrid collections combine the best of both approaches:
- Dense vectors capture semantic meaning ("car" matches "automobile")
- Sparse vectors with IDF capture keyword importance across the entire corpus
- RRF fusion combines both for superior search quality
# Clone the repository
git clone https://github.com/your-org/gemma-embedding-framework.git
cd gemma-embedding-framework
# Build with default reranker (bge-gemma - quality)
docker-compose build
# Start all services
docker-compose upThe services will be available at:
- API Server: http://localhost:3000
- Embedding Service (gRPC): localhost:50051
- Embedding Service (HTTP): http://localhost:8080 (when enabled)
The API server includes a web UI for testing at http://localhost:3000:
Features:
- Collection Management - Create and select vector collections
- Document Indexing - Upload files or paste text with configurable chunking
- Semantic Search - Search with multiple modes and reranking strategies
See the Frontend User Guide for detailed usage instructions.
The API server provides a convenient REST interface for testing the embedding pipeline.
curl http://localhost:3000/healthResponse:
{
"status": "ready",
"embedding_service": true,
"service_info": {
"embeddingModel": "google/embeddinggemma-300m",
"rerankerModel": "BAAI/bge-reranker-v2-gemma",
"device": "cuda"
},
"capabilities": {
"extensions": [".txt", ".pdf", ".docx", ".html", ".tmx", ".xliff"],
"chunkingStrategies": ["fixed", "sentence", "paragraph", "segment"]
}
}# Upload a text file
curl -X POST http://localhost:3000/upload \
-F "file=@document.txt"
# Upload a PDF with custom chunking
curl -X POST http://localhost:3000/upload \
-F "file=@report.pdf" \
-F 'options={"chunking": {"strategy": "paragraph"}}'
# Upload raw text
curl -X POST http://localhost:3000/upload \
-H "Content-Type: application/json" \
-d '{"text": "Your content here..."}'Response:
{
"status": "Indexed!",
"document_id": "doc_1704672000000",
"chunks_created": 5,
"total_tokens": 1234,
"latency": {
"parse_ms": 45.2,
"chunk_ms": 2.1,
"embed_ms": 850.5,
"qdrant_ms": 120.3,
"total_ms": 1018.1
}
}# Hybrid search (recommended for hybrid collections)
curl -X POST http://localhost:3000/search \
-H "Content-Type: application/json" \
-d '{
"query": "What is machine learning?",
"limit": 5,
"searchMode": "hybrid"
}'
# Auto-detect search mode based on collection type (default)
curl -X POST http://localhost:3000/search \
-H "Content-Type: application/json" \
-d '{
"query": "What is machine learning?",
"limit": 5,
"searchMode": "auto"
}'
# Dense-only search (semantic similarity)
curl -X POST http://localhost:3000/search \
-H "Content-Type: application/json" \
-d '{
"query": "What is machine learning?",
"searchMode": "dense"
}'
# Sparse-only search (keyword/IDF matching)
curl -X POST http://localhost:3000/search \
-H "Content-Type: application/json" \
-d '{
"query": "What is machine learning?",
"searchMode": "sparse"
}'
# Without reranking (faster)
curl -X POST http://localhost:3000/search \
-H "Content-Type: application/json" \
-d '{
"query": "What is machine learning?",
"limit": 5,
"skipRerank": true
}'Search Modes:
| Mode | Description | Best For |
|---|---|---|
auto |
Detects collection type and uses appropriate mode | Default, recommended |
hybrid |
Dense + sparse with RRF fusion | Best quality on hybrid collections |
dense |
Semantic similarity only | Conceptual queries, synonyms |
sparse |
Keyword/IDF matching only | Exact keyword queries |
Response:
{
"results": [
{
"text": "Machine learning is a subset of artificial intelligence...",
"original_score": 0.85,
"rerank_score": 0.92,
"metadata": {
"document_id": "doc_123",
"filename": "ml-intro.pdf",
"chunk_index": 3
}
}
],
"searchMode": "hybrid",
"reranked": true,
"latency": {
"embed_ms": 45.2,
"qdrant_ms": 12.5,
"rerank_ms": 85.3,
"total_ms": 143.0
}
}# Single embedding
curl -X POST http://localhost:3000/embed \
-H "Content-Type: application/json" \
-d '{
"text": "Hello, world!",
"taskType": "query"
}'
# Batch embeddings
curl -X POST http://localhost:3000/embed/batch \
-H "Content-Type: application/json" \
-d '{
"texts": ["First document", "Second document"],
"taskType": "document"
}'curl -X POST http://localhost:3000/rerank \
-H "Content-Type: application/json" \
-d '{
"query": "capital of France",
"documents": [
"Paris is a beautiful city.",
"Berlin is the capital of Germany.",
"Paris is the capital of France."
]
}'For programmatic integration, use the Node.js SDK:
npm install @neurelectra/gemma-embed-sdk
# or link locally during development
cd sdk && npm link
cd ../your-project && npm link @neurelectra/gemma-embed-sdkimport { GemmaClient } from '@neurelectra/gemma-embed-sdk';
// Create unified client
const client = new GemmaClient({
embeddingHost: 'localhost:50051',
qdrantUrl: 'http://localhost:6333'
});
await client.waitForReady();
// Create hybrid collection with IDF weighting
await client.qdrant.createCollection('my_docs', {
vectorType: 'hybrid', // dense + sparse vectors
dimensions: 768,
distance: 'Cosine',
sparse: { useIdf: true } // corpus-wide IDF weighting
});
// Index documents with chunking strategy
await client.index('my_docs', [
{ text: 'Machine learning is a subset of AI...', metadata: { source: 'ml.pdf' } },
{ text: 'Neural networks use layers of neurons...', metadata: { source: 'nn.pdf' } }
], {
strategy: 'sentence', // 'fixed' | 'sentence' | 'paragraph'
chunkSize: 1000,
chunkOverlap: 200
});
// Hybrid search (combines dense semantic + sparse keyword with IDF)
const results = await client.search('my_docs', 'What is machine learning?', {
mode: 'hybrid', // 'dense' | 'sparse' | 'hybrid'
limit: 5,
rerank: true,
rerankStrategy: 'smart'
});
console.log(results);
client.close();import { EmbeddingClient } from '@neurelectra/gemma-embed-sdk';
// Connect to embedding service
const client = new EmbeddingClient('localhost:50051');
await client.waitForReady();
// Generate embeddings
const { vector, metadata } = await client.embed('What is AI?', 'query');
console.log(`Dimensions: ${vector.length}`); // 768
// Batch processing
const batch = await client.embedBatch([
'First document',
'Second document'
], 'document');
// Reranking
const ranked = await client.rerank(
'machine learning',
['ML is great', 'Weather today', 'Deep learning'],
{ topK: 2, strategy: 'smart' }
);
// Cleanup
client.close();import { QdrantManager, SparseEncoder } from '@neurelectra/gemma-embed-sdk';
const qdrant = new QdrantManager('http://localhost:6333');
const encoder = new SparseEncoder({ minTokenLength: 3 });
// List collections
const collections = await qdrant.listCollections();
// Count points
const count = await qdrant.count('my_docs');
// Export/import vocabulary for sparse encoder
const vocab = encoder.exportVocab();
encoder.importVocab(vocab);See sdk/README.md for full SDK documentation.
The API server supports multiple document formats with intelligent chunking.
| Format | Extensions | Parser |
|---|---|---|
| Plain Text | .txt, .md, .csv |
TextParser |
.pdf |
PdfParser | |
| Microsoft Office | .docx, .xlsx, .pptx |
OfficeParser |
| HTML | .html, .htm |
HtmlParser |
| TMX | .tmx |
TmxParser (bilingual) |
| XLIFF | .xlf, .xliff |
XliffParser (bilingual) |
| Strategy | Description | Best For |
|---|---|---|
fixed |
Fixed character size with intelligent boundary breaking | General documents, predictable chunk sizes |
sentence |
Groups sentences up to target size, handles abbreviations | Articles, essays, well-formatted text |
paragraph |
Preserves paragraph boundaries, sub-splits long paragraphs | Reports, structured documents |
segment |
Uses existing segments from bilingual files | TMX, XLIFF translation files |
Chunking in SDK:
// Fixed chunking (default) - character-based with smart boundaries
await client.index('docs', documents, {
strategy: 'fixed',
chunkSize: 500,
chunkOverlap: 50
});
// Sentence chunking - groups sentences, handles abbreviations (Mr., Dr., etc.)
await client.index('docs', documents, {
strategy: 'sentence',
chunkSize: 1000,
chunkOverlap: 200
});
// Paragraph chunking - preserves document structure
await client.index('docs', documents, {
strategy: 'paragraph',
chunkSize: 1500
});Chunking in API:
# Upload with sentence chunking
curl -X POST http://localhost:3000/upload \
-F "file=@document.pdf" \
-F 'options={"chunking": {"strategy": "sentence", "chunkSize": 1000, "overlap": 200}}'{
"chunking": {
"strategy": "fixed",
"chunkSize": 1000,
"overlap": 200
},
"parser": {
"preserveFormatting": false
},
"bilingual": {
"skipTarget": false,
"sourceOnly": false
}
}Choose the reranker that fits your use case:
| Preset | Model | Size | Latency | Languages | Best For |
|---|---|---|---|---|---|
speed |
ms-marco-MiniLM | ~90MB | ~20ms | English | Low-latency English apps |
balanced |
mxbai-rerank-base | ~400MB | ~50ms | English | Speed/quality trade-off |
quality |
bge-reranker-v2-gemma | ~2GB | ~80ms | English | Best English accuracy |
multilingual |
bge-reranker-v2-m3 | ~568MB | ~60ms | 100+ languages | Non-English content |
Important: For non-English content (Portuguese, Spanish, Chinese, etc.), use multilingual. The other rerankers perform poorly on non-English text.
# Build with speed reranker (English, fastest)
docker-compose build --build-arg RERANKER_MODEL=speed
# Build with quality reranker (English, default)
docker-compose build --build-arg RERANKER_MODEL=quality
# Build with multilingual reranker (100+ languages)
docker-compose build --build-arg RERANKER_MODEL=multilingualIn addition to model-based reranking, the service supports algorithmic reranking that runs locally without neural network inference. These strategies run directly in the API server (~1-3ms) without calling the embedding service.
| Strategy | Speed | Description | Best For |
|---|---|---|---|
smart |
~1ms | max(BM25, vector) per document | Recommended default |
bm25 |
~2ms | BM25 keyword matching | Keyword queries, any language |
hybrid |
~1ms | Vector + BM25 weighted average | Semantic + keyword balance |
model |
~80ms | Cross-encoder neural reranking | Best accuracy (GPU recommended) |
mmr |
~60ms | Maximal Marginal Relevance | Reducing redundant results |
Smart Strategy (Recommended):
The smart strategy takes the best of BM25 and vector similarity for each document:
- When keywords match: Uses BM25 score (100%)
- When no keyword match: Falls back to vector score (semantic similarity)
This solves the problem where BM25 returns 0% for semantically similar content without exact keyword matches.
# Smart - Best of BM25 and vector (recommended)
curl -X POST http://localhost:3000/search \
-H "Content-Type: application/json" \
-d '{
"query": "machine learning",
"rerankStrategy": "smart"
}'
# BM25 - Fast, keyword-based (great for non-English content!)
curl -X POST http://localhost:3000/search \
-H "Content-Type: application/json" \
-d '{
"query": "machine learning",
"rerankStrategy": "bm25"
}'
# Hybrid - Weighted average of vector and BM25
curl -X POST http://localhost:3000/search \
-H "Content-Type: application/json" \
-d '{
"query": "database optimization",
"rerankStrategy": "hybrid"
}'
# MMR - For diverse results (reduce redundancy)
curl -X POST http://localhost:3000/search \
-H "Content-Type: application/json" \
-d '{
"query": "python tutorials",
"rerankStrategy": "mmr"
}'Tip: For CPU-only deployments, use smart, bm25, or hybrid for fast reranking (~1-3ms). Reserve model and mmr for GPU deployments where they can run efficiently.
For fastest searches (vector similarity only):
curl -X POST http://localhost:3000/search \
-H "Content-Type: application/json" \
-d '{
"query": "Sua pergunta aqui",
"skipRerank": true
}'Hybrid search combines two complementary approaches for superior search quality:
| Vector Type | What It Captures | Example |
|---|---|---|
| Dense | Semantic meaning | "car" matches "automobile", "vehicle" |
| Sparse (IDF) | Keyword importance | "Maduro" matches exact name mentions |
Problem with Dense-Only: Semantic search may miss exact keyword matches that are critical for names, technical terms, or specific phrases.
Problem with Sparse-Only (BM25): Keyword matching misses semantically similar content when words differ ("injured" vs "wounded").
Solution - Hybrid: Combine both with RRF (Reciprocal Rank Fusion) to get the best of both worlds.
When creating collections, choose the appropriate type:
// Hybrid collection (recommended) - best search quality
await client.qdrant.createCollection('my_docs', {
vectorType: 'hybrid',
sparse: { useIdf: true } // Corpus-wide IDF weighting
});
// Dense-only collection - simpler, semantic search only
await client.qdrant.createCollection('simple_docs', {
vectorType: 'dense'
});| Mode | Uses | Best For |
|---|---|---|
auto |
Detects collection type | Default - recommended |
hybrid |
Dense + Sparse with RRF | Best quality on hybrid collections |
dense |
Dense vectors only | Conceptual/synonym queries |
sparse |
Sparse vectors only | Exact keyword matching |
Without IDF: Common words like "the", "is", "a" have equal weight to important terms.
With IDF: Terms that appear in fewer documents (like proper nouns, technical terms) get higher weight, improving keyword matching quality.
The SDK's SparseEncoder generates TF (term frequency) vectors, and Qdrant applies IDF weighting at search time using corpus-wide statistics.
gemma-embedding-framework/
├── README.md # This file
├── docker-compose.yml # Service orchestration
├── vector.proto # gRPC service contract
├── docs/ # Documentation
│ ├── FRONTEND_GUIDE.md # Web UI user guide
│ └── images/ # Screenshots for documentation
├── embedding-service-python/ # Python embedding service
│ ├── Dockerfile
│ ├── server.py # gRPC + HTTP server
│ ├── download_models.py # Model pre-download script
│ ├── rerankers/ # Algorithmic rerankers (Python)
│ │ ├── bm25.py # BM25 implementation
│ │ ├── mmr.py # MMR diversity ranking
│ │ ├── hybrid.py # Vector + BM25 hybrid
│ │ └── smart.py # max(BM25, vector)
│ └── README.md # Detailed service docs
├── api-server/ # Express.js REST gateway (Test Bed)
│ ├── Dockerfile
│ ├── server.js # Main API server
│ ├── package.json
│ ├── rerankers/ # Local algorithmic rerankers
│ │ ├── bm25.js # BM25 keyword ranking
│ │ ├── smart.js # max(BM25, vector) strategy
│ │ └── hybrid.js # Weighted vector + BM25
│ ├── public/ # Web UI
│ │ └── index.html # Search interface with collection management
│ └── lib/ # Document processing
│ ├── processor.js # Main orchestrator
│ ├── parsers/ # Format parsers
│ ├── chunkers/ # Chunking strategies
│ └── plugins/ # TMX, XLIFF plugins
└── sdk/ # Node.js client SDK
├── package.json
├── README.md # SDK documentation
└── src/
├── index.js # Module exports
├── embedding-client.js # gRPC client for embeddings
├── qdrant-manager.js # Qdrant collection/search operations
├── sparse-encoder.js # Text → sparse vector conversion
├── gemma-client.js # Unified client (embeddings + Qdrant)
└── chunkers/ # Text chunking strategies
├── base.js # BaseChunker & ChunkerRegistry
├── fixed.js # Fixed-size chunking
├── sentence.js # Sentence-based chunking
└── paragraph.js # Paragraph-based chunking
| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
(required) | Hugging Face access token |
RERANKER_MODEL |
bge-gemma |
Reranker preset or model ID |
QDRANT_URL |
http://localhost:6333 |
Qdrant connection URL |
COLLECTION_NAME |
documents |
Qdrant collection name |
PORT |
3000 |
API server port |
GRPC_PORT |
50051 |
Embedding service gRPC port |
HTTP_ENABLED |
false |
Enable HTTP REST on embedding service |
HTTP_PORT |
8080 |
Embedding service HTTP port |
PRODUCTION_MODE |
false |
Suppress verbose logging (hides query text, instance IDs) |
Create docker-compose.override.yml for local customization:
services:
embedding-service:
environment:
- HTTP_ENABLED=true
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]Typical latencies (RTX 3080):
| Operation | Latency |
|---|---|
| Single embedding | ~50ms |
| Batch embedding (10 texts) | ~150ms |
| Rerank (20 documents) | ~80ms |
| Full search pipeline | ~200ms |
Tips for optimization:
- Use batch endpoints for multiple texts
- Enable GPU acceleration
- Use gRPC for lowest latency
- Pre-filter with metadata before reranking
Verify your Hugging Face token and that you've accepted the model license:
curl -H "Authorization: Bearer $HF_TOKEN" https://huggingface.co/api/whoamiThe embedding service needs time to load models. Wait for the log message:
[Startup] Embedding service is ready!
- Ensure GPU is being used: check for
"device": "cuda"in/health - Use batch endpoints instead of single requests
- Consider the
speedreranker for lower latency
This framework uses:
- EmbeddingGemma - Subject to Google's terms
- BGE-Reranker-v2-Gemma - MIT License
- BGE-Reranker-v2-M3 - MIT License (multilingual)
- MS-MARCO MiniLM - Apache 2.0
Contributions are welcome! Please open an issue or pull request.
For issues and feature requests, please open an issue in the repository.
