A high-performance data ingestion pipeline in Go for processing CSV datasets and storing them in Qdrant vector database with hybrid search support.
- High Performance: Concurrent processing with configurable worker pools
- Hybrid Search: Dense vectors (semantic) + BM25 sparse vectors (keyword)
- Memory Optimized:
- FLOAT16 vector storage
- 1.5-bit binary quantization
- On-disk vector and payload storage
- MRL Support: Matryoshka Representation Learning for dimension reduction (768D → 512D)
- Docker Ready: Complete Docker Compose setup with CUDA-accelerated TEI
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ CSV Files │────▶│ Go Workers │────▶│ Qdrant │
└─────────────┘ └──────────────┘ └─────────────┘
│
▼
┌──────────────┐
│ TEI (CUDA) │
│ Dense 512D │
└──────────────┘
- Dense Embeddings: Generated by TEI using
google/embeddinggemma-300mwith MRL (512D) - Sparse Vectors: Server-side BM25 generation by Qdrant
- Docker & Docker Compose
- NVIDIA GPU with CUDA support (for TEI)
- Hugging Face account with access token
-
Clone the repository
git clone <repository-url> cd qdrant-data-processor
-
Configure environment
cp .env.example .env # Edit .env and set your HF_TOKEN -
Prepare dataset
mkdir dataset # Place your CSV files in the dataset/ folder -
Run the pipeline
docker-compose up -d docker-compose logs -f app
All configuration is done via environment variables (see .env.example):
| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
- | Hugging Face token (required) |
TEI_ENDPOINT |
http://tei:80 |
TEI service URL |
QDRANT_HOST |
qdrant |
Qdrant host |
QDRANT_PORT |
6334 |
Qdrant gRPC port |
QDRANT_API_KEY |
- | Qdrant Cloud API key (optional) |
COLLECTION_NAME |
amazon-usa-products |
Collection name |
DENSE_MODEL |
google/embeddinggemma-300m |
Dense embedding model |
SPARSE_MODEL |
Qdrant/bm25 |
Sparse embedding model |
BATCH_SIZE |
64 |
Processing batch size |
WORKER_COUNT |
10 |
Parallel workers |
DATASET_PATH |
/dataset |
Path to CSV files |
This project uses the Amazon Products Dataset 2023 (1.4M Products) from Kaggle.
Source: Kaggle Dataset Link
Setup:
- Download
amazon_categories.csvandamazon_products.csv. - Move them to the
dataset/directory:mkdir -p dataset mv /path/to/download/amazon_categories.csv dataset/ mv /path/to/download/amazon_products.csv dataset/
Products CSV (amazon_products.csv):
asin,title,imgUrl,productURL,stars,reviews,price,listPrice,category_id,isBestSeller,boughtInLastMonthCategories CSV (amazon_categories.csv):
id,category_nameThe project uses google/embeddinggemma-300m, which is a gated model.
Action Required:
- Visit the model page on Hugging Face.
- Accept the access terms.
- Generate an Access Token (Read) from your Hugging Face settings.
- Set the
HF_TOKENin your.envfile.
Each point in Qdrant will store a JSON payload similar to this:
{
"asin": "B08BHPBBZ9",
"title": "Hand-woven Hollow Out Soft Straw Shoulder Bag with Pearl Flower, Boho Straw Handle Tote Summer Beach Bag Handbag",
"category_id": 118,
"category_name": "Women's Handbags",
"stars": 4.4,
"reviews": 3,
"price": 49.99,
"list_price": 0,
"is_best_seller": false,
"bought_in_last_month": 0,
"img_url": "https://m.media-amazon.com/images/I/615nVyNZ5+L._AC_UL320_.jpg",
"product_url": "https://www.amazon.com/dp/B08BHPBBZ9",
"embedding_text": "title: hand-woven hollow out soft straw shoulder bag with pearl flower, boho straw handle tote summer beach bag handbag | text: women's handbags"
}The Qdrant collection is created with these optimizations:
| Setting | Value | Purpose |
|---|---|---|
| Dense Vector Size | 512D | MRL-reduced EmbeddingGemma |
| Distance | Cosine | Semantic similarity |
| Datatype | FLOAT16 | 50% memory reduction |
| Quantization | 1.5-bit Binary | Fast search, low memory |
| Vectors On Disk | Yes | Memory efficiency |
| Payloads On Disk | Yes | Memory efficiency |
| Sparse Vectors | BM25 + IDF | Keyword search |
Each point includes these searchable fields:
| Field | Type | Indexed |
|---|---|---|
asin |
string | No |
title |
string | No |
category_id |
integer | ✅ |
category_name |
keyword | ✅ |
stars |
float | ✅ |
price |
float | ✅ |
is_best_seller |
bool | ✅ |
bought_in_last_month |
integer | ✅ |
reviews |
integer | ✅ |
qdrant-data-processor/
├── config/config.go # Configuration management
├── models/models.go # Data models
├── readers/csv_reader.go # CSV streaming reader
├── processors/preprocessor.go # Text formatting
├── embeddings/tei_client.go # TEI OpenAI-compatible client
├── storage/qdrant_client.go # Qdrant client
├── workers/pool.go # Concurrent worker pool
├── main.go # Entry point
├── go.mod # Go module
├── Dockerfile # Multi-stage build
├── docker-compose.yml # Full stack
├── .env.example # Environment template
└── .gitignore # Git exclusions
go build -o qdrant-data-processor .# Start Qdrant
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant:v1.16.3
# Start TEI (requires NVIDIA GPU)
docker run --gpus 1 -p 8080:80 \
-e HF_TOKEN=$HF_TOKEN \
ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.3 \
--model-id google/embeddinggemma-300m --port 80
# Run processor
QDRANT_HOST=localhost TEI_ENDPOINT=http://localhost:8080 ./qdrant-data-processordocker-compose build appReal-world benchmark on NVIDIA RTX 3080Ti / Intel i7 12700K:
- Configuration: 20 workers, batch size 500
- Throughput: ~380-400 items/sec (end-to-end)
- TEI Latency: ~23-26s per batch of 500 (complex embeddings)
- Qdrant Upsert: ~26ms per batch (extremely fast due to async writes)
- Total Dataset: ~1.4M products processed in ~1 hour
- Ensure
HF_TOKENis set in.env - Verify GPU is available:
nvidia-smi - Check TEI logs:
docker-compose logs tei
- Delete the collection and restart:
curl -X DELETE http://localhost:6333/collections/amazon-usa-products docker-compose restart app
- Reduce
WORKER_COUNT - Reduce
BATCH_SIZE
MIT License
- Qdrant - Vector database
- Hugging Face TEI - Embedding inference
- Google EmbeddingGemma - Embedding model