Qdrant Data Processor

A high-performance data ingestion pipeline in Go for processing CSV datasets and storing them in Qdrant vector database with hybrid search support.

Features

High Performance: Concurrent processing with configurable worker pools
Hybrid Search: Dense vectors (semantic) + BM25 sparse vectors (keyword)
Memory Optimized:
- FLOAT16 vector storage
- 1.5-bit binary quantization
- On-disk vector and payload storage
MRL Support: Matryoshka Representation Learning for dimension reduction (768D → 512D)
Docker Ready: Complete Docker Compose setup with CUDA-accelerated TEI

Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  CSV Files  │────▶│  Go Workers  │────▶│   Qdrant    │
└─────────────┘     └──────────────┘     └─────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │  TEI (CUDA)  │
                    │  Dense 512D  │
                    └──────────────┘

Dense Embeddings: Generated by TEI using google/embeddinggemma-300m with MRL (512D)
Sparse Vectors: Server-side BM25 generation by Qdrant

Quick Start

Prerequisites

Docker & Docker Compose
NVIDIA GPU with CUDA support (for TEI)
Hugging Face account with access token

Setup

Clone the repository

git clone <repository-url>
cd qdrant-data-processor

Configure environment

cp .env.example .env
# Edit .env and set your HF_TOKEN

Prepare dataset

mkdir dataset
# Place your CSV files in the dataset/ folder

Run the pipeline

docker-compose up -d
docker-compose logs -f app

Configuration

All configuration is done via environment variables (see .env.example):

Variable	Default	Description
`HF_TOKEN`	-	Hugging Face token (required)
`TEI_ENDPOINT`	`http://tei:80`	TEI service URL
`QDRANT_HOST`	`qdrant`	Qdrant host
`QDRANT_PORT`	`6334`	Qdrant gRPC port
`QDRANT_API_KEY`	-	Qdrant Cloud API key (optional)
`COLLECTION_NAME`	`amazon-usa-products`	Collection name
`DENSE_MODEL`	`google/embeddinggemma-300m`	Dense embedding model
`SPARSE_MODEL`	`Qdrant/bm25`	Sparse embedding model
`BATCH_SIZE`	`64`	Processing batch size
`WORKER_COUNT`	`10`	Parallel workers
`DATASET_PATH`	`/dataset`	Path to CSV files

Dataset

This project uses the Amazon Products Dataset 2023 (1.4M Products) from Kaggle.

Source: Kaggle Dataset Link

Setup:

Download amazon_categories.csv and amazon_products.csv.

Move them to the dataset/ directory:

mkdir -p dataset
mv /path/to/download/amazon_categories.csv dataset/
mv /path/to/download/amazon_products.csv dataset/

Input Format

Products CSV (amazon_products.csv):

asin,title,imgUrl,productURL,stars,reviews,price,listPrice,category_id,isBestSeller,boughtInLastMonth

Categories CSV (amazon_categories.csv):

id,category_name

Model Access

The project uses google/embeddinggemma-300m, which is a gated model.

Action Required:

Visit the model page on Hugging Face.
Accept the access terms.
Generate an Access Token (Read) from your Hugging Face settings.
Set the HF_TOKEN in your .env file.

Qdrant Payload Example

Each point in Qdrant will store a JSON payload similar to this:

{
  "asin": "B08BHPBBZ9",
  "title": "Hand-woven Hollow Out Soft Straw Shoulder Bag with Pearl Flower, Boho Straw Handle Tote Summer Beach Bag Handbag",
  "category_id": 118,
  "category_name": "Women's Handbags",
  "stars": 4.4,
  "reviews": 3,
  "price": 49.99,
  "list_price": 0,
  "is_best_seller": false,
  "bought_in_last_month": 0,
  "img_url": "https://m.media-amazon.com/images/I/615nVyNZ5+L._AC_UL320_.jpg",
  "product_url": "https://www.amazon.com/dp/B08BHPBBZ9",
  "embedding_text": "title: hand-woven hollow out soft straw shoulder bag with pearl flower, boho straw handle tote summer beach bag handbag | text: women's handbags"
}

Collection Configuration

The Qdrant collection is created with these optimizations:

Setting	Value	Purpose
Dense Vector Size	512D	MRL-reduced EmbeddingGemma
Distance	Cosine	Semantic similarity
Datatype	FLOAT16	50% memory reduction
Quantization	1.5-bit Binary	Fast search, low memory
Vectors On Disk	Yes	Memory efficiency
Payloads On Disk	Yes	Memory efficiency
Sparse Vectors	BM25 + IDF	Keyword search

Payload Fields

Each point includes these searchable fields:

Field	Type	Indexed
`asin`	string	No
`title`	string	No
`category_id`	integer	✅
`category_name`	keyword	✅
`stars`	float	✅
`price`	float	✅
`is_best_seller`	bool	✅
`bought_in_last_month`	integer	✅
`reviews`	integer	✅

Project Structure

qdrant-data-processor/
├── config/config.go          # Configuration management
├── models/models.go          # Data models
├── readers/csv_reader.go     # CSV streaming reader
├── processors/preprocessor.go # Text formatting
├── embeddings/tei_client.go  # TEI OpenAI-compatible client
├── storage/qdrant_client.go  # Qdrant client
├── workers/pool.go           # Concurrent worker pool
├── main.go                   # Entry point
├── go.mod                    # Go module
├── Dockerfile                # Multi-stage build
├── docker-compose.yml        # Full stack
├── .env.example              # Environment template
└── .gitignore               # Git exclusions

Development

Build Locally

go build -o qdrant-data-processor .

Run Locally

# Start Qdrant
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant:v1.16.3

# Start TEI (requires NVIDIA GPU)
docker run --gpus 1 -p 8080:80 \
  -e HF_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.3 \
  --model-id google/embeddinggemma-300m --port 80

# Run processor
QDRANT_HOST=localhost TEI_ENDPOINT=http://localhost:8080 ./qdrant-data-processor

Docker Build

docker-compose build app

Performance

Real-world benchmark on NVIDIA RTX 3080Ti / Intel i7 12700K:

Configuration: 20 workers, batch size 500
Throughput: ~380-400 items/sec (end-to-end)
TEI Latency: ~23-26s per batch of 500 (complex embeddings)
Qdrant Upsert: ~26ms per batch (extremely fast due to async writes)
Total Dataset: ~1.4M products processed in ~1 hour

Troubleshooting

TEI fails to start

Ensure HF_TOKEN is set in .env
Verify GPU is available: nvidia-smi
Check TEI logs: docker-compose logs tei

Dimension mismatch error

Delete the collection and restart:

curl -X DELETE http://localhost:6333/collections/amazon-usa-products
docker-compose restart app

Memory issues

Reduce WORKER_COUNT
Reduce BATCH_SIZE

License

MIT License

Acknowledgments

Qdrant - Vector database
Hugging Face TEI - Embedding inference
Google EmbeddingGemma - Embedding model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qdrant Data Processor

Features

Architecture

Quick Start

Prerequisites

Setup

Configuration

Dataset

Input Format

Model Access

Qdrant Payload Example

Collection Configuration

Payload Fields

Project Structure

Development

Build Locally

Run Locally

Docker Build

Performance

Troubleshooting

TEI fails to start

Dimension mismatch error

Memory issues

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
embeddings		embeddings
models		models
processors		processors
readers		readers
storage		storage
workers		workers
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Folders and files

Latest commit

History

Repository files navigation

Qdrant Data Processor

Features

Architecture

Quick Start

Prerequisites

Setup

Configuration

Dataset

Input Format

Model Access

Qdrant Payload Example

Collection Configuration

Payload Fields

Project Structure

Development

Build Locally

Run Locally

Docker Build

Performance

Troubleshooting

TEI fails to start

Dimension mismatch error

Memory issues

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages