Skip to content

gururaser/qdrant-data-processor

Repository files navigation

Qdrant Data Processor

A high-performance data ingestion pipeline in Go for processing CSV datasets and storing them in Qdrant vector database with hybrid search support.

Features

  • High Performance: Concurrent processing with configurable worker pools
  • Hybrid Search: Dense vectors (semantic) + BM25 sparse vectors (keyword)
  • Memory Optimized:
    • FLOAT16 vector storage
    • 1.5-bit binary quantization
    • On-disk vector and payload storage
  • MRL Support: Matryoshka Representation Learning for dimension reduction (768D → 512D)
  • Docker Ready: Complete Docker Compose setup with CUDA-accelerated TEI

Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  CSV Files  │────▶│  Go Workers  │────▶│   Qdrant    │
└─────────────┘     └──────────────┘     └─────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │  TEI (CUDA)  │
                    │  Dense 512D  │
                    └──────────────┘
  • Dense Embeddings: Generated by TEI using google/embeddinggemma-300m with MRL (512D)
  • Sparse Vectors: Server-side BM25 generation by Qdrant

Quick Start

Prerequisites

  • Docker & Docker Compose
  • NVIDIA GPU with CUDA support (for TEI)
  • Hugging Face account with access token

Setup

  1. Clone the repository

    git clone <repository-url>
    cd qdrant-data-processor
  2. Configure environment

    cp .env.example .env
    # Edit .env and set your HF_TOKEN
  3. Prepare dataset

    mkdir dataset
    # Place your CSV files in the dataset/ folder
  4. Run the pipeline

    docker-compose up -d
    docker-compose logs -f app

Configuration

All configuration is done via environment variables (see .env.example):

Variable Default Description
HF_TOKEN - Hugging Face token (required)
TEI_ENDPOINT http://tei:80 TEI service URL
QDRANT_HOST qdrant Qdrant host
QDRANT_PORT 6334 Qdrant gRPC port
QDRANT_API_KEY - Qdrant Cloud API key (optional)
COLLECTION_NAME amazon-usa-products Collection name
DENSE_MODEL google/embeddinggemma-300m Dense embedding model
SPARSE_MODEL Qdrant/bm25 Sparse embedding model
BATCH_SIZE 64 Processing batch size
WORKER_COUNT 10 Parallel workers
DATASET_PATH /dataset Path to CSV files

Dataset

This project uses the Amazon Products Dataset 2023 (1.4M Products) from Kaggle.

Source: Kaggle Dataset Link

Setup:

  1. Download amazon_categories.csv and amazon_products.csv.
  2. Move them to the dataset/ directory:
    mkdir -p dataset
    mv /path/to/download/amazon_categories.csv dataset/
    mv /path/to/download/amazon_products.csv dataset/

Input Format

Products CSV (amazon_products.csv):

asin,title,imgUrl,productURL,stars,reviews,price,listPrice,category_id,isBestSeller,boughtInLastMonth

Categories CSV (amazon_categories.csv):

id,category_name

Model Access

The project uses google/embeddinggemma-300m, which is a gated model.

Action Required:

  1. Visit the model page on Hugging Face.
  2. Accept the access terms.
  3. Generate an Access Token (Read) from your Hugging Face settings.
  4. Set the HF_TOKEN in your .env file.

Qdrant Payload Example

Each point in Qdrant will store a JSON payload similar to this:

{
  "asin": "B08BHPBBZ9",
  "title": "Hand-woven Hollow Out Soft Straw Shoulder Bag with Pearl Flower, Boho Straw Handle Tote Summer Beach Bag Handbag",
  "category_id": 118,
  "category_name": "Women's Handbags",
  "stars": 4.4,
  "reviews": 3,
  "price": 49.99,
  "list_price": 0,
  "is_best_seller": false,
  "bought_in_last_month": 0,
  "img_url": "https://m.media-amazon.com/images/I/615nVyNZ5+L._AC_UL320_.jpg",
  "product_url": "https://www.amazon.com/dp/B08BHPBBZ9",
  "embedding_text": "title: hand-woven hollow out soft straw shoulder bag with pearl flower, boho straw handle tote summer beach bag handbag | text: women's handbags"
}

Collection Configuration

The Qdrant collection is created with these optimizations:

Setting Value Purpose
Dense Vector Size 512D MRL-reduced EmbeddingGemma
Distance Cosine Semantic similarity
Datatype FLOAT16 50% memory reduction
Quantization 1.5-bit Binary Fast search, low memory
Vectors On Disk Yes Memory efficiency
Payloads On Disk Yes Memory efficiency
Sparse Vectors BM25 + IDF Keyword search

Payload Fields

Each point includes these searchable fields:

Field Type Indexed
asin string No
title string No
category_id integer
category_name keyword
stars float
price float
is_best_seller bool
bought_in_last_month integer
reviews integer

Project Structure

qdrant-data-processor/
├── config/config.go          # Configuration management
├── models/models.go          # Data models
├── readers/csv_reader.go     # CSV streaming reader
├── processors/preprocessor.go # Text formatting
├── embeddings/tei_client.go  # TEI OpenAI-compatible client
├── storage/qdrant_client.go  # Qdrant client
├── workers/pool.go           # Concurrent worker pool
├── main.go                   # Entry point
├── go.mod                    # Go module
├── Dockerfile                # Multi-stage build
├── docker-compose.yml        # Full stack
├── .env.example              # Environment template
└── .gitignore               # Git exclusions

Development

Build Locally

go build -o qdrant-data-processor .

Run Locally

# Start Qdrant
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant:v1.16.3

# Start TEI (requires NVIDIA GPU)
docker run --gpus 1 -p 8080:80 \
  -e HF_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.3 \
  --model-id google/embeddinggemma-300m --port 80

# Run processor
QDRANT_HOST=localhost TEI_ENDPOINT=http://localhost:8080 ./qdrant-data-processor

Docker Build

docker-compose build app

Performance

Real-world benchmark on NVIDIA RTX 3080Ti / Intel i7 12700K:

  • Configuration: 20 workers, batch size 500
  • Throughput: ~380-400 items/sec (end-to-end)
  • TEI Latency: ~23-26s per batch of 500 (complex embeddings)
  • Qdrant Upsert: ~26ms per batch (extremely fast due to async writes)
  • Total Dataset: ~1.4M products processed in ~1 hour

Troubleshooting

TEI fails to start

  • Ensure HF_TOKEN is set in .env
  • Verify GPU is available: nvidia-smi
  • Check TEI logs: docker-compose logs tei

Dimension mismatch error

  • Delete the collection and restart:
    curl -X DELETE http://localhost:6333/collections/amazon-usa-products
    docker-compose restart app

Memory issues

  • Reduce WORKER_COUNT
  • Reduce BATCH_SIZE

License

MIT License

Acknowledgments

About

A high-performance data ingestion pipeline in Go for processing Amazon Products Dataset 2023 (1.4M Products) dataset from Kaggle and storing them in Qdrant vector database with hybrid search support.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors