## Day 2: HNSW Tuning and Filtering Optimization

In this tutorial, we'll explore how Qdrant's HNSW algorithm affects search performance and how payload indexes can dramatically improve filtering speed. You'll learn to:

- Tune our HNSW configurations
- Compare filtering performance with and without payload indexes

### 1. Installing Required Dependencies

We'll need several libraries to work with Qdrant and perform our performance tests:

In [None]:
#!pip install datasets qdrant-client tqdm openai -q

Now let's install the necessary libraries. These packages will enable us to:

- `datasets`: Load and work with the DBpedia dataset
- `qdrant-client`: Interact with our Qdrant vector database
- `tqdm`: Show progress bars during data processing
- `openai`: Generate embeddings for our queries
- `time` - for performance timing

In [1]:
from datasets import load_dataset
from qdrant_client import QdrantClient, models
from tqdm import tqdm
import time
import os

  from .autonotebook import tqdm as notebook_tqdm


### 2. Connecting to Qdrant Cloud

Now we'll establish a connection to our Qdrant cloud instance. Unlike the Day 1 tutorial which used an in-memory database, we're using Qdrant Cloud for this performance testing. This allows us to test with larger datasets and measure real-world performance with cloud infrastructure.

In [2]:
#from google.colab import userdata
from materials.q_drant_client import client

collections=[CollectionDescription(name='store'), CollectionDescription(name='my_first_collection'), CollectionDescription(name='my_collection'), CollectionDescription(name='day0_first_system'), CollectionDescription(name='dbpedia_100K'), CollectionDescription(name='dev_vectors'), CollectionDescription(name='production_vectors')]


### 3. Loading the DBpedia Dataset

We'll use the DBpedia entities dataset, which contains 100K Wikipedia articles with **pre-computed embeddings** with OpenAI's `text-embedding-3-large` model with 1536-dimensional vectors (first 1536 dimensions of a `text-embedding-3-large` embedding), which is ideal for testing HNSW performance on high-dimensional data.

The 100K vectors dataset allow us to see real performance differences and includes titles, text, and categories for filtering tests.

In [None]:
ds = load_dataset("Qdrant/dbpedia-entities-openai3-text-embedding-3-large-1536-100K")

collection_name = "dbpedia_100K"

### 4. Creating Our Collection

We're starting with `m=0` for a specific reason: **bulk upload speed**. When `m=0`, Qdrant doesn't build any HNSW connections during indexing, which makes uploading 100K vectors much faster. This is perfect for our testing workflow, where we'll:

1. Upload data quickly with `m=0`
2. Test full scan performance
3. Update to `m=16` and test HNSW performance
4. Compare the difference

> **<font color='red'>Warning:</font>** Don't use this technique on subsequent bulk uploads. Setting `m=0` will delete the existing HNSW index. Rebuilding from scratch is slow and resource-intensive.

The `strict_mode_config` with `enabled=False` and `unindexed_filtering_retrieve=True` allows us to test filtering without payload indexes, so we can measure the performance impact of adding indexes later.

> Qdrant (Managed) Cloud runs in strict mode by default.

In [None]:
client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=1536,
        distance=models.Distance.COSINE
    ),
    hnsw_config=models.HnswConfigDiff(
        m=0, # número de nós vizinhos para cada nó
        ef_construct=100, # quantos candidatos para cada nó
        full_scan_threshold=10000 # abaixo disso, faz scan completo, acima disso, faz scan parcial HNSW
    ),
    strict_mode_config=models.StrictModeConfig(
        enabled=False,
        unindexed_filtering_retrieve=True  # Allow filtering without indexes
    )
)

print(f"Created collection: {collection_name}")

### 5. Exploring Our Dataset

Before we start uploading data, let's examine the dataset structure to understand what we're working with. This helps us verify the data format and plan our upload strategy.

In [None]:
print("Dataset info:")
print(ds)

print("\nFirst example (proper access):")
first_example = ds['train'][0]
print(first_example)

print("\nDataset features:")
print(ds['train'].features)

print("\nAvailable columns:")
print(ds['train'].column_names)

### 6. Bulk Uploading Points

We're uploading 100K vectors in large batches (10K each) to speed up the process. With `m=0`, the upload is much faster since Qdrant isn't building HNSW connections during indexing.

> **<font color='red'>Warning:</font>** Don't use this technique on subsequent bulk uploads. Setting `m=0` will delete the existing HNSW index. Rebuilding from scratch is slow and resource-intensive.

The payload includes fields we'll use for filtering tests: `length` (text length) and `has_numbers` (a boolean flag). These will let us test different filter types later.

In [None]:
batch_size = 10000
total_points = len(ds['train'])

print(f"Uploading {total_points} points in batches of {batch_size}")

def upload_batch_without_indexes(start_idx, end_idx):
    points = []
    for i in range(start_idx, min(end_idx, total_points)):
        example = ds['train'][i]

        # Get the embedding
        embedding = example['text-embedding-3-large-1536-embedding']

        # Create payload
        payload = {
            'text': example['text'],
            'title': example['title'],
            '_id': example['_id'],
            'length': len(example['text']),
            'has_numbers': any(char.isdigit() for char in example['text'])
        }

        points.append(models.PointStruct(
            id=i,
            vector=embedding,
            payload=payload
        ))

    if points:
        client.upload_points(collection_name=collection_name, points=points)
        return len(points)
    return 0

# Upload all batches
total_uploaded = 0
for i in tqdm(range(0, total_points, batch_size), desc="Uploading points"):
    uploaded = upload_batch_without_indexes(i, i + batch_size)
    total_uploaded += uploaded

print(f"\nUpload completed! Total points uploaded: {total_uploaded}")

### 7. Updating to HNSW Configuration

Now we'll update the collection to use `m=16`, which builds HNSW connections between vectors. This should dramatically improve search speed by creating a navigable graph structure for vector search. Search becomes near‑logarithmic instead of linear scanning.

The `m=16` parameter means each node connects to 16 nearest neighbors, creating a balance between search speed and index size.

In [None]:
client.update_collection(
    collection_name=collection_name,
    hnsw_config=models.HnswConfigDiff(
        m=16,  # Updated from 0 to 16
    )
)

print("HNSW indexing enabled with m=16")

Now it will take some time to build the HNSW index. You can check whether it's fully built and optimized by, for example, looking at the **collection status** — once it changes from `YELLOW` to `GREEN`, the process is complete.


In [4]:
client.get_collection(collection_name=collection_name).status

<CollectionStatus.GREEN: 'green'>

However, you can start querying the collection right away!


### 8. Creating a Test Query

We're using OpenAI to generate an embedding for our query. We must use the same model (`text-embedding-3-large`) and dimensions (1536, first 1536 dimensions out of 3072) that were used to create our dataset embeddings.

> We've provided the embedding of the test query for the experiments, so you **don't have to use the OpenAI API**.


In [6]:
import numpy as np
from materials.embedding_model import embed_model

new_query = "artificial intelligence"

def get_query_embedding(text):
    try:
        response = embed_model.embed_query(text)
        return response
    except Exception as e:
        print(f"Error getting OpenAI embedding: {e}")
        print("Using random vector as fallback...")
        return np.random.normal(0, 1, 1536).tolist()

# Get the embedding
query_embedding = get_query_embedding(new_query)
print(len(query_embedding))

1536


In [None]:
# Test query already embedded, if you prefer to avoid using OpenAI API

import requests

url = "https://storage.googleapis.com/qdrant-examples/query_embedding_day_2.json"
resp = requests.get(url)

query_embedding = resp.json()["query_vector"]

print(f"Embedding dimensions: {len(query_embedding)}")
print(f"First 5 values: {query_embedding[:5]}")

### 9. Performing Similarity Search

After all vectors are indexed, we'll test the baseline performance with `m=16`.

In [7]:
print("Running baseline performance test...")

# Warm up the RAM index/vectors cache with a test query
print("Warming up caches...")
client.query_points(collection_name=collection_name, query=query_embedding, limit=1)

# Measure vector search performance
search_times = []
for _ in range(3):  # Multiple runs for a stable average
    start_time = time.time()
    response = client.query_points(
        collection_name=collection_name,
        query=query_embedding,
        limit=10
    )
    search_time = (time.time() - start_time) * 1000
    search_times.append(search_time)

baseline_time = sum(search_times) / len(search_times)

print(f"Average search time: {baseline_time:.2f}ms")
print(f"Search times: {[f'{t:.2f}ms' for t in search_times]}")
print(f"Found {len(response.points)} results")
print(f"Top result: '{response.points[0].payload['title']}' (score: {response.points[0].score:.4f})")

Running baseline performance test...
Warming up caches...
Average search time: 345.55ms
Search times: ['473.12ms', '283.73ms', '279.79ms']
Found 10 results
Top result: 'A.I. Artificial Intelligence' (score: 0.5392)


The first time you run a query, Qdrant may need to load parts of the index from disk into memory, which can make it slower. After that, those parts stay cached in memory, so repeated queries are much faster. But if your machine is low on memory or you wait too long, the system might remove that cached data, and the process would repeat.

### 10. Testing Filtering Without Indexes

Now we'll test filtering performance without any payload indexes. This forces Qdrant to scan through all 100K vectors and check each one against the filter condition.

We're comparing the search time with and without a filter to see the overhead of full scan filtering.

In [9]:
print("Testing filtering without payload indexes")

# Create a text-based filter
text_filter = models.Filter(
    must=[
        models.FieldCondition(
            key="text",
            match=models.MatchText(text="data")
        )
    ]
)

# Run multiple times for more reliable measurement
unindexed_times = []
for i in range(3):
    start_time = time.time()
    response = client.query_points(
        collection_name=collection_name,
        query=query_embedding,
        limit=10,
        search_params=models.SearchParams(hnsw_ef=100),
        query_filter=text_filter
    )
    unindexed_times.append((time.time() - start_time) * 1000)

unindexed_filter_time = sum(unindexed_times) / len(unindexed_times)

print(f"Filtered search (WITHOUT index): {unindexed_filter_time:.2f}ms")
print(f"Individual times: {[f'{t:.2f}ms' for t in unindexed_times]}")
print(f"Overhead vs baseline: {unindexed_filter_time - baseline_time:.2f}ms")
print(f"Found {len(response.points)} matching results")
if response.points:
    print(f"Top result: '{response.points[0].payload['text']}'\nScore: {response.points[0].score:.4f}")
else:
    print("No results found - try a different filter term")

Testing filtering without payload indexes
Filtered search (WITHOUT index): 942.11ms
Individual times: ['1224.59ms', '780.72ms', '821.02ms']
Overhead vs baseline: 596.56ms
Found 10 matching results
Top result: 'Cyc (/ˈsaɪk/) is an artificial intelligence project that attempts to assemble a comprehensive ontology and knowledge base of everyday common sense knowledge, with the goal of enabling AI applications to perform human-like reasoning.The project was started in 1984 by Douglas Lenat at MCC and is developed by the Cycorp company.Parts of the project are released as OpenCyc, which provides an API, RDF endpoint, and data dump under an open source license.'
Score: 0.3030


### 11. Creating Payload Indexes

Now we'll create a [full text index](https://qdrant.tech/documentation/concepts/indexing/#full-text-index) on our text payload field. This should dramatically improve filtering performance by allowing Qdrant to quickly locate matching records instead of scanning all vectors.

> **<font color='red'>Warning:</font>** The **Filterable HNSW** index (Qdrant’s native vector index structure designed for vector search with filtering) is built **only if** the payload indices are created **before** the HNSW index is built.


In [11]:
client.create_payload_index(
    collection_name=collection_name,
    field_name="text",
    wait=True,
    field_schema=models.TextIndexParams(
        type="text",
        tokenizer="word",
        phrase_matching=False
        )
    )

print("Payload index created for 'text' field")

# If you want filter‑aware HNSW and you built the graph before creating payload indexes,
# rebuild the graph to attach filter data structures.

# Note: Reindexing takes up a lot of resources, and it is advised to set payload
# indexes only once, before creating HNSW graph.
# client.update_collection(collection_name=collection_name, hnsw_config=models.HnswConfigDiff(m=0))
# client.update_collection(collection_name=collection_name, hnsw_config=models.HnswConfigDiff(m=16))

Payload index created for 'text' field


### 12. Testing Filtering With Indexes

Now we'll test the same filter query, but this time with the payload index in place. The performance difference should be dramatic - indexed filtering should be much faster than the full scan we just tested.

This comparison shows the real-world impact of adding indexes to your vector search engine.

In [13]:
print("Testing filtering WITH payload indexes...")

print("Warming up caches...")
client.query_points(collection_name=collection_name, query=query_embedding, limit=1)

# Run multiple times for more reliable measurement
indexed_times = []
for i in range(3):
    start_time = time.time()
    response = client.query_points(
        collection_name=collection_name,
        query=query_embedding,
        limit=10,
        search_params=models.SearchParams(hnsw_ef=100),
        query_filter=text_filter
    )
    indexed_times.append((time.time() - start_time) * 1000)

indexed_filter_time = sum(indexed_times) / len(indexed_times)

print(f"Filtered search (WITH index): {indexed_filter_time:.2f}ms")
print(f"Individual times: {[f'{t:.2f}ms' for t in indexed_times]}")
print(f"Overhead vs baseline: {indexed_filter_time - baseline_time:.2f}ms")
print(f"Found {len(response.points)} matching results")
if response.points:
    print(f"Top result: '{response.points[0].payload['text']}'\nScore: {response.points[0].score:.4f}")
else:
    print("No results found - try a different filter term")

Testing filtering WITH payload indexes...
Warming up caches...
Filtered search (WITH index): 324.50ms
Individual times: ['412.68ms', '283.26ms', '277.56ms']
Overhead vs baseline: -21.05ms
Found 10 matching results
Top result: 'Cyc (/ˈsaɪk/) is an artificial intelligence project that attempts to assemble a comprehensive ontology and knowledge base of everyday common sense knowledge, with the goal of enabling AI applications to perform human-like reasoning.The project was started in 1984 by Douglas Lenat at MCC and is developed by the Cycorp company.Parts of the project are released as OpenCyc, which provides an API, RDF endpoint, and data dump under an open source license.'
Score: 0.3030


## Next Steps

In this tutorial, you've learned how to:

- **Optimize initial upload speed** by starting with `m=0` and building HNSW later
- **Measure filtering overhead** with and without payload indexes
- **Tune our HNSW index and parameters**

These techniques help you understand the performance trade-offs in vector search engines and optimize your applications for production use.