[Bug]: RangeSearch result not as expected #34199

cydrain · 2024-06-26T09:05:10Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: v2.3.4
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): ubuntu22.04
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

generate 10,000 vectors, and create "IVF_SQ8" index with metric "COSINE"
do RangeSearch with radius=0.7, get following results:

Top 0: id: 1613, distance: 0.8454123735427856, entity: {}      <========== this result miss when RangeSearch with radius=0.8
Top 1: id: 9179, distance: 0.8377795219421387, entity: {}
Top 2: id: 3374, distance: 0.8356838226318359, entity: {}
Top 3: id: 5438, distance: 0.8329548835754395, entity: {}
Top 4: id: 9604, distance: 0.8325211405754089, entity: {}
Top 5: id: 7765, distance: 0.8311684131622314, entity: {}
Top 6: id: 6103, distance: 0.8307194113731384, entity: {}
Top 7: id: 8275, distance: 0.827889084815979, entity: {}
Top 8: id: 5087, distance: 0.8266986012458801, entity: {}
Top 9: id: 8819, distance: 0.826603889465332, entity: {}

do RangeSearch again with radius=0.8

Top 0: id: 9179, distance: 0.8377795219421387, entity: {}
Top 1: id: 3374, distance: 0.8356838226318359, entity: {}
Top 2: id: 7765, distance: 0.8311684131622314, entity: {}
Top 3: id: 8138, distance: 0.8248189687728882, entity: {}
Top 4: id: 552, distance: 0.818252682685852, entity: {}
Top 5: id: 3877, distance: 0.8178315162658691, entity: {}
Top 6: id: 2179, distance: 0.8175950646400452, entity: {}
Top 7: id: 3890, distance: 0.8130142688751221, entity: {}
Top 8: id: 9989, distance: 0.8103870153427124, entity: {}
Top 9: id: 3051, distance: 0.8075019717216492, entity: {}

set "max_empty_result_buckets=65000" or any value else, still cannot get the result back "id=1613"

Expected Behavior

should be able to get the result with "id=1613"

Steps To Reproduce

run this script:

import random

from pymilvus import (
    connections,
    FieldSchema, CollectionSchema, DataType,
    Collection,
    utility
)

_HOST = '127.0.0.1'
_PORT = '19530'

# Const names
_COLLECTION_NAME = 'demo'
_ID_FIELD_NAME = 'id_field'
_VECTOR_FIELD_NAME = 'float_vector_field'

# Vector parameters
_BATCH = 100000
_ROWS = 10000
_DIM = 128
_INDEX_FILE_SIZE = 32  # max file size of stored index
_NQ = 1

# Index parameters
_METRIC_TYPE = 'COSINE'
_INDEX_TYPE = 'IVF_SQ8'
_NLIST = 1024
_NPROBE = 16
_TOPK = 10

# Create a Milvus connection
def create_connection():
    print(f"\nCreate connection...")
    connections.connect(host=_HOST, port=_PORT)
    print(f"\nList connections:")
    print(connections.list_connections())

# Create a collection named 'demo'
def create_collection(name, id_field, vector_field):
    field1 = FieldSchema(name=id_field, dtype=DataType.INT64, description="int64", is_primary=True)
    field2 = FieldSchema(name=vector_field, dtype=DataType.FLOAT_VECTOR, description="float vector", dim=_DIM,
                         is_primary=False)
    schema = CollectionSchema(fields=[field1, field2], description="collection description")
    collection = Collection(name=name, data=None, schema=schema, properties={"collection.ttl.seconds": 15})
    print("\ncollection created:", name)
    return collection

def has_collection(name):
    return utility.has_collection(name)

# Drop a collection in Milvus
def drop_collection(name):
    collection = Collection(name)
    collection.drop()
    print("\nDrop collection: {}".format(name))

# List all collections in Milvus
def list_collections():
    print("\nlist collections:")
    print(utility.list_collections())

def insert(collection, num, dim):
    data_idx = [i for i in range(num)]
    data_vec = [[random.random() for _ in range(dim)] for _ in range(num)]
    if num <= _BATCH:
        collection.insert([data_idx, data_vec])
    else:
        i = 0
        while i < num:
            n = min(_BATCH, num - i)
            collection.insert([data_idx[i:i+n], data_vec[i:i+n]])
            i += n
    return data_vec

def get_entity_num(collection):
    print("\nThe number of entity:")
    print(collection.num_entities)

def create_index(collection, filed_name):
    index_param = {
        "index_type": _INDEX_TYPE,
        "params": {"nlist": _NLIST, "efConstruction": _EFC, "M": _M},
        "metric_type": _METRIC_TYPE}
    collection.create_index(filed_name, index_param)
    print("\nCreated index:\n{}".format(collection.index().params))

def drop_index(collection):
    collection.drop_index()
    print("\nDrop index sucessfully")

def load_collection(collection):
    collection.load()

def release_collection(collection):
    collection.release()

def search(collection, vector_field, id_field, search_vectors):
    search_param = {
        "data": search_vectors,
        "anns_field": vector_field,
        "param": {"metric_type": _METRIC_TYPE, "params": {"nprobe": _NPROBE, "radius": 0.8,
                                                          "range_filter": 1.0,
                                                          "max_empty_result_buckets": 65000}},
        "limit": _TOPK,
        "expr": "id_field >= 0"}
    results = collection.search(**search_param)
    # results = collection.query(expr="id_field >= 0")
    for i, result in enumerate(results):
        print("\nSearch result for {}th vector: ".format(i))
        for j, res in enumerate(result):
            print("Top {}: {}".format(j, res))

def set_properties(collection):
    collection.set_properties(properties={"collection.ttl.seconds": 1800})

def main():
    # create a connection
    create_connection()

    # drop collection if the collection exists
    if has_collection(_COLLECTION_NAME):
        drop_collection(_COLLECTION_NAME)

    # create collection
    collection = create_collection(_COLLECTION_NAME, _ID_FIELD_NAME, _VECTOR_FIELD_NAME)

    # alter ttl properties of collection level
    set_properties(collection)

    # show collections
    list_collections()

    # insert 10000 vectors with 128 dimension
    vectors = insert(collection, _ROWS, _DIM)

    collection.flush()
    # get the number of entities
    get_entity_num(collection)

    # create index
    create_index(collection, _VECTOR_FIELD_NAME)

    # load data to memory
    load_collection(collection)

    # search
    query_vec = [[i + j for i in range(_DIM)] for j in range(_NQ)]
    search(collection, _VECTOR_FIELD_NAME, _ID_FIELD_NAME, query_vec)

    # release memory
    # release_collection(collection)

    # drop collection index
    # drop_index(collection)

    # drop collection
    # drop_collection(_COLLECTION_NAME)

if __name__ == '__main__':
    main()

Milvus Log

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

cydrain · 2024-06-26T09:05:30Z

/assign

cydrain · 2024-06-27T03:55:15Z

Milvus Version	Knowhere Version	Issue Reproduced
v2.3.4		Yes
v2.3.18	v2.2.6	Yes
v2.4.0	v2.3.0	No
v2.4.5		No
master (`a08000c`)		No

cydrain · 2024-06-27T07:05:50Z

The IVF range search param "max_empty_result_buckets" is introduced in knowhere v2.3.0 (zilliztech/knowhere#455)
So in Milvus v2.4.0 and later release, user can improve IVF's range search recall by increasing "max_empty_result_buckets".
In Milvus v2.3.x, there is no parameter to improve the IVF's range search recall.

cydrain · 2024-06-27T07:07:48Z

/close

sre-ci-robot · 2024-06-27T07:07:51Z

@cydrain: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cydrain added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 26, 2024

cydrain assigned yanliang567 Jun 26, 2024

sre-ci-robot assigned cydrain Jun 26, 2024

cydrain mentioned this issue Jun 26, 2024

[Bug]: Filtering results do not match expectations when using the radius param. #30327

Closed

1 task

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 26, 2024

yanliang567 removed their assignment Jun 26, 2024

sre-ci-robot closed this as completed Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: RangeSearch result not as expected #34199

[Bug]: RangeSearch result not as expected #34199

cydrain commented Jun 26, 2024

cydrain commented Jun 26, 2024

cydrain commented Jun 27, 2024 •

edited

Loading

cydrain commented Jun 27, 2024

cydrain commented Jun 27, 2024

sre-ci-robot commented Jun 27, 2024

[Bug]: RangeSearch result not as expected #34199

[Bug]: RangeSearch result not as expected #34199

Comments

cydrain commented Jun 26, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

cydrain commented Jun 26, 2024

cydrain commented Jun 27, 2024 • edited Loading

cydrain commented Jun 27, 2024

cydrain commented Jun 27, 2024

sre-ci-robot commented Jun 27, 2024

cydrain commented Jun 27, 2024 •

edited

Loading