Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: RangeSearch result not as expected #34199

Closed
1 task done
cydrain opened this issue Jun 26, 2024 · 5 comments
Closed
1 task done

[Bug]: RangeSearch result not as expected #34199

cydrain opened this issue Jun 26, 2024 · 5 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@cydrain
Copy link
Contributor

cydrain commented Jun 26, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: v2.3.4
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): ubuntu22.04
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. generate 10,000 vectors, and create "IVF_SQ8" index with metric "COSINE"
  2. do RangeSearch with radius=0.7, get following results:
Top 0: id: 1613, distance: 0.8454123735427856, entity: {}      <========== this result miss when RangeSearch with radius=0.8
Top 1: id: 9179, distance: 0.8377795219421387, entity: {}
Top 2: id: 3374, distance: 0.8356838226318359, entity: {}
Top 3: id: 5438, distance: 0.8329548835754395, entity: {}
Top 4: id: 9604, distance: 0.8325211405754089, entity: {}
Top 5: id: 7765, distance: 0.8311684131622314, entity: {}
Top 6: id: 6103, distance: 0.8307194113731384, entity: {}
Top 7: id: 8275, distance: 0.827889084815979, entity: {}
Top 8: id: 5087, distance: 0.8266986012458801, entity: {}
Top 9: id: 8819, distance: 0.826603889465332, entity: {}
  1. do RangeSearch again with radius=0.8
Top 0: id: 9179, distance: 0.8377795219421387, entity: {}
Top 1: id: 3374, distance: 0.8356838226318359, entity: {}
Top 2: id: 7765, distance: 0.8311684131622314, entity: {}
Top 3: id: 8138, distance: 0.8248189687728882, entity: {}
Top 4: id: 552, distance: 0.818252682685852, entity: {}
Top 5: id: 3877, distance: 0.8178315162658691, entity: {}
Top 6: id: 2179, distance: 0.8175950646400452, entity: {}
Top 7: id: 3890, distance: 0.8130142688751221, entity: {}
Top 8: id: 9989, distance: 0.8103870153427124, entity: {}
Top 9: id: 3051, distance: 0.8075019717216492, entity: {}
  1. set "max_empty_result_buckets=65000" or any value else, still cannot get the result back "id=1613"

Expected Behavior

should be able to get the result with "id=1613"

Steps To Reproduce

run this script:

import random

from pymilvus import (
    connections,
    FieldSchema, CollectionSchema, DataType,
    Collection,
    utility
)

_HOST = '127.0.0.1'
_PORT = '19530'

# Const names
_COLLECTION_NAME = 'demo'
_ID_FIELD_NAME = 'id_field'
_VECTOR_FIELD_NAME = 'float_vector_field'

# Vector parameters
_BATCH = 100000
_ROWS = 10000
_DIM = 128
_INDEX_FILE_SIZE = 32  # max file size of stored index
_NQ = 1

# Index parameters
_METRIC_TYPE = 'COSINE'
_INDEX_TYPE = 'IVF_SQ8'
_NLIST = 1024
_NPROBE = 16
_TOPK = 10

# Create a Milvus connection
def create_connection():
    print(f"\nCreate connection...")
    connections.connect(host=_HOST, port=_PORT)
    print(f"\nList connections:")
    print(connections.list_connections())

# Create a collection named 'demo'
def create_collection(name, id_field, vector_field):
    field1 = FieldSchema(name=id_field, dtype=DataType.INT64, description="int64", is_primary=True)
    field2 = FieldSchema(name=vector_field, dtype=DataType.FLOAT_VECTOR, description="float vector", dim=_DIM,
                         is_primary=False)
    schema = CollectionSchema(fields=[field1, field2], description="collection description")
    collection = Collection(name=name, data=None, schema=schema, properties={"collection.ttl.seconds": 15})
    print("\ncollection created:", name)
    return collection

def has_collection(name):
    return utility.has_collection(name)

# Drop a collection in Milvus
def drop_collection(name):
    collection = Collection(name)
    collection.drop()
    print("\nDrop collection: {}".format(name))

# List all collections in Milvus
def list_collections():
    print("\nlist collections:")
    print(utility.list_collections())

def insert(collection, num, dim):
    data_idx = [i for i in range(num)]
    data_vec = [[random.random() for _ in range(dim)] for _ in range(num)]
    if num <= _BATCH:
        collection.insert([data_idx, data_vec])
    else:
        i = 0
        while i < num:
            n = min(_BATCH, num - i)
            collection.insert([data_idx[i:i+n], data_vec[i:i+n]])
            i += n
    return data_vec

def get_entity_num(collection):
    print("\nThe number of entity:")
    print(collection.num_entities)

def create_index(collection, filed_name):
    index_param = {
        "index_type": _INDEX_TYPE,
        "params": {"nlist": _NLIST, "efConstruction": _EFC, "M": _M},
        "metric_type": _METRIC_TYPE}
    collection.create_index(filed_name, index_param)
    print("\nCreated index:\n{}".format(collection.index().params))

def drop_index(collection):
    collection.drop_index()
    print("\nDrop index sucessfully")

def load_collection(collection):
    collection.load()

def release_collection(collection):
    collection.release()

def search(collection, vector_field, id_field, search_vectors):
    search_param = {
        "data": search_vectors,
        "anns_field": vector_field,
        "param": {"metric_type": _METRIC_TYPE, "params": {"nprobe": _NPROBE, "radius": 0.8,
                                                          "range_filter": 1.0,
                                                          "max_empty_result_buckets": 65000}},
        "limit": _TOPK,
        "expr": "id_field >= 0"}
    results = collection.search(**search_param)
    # results = collection.query(expr="id_field >= 0")
    for i, result in enumerate(results):
        print("\nSearch result for {}th vector: ".format(i))
        for j, res in enumerate(result):
            print("Top {}: {}".format(j, res))

def set_properties(collection):
    collection.set_properties(properties={"collection.ttl.seconds": 1800})

def main():
    # create a connection
    create_connection()

    # drop collection if the collection exists
    if has_collection(_COLLECTION_NAME):
        drop_collection(_COLLECTION_NAME)

    # create collection
    collection = create_collection(_COLLECTION_NAME, _ID_FIELD_NAME, _VECTOR_FIELD_NAME)

    # alter ttl properties of collection level
    set_properties(collection)

    # show collections
    list_collections()

    # insert 10000 vectors with 128 dimension
    vectors = insert(collection, _ROWS, _DIM)

    collection.flush()
    # get the number of entities
    get_entity_num(collection)

    # create index
    create_index(collection, _VECTOR_FIELD_NAME)

    # load data to memory
    load_collection(collection)

    # search
    query_vec = [[i + j for i in range(_DIM)] for j in range(_NQ)]
    search(collection, _VECTOR_FIELD_NAME, _ID_FIELD_NAME, query_vec)

    # release memory
    # release_collection(collection)

    # drop collection index
    # drop_index(collection)

    # drop collection
    # drop_collection(_COLLECTION_NAME)

if __name__ == '__main__':
    main()

Milvus Log

No response

Anything else?

No response

@cydrain cydrain added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 26, 2024
@cydrain
Copy link
Contributor Author

cydrain commented Jun 26, 2024

/assign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 26, 2024
@yanliang567 yanliang567 removed their assignment Jun 26, 2024
@cydrain
Copy link
Contributor Author

cydrain commented Jun 27, 2024

Milvus Version Knowhere Version Issue Reproduced
v2.3.4 Yes
v2.3.18 v2.2.6 Yes
v2.4.0 v2.3.0 No
v2.4.5 No
master (a08000c) No

@cydrain
Copy link
Contributor Author

cydrain commented Jun 27, 2024

The IVF range search param "max_empty_result_buckets" is introduced in knowhere v2.3.0 (zilliztech/knowhere#455)
So in Milvus v2.4.0 and later release, user can improve IVF's range search recall by increasing "max_empty_result_buckets".
In Milvus v2.3.x, there is no parameter to improve the IVF's range search recall.

@cydrain
Copy link
Contributor Author

cydrain commented Jun 27, 2024

/close

@sre-ci-robot
Copy link
Contributor

@cydrain: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants