[Bug]: range search results are not the same as the intersection of search results with ranges #32630

yanliang567 · 2024-04-26T02:20:16Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: master-20240425-f06509bf-amd64
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):

Current Behavior

My case is

create a collection and insert 10000_128d vectors
build index IVF_FLAT with nlist 128, metric=L2
load
gen a vectorA and get search_res with search for topk=1000
radius = search_res[0].distances[350]
range_filter = search_res[0].distances[50]
range search with the same vectorA for topk=300
check the range search result with the results of search_res[50:350]

expect:
300 results are same in search_res and range_search_results
actual:
only 51 of range_search_results hit the search_res[50:350], less than 20%

Expected Behavior

if it is hard to get 100% same results with search, it should be >90%

Steps To Reproduce

No response

Milvus Log

test case:

@pytest.mark.tags(CaseLabel.L0)
    @pytest.mark.parametrize("index_type",ct.all_index_types[:6])
    @pytest.mark.parametrize("metric", ct.float_metrics)
    @pytest.mark.parametrize("vector_data_type", ["FLOAT_VECTOR"])
    def test_range_search_default(self, index_type, metric, vector_data_type):
        """
        target: verify the range search returns correct results
        method: 1. create collection, insert 20000 vectors,
                2. search with topk=1000
                3. range search with first 100 distance as filter
                4. verified the range search results is same as the search top 100 results
        """
        collection_w = self.init_collection_general(prefix, auto_id=True, insert_data=False, is_index=False,
                                                    vector_data_type=vector_data_type, with_json=False)[0]
        for _ in range(3):
            data = cf.gen_default_dataframe_data(nb=2000, auto_id=True, with_json=False)
            collection_w.insert(data)

        collection_w.flush()
        _index_params = {"index_type": index_type, "metric_type": metric, "params": cf.get_index_params_params(index_type)}
        collection_w.create_index(ct.default_float_vec_field_name, index_params=_index_params)
        collection_w.load()

        for i in range(2):
            if i % 2 != 0:
                # add some growing segments
                log.debug("verify range search with growing segments")
                for _ in range(2):
                    data = cf.gen_default_dataframe_data(nb=2000, auto_id=True, with_json=False)
                    collection_w.insert(data)

            search_params = {"metric_type": metric, "params": {}}
            nq = 1
            search_vectors = cf.gen_vectors(nq, ct.default_dim)
            search_res = collection_w.search(search_vectors, default_search_field,
                                             search_params, limit=1000)[0]
            # assert len(search_res[0].ids) == 1000
            log.debug(f"search topk=1000 returns {len(search_res[0].ids)}")
            check_topk = 300
            check_from = 30
            ids = search_res[0].ids[check_from:check_from+check_topk]
            radius = search_res[0].distances[check_from+check_topk]
            range_filter = search_res[0].distances[check_from]
            # if metric != "L2":
            #     tmp = radius
            #     radius = range_filter
            #     range_filter = tmp
            range_search_params = {"metric_type": metric, "params": {"radius": radius, "range_filter": range_filter}}
            range_res = collection_w.search(search_vectors, default_search_field,
                                            range_search_params, limit=check_topk)[0]
            range_ids = range_res[0].ids
            log.debug(f"range search radius={radius}, range_filter={range_filter}, results num: {len(range_ids)}")
            hit_rate = round(len(set(ids).intersection(set(range_ids))) / len(set(ids)), 2)
            log.debug(f"range search results hit rate: {hit_rate}")
            assert hit_rate >= 0.9

Anything else?

The same situation with IVF_*, SCCAN Index. HNSW has a better result of 80% hit rate.
Checking the first 10 results, you will see the range_res have better distances

The text was updated successfully, but these errors were encountered:

yanliang567 · 2024-04-26T02:21:17Z

/assign @liliu-z
/unassign

yanliang567 · 2024-04-26T06:00:44Z

changed a bit for my test: use FLAT index for the first search and then rebuild the collection with test target index type, the results is a bit better --50%, but still not perfect.

liliu-z · 2024-04-28T09:30:46Z

Try enlarge range search param

liliu-z · 2024-04-28T09:41:04Z

/assign @yanliang567

liliu-z · 2024-04-28T09:41:08Z

/unassign

yanliang567 · 2024-04-28T09:45:30Z

Try enlarge range search param

range search results are already better than search results.
I understand the approaches are different, but I also think we should align the results between search and range search, or it is very confusing to users. @xiaofan-luan @liliu-z
/assign @xiaofan-luan

yanliang567 · 2024-04-28T09:46:16Z

besides, we need to align the results in search, range_search, search with pagination, grouping search, etc.

liliu-z · 2024-04-28T09:51:10Z

Try enlarge range search param

range search results are already better than search results. I understand the approaches are different, but I also think we should align the results between search and range search, or it is very confusing to users. @xiaofan-luan @liliu-z /assign @xiaofan-luan

@yanliang567 I don't agree about the result alignment

search/range are using different algos, it is impossible to do this result alignment
I didn't see any necessity to combine the usage of topK search and range search. In that way, we should change range search into topK with distance filtering instead of an independent search type.

Also, we have iterator, search, range search, I didn't see a strong signal that we need to align all the results.

yanliang567 · 2024-04-29T01:27:21Z

@liliu-z From a user’s point of view, he does not care about the algos, he is just confusing about the results are not fitting for each other, and he does not know which results are accurate

xiaofan-luan · 2024-04-29T13:45:30Z

I think there is no necessity to sync all the search result.
But it does make sense to have at least 90% similar result. Otherwise this may indicate a huge accuracy loss?
@yanliang567 @liliu-z

liliu-z · 2024-05-06T03:21:02Z

I think there is no necessity to sync all the search result. But it does make sense to have at least 90% similar result. Otherwise this may indicate a huge accuracy loss? @yanliang567 @liliu-z

Yes, we have a param that can be tuned to get better recall

xiaofan-luan · 2024-05-06T04:12:11Z

let's set the goal to be:

default params has more than 90% recall
through the knob tuning we can get 95% recall or higher

yanliang567 · 2024-05-07T01:39:01Z

Then we need some improvement here in range search. /assign @liliu-z

cydrain · 2024-06-06T10:03:19Z

/assign

cydrain · 2024-06-11T02:56:35Z

cqy123456 · 2024-06-12T03:29:04Z

growing index still not support fp16/bf16/binary vector.

cqy123456 · 2024-06-12T09:24:25Z

plz use this code try again:

@pytest.mark.tags(CaseLabel.L0)
@pytest.mark.parametrize("vector_data_type", ct.all_float_vector_types)
@pytest.mark.parametrize("with_growing", [True, False])
def test_range_search_default(self, index_type, metric, vector_data_type, with_growing):
    """
    target: verify the range search returns correct results
    method: 1. create collection, insert 8000 vectors,
            2. search with topk=1000
            3. range search from the 30th-330th distance as filter
            4. verified the range search results is same as the search results in the range
    """
    counter = 0
    collection_w = self.init_collection_general(prefix, auto_id=True, insert_data=False, is_index=False,
                                                vector_data_type=vector_data_type, with_json=False)[0]
    nb = 2000
    for i in range(3):
        data = cf.gen_general_default_list_data(nb=nb, auto_id=True, start=counter,
                                                vector_data_type=vector_data_type, with_json=False)
        collection_w.insert(data)
        counter = counter + nb

    collection_w.flush()
    _index_params = {"index_type": "FLAT", "metric_type": metric, "params": {}}
    collection_w.create_index(ct.default_float_vec_field_name, index_params=_index_params)
    collection_w.load()

    if with_growing is True:
        # add some growing segments
        for _ in range(2):
            data = cf.gen_general_default_list_data(nb=nb, auto_id=True, start=counter,
                                                    vector_data_type=vector_data_type, with_json=False)
            collection_w.insert(data)
            counter = counter + nb

    search_params = {"params": {}}
    nq = 1
    search_vectors = cf.gen_vectors(nq, ct.default_dim, vector_data_type=vector_data_type)
    search_res = collection_w.search(search_vectors, default_search_field,
                                        search_params, limit=1000)[0]
    assert len(search_res[0].ids) == 1000
    log.debug(f"search topk=1000 returns {len(search_res[0].ids)}")
    check_topk = 300
    check_from = 30
    ids = search_res[0].ids[check_from:check_from + check_topk]
    radius = search_res[0].distances[check_from + check_topk]
    range_filter = search_res[0].distances[check_from]

    # rebuild the collection with test target index
    collection_w.release()
    collection_w.indexes[0].drop()
    _index_params = {"index_type": index_type, "metric_type": metric,
                        "params": cf.get_index_params_params(index_type)}
    collection_w.create_index(ct.default_float_vec_field_name, index_params=_index_params)
    collection_w.load()

    params = cf.get_search_params_params(index_type)
    params.update({"radius": radius, "range_filter": range_filter})
    if index_type == "HNSW":
        params.update({"ef": check_topk+100})
    if index_type == "IVF_PQ":
        params.update({"max_empty_result_buckets": 100})
    range_search_params = {"params": params}
    range_res = collection_w.search(search_vectors, default_search_field,
                                    range_search_params, limit=check_topk)[0]
    range_ids = range_res[0].ids
    # assert len(range_ids) == check_topk
    log.debug(f"cqy: index params: _index_params{_index_params}")
    log.debug(f"cqy: knn ids={ids}")
    log.debug(f"cqy: knn dis={search_res[0].distances[check_from:check_from + check_topk]}")
    log.debug(f"cqy: range_search ids={range_res[0].ids}")
    log.debug(f"cqy: range_search dis={range_res[0].distances}")
    log.debug(f"cqy: range search radius={radius}, range_filter={range_filter}, range results num: {len(range_ids)}")
    hit_rate = round(len(set(ids).intersection(set(range_ids))) / len(set(ids)), 2)
    log.debug(f"cqy: range search results with growing {with_growing} hit rate: {hit_rate}")
    assert hit_rate >= 0.2    # issue #32630 to improve the accuracy

cqy123456 · 2024-06-12T09:26:17Z

test case in milvus may has some problem. when i = 1, use current index to generate gt, instead of flat.

cydrain · 2024-06-13T02:53:20Z

some errors in the test script:

in line 6912 and 6926, "cf.gen_general_default_list_data" is called to generate test data, but this API always generates id from 0 by default, should change it like following:

    row_cnt = 0
    nb = 2000
    for i in range(3):
        data = cf.gen_general_default_list_data(nb=nb, auto_id=True, start=row_cnt,
                                                vector_data_type=vector_data_type, with_json=False)
        collection_w.insert(data)
        row_cnt += nb

2. in the 1st loop, "FLAT" index is dropped and new specified index is created, so in the 2nd loop, this new created index is used to calculate the golden, and makes the recall calculation meaningless.

this test script need update @yanliang567

cydrain · 2024-06-13T03:02:11Z

/assign @yanliang567

yanliang567 · 2024-06-17T08:59:11Z

updating and rerunning the tests...

yanliang567 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 26, 2024

yanliang567 self-assigned this Apr 26, 2024

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 26, 2024

yanliang567 added this to the 2.4.1 milestone Apr 26, 2024

sre-ci-robot assigned liliu-z and unassigned yanliang567 Apr 26, 2024

sre-ci-robot assigned yanliang567 Apr 28, 2024

sre-ci-robot unassigned liliu-z Apr 28, 2024

sre-ci-robot assigned xiaofan-luan Apr 28, 2024

yanliang567 assigned liliu-z May 7, 2024

yanliang567 modified the milestones: 2.4.1, 2.4.2 May 7, 2024

yanliang567 modified the milestones: 2.4.2, 2.4.3 May 24, 2024

yanliang567 added this to the 2.4.4 milestone May 30, 2024

yanliang567 mentioned this issue Jun 3, 2024

[Bug]: [Nightly] The accuracy of range search using FLOAT16_VECTOR is lower than expected #33512

Open

1 task

yanliang567 modified the milestones: 2.4.4, 2.4.5 Jun 5, 2024

sre-ci-robot assigned cydrain Jun 6, 2024

yanliang567 removed their assignment Jun 7, 2024

sre-ci-robot assigned yanliang567 Jun 13, 2024

yanliang567 modified the milestones: 2.4.5, 2.4.6 Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: range search results are not the same as the intersection of search results with ranges #32630

[Bug]: range search results are not the same as the intersection of search results with ranges #32630

yanliang567 commented Apr 26, 2024

yanliang567 commented Apr 26, 2024

yanliang567 commented Apr 26, 2024

liliu-z commented Apr 28, 2024

liliu-z commented Apr 28, 2024

liliu-z commented Apr 28, 2024

yanliang567 commented Apr 28, 2024

yanliang567 commented Apr 28, 2024

liliu-z commented Apr 28, 2024

yanliang567 commented Apr 29, 2024

xiaofan-luan commented Apr 29, 2024

liliu-z commented May 6, 2024

xiaofan-luan commented May 6, 2024

yanliang567 commented May 7, 2024

cydrain commented Jun 6, 2024

cydrain commented Jun 11, 2024

cqy123456 commented Jun 12, 2024

cqy123456 commented Jun 12, 2024

cqy123456 commented Jun 12, 2024

cydrain commented Jun 13, 2024 •

edited

Loading

cydrain commented Jun 13, 2024

yanliang567 commented Jun 17, 2024

[Bug]: range search results are not the same as the intersection of search results with ranges #32630

[Bug]: range search results are not the same as the intersection of search results with ranges #32630

Comments

yanliang567 commented Apr 26, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Apr 26, 2024

yanliang567 commented Apr 26, 2024

liliu-z commented Apr 28, 2024

liliu-z commented Apr 28, 2024

liliu-z commented Apr 28, 2024

yanliang567 commented Apr 28, 2024

yanliang567 commented Apr 28, 2024

liliu-z commented Apr 28, 2024

yanliang567 commented Apr 29, 2024

xiaofan-luan commented Apr 29, 2024

liliu-z commented May 6, 2024

xiaofan-luan commented May 6, 2024

yanliang567 commented May 7, 2024

cydrain commented Jun 6, 2024

cydrain commented Jun 11, 2024

cqy123456 commented Jun 12, 2024

cqy123456 commented Jun 12, 2024

cqy123456 commented Jun 12, 2024

cydrain commented Jun 13, 2024 • edited Loading

cydrain commented Jun 13, 2024

yanliang567 commented Jun 17, 2024

cydrain commented Jun 13, 2024 •

edited

Loading