Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Zero Recall Rate, Very Strange for multi-vector search!!!!!!! #33294

Open
1 task done
JackTan25 opened this issue May 22, 2024 · 22 comments
Open
1 task done

[Bug]: Zero Recall Rate, Very Strange for multi-vector search!!!!!!! #33294

JackTan25 opened this issue May 22, 2024 · 22 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@JackTan25
Copy link

JackTan25 commented May 22, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: build from source commit 648d5661ca8771fadc664f427ac330b083b8734e
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus  2.4.3
- OS(Ubuntu or CentOS): Ubuntu20.04
- CPU/Memory: 12c 128G
- GPU: 
- Others:

Current Behavior

Get Zero Recall Rate.
image
can't find any correct result.

Expected Behavior

expect high recall rate.

Steps To Reproduce

1. modify the rescore code like below:

func (ws *weightedScorer) getActivateFunc() activateFunc {
	mUpper := strings.ToUpper(ws.getMetricType())
	isCosine := mUpper == strings.ToUpper(metric.COSINE)
	isIP := mUpper == strings.ToUpper(metric.IP)
	if isCosine {
		f := func(distance float32) float32 {
			return (1 + distance) * 0.5
		}
		return f
	}

	if isIP {
		f := func(distance float32) float32 {
			return 0.5 + float32(math.Atan(float64(distance)))/math.Pi
		}
		return f
	}

	f := func(distance float32) float32 {
                // just return distance, because mine metric way is so.
		return distance
	}
	return f
}

2. use my script like this:
$ cd small_data
$ python3 data_load_small_data.py
$ python3 milvus_small_multi_vector.py > milvus.txt
$ python3 ground_truth.py > ground_truth>txt

Milvus Log

No response

Anything else?

No response

@JackTan25 JackTan25 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 22, 2024
@JackTan25
Copy link
Author

JackTan25 commented May 22, 2024

small_data.zip
@czs007

  1. modify source code here
    1941716386180_ pic
  2. unzip small_data.zip & cd small_data
  3. python3 data_load_small_data.py
  4. python3 milvus_small_multi_vector.py > milvus.txt
  5. python3 ground_truth.py > ground_truth.txt
  6. we get the result comparison, distances and ids are all totally wrong. We get zero recall rate!!!! (the result is sorted by the distance).

@JackTan25 JackTan25 changed the title [Bug]: Zero Recall Rerate, Very Strange for multi-vector search!!!!!!! [Bug]: Zero Recall Rate, Very Strange for multi-vector search!!!!!!! May 22, 2024
@yanliang567
Copy link
Contributor

/assign @czs007
/unassign

@sre-ci-robot sre-ci-robot assigned czs007 and unassigned yanliang567 May 23, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 23, 2024
@yanliang567 yanliang567 added this to the 2.4.2 milestone May 23, 2024
@JackTan25
Copy link
Author

JackTan25 commented May 23, 2024

I'm in hurry, so please help me, thanks very much. cc @czs007. Can you reproduce?

@czs007
Copy link
Contributor

czs007 commented May 23, 2024

working on it. @JackTan25

@JackTan25
Copy link
Author

working on it. @JackTan25

Thanks, can you reproduce it? cc @czs007

@czs007
Copy link
Contributor

czs007 commented May 23, 2024

@JackTan25

big := func(i, j int) bool {

Please change the function here from "big" to "small". After applying the previous activation function, when performing ranking, it was sorted in descending order. Once you modify it to return the original L2 distance, it should be sorted in ascending order.

@JackTan25
Copy link
Author

Ok, let me try it. Thanks.

@JackTan25
Copy link
Author

hi, I test 1000 rows dataset, and it works expectedly, but when the dataset is very large to 100w, I can get only 2%. I want to upper the limit 16384, where should I modify for the source code? @czs007

@czs007
Copy link
Contributor

czs007 commented May 23, 2024

@JackTan25 what do you mean by 2%? recall?

@JackTan25
Copy link
Author

yes, I upper the top to 99w and I can get 64% now.

@czs007
Copy link
Contributor

czs007 commented May 23, 2024

@JackTan25 Why does the value of TopK need to be so large?

@JackTan25
Copy link
Author

JackTan25 commented May 23, 2024

I feel strange yet. I think you can test the small_data I give you, when the topk2 is low(50), the recall rate is still zero, when I lift it up to 1000(dataset is 1000), it can only get 94%. Multi Vector's recall is very low, but Single Vector recall is very high and quick. cc @czs007

@JackTan25
Copy link
Author

The question is that, the query itself can make a big difference to the result. Different query can get different recall.

@xiaofan-luan
Copy link
Contributor

I feel strange yet. I think you can test the small_data I give you, when the topk2 is low(50), the recall rate is still zero, when I lift it up to 1000(dataset is 1000), it can only get 94%. Multi Vector's recall is very low, but Single Vector recall is very high and quick. cc @czs007

I think it still means the ranking function you modified has some bug. maybe you should debug into it

@yanliang567 yanliang567 modified the milestones: 2.4.2, 2.4.3 May 24, 2024
@JackTan25
Copy link
Author

JackTan25 commented May 24, 2024

I feel strange yet. I think you can test the small_data I give you, when the topk2 is low(50), the recall rate is still zero, when I lift it up to 1000(dataset is 1000), it can only get 94%. Multi Vector's recall is very low, but Single Vector recall is very high and quick. cc @czs007

I think it still means the ranking function you modified has some bug. maybe you should debug into it

well, I think the recall is low,but I modify the top-k2, it can really grow the recall. The ranking function is right, maybe it's the algorithm's bug. cc @xiaofan-luan

@JackTan25
Copy link
Author

JackTan25 commented May 24, 2024

image image

I just modify two places here. cc @xiaofan-luan @czs007 Is this right?

@JackTan25
Copy link
Author

@czs007 @xiaofan-luan Is there any other logic that we need to check? I'm not familiar with the code module.

@JackTan25
Copy link
Author

well, I find a thing is that, seems for the weight rank, the score is not sum, but a single column. @czs007 Where is the logic of this part?

@JackTan25
Copy link
Author

I feel strange yet. I think you can test the small_data I give you, when the topk2 is low(50), the recall rate is still zero, when I lift it up to 1000(dataset is 1000), it can only get 94%. Multi Vector's recall is very low, but Single Vector recall is very high and quick. cc @czs007

I think it still means the ranking function you modified has some bug. maybe you should debug into it

well, I think the recall is low,but I modify the top-k2, it can really grow the recall. The ranking function is right, maybe it's the algorithm's bug. cc @xiaofan-luan

grow to 1000, can get 100%. cc @xiaofan-luan

@JackTan25
Copy link
Author

JackTan25 commented May 24, 2024

image

The code here is very strange here. What does the meaning of realTopK? The limit user gives or the number of vector column? cc @czs007 @xiaofan-luan

@yanliang567 yanliang567 modified the milestones: 2.4.3, 2.4.4, 2.4.5 May 30, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.5, 2.4.6 Jun 26, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.6, 2.4.7 Jul 19, 2024
@yanliang567
Copy link
Contributor

any updates?

@xiaofan-luan
Copy link
Contributor

image The code here is very strange here. What does the meaning of realTopK? The limit user gives or the number of vector column? cc @czs007 @xiaofan-luan

I guess real topk means topk. but sometimes a search can not return limit result(For example, you ask for topk1000 but there are only 500 entities in milvus, then real topk is 500),

@yanliang567 yanliang567 modified the milestones: 2.4.7, 2.4.8 Aug 12, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.8, 2.4.10 Aug 19, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.10, 2.4.11 Sep 5, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.11, 2.4.12 Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants