Sudden high search latencies with 50% used QueryNodes #33293

Archalbc · 2024-05-22T15:58:21Z

Archalbc
May 22, 2024

Hi !
We are currently facing a quite strange issue on our query nodes (?).
After a lot of investigation and changes we think we have reached a point where no more documentation or Google search could help us to understand what is happenning in our cluster.
Setup:

Milvus cluster 2.3.12
3 proxy nodes, 10 query nodes, 3 index nodes, 2 datanodes
~60M entities in 4-5 collections
query node: 16 CPU, 32GB ram

We got high latencies on our querynodes while their CPU usage was not that high, that's why we tested differents stuff:

during high traffic, we observed the parallel readTask reaching its maximum value (16 for 16 CPU), so we increased the queryNode.scheduler.maxReadConcurrentRatio to 2
As we have more than 16GB RAM for querynodes, we increased the max segment size from 512Mo to 2GB.
We tested adding 1 in-memory replicas for the biggest collection (20M entities)

Timeline of the day:

22 May 2024 10:50
- increase maxConcurrentReadRatio: 2
22 May 2024 15:04
- Add proxies to check if rateLimit or latencies are better
- Nothing change, back to 3 proxies, no issue with proxyNodes
22 May 2024 15:45
- datacoord.segment.maxSize from 512 to 2048
- add indexnode count from 2 to 10 to accelerate the re-indexation
22 May 2024 16:50-55
- We are reaching 300 RPS, latencies suddenly explode like at lunch around 13:00

But we don't think we improved the situation that much.

Something quite strange here is that latencies explode as soon as we are approaching 300 RPS. We expect latencies gradually increase with the increase of the traffic load but this is clearly not the case, this is too sudden. (example at lunch 13:00 and at the end of the afternoon at 17:00)

As querynode CPU usage is not high we cannot really understand where is the issue (rate limit? segcore ? queue ? timetick ?), everything suddenly increasing when we reach around 300 RPS.

Bonus: Why we are having rateLimit on DQL whereas this feature is disabled (false) ??

Update:
Why it seems only one querynode got ready readTask ? My collections' segments are well spread.

legend: {node_id}-{collection_id} (448835623831330678 is the biggest collection)

Thank you for your help.

Answered by Archalbc

May 29, 2024

Hey !
It seems we finally stabilize this situation and understand a lot of stuff. Here is the summary of everything we did to conclude this issue:

Add 2 replicas to each collection: Better spread of search on all query nodes
Fix failed requests from querynode logs
- "ef(100) should be larger than k(110)"
- It was an issue on the client app side.
Set queryNode.scheduler.maxReadConcurrentRatio from 1 to 2: better use of the querynode's CPU
Increase dataCoord.segment.maxsize from 512 Mo to 2Gb as we have queryNode with 32GB Ram : Divide by 4 the number of segments > Improve the search latency
Define queryNode pod AntiAffinity to enforce their placement on different Kubernetes worker nodes.
I…

View full answer

yhmo · 2024-05-23T03:22:32Z

yhmo
May 23, 2024
Collaborator

Some questions:

what the vector dimension is?
which type of index?
the client calls both query() and search() interfaces. what fields are retrieved by query()?

CPU usage of querynode is not high, which indicates the bottleneck is in query() since query() might fetch data from storage.

3 replies

flsworld May 23, 2024

Hello @yhmo,

what the vector dimension is?
256
which type of index?
HNSW : ef: 100, efContruction: 360, M: 8
the client calls both query() and search() interfaces. what fields are retrieved by query()?
2 fields : embedding and createdAt

yhmo May 23, 2024
Collaborator

60M 256dim, the raw data size is 60GB. The biggest collection has 20M entities, 20GB.
If segment.maxSize=512MB, ideally this collection has 40 ~ 200 segments. The screenshot shows the number of segments is from 20 to 240 for each query node, far beyond expected.
If segment.maxSize=2GB, ideally this collection has 10 ~ 50 segments. The screenshot shows the number of segments is from 10 to 90 for each query node, far beyond expected.

Does this collection have a partition key?

Archalbc May 23, 2024
Author

Hi,

No collection use a partition key.

Here is our current segment allocation

xiaofan-luan · 2024-05-23T07:04:03Z

xiaofan-luan
May 23, 2024
Maintainer

how many shards do you have in the collection? is it always the shard delegator slower than other nodes that slower the overall search? Because from you use cases, it seems that there is usually 1-2 node slower than others.
you can try to ignore growing segment see if you can improve the search performance. if yes increase queryCoord.growingRowCountWeight to 20 see if it helps on performance

1 reply

flsworld May 23, 2024

@Archalbc will be more precise than me.

We got 4 shards per collection
Unfortunately this is one of the feature we wanted when opting for milvus : indexing in streaming. We need to search in growing segment..

flsworld · 2024-05-23T07:33:28Z

flsworld
May 23, 2024

Hello guys,

We made some further investigation yesterday. Here is what we discovered :

During peak latency from 19:30 to 23:00

The same querynode nhbbd got both its unsolved read requests and read requests queues kinda congested

And its read tasks capped

So we wanted to see what nhbbd had to say and this is what it told us :

=> failed to search: out of range in json: ef(100) should be larger than k(110)
E20240522 17:25:33.143208   498 hnsw_config.h:67] [KNOWHERE][CheckAndAdjustForSearch][milvus] ef(100) should be larger than k(110)
E20240522 17:25:33.151029   453 hnsw_config.h:67] [KNOWHERE][CheckAndAdjustForSearch][milvus] ef(100) should be larger than k(110)

I got questions starting from there :

is it possible that this single querynode could plumb the overall querynode component
when the milvus client asks for ANN, topK is either 20 or 30, never 110. Is there any mechanisms in the milvus cluster that could increase this value to 110 ?

7 replies

flsworld May 23, 2024

Even if we don't know what caused the topK to be increased to 110, we set up the ef to 111 in order to avoid the errors. But the problem remain exactly the same.

flsworld May 23, 2024

This is the stack trace of one of the log in the querynode in error during the bump of the latency. The other log messages are retry and are failing the same way

github.com/milvus-io/milvus/pkg/util/retry.Do
	/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:46
github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).call
	/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:467
github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call
	/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:553
github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
	/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:569
github.com/milvus-io/milvus/internal/distributed/querynode/client.wrapGrpcCall[...]
	/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:88
github.com/milvus-io/milvus/internal/distributed/querynode/client.(*Client).SearchSegments
	/go/src/github.com/milvus-io/milvus/internal/distributed/querynode/client/client.go:206
github.com/milvus-io/milvus/internal/querynodev2/cluster.(*remoteWorker).SearchSegments
	/go/src/github.com/milvus-io/milvus/internal/querynodev2/cluster/worker.go:107
github.com/milvus-io/milvus/internal/querynodev2/delegator.(*shardDelegator).Search.func3
	/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator.go:251
github.com/milvus-io/milvus/internal/querynodev2/delegator.executeSubTasks[...].func1
	/go/src/github.com/milvus-io/milvus/internal/querynodev2/delegator/delegator.go:520

xiaofan-luan May 24, 2024
Maintainer

I think this could fail due to timeout.
your cpu usage is already very high.
need to analyze why cpu is high.
a pprof could be very helpful

xiaofan-luan May 24, 2024
Maintainer

your cpu request is only 8 core and it seems that many of the cpu already used more than 8 core.

by saying 16core, you should use set request to 16 core

flsworld May 24, 2024

Yeah thanks a lot @xiaofan-luan ! One of our next move will be to leverage on pprof. You're right about the cpu, we'll increase its value this morning

xiaofan-luan · 2024-05-23T15:07:03Z

xiaofan-luan
May 23, 2024
Maintainer

obviously, the reason why node is slow is long queue latency

and cpu usage is already high -> For 16 cpu node, you've used more than 12 cores from your monitoring system.

Next step:

do a compaction see how it helps on improving the performance(it will take some time to compact and rebuild index)
pprof the busiest node and see what part of the request eats your cpu.
scale up or out
try zlliz cloud(should be 3-5 times faster under same resources)

0 replies

flsworld · 2024-05-24T05:33:55Z

flsworld
May 24, 2024

I was thinking about the topK value that is not the one we requested. Would it be possible that the query nodes ask for a higher topK depending on the targeted bucket ?
The average values in search group topK and search topK are the one requested in the milvus client. The fluctuations would represent some compensations due to the lack of results in a bucket 🤷

Except than that, I noticed that in the search topK, there are 2 query nodes that are not contributing : 4895 & 4896. I was wondering why because it is not the case in the search group topK.

1 reply

xiaofan-luan May 24, 2024
Maintainer

I was thinking about the topK value that is not the one we requested. Would it be possible that the query nodes ask for a higher topK depending on the targeted bucket ? The average values in search group topK and search topK are the one requested in the milvus client. The fluctuations would represent some compensations due to the lack of results in a bucket 🤷

Except than that, I noticed that in the search topK, there are 2 query nodes that are not contributing : 4895 & 4896. I was wondering why because it is not the case in the search group topK.

no, milvus won't increase topk .

Archalbc · 2024-05-24T13:07:05Z

Archalbc
May 24, 2024
Author

Hi!
Here is some news:

we set cpu request/limit to 16 for querynode
we set 2 replicas for all collections

Now the load is more spread accross all query nodes and I'm starting to wonder if the bottleneck is the fact that as all collections have 4 shards, there are too many shard delegators and growing segment per collection + 10 query nodes, maybe it's a bit too much of messages to process for querynode delegators

Should we recreate all collection with shards_num=2.
what do you think about that ?

pprof in something we will do after this move.

0 replies

xiaofan-luan · 2024-05-25T14:35:51Z

xiaofan-luan
May 25, 2024
Maintainer

I don't think it's really help to reduce shard numbers, unless querynodes is the bottleneck.

But you could try with next release.
with 10 nodes, and 2 shards, use ChannelLevelScoreBalancerName will ensure each shard is only spread to 4 nodes.

The segment number distribution seems to be not very balanced. did you check the the reason?

1 reply

Archalbc May 29, 2024
Author

Hello,
Yes we checked the distribution and it seems it was a mess because karpenter moved a lot of querynode for placement optimization ><

Archalbc · 2024-05-29T08:58:45Z

Archalbc
May 29, 2024
Author

Hey !
It seems we finally stabilize this situation and understand a lot of stuff. Here is the summary of everything we did to conclude this issue:

Add 2 replicas to each collection: Better spread of search on all query nodes
Fix failed requests from querynode logs
- "ef(100) should be larger than k(110)"
- It was an issue on the client app side.
Set queryNode.scheduler.maxReadConcurrentRatio from 1 to 2: better use of the querynode's CPU
Increase dataCoord.segment.maxsize from 512 Mo to 2Gb as we have queryNode with 32GB Ram : Divide by 4 the number of segments > Improve the search latency
Define queryNode pod AntiAffinity to enforce their placement on different Kubernetes worker nodes.
Increase resource "request" CPU for Query Nodes to have a better guarantee of usable resources. (Resources "limits" stay the same)

Thank a lot again for your help and availability that we really appreciate <3

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden high search latencies with 50% used QueryNodes #33293

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 13 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Sudden high search latencies with 50% used QueryNodes #33293

Replies: 8 comments · 13 replies

yhmo May 23, 2024 Collaborator

yhmo May 23, 2024 Collaborator

Archalbc May 23, 2024 Author

xiaofan-luan May 23, 2024 Maintainer

xiaofan-luan May 24, 2024 Maintainer

xiaofan-luan May 24, 2024 Maintainer

xiaofan-luan May 23, 2024 Maintainer

xiaofan-luan May 24, 2024 Maintainer

Archalbc May 24, 2024 Author

xiaofan-luan May 25, 2024 Maintainer

Archalbc May 29, 2024 Author

Archalbc May 29, 2024 Author

Replies: 8 comments 13 replies

yhmo
May 23, 2024
Collaborator

yhmo May 23, 2024
Collaborator

Archalbc May 23, 2024
Author

xiaofan-luan
May 23, 2024
Maintainer

xiaofan-luan May 24, 2024
Maintainer

xiaofan-luan May 24, 2024
Maintainer

xiaofan-luan
May 23, 2024
Maintainer

xiaofan-luan May 24, 2024
Maintainer

Archalbc
May 24, 2024
Author

xiaofan-luan
May 25, 2024
Maintainer

Archalbc May 29, 2024
Author

Archalbc
May 29, 2024
Author