Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][cluster] After queryNode pod failure is restored, hybrid_search RT increases in replica=2 scene #32466

Closed
1 task done
wangting0128 opened this issue Apr 19, 2024 · 3 comments
Assignees
Labels
2.4-features kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@wangting0128
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4-20240417-dff96c32-amd64
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0rc66
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: multi-vector-chaos-2-v5j4l

server:

NAME                                                              READY   STATUS                            RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
multi-vector-chaos-2-v5j4l-etcd-0                                 1/1     Running                           0               6m54s   10.104.28.105   4am-node33   <none>           <none>
multi-vector-chaos-2-v5j4l-etcd-1                                 1/1     Running                           0               6m53s   10.104.27.201   4am-node31   <none>           <none>
multi-vector-chaos-2-v5j4l-etcd-2                                 1/1     Running                           0               6m53s   10.104.29.100   4am-node35   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-datacoord-69b9555746-ss4jw      1/1     Running                           0               6m54s   10.104.14.140   4am-node18   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-datanode-69f78d5bf7-lkkkj       1/1     Running                           1 (2m25s ago)   6m54s   10.104.14.143   4am-node18   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-indexcoord-65dc8c9cc-vxkt6      1/1     Running                           0               6m54s   10.104.24.155   4am-node29   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-indexnode-7c6c84d97d-l8l55      1/1     Running                           0               6m53s   10.104.27.192   4am-node31   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-indexnode-7c6c84d97d-nmbcc      1/1     Running                           0               6m53s   10.104.31.205   4am-node34   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-indexnode-7c6c84d97d-rpmcg      1/1     Running                           0               6m53s   10.104.9.6      4am-node14   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-proxy-5b64744657-mkz2v          1/1     Running                           1 (2m23s ago)   6m54s   10.104.14.141   4am-node18   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-querycoord-74c6556d54-8bngv     1/1     Running                           1 (2m24s ago)   6m54s   10.104.14.142   4am-node18   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-querynode-5d7b7c9548-cw48h      1/1     Running                           0               6m54s   10.104.21.143   4am-node24   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-querynode-5d7b7c9548-sqffh      1/1     Running                           0               6m54s   10.104.15.209   4am-node20   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-querynode-5d7b7c9548-vf55p      1/1     Running                           0               6m54s   10.104.16.243   4am-node21   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-rootcoord-5f4b9fb754-dntt7      1/1     Running                           1 (2m39s ago)   6m54s   10.104.29.86    4am-node35   <none>           <none>
multi-vector-chaos-2-v5j4l-minio-0                                1/1     Running                           0               6m54s   10.104.27.196   4am-node31   <none>           <none>
multi-vector-chaos-2-v5j4l-minio-1                                1/1     Running                           0               6m54s   10.104.29.97    4am-node35   <none>           <none>
multi-vector-chaos-2-v5j4l-minio-2                                1/1     Running                           0               6m53s   10.104.28.108   4am-node33   <none>           <none>
multi-vector-chaos-2-v5j4l-minio-3                                1/1     Running                           0               6m53s   10.104.24.166   4am-node29   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-bookie-0                        1/1     Running                           0               6m54s   10.104.27.200   4am-node31   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-bookie-1                        1/1     Running                           0               6m53s   10.104.29.98    4am-node35   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-bookie-2                        1/1     Running                           0               6m53s   10.104.28.110   4am-node33   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-bookie-init-jlnqm               0/1     Completed                         0               6m54s   10.104.24.154   4am-node29   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-broker-0                        1/1     Running                           0               6m54s   10.104.24.156   4am-node29   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-proxy-0                         1/1     Running                           0               6m54s   10.104.28.98    4am-node33   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-pulsar-init-474zn               0/1     Completed                         0               6m54s   10.104.24.153   4am-node29   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-recovery-0                      1/1     Running                           0               6m54s   10.104.15.208   4am-node20   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-zookeeper-0                     1/1     Running                           0               6m54s   10.104.24.163   4am-node29   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-zookeeper-1                     1/1     Running                           0               6m2s    10.104.26.119   4am-node32   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-zookeeper-2                     1/1     Running                           0               5m16s   10.104.34.137   4am-node37   <none>           <none>

after pod failure:

NAME                                                              READY   STATUS                            RESTARTS        AGE     IP              NODE         NOMINATED NODE   READINESS GATES
multi-vector-chaos-2-v5j4l-etcd-0                                 1/1     Running                           0               6h47m   10.104.28.105   4am-node33   <none>           <none>
multi-vector-chaos-2-v5j4l-etcd-1                                 1/1     Running                           0               6h47m   10.104.27.201   4am-node31   <none>           <none>
multi-vector-chaos-2-v5j4l-etcd-2                                 1/1     Running                           0               6h47m   10.104.29.100   4am-node35   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-datacoord-69b9555746-ss4jw      1/1     Running                           0               6h47m   10.104.14.140   4am-node18   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-datanode-69f78d5bf7-lkkkj       1/1     Running                           1 (6h43m ago)   6h47m   10.104.14.143   4am-node18   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-indexcoord-65dc8c9cc-vxkt6      1/1     Running                           0               6h47m   10.104.24.155   4am-node29   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-indexnode-7c6c84d97d-l8l55      1/1     Running                           0               6h47m   10.104.27.192   4am-node31   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-indexnode-7c6c84d97d-nmbcc      1/1     Running                           0               6h47m   10.104.31.205   4am-node34   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-indexnode-7c6c84d97d-rpmcg      1/1     Running                           0               6h47m   10.104.9.6      4am-node14   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-proxy-5b64744657-mkz2v          1/1     Running                           1 (6h43m ago)   6h47m   10.104.14.141   4am-node18   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-querycoord-74c6556d54-8bngv     1/1     Running                           1 (6h43m ago)   6h47m   10.104.14.142   4am-node18   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-querynode-5d7b7c9548-cw48h      1/1     Running                           22 (112m ago)   6h47m   10.104.21.143   4am-node24   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-querynode-5d7b7c9548-sqffh      1/1     Running                           0               6h47m   10.104.15.209   4am-node20   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-querynode-5d7b7c9548-vf55p      1/1     Running                           0               6h47m   10.104.16.243   4am-node21   <none>           <none>
multi-vector-chaos-2-v5j4l-milvus-rootcoord-5f4b9fb754-dntt7      1/1     Running                           1 (6h43m ago)   6h47m   10.104.29.86    4am-node35   <none>           <none>
multi-vector-chaos-2-v5j4l-minio-0                                1/1     Running                           0               6h47m   10.104.27.196   4am-node31   <none>           <none>
multi-vector-chaos-2-v5j4l-minio-1                                1/1     Running                           0               6h47m   10.104.29.97    4am-node35   <none>           <none>
multi-vector-chaos-2-v5j4l-minio-2                                1/1     Running                           0               6h47m   10.104.28.108   4am-node33   <none>           <none>
multi-vector-chaos-2-v5j4l-minio-3                                1/1     Running                           0               6h47m   10.104.24.166   4am-node29   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-bookie-0                        1/1     Running                           0               6h47m   10.104.27.200   4am-node31   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-bookie-1                        1/1     Running                           0               6h47m   10.104.29.98    4am-node35   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-bookie-2                        1/1     Running                           0               6h47m   10.104.28.110   4am-node33   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-bookie-init-jlnqm               0/1     Completed                         0               6h47m   10.104.24.154   4am-node29   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-broker-0                        1/1     Running                           0               6h47m   10.104.24.156   4am-node29   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-proxy-0                         1/1     Running                           0               6h47m   10.104.28.98    4am-node33   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-pulsar-init-474zn               0/1     Completed                         0               6h47m   10.104.24.153   4am-node29   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-recovery-0                      1/1     Running                           0               6h47m   10.104.15.208   4am-node20   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-zookeeper-0                     1/1     Running                           0               6h47m   10.104.24.163   4am-node29   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-zookeeper-1                     1/1     Running                           0               6h46m   10.104.26.119   4am-node32   <none>           <none>
multi-vector-chaos-2-v5j4l-pulsar-zookeeper-2                     1/1     Running                           0               6h45m   10.104.34.137   4am-node37   <none>           <none>

pod failure: 2024-04-17 06:20:35 ~ 2024-04-17 08:22:38
截屏2024-04-19 16 37 06

Proxy Search Latency
image
截屏2024-04-19 16 36 30

queryNode Search Request Latency
hybrid_search request traffic is not sent to the newly online nodes, causing RT to increase
截屏2024-04-19 16 37 35
截屏2024-04-19 16 41 21

client pod name: multi-vector-chaos-2-v5j4l-2424246326

Expected Behavior

No response

Steps To Reproduce

1. Concurrent hybrid_search 6h,reqs=10, concurrent number=50
2. After concurrent test 2h, Pod failure queryNode 2h

Milvus Log

No response

Anything else?

test result:

[2024-04-17 10:05:27,512 -  INFO - fouram]: Print locust final stats. (locust_runner.py:56)
[2024-04-17 10:05:27,512 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-04-17 10:05:27,512 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-04-17 10:05:27,512 -  INFO - fouram]: grpc     hybrid_search                                                                 287537     0(0.00%) |   3744     685   20072   2900 |   13.31        0.00 (stats.py:789)
[2024-04-17 10:05:27,512 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-04-17 10:05:27,512 -  INFO - fouram]:          Aggregated                                                                    287537     0(0.00%) |   3744     685   20072   2900 |   13.31        0.00 (stats.py:789)
[2024-04-17 10:05:27,512 -  INFO - fouram]:  (stats.py:790)
[2024-04-17 10:05:27,516 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'cluster',
            'config_name': 'cluster_2c2m',
            'config': {'queryNode': {'resources': {'limits': {'cpu': '8.0',
                                                              'memory': '32Gi'},
                                                   'requests': {'cpu': '5.0',
                                                                'memory': '17Gi'}},
                                     'replicas': 3},
                       'indexNode': {'resources': {'limits': {'cpu': 16,
                                                              'memory': '8Gi'},
                                                   'requests': {'cpu': 9,
                                                                'memory': '5Gi'}},
                                     'replicas': 3},
                       'dataNode': {'resources': {'limits': {'cpu': '2.0',
                                                             'memory': '2Gi'},
                                                  'requests': {'cpu': '2.0',
                                                               'memory': '2Gi'}}},
                       'cluster': {'enabled': True},
                       'pulsar': {},
                       'kafka': {},
                       'minio': {'metrics': {'podMonitor': {'enabled': True}}},
                       'etcd': {'metrics': {'enabled': True,
                                            'podMonitor': {'enabled': True}}},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus',
                                         'tag': '2.4-20240417-dff96c32-amd64'}}},
            'host': 'multi-vector-chaos-2-v5j4l-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_ivf_sq8_hybrid_search_cluster',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'scalars_index': {'int64_1': {},
                                                                      'id': {'index_type': 'INVERTED'}},
                                                    'vectors_index': {'float_vector_1': {'index_type': 'HNSW',
                                                                                         'index_param': {'M': 8,
                                                                                                         'efConstruction': 200},
                                                                                         'metric_type': 'L2'},
                                                                      'binary_vector_1': {'index_type': 'BIN_IVF_FLAT',
                                                                                          'index_param': {'nlist': 2048},
                                                                                          'metric_type': 'JACCARD'}},
                                                    'scalars_params': {'float_vector_1': {'params': {'dim': 200},
                                                                                          'other_params': {'dataset': 'text2img',
                                                                                                           'dim': 200}},
                                                                       'binary_vector_1': {'params': {'dim': 512},
                                                                                           'other_params': {'dataset': 'binary',
                                                                                                            'dim': 512}},
                                                                       'array_varchar_1': {'params': {'max_length': 10,
                                                                                                      'max_capacity': 5},
                                                                                           'other_params': {'varchar_filled': False}}},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': '5m',
                                                    'ni_per': 5000},
                                 'collection_params': {'other_fields': ['float_vector_1',
                                                                        'array_varchar_1',
                                                                        'int64_1',
                                                                        'binary_vector_1'],
                                                       'shards_num': 2},
                                 'load_params': {'replica_number': 2},
                                 'index_params': {'index_type': 'IVF_SQ8',
                                                  'index_param': {'nlist': 1024}},
                                 'concurrent_params': {'concurrent_number': 50,
                                                       'during_time': '6h',
                                                       'interval': 20,
                                                       'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'hybrid_search',
                                                       'weight': 1,
                                                       'params': {'nq': 1,
                                                                  'top_k': 10,
                                                                  'reqs': [{'search_param': {'nprobe': 32},
                                                                            'anns_field': 'float_vector',
                                                                            'expr': 'int64_1 '
                                                                                    '< '
                                                                                    '100000'},
                                                                           {'search_param': {'ef': 64},
                                                                            'anns_field': 'float_vector_1',
                                                                            'expr': 'id '
                                                                                    '> '
                                                                                    '10',
                                                                            'top_k': 60},
                                                                           {'search_param': {'nprobe': 64},
                                                                            'anns_field': 'binary_vector_1',
                                                                            'top_k': 2000},
                                                                           {'search_param': {'nprobe': 32},
                                                                            'anns_field': 'float_vector',
                                                                            'expr': 'int64_1 '
                                                                                    '< '
                                                                                    '100000'},
                                                                           {'search_param': {'ef': 64},
                                                                            'anns_field': 'float_vector_1',
                                                                            'expr': 'id '
                                                                                    '<= '
                                                                                    '100000',
                                                                            'top_k': 60},
                                                                           {'search_param': {'nprobe': 64},
                                                                            'anns_field': 'binary_vector_1',
                                                                            'top_k': 2000},
                                                                           {'search_param': {'nprobe': 32},
                                                                            'anns_field': 'float_vector',
                                                                            'expr': 'int64_1 '
                                                                                    '> '
                                                                                    '200000'},
                                                                           {'search_param': {'ef': 128},
                                                                            'anns_field': 'float_vector_1',
                                                                            'expr': 'id '
                                                                                    '> '
                                                                                    '10',
                                                                            'top_k': 60},
                                                                           {'search_param': {'nprobe': 32},
                                                                            'anns_field': 'binary_vector_1',
                                                                            'top_k': 2000},
                                                                           {'search_param': {'nprobe': 16},
                                                                            'anns_field': 'float_vector',
                                                                            'expr': 'int64_1 '
                                                                                    '< '
                                                                                    '100000'}],
                                                                  'rerank': {'RRFRanker': []},
                                                                  'timeout': 600,
                                                                  'random_data': True}}]},
            'run_id': 2024041740362808,
            'datetime': '2024-04-17 03:20:36.798427',
            'client_version': '2.4.0'},
 'result': {'test_result': {'index': {'RT': 501.0145,
                                      'float_vector_1': {'RT': 468.2082},
                                      'binary_vector_1': {'RT': 339.8257},
                                      'int64_1': {'RT': 0.5274},
                                      'id': {'RT': 0.5349}},
                            'insert': {'total_time': 646.141,
                                       'VPS': 7738.2491,
                                       'batch_time': 0.6461,
                                       'batch': 5000},
                            'flush': {'RT': 2.57},
                            'load': {'RT': 13.607},
                            'Locust': {'Aggregated': {'Requests': 287537,
                                                      'Fails': 0,
                                                      'RPS': 13.31,
                                                      'fail_s': 0.0,
                                                      'RT_max': 20072.57,
                                                      'RT_avg': 3744.43,
                                                      'TP50': 2900.0,
                                                      'TP99': 16000.0},
                                       'hybrid_search': {'Requests': 287537,
                                                         'Fails': 0,
                                                         'RPS': 13.31,
                                                         'fail_s': 0.0,
                                                         'RT_max': 20072.57,
                                                         'RT_avg': 3744.43,
                                                         'TP50': 2900.0,
                                                         'TP99': 16000.0}}}}}
@wangting0128 wangting0128 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test 2.4-features labels Apr 19, 2024
@wangting0128 wangting0128 added this to the 2.4.1 milestone Apr 19, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 19, 2024
@yanliang567 yanliang567 removed their assignment Apr 19, 2024
@wangting0128 wangting0128 added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 23, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.1, 2.4.2 May 7, 2024
sre-ci-robot pushed a commit that referenced this issue May 8, 2024
issue: #32466

this PR enhance that when shard location changed, update proxy's shard
leader cache. in case of query node failover case, proxy can find
replica recover

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
weiliu1031 added a commit to weiliu1031/milvus that referenced this issue May 10, 2024
…us-io#32470)

issue: milvus-io#32466

this PR enhance that when shard location changed, update proxy's shard
leader cache. in case of query node failover case, proxy can find
replica recover

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
@weiliu1031
Copy link
Contributor

please verify this with latest image

@weiliu1031
Copy link
Contributor

/assign @wangting0128

@wangting0128
Copy link
Contributor Author

verification pssed

image: master-20240517-225f4a61-amd64
argo task: multi-vector-chaos-2-fzrk5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.4-features kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants