[Bug]: [benchmark] Continuous concurrent upsert and count(*) testing milvus OOM #31705

elstic · 2024-03-28T11:12:35Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:2.4-20240327-d37e1fdd-amd64 
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.0rc19
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

After inserting 2 million vectors and a string scalar, the string column is the partition key. Concurrency is 5, with continuous concurrent upserts and count(*). Each time you upsert 2000 pieces of data, milvus oom and restart.

argo task : upsert-count-5d5dk
client pod: upsert-count-5d5dk-99120586 (qa ns)

test env: 4am cluster, qa-milvus ns

upsert-count-5d5dk-1-41-1560-etcd-0                               1/1     Running                           0                125m    10.104.21.35    4am-node24   <none>           <none>
upsert-count-5d5dk-1-41-1560-milvus-standalone-6f47687599-smpwm   0/1     CrashLoopBackOff                  15 (3m10s ago)   125m    10.104.30.8     4am-node38   <none>           <none>
upsert-count-5d5dk-1-41-1560-minio-78d5bffc84-p59g8               1/1     Running                           0                125m    10.104.34.102   4am-node37   <none>           <none>

resource: 8c16g

Expected Behavior

No response

Steps To Reproduce

1. Deployment of milvus
2. Create index, load after inserting 2 million data
3. concurrent upsert and count(*)

Milvus Log

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

xiaofan-luan · 2024-03-28T17:00:15Z

I think this might be as expected. you need enough memory buffer before compaction done.
But we quota limitation is not working as expected?

xiaofan-luan · 2024-03-28T17:00:22Z

/assign @bigsheeper

xiaofan-luan · 2024-03-28T17:35:04Z

Another thing is how we can improve compaction throughput to catch up intense write/delete workload.
We kind like to rethink the compaction policy and compction implementation

yanliang567 · 2024-03-29T01:26:37Z

I think a 8c16g standalone pod shall be able to hold 2m_128d entities with 5-concurrents upsert requests without OOM.

I think this might be as expected. you need enough memory buffer before compaction done. But we quota limitation is not working as expected?

xiaofan-luan · 2024-03-29T01:28:21Z

I think a 8c16g standalone pod shall be able to hold 2m_128d entities with 5-concurrents upsert requests without OOM.

I think this might be as expected. you need enough memory buffer before compaction done. But we quota limitation is not working as expected?

based on how fast compaction can work

bigsheeper · 2024-04-01T13:03:57Z

I think this might be as expected. you need enough memory buffer before compaction done. But we quota limitation is not working as expected?

I believe this is because the memory growth is too rapid, and the quota center takes several seconds to provide feedback to the proxy interceptor.
On the other hand, the memory growth may also be attributed to datanode/querynode consuming the remaining messages in the message queue (MQ), further exacerbating the situation.

xiaofan-luan · 2024-04-01T22:47:47Z

I think this might be as expected. you need enough memory buffer before compaction done. But we quota limitation is not working as expected?

I believe this is because the memory growth is too rapid, and the quota center takes several seconds to provide feedback to the proxy interceptor. On the other hand, the memory growth may also be attributed to datanode/querynode consuming the remaining messages in the message queue (MQ), further exacerbating the situation.

We may want to add insertion throughput limit earlier? could that be an option?

for example 0%-60% - not limit
60-70 - 20MB/s
70 - 80 - 10MB/s
80 - 90 - 5MB/s
90 + - No limit allowed

bigsheeper · 2024-04-02T02:52:33Z

We may want to add insertion throughput limit earlier? could that be an option?

for example 0%-60% - not limit 60-70 - 20MB/s 70 - 80 - 10MB/s 80 - 90 - 5MB/s 90 + - No limit allowed

Yes, we support such functionality, provided that the quotaAndLimits.dml.upsertRate configuration is set (BTW, it's configured in the cloud). Otherwise, it's only "90 + - No limit allowed".

bigsheeper · 2024-04-02T02:56:32Z

@elstic Maybe you can try to set quotaAndLimits.dml.upsertRate.max to 0.5 (means 0.5MB/s), and also accelerate the quota center check frequency by setting quotaAndLimits.quotaCenterCollectInterval from 3(s) to 1(s)

elstic · 2024-04-07T03:14:17Z

@elstic Maybe you can try to set quotaAndLimits.dml.upsertRate.max to 0.5 (means 0.5MB/s), and also accelerate the quota center check frequency by setting quotaAndLimits.quotaCenterCollectInterval from 3(s) to 1(s)

milvus still crashes, here are the test details

upsert-count-zpm8n-1-4-6358-etcd-0                                1/1     Running                           0                  32m     10.104.26.166   4am-node32   <none>           <none>
upsert-count-zpm8n-1-4-6358-milvus-standalone-66f9f44c65-f69ls    0/1     CrashLoopBackOff                  5 (59s ago)        32m     10.104.33.104   4am-node36   <none>           <none>
upsert-count-zpm8n-1-4-6358-minio-5f9cf686db-dmb7s                1/1     Running                           0                  32m     10.104.33.103   4am-node36   <none>           <none>

I changed the quota parameter as you provided.

  quotaAndLimits:
    dml:
      upsertRate:
        max: 0.5
    quotaCenterCollectInterval: 1

milvus resources:

  standalone:
    resources:
      limits:
        cpu: 8
        memory: 16Gi
      requests:
        cpu: 8
        memory: 16Gi

client parameters:

    dataset_params:
      dataset_name: sift
      dim: 128
      dataset_size: 2m
      ni_per: 5000
      metric_type: L2
      scalars_params:
        varchar_1:
          params:
            is_partition_key: true
            max_length: 100
    collection_params:
      other_fields:
        - varchar_1
      shards_num: 2
      num_partitions: 16
    load_params: {}
    release_params: {}
    index_params:
      index_type: HNSW
      index_param:
        M: 8
        efConstruction: 200
    query_params: {}
    search_params: {}
    resource_groups_params:
      reset: false
    database_user_params:
      reset_rbac: false
      reset_db: false
    concurrent_params:
      concurrent_number:
        - 1
      during_time: 3h
      interval: 20
    concurrent_tasks:
      - type: query
        weight: 1
        params:
          expr: ''
          output_fields:
            - count(*)
      - type: upsert
        weight: 1
        params:
          nb: 2000
          random_id: false
          start_id: 1

It means to insert 2 million 128d vectors and str columns of length 100. Continuous serial count(*) and upsert (concurrency reduced to 1), upserting 2000 pieces of data at a time.

resource grafana:

xiaofan-luan · 2024-04-07T06:14:09Z

@elstic
Is this related to count?

elstic · 2024-04-07T06:20:04Z

@elstic Is this related to count?

I think it's irrelevant, the count operation is just added to verify if upsert causes the amount of data to increase or decrease.
I will start another test with only upsert operations to verify my opinion.

elstic · 2024-04-07T06:59:21Z

@elstic Is this related to count?

I think it's irrelevant, the count operation is just added to verify if upsert causes the amount of data to increase or decrease. I will start another test with only upsert operations to verify my opinion.

crash from upsert only.

upsert-count-wbvb7-1-7-2291-etcd-0                                1/1     Running                           0                 36m     10.104.16.188   4am-node21   <none>           <none>
upsert-count-wbvb7-1-7-2291-milvus-standalone-578cdc9c59-tbsx4    0/1     CrashLoopBackOff                  6 (2m10s ago)     36m     10.104.27.228   4am-node31   <none>           <none>
upsert-count-wbvb7-1-7-2291-minio-84b8454844-kcmmc                1/1     Running                           0                 36m     10.104.30.6     4am-node38   <none>           <none>

bigsheeper · 2024-04-09T06:46:30Z

@elstic Has the quotaAndLimits.dml.enabled been configured to true?

elstic · 2024-04-09T07:05:11Z

@elstic Has the quotaAndLimits.dml.enabled been configured to true?

No modification, default value.
Do I need to set it to true and then run it? Are there any other configurations that need to be changed?

bigsheeper · 2024-04-10T07:56:40Z

@elstic Has the quotaAndLimits.dml.enabled been configured to true?

No modification, default value. Do I need to set it to true and then run it? Are there any other configurations that need to be changed?

Nope, I'll shard you a user document about quota and limits.

elstic · 2024-04-10T11:27:11Z

When I set quotaAndLimits.dml.enabled to true , continually upsert , the memory of milvus does not grow rapidly.

quotaAndLimits:
  dml:
    enabled: true
    upsertRate:
      max: 0.5
  quotaCenterCollectInterval: 1

xiaofan-luan · 2024-04-11T21:59:37Z

quotaAndLimits.dml.enabled

So this is the back pressure of memory works?

bigsheeper · 2024-04-12T02:16:57Z

quotaAndLimits.dml.enabled

So this is the back pressure of memory works?

The memory did not reach the memory low watermark, so it's just rate limiting. To observe the backpressure effect, we need to continue upserting for a longer period.

elstic · 2024-04-28T04:02:16Z

This case works fine in the 2.4 branch, but crashes in the master branch.
image: master-20240426-c080dc16-amd64

server:

upsert-count-1744400-1-58-9840-etcd-0                             1/1     Running                       0                  11m     10.104.32.58    4am-node39   <none>           <none>
upsert-count-1744400-1-58-9840-milvus-standalone-86894b955z7rsg   1/1     Running                       0                  11m     10.104.15.137   4am-node20   <none>           <none>
upsert-count-1744400-1-58-9840-minio-bfcf4c7d8-cdwld              1/1     Running                       0                  11m     10.104.32.53    4am-node39   <none>           <none> (base.py:257)
[2024-04-28 00:25:57,207 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'NAME|upsert-count-1744400-1-58-9840-milvus|upsert-count-1744400-1-58-9840-minio|upsert-count-1744400-1-58-9840-etcd|upsert-count-1744400-1-58-9840-pulsar|upsert-count-1744400-1-58-9840-zookeeper|upsert-count-1744400-1-58-9840-kafka|upsert-count-1744400-1-58-9840-log|upsert-count-1744400-1-58-9840-tikv'  (util_cmd.py:14)
[2024-04-28 00:26:07,024 -  INFO - fouram]: [CliClient] pod details of release(upsert-count-1744400-1-58-9840): 
 I0428 00:25:58.458175    3223 request.go:665] Waited for 1.16826314s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/autoscaling/v2beta1?timeout=32s
NAME                                                              READY   STATUS                        RESTARTS           AGE     IP              NODE         NOMINATED NODE   READINESS GATES
upsert-count-1744400-1-58-9840-etcd-0                             1/1     Running                       0                  5h22m   10.104.32.58    4am-node39   <none>           <none>
upsert-count-1744400-1-58-9840-milvus-standalone-86894b955z7rsg   0/1     CrashLoopBackOff              37 (2m6s ago)      5h22m   10.104.15.137   4am-node20   <none>           <none>
upsert-count-1744400-1-58-9840-minio-bfcf4c7d8-cdwld              1/1     Running                       0                  5h22m   10.104.32.53    4am-node39   <none>           <none>

deploy config:

    standalone:
      resources:
        limits:
          cpu: 8
          memory: 16Gi
        requests:
          cpu: 8
          memory: 16Gi
      profiling:
        enabled: true
    extraConfigFiles:
      user.yaml: |+
        quotaAndLimits:
          dml:
            enabled: true
            upsertRate:
              max: 0.5
          quotaCenterCollectInterval: 1

grafana:

client pod: fouramf-6tth5-60-4232-milvus-standalone-5f9fd547d-m5mgs

elstic · 2024-04-28T11:05:18Z

@bigsheeper please pay attention to this issue.

bigsheeper · 2024-04-29T07:12:37Z

related to #32647

should be fixed, please help to verify @elstic

bigsheeper · 2024-04-29T07:12:52Z

/assign @elstic
/unassign

elstic · 2024-04-30T06:26:58Z

Verified and fixed
Verify image: master-20240430-42d0412e

elstic · 2024-05-22T02:42:25Z

The issue has resurfaced.
When I turn on dml flow limiting and continue to concurrently upsert, milvus crashes.
image: master-20240521-3d105fcb-amd64
deploy config:

    standalone:
      resources:
        limits:
          cpu: 8
          memory: 16Gi
        requests:
          cpu: 8
          memory: 16Gi
      profiling:
        enabled: true
    extraConfigFiles:
      user.yaml: |+
        quotaAndLimits:
          dml:
            enabled: true
            upsertRate:
              max: 0.5
          quotaCenterCollectInterval: 1

server:

upsert-count-1718000-1-71-5787-etcd-0                             1/1     Running       0               11m     10.104.26.12    4am-node3
2   <none>           <none>
upsert-count-1718000-1-71-5787-milvus-standalone-9d776678-rwvjh   1/1     Running       0               11m     10.104.20.245   4am-node2
2   <none>           <none>
upsert-count-1718000-1-71-5787-minio-697ff7664-z4mj8              1/1     Running       0               11m     10.104.25.97    4am-node3
0   <none>           <none> (base.py:257)
[2024-05-22 00:23:22,654 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'NAME|upsert-count-1718000-1-71-
5787-milvus|upsert-count-1718000-1-71-5787-minio|upsert-count-1718000-1-71-5787-etcd|upsert-count-1718000-1-71-5787-pulsar|upsert-count-1
718000-1-71-5787-zookeeper|upsert-count-1718000-1-71-5787-kafka|upsert-count-1718000-1-71-5787-log|upsert-count-1718000-1-71-5787-tikv'
(util_cmd.py:14)
[2024-05-22 00:23:32,987 -  INFO - fouram]: [CliClient] pod details of release(upsert-count-1718000-1-71-5787):
 I0522 00:23:24.303787    3444 request.go:665] Waited for 1.198514145s due to client-side throttling, not priority and fairness, request:
 GET:https://kubernetes.default.svc.cluster.local/apis/autoscaling/v1?timeout=32s
NAME                                                              READY   STATUS             RESTARTS        AGE     IP              NODE
         NOMINATED NODE   READINESS GATES
upsert-count-1718000-1-71-5787-etcd-0                             1/1     Running            0               5h18m   10.104.26.12    4am-
node32   <none>           <none>
upsert-count-1718000-1-71-5787-milvus-standalone-9d776678-rwvjh   0/1     CrashLoopBackOff   9 (34s ago)     5h18m   10.104.20.245   4am-
node22   <none>           <none>
upsert-count-1718000-1-71-5787-minio-697ff7664-z4mj8              1/1     Running            0               5h18m   10.104.25.97    4am-
node30   <none>           <none> (cli_client.py:138)

client error log:

grafana:

prometheus:

xiaofan-luan · 2024-05-22T04:13:21Z

@bigsheeper
please help on it

bigsheeper · 2024-05-22T13:12:21Z

issue: #31705 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>

issue: #31705 pr: #33289 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>

If the request is limited by rate limiter, limiter should not "Cancel". This is because, if limited, tokens are not deducted; instead, "Cancel" operation would increase the token count. issue: #31705 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>

If the request is limited by rate limiter, limiter should not "Cancel". This is because, if limited, tokens are not deducted; instead, "Cancel" operation would increase the token count. issue: #31705 pr: #33335 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>

bigsheeper · 2024-05-24T02:05:13Z

/assign @elstic
please help to verify

elstic · 2024-05-24T08:01:41Z

Verified and fixed
Verify image: master-20240524-7730b910-amd64

elstic added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Mar 28, 2024

elstic added this to the 2.4.0 milestone Mar 28, 2024

elstic assigned yanliang567 Mar 28, 2024

sre-ci-robot assigned bigsheeper Mar 28, 2024

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 29, 2024

yanliang567 removed their assignment Mar 29, 2024

yanliang567 modified the milestones: 2.4.0, 2.4.1 Apr 18, 2024

sre-ci-robot assigned elstic and unassigned bigsheeper Apr 29, 2024

elstic closed this as completed Apr 30, 2024

elstic reopened this May 22, 2024

elstic assigned bigsheeper and unassigned elstic May 22, 2024

elstic modified the milestones: 2.4.1, 2.4.2 May 22, 2024

bigsheeper mentioned this issue May 22, 2024

fix: Fix printing type of request #33289

Merged

sre-ci-robot pushed a commit that referenced this issue May 23, 2024

fix: Fix printing type of request (#33289)

155cb40

issue: #31705 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>

This was referenced May 23, 2024

fix: Fix printing type of request (#33289) #33319

Merged

fix: Fix global rate limit is not working #33335

Merged

fix: Fix global rate limit is not working (#33335) #33336

Merged

czs007 pushed a commit that referenced this issue May 23, 2024

fix: Fix printing type of request (#33289) (#33319)

62663ac

issue: #31705 pr: #33289 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>

sre-ci-robot assigned elstic May 24, 2024

elstic closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [benchmark] Continuous concurrent upsert and count(*) testing milvus OOM #31705

[Bug]: [benchmark] Continuous concurrent upsert and count(*) testing milvus OOM #31705

elstic commented Mar 28, 2024 •

edited

Loading

xiaofan-luan commented Mar 28, 2024

xiaofan-luan commented Mar 28, 2024

xiaofan-luan commented Mar 28, 2024

yanliang567 commented Mar 29, 2024

xiaofan-luan commented Mar 29, 2024

bigsheeper commented Apr 1, 2024

xiaofan-luan commented Apr 1, 2024

bigsheeper commented Apr 2, 2024

bigsheeper commented Apr 2, 2024

elstic commented Apr 7, 2024

xiaofan-luan commented Apr 7, 2024

elstic commented Apr 7, 2024

elstic commented Apr 7, 2024

bigsheeper commented Apr 9, 2024

elstic commented Apr 9, 2024 •

edited

Loading

bigsheeper commented Apr 10, 2024

elstic commented Apr 10, 2024

xiaofan-luan commented Apr 11, 2024

bigsheeper commented Apr 12, 2024

elstic commented Apr 28, 2024

elstic commented Apr 28, 2024

bigsheeper commented Apr 29, 2024

bigsheeper commented Apr 29, 2024

elstic commented Apr 30, 2024

elstic commented May 22, 2024 •

edited

Loading

xiaofan-luan commented May 22, 2024

bigsheeper commented May 22, 2024

bigsheeper commented May 24, 2024

elstic commented May 24, 2024

[Bug]: [benchmark] Continuous concurrent upsert and count(*) testing milvus OOM #31705

[Bug]: [benchmark] Continuous concurrent upsert and count(*) testing milvus OOM #31705

Comments

elstic commented Mar 28, 2024 • edited Loading

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

xiaofan-luan commented Mar 28, 2024

xiaofan-luan commented Mar 28, 2024

xiaofan-luan commented Mar 28, 2024

yanliang567 commented Mar 29, 2024

xiaofan-luan commented Mar 29, 2024

bigsheeper commented Apr 1, 2024

xiaofan-luan commented Apr 1, 2024

bigsheeper commented Apr 2, 2024

bigsheeper commented Apr 2, 2024

elstic commented Apr 7, 2024

xiaofan-luan commented Apr 7, 2024

elstic commented Apr 7, 2024

elstic commented Apr 7, 2024

bigsheeper commented Apr 9, 2024

elstic commented Apr 9, 2024 • edited Loading

bigsheeper commented Apr 10, 2024

elstic commented Apr 10, 2024

xiaofan-luan commented Apr 11, 2024

bigsheeper commented Apr 12, 2024

elstic commented Apr 28, 2024

elstic commented Apr 28, 2024

bigsheeper commented Apr 29, 2024

bigsheeper commented Apr 29, 2024

elstic commented Apr 30, 2024

elstic commented May 22, 2024 • edited Loading

xiaofan-luan commented May 22, 2024

bigsheeper commented May 22, 2024

bigsheeper commented May 24, 2024

elstic commented May 24, 2024

elstic commented Mar 28, 2024 •

edited

Loading

elstic commented Apr 9, 2024 •

edited

Loading

elstic commented May 22, 2024 •

edited

Loading