Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark] Continuous concurrent upsert and count(*) testing milvus OOM #31705

Closed
1 task done
elstic opened this issue Mar 28, 2024 · 29 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@elstic
Copy link
Contributor

elstic commented Mar 28, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.4-20240327-d37e1fdd-amd64 
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.0rc19
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

After inserting 2 million vectors and a string scalar, the string column is the partition key. Concurrency is 5, with continuous concurrent upserts and count(*). Each time you upsert 2000 pieces of data, milvus oom and restart.

argo task : upsert-count-5d5dk
client pod: upsert-count-5d5dk-99120586 (qa ns)

test env: 4am cluster, qa-milvus ns

upsert-count-5d5dk-1-41-1560-etcd-0                               1/1     Running                           0                125m    10.104.21.35    4am-node24   <none>           <none>
upsert-count-5d5dk-1-41-1560-milvus-standalone-6f47687599-smpwm   0/1     CrashLoopBackOff                  15 (3m10s ago)   125m    10.104.30.8     4am-node38   <none>           <none>
upsert-count-5d5dk-1-41-1560-minio-78d5bffc84-p59g8               1/1     Running                           0                125m    10.104.34.102   4am-node37   <none>           <none>

image

resource: 8c16g
image

Expected Behavior

No response

Steps To Reproduce

1. Deployment of milvus
2. Create index, load after inserting 2 million data
3. concurrent upsert and count(*)

Milvus Log

No response

Anything else?

No response

@elstic elstic added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Mar 28, 2024
@elstic elstic added this to the 2.4.0 milestone Mar 28, 2024
@xiaofan-luan
Copy link
Contributor

I think this might be as expected. you need enough memory buffer before compaction done.
But we quota limitation is not working as expected?

@xiaofan-luan
Copy link
Contributor

/assign @bigsheeper

@xiaofan-luan
Copy link
Contributor

Another thing is how we can improve compaction throughput to catch up intense write/delete workload.
We kind like to rethink the compaction policy and compction implementation

@yanliang567
Copy link
Contributor

I think a 8c16g standalone pod shall be able to hold 2m_128d entities with 5-concurrents upsert requests without OOM.

I think this might be as expected. you need enough memory buffer before compaction done. But we quota limitation is not working as expected?

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 29, 2024
@yanliang567 yanliang567 removed their assignment Mar 29, 2024
@xiaofan-luan
Copy link
Contributor

I think a 8c16g standalone pod shall be able to hold 2m_128d entities with 5-concurrents upsert requests without OOM.

I think this might be as expected. you need enough memory buffer before compaction done. But we quota limitation is not working as expected?

based on how fast compaction can work

@bigsheeper
Copy link
Contributor

I think this might be as expected. you need enough memory buffer before compaction done. But we quota limitation is not working as expected?

I believe this is because the memory growth is too rapid, and the quota center takes several seconds to provide feedback to the proxy interceptor.
On the other hand, the memory growth may also be attributed to datanode/querynode consuming the remaining messages in the message queue (MQ), further exacerbating the situation.

@xiaofan-luan
Copy link
Contributor

I think this might be as expected. you need enough memory buffer before compaction done. But we quota limitation is not working as expected?

I believe this is because the memory growth is too rapid, and the quota center takes several seconds to provide feedback to the proxy interceptor. On the other hand, the memory growth may also be attributed to datanode/querynode consuming the remaining messages in the message queue (MQ), further exacerbating the situation.

We may want to add insertion throughput limit earlier? could that be an option?

for example 0%-60% - not limit
60-70 - 20MB/s
70 - 80 - 10MB/s
80 - 90 - 5MB/s
90 + - No limit allowed

@bigsheeper
Copy link
Contributor

We may want to add insertion throughput limit earlier? could that be an option?

for example 0%-60% - not limit 60-70 - 20MB/s 70 - 80 - 10MB/s 80 - 90 - 5MB/s 90 + - No limit allowed

Yes, we support such functionality, provided that the quotaAndLimits.dml.upsertRate configuration is set (BTW, it's configured in the cloud). Otherwise, it's only "90 + - No limit allowed".

@bigsheeper
Copy link
Contributor

@elstic Maybe you can try to set quotaAndLimits.dml.upsertRate.max to 0.5 (means 0.5MB/s), and also accelerate the quota center check frequency by setting quotaAndLimits.quotaCenterCollectInterval from 3(s) to 1(s)

@elstic
Copy link
Contributor Author

elstic commented Apr 7, 2024

@elstic Maybe you can try to set quotaAndLimits.dml.upsertRate.max to 0.5 (means 0.5MB/s), and also accelerate the quota center check frequency by setting quotaAndLimits.quotaCenterCollectInterval from 3(s) to 1(s)

milvus still crashes, here are the test details

upsert-count-zpm8n-1-4-6358-etcd-0                                1/1     Running                           0                  32m     10.104.26.166   4am-node32   <none>           <none>
upsert-count-zpm8n-1-4-6358-milvus-standalone-66f9f44c65-f69ls    0/1     CrashLoopBackOff                  5 (59s ago)        32m     10.104.33.104   4am-node36   <none>           <none>
upsert-count-zpm8n-1-4-6358-minio-5f9cf686db-dmb7s                1/1     Running                           0                  32m     10.104.33.103   4am-node36   <none>           <none>

I changed the quota parameter as you provided.

  quotaAndLimits:
    dml:
      upsertRate:
        max: 0.5
    quotaCenterCollectInterval: 1

milvus resources:

  standalone:
    resources:
      limits:
        cpu: 8
        memory: 16Gi
      requests:
        cpu: 8
        memory: 16Gi

client parameters:

    dataset_params:
      dataset_name: sift
      dim: 128
      dataset_size: 2m
      ni_per: 5000
      metric_type: L2
      scalars_params:
        varchar_1:
          params:
            is_partition_key: true
            max_length: 100
    collection_params:
      other_fields:
        - varchar_1
      shards_num: 2
      num_partitions: 16
    load_params: {}
    release_params: {}
    index_params:
      index_type: HNSW
      index_param:
        M: 8
        efConstruction: 200
    query_params: {}
    search_params: {}
    resource_groups_params:
      reset: false
    database_user_params:
      reset_rbac: false
      reset_db: false
    concurrent_params:
      concurrent_number:
        - 1
      during_time: 3h
      interval: 20
    concurrent_tasks:
      - type: query
        weight: 1
        params:
          expr: ''
          output_fields:
            - count(*)
      - type: upsert
        weight: 1
        params:
          nb: 2000
          random_id: false
          start_id: 1

It means to insert 2 million 128d vectors and str columns of length 100. Continuous serial count(*) and upsert (concurrency reduced to 1), upserting 2000 pieces of data at a time.

resource grafana:
image

@xiaofan-luan
Copy link
Contributor

@elstic
Is this related to count?

@elstic
Copy link
Contributor Author

elstic commented Apr 7, 2024

@elstic Is this related to count?

I think it's irrelevant, the count operation is just added to verify if upsert causes the amount of data to increase or decrease.
I will start another test with only upsert operations to verify my opinion.

@elstic
Copy link
Contributor Author

elstic commented Apr 7, 2024

@elstic Is this related to count?

I think it's irrelevant, the count operation is just added to verify if upsert causes the amount of data to increase or decrease. I will start another test with only upsert operations to verify my opinion.

crash from upsert only.

upsert-count-wbvb7-1-7-2291-etcd-0                                1/1     Running                           0                 36m     10.104.16.188   4am-node21   <none>           <none>
upsert-count-wbvb7-1-7-2291-milvus-standalone-578cdc9c59-tbsx4    0/1     CrashLoopBackOff                  6 (2m10s ago)     36m     10.104.27.228   4am-node31   <none>           <none>
upsert-count-wbvb7-1-7-2291-minio-84b8454844-kcmmc                1/1     Running                           0                 36m     10.104.30.6     4am-node38   <none>           <none>

@bigsheeper
Copy link
Contributor

@elstic Has the quotaAndLimits.dml.enabled been configured to true?

@elstic
Copy link
Contributor Author

elstic commented Apr 9, 2024

@elstic Has the quotaAndLimits.dml.enabled been configured to true?

No modification, default value.
Do I need to set it to true and then run it? Are there any other configurations that need to be changed?

@bigsheeper
Copy link
Contributor

@elstic Has the quotaAndLimits.dml.enabled been configured to true?

No modification, default value. Do I need to set it to true and then run it? Are there any other configurations that need to be changed?

Nope, I'll shard you a user document about quota and limits.

@elstic
Copy link
Contributor Author

elstic commented Apr 10, 2024

When I set quotaAndLimits.dml.enabled to true , continually upsert , the memory of milvus does not grow rapidly.

quotaAndLimits:
  dml:
    enabled: true
    upsertRate:
      max: 0.5
  quotaCenterCollectInterval: 1

image

@xiaofan-luan
Copy link
Contributor

quotaAndLimits.dml.enabled

So this is the back pressure of memory works?

@bigsheeper
Copy link
Contributor

quotaAndLimits.dml.enabled

So this is the back pressure of memory works?

The memory did not reach the memory low watermark, so it's just rate limiting. To observe the backpressure effect, we need to continue upserting for a longer period.

image

@yanliang567 yanliang567 modified the milestones: 2.4.0, 2.4.1 Apr 18, 2024
@elstic
Copy link
Contributor Author

elstic commented Apr 28, 2024

This case works fine in the 2.4 branch, but crashes in the master branch.
image: master-20240426-c080dc16-amd64

server:

upsert-count-1744400-1-58-9840-etcd-0                             1/1     Running                       0                  11m     10.104.32.58    4am-node39   <none>           <none>
upsert-count-1744400-1-58-9840-milvus-standalone-86894b955z7rsg   1/1     Running                       0                  11m     10.104.15.137   4am-node20   <none>           <none>
upsert-count-1744400-1-58-9840-minio-bfcf4c7d8-cdwld              1/1     Running                       0                  11m     10.104.32.53    4am-node39   <none>           <none> (base.py:257)
[2024-04-28 00:25:57,207 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'NAME|upsert-count-1744400-1-58-9840-milvus|upsert-count-1744400-1-58-9840-minio|upsert-count-1744400-1-58-9840-etcd|upsert-count-1744400-1-58-9840-pulsar|upsert-count-1744400-1-58-9840-zookeeper|upsert-count-1744400-1-58-9840-kafka|upsert-count-1744400-1-58-9840-log|upsert-count-1744400-1-58-9840-tikv'  (util_cmd.py:14)
[2024-04-28 00:26:07,024 -  INFO - fouram]: [CliClient] pod details of release(upsert-count-1744400-1-58-9840): 
 I0428 00:25:58.458175    3223 request.go:665] Waited for 1.16826314s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/autoscaling/v2beta1?timeout=32s
NAME                                                              READY   STATUS                        RESTARTS           AGE     IP              NODE         NOMINATED NODE   READINESS GATES
upsert-count-1744400-1-58-9840-etcd-0                             1/1     Running                       0                  5h22m   10.104.32.58    4am-node39   <none>           <none>
upsert-count-1744400-1-58-9840-milvus-standalone-86894b955z7rsg   0/1     CrashLoopBackOff              37 (2m6s ago)      5h22m   10.104.15.137   4am-node20   <none>           <none>
upsert-count-1744400-1-58-9840-minio-bfcf4c7d8-cdwld              1/1     Running                       0                  5h22m   10.104.32.53    4am-node39   <none>           <none>

deploy config:

    standalone:
      resources:
        limits:
          cpu: 8
          memory: 16Gi
        requests:
          cpu: 8
          memory: 16Gi
      profiling:
        enabled: true
    extraConfigFiles:
      user.yaml: |+
        quotaAndLimits:
          dml:
            enabled: true
            upsertRate:
              max: 0.5
          quotaCenterCollectInterval: 1

grafana:
image

client pod: fouramf-6tth5-60-4232-milvus-standalone-5f9fd547d-m5mgs

@elstic
Copy link
Contributor Author

elstic commented Apr 28, 2024

@bigsheeper please pay attention to this issue.

@bigsheeper
Copy link
Contributor

image

related to #32647

should be fixed, please help to verify @elstic

@bigsheeper
Copy link
Contributor

/assign @elstic
/unassign

@sre-ci-robot sre-ci-robot assigned elstic and unassigned bigsheeper Apr 29, 2024
@elstic
Copy link
Contributor Author

elstic commented Apr 30, 2024

Verified and fixed
Verify image: master-20240430-42d0412e

@elstic elstic closed this as completed Apr 30, 2024
@elstic
Copy link
Contributor Author

elstic commented May 22, 2024

The issue has resurfaced.
When I turn on dml flow limiting and continue to concurrently upsert, milvus crashes.
image: master-20240521-3d105fcb-amd64
deploy config:

    standalone:
      resources:
        limits:
          cpu: 8
          memory: 16Gi
        requests:
          cpu: 8
          memory: 16Gi
      profiling:
        enabled: true
    extraConfigFiles:
      user.yaml: |+
        quotaAndLimits:
          dml:
            enabled: true
            upsertRate:
              max: 0.5
          quotaCenterCollectInterval: 1

server:

upsert-count-1718000-1-71-5787-etcd-0                             1/1     Running       0               11m     10.104.26.12    4am-node3
2   <none>           <none>
upsert-count-1718000-1-71-5787-milvus-standalone-9d776678-rwvjh   1/1     Running       0               11m     10.104.20.245   4am-node2
2   <none>           <none>
upsert-count-1718000-1-71-5787-minio-697ff7664-z4mj8              1/1     Running       0               11m     10.104.25.97    4am-node3
0   <none>           <none> (base.py:257)
[2024-05-22 00:23:22,654 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'NAME|upsert-count-1718000-1-71-
5787-milvus|upsert-count-1718000-1-71-5787-minio|upsert-count-1718000-1-71-5787-etcd|upsert-count-1718000-1-71-5787-pulsar|upsert-count-1
718000-1-71-5787-zookeeper|upsert-count-1718000-1-71-5787-kafka|upsert-count-1718000-1-71-5787-log|upsert-count-1718000-1-71-5787-tikv'
(util_cmd.py:14)
[2024-05-22 00:23:32,987 -  INFO - fouram]: [CliClient] pod details of release(upsert-count-1718000-1-71-5787):
 I0522 00:23:24.303787    3444 request.go:665] Waited for 1.198514145s due to client-side throttling, not priority and fairness, request:
 GET:https://kubernetes.default.svc.cluster.local/apis/autoscaling/v1?timeout=32s
NAME                                                              READY   STATUS             RESTARTS        AGE     IP              NODE
         NOMINATED NODE   READINESS GATES
upsert-count-1718000-1-71-5787-etcd-0                             1/1     Running            0               5h18m   10.104.26.12    4am-
node32   <none>           <none>
upsert-count-1718000-1-71-5787-milvus-standalone-9d776678-rwvjh   0/1     CrashLoopBackOff   9 (34s ago)     5h18m   10.104.20.245   4am-
node22   <none>           <none>
upsert-count-1718000-1-71-5787-minio-697ff7664-z4mj8              1/1     Running            0               5h18m   10.104.25.97    4am-
node30   <none>           <none> (cli_client.py:138)

client error log:
image

grafana:
image

prometheus:
image

@elstic elstic reopened this May 22, 2024
@elstic elstic assigned bigsheeper and unassigned elstic May 22, 2024
@elstic elstic modified the milestones: 2.4.1, 2.4.2 May 22, 2024
@xiaofan-luan
Copy link
Contributor

@bigsheeper
please help on it

@bigsheeper
Copy link
Contributor

image

sre-ci-robot pushed a commit that referenced this issue May 23, 2024
issue: #31705

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
czs007 pushed a commit that referenced this issue May 23, 2024
issue: #31705

pr: #33289

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue May 24, 2024
If the request is limited by rate limiter, limiter should not "Cancel".
This is because, if limited, tokens are not deducted; instead, "Cancel"
operation would increase the token count.

issue: #31705

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue May 24, 2024
If the request is limited by rate limiter, limiter should not "Cancel".
This is because, if limited, tokens are not deducted; instead, "Cancel"
operation would increase the token count.

issue: #31705

pr: #33335

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
@bigsheeper
Copy link
Contributor

/assign @elstic
please help to verify

@elstic
Copy link
Contributor Author

elstic commented May 24, 2024

Verified and fixed
Verify image: master-20240524-7730b910-amd64

@elstic elstic closed this as completed May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants