Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insert data too quickly causing datanode crash #25680

Closed
1 task done
iytprince2 opened this issue Jul 17, 2023 · 13 comments
Closed
1 task done

insert data too quickly causing datanode crash #25680

iytprince2 opened this issue Jul 17, 2023 · 13 comments
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@iytprince2
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.2.11
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):   pulsar 
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus 2.2.1
- OS(Ubuntu or CentOS): debian
- CPU/Memory: 
- GPU: 
- Others: benchmark tool https://github.com/zilliztech/vectordb-benchmark

Current Behavior

when i start command
python3 main.py recall --host xxx--engine milvus --dataset-name glove-200-angular --config-name milvus_recall_k8s.yaml

datanode start to fail
pod/my-release-milvus-datanode-6b885d46-cvjzv 0/1 Running 2 11m
pod/my-release-milvus-datanode-6b885d46-s6vj2 0/1 CrashLoopBackOff 2 11m

Expected Behavior

i can insert what i want

Steps To Reproduce

1、k8s resources config
proxy:
  enabled: true
  replicas: 2 #modified
  resources:
    requests:
      memory: "4Gi" #modified
      cpu: "0.5" #modified 
    limits:
      memory: "4Gi" #modified
      cpu: "1" #modified
   
rootCoordinator:
  enabled: true
  replicas: 2       
  resources:
    requests:
      memory: "2Gi" #modified  
      cpu: "0.5"  #modified
    limits:
      memory: "2Gi" #modified 
      cpu: "1" #modified
  activeStandby:
    enabled: true
  
queryCoordinator:
  enabled: true
  replicas: 2         
  resources:
    requests:
      memory: "1Gi" #modified 
      cpu: "0.5"  #modified
    limits:
      memory: "1Gi"  #modified
      cpu: "1" #modified
  activeStandby:
    enabled: true

queryNode:
  enabled: true
  replicas: 2 #modified
  resources:
    requests:
      memory: "8Gi" #modified  
      cpu: "2" #modified
    limits:
      memory: "8Gi" #modified  
      cpu: "4" #modified
  
indexCoordinator:
  enabled: true
  replicas: 2         
  resources:
    requests:
      memory: "1Gi" #modified  
      cpu: "0.5" #modified  
    limits:
      memory: "1Gi" #modified  
      cpu: "1" #modified
  activeStandby:
    enabled: true

indexNode:
  enabled: true
  replicas: 2 #modified
  resources:
    requests:
      memory: "4Gi" #modified  
      cpu: "2" #modified
    limits:
      memory: "4Gi" #modified  
      cpu: "2" #modified
        
dataCoordinator:
  enabled: true
  replicas: 2         
  resources:
    requests:
      memory: "1Gi" #modified  
      cpu: "0.5" #modified
    limits:
      memory: "1Gi" #modified  
      cpu: "1" #modified
  activeStandby:
    enabled: true
  
dataNode:
  enabled: true
  replicas: 2 #modified
  resources:
    requests:
      memory: "4Gi" #modified  
      cpu: "2"  #modified
    limits:
      memory: "8Gi" #modified 
      cpu: "4" #modified

2、benchmark tool config
# Connection parameters can be added dynamically
connection_params:
  secure: false
  port: 19532

# Collection parameters can be added dynamically
collection_params:
  collection_name: milvus_benchmark_collection
  other_fields: []

insert_params:
#  The data size of the vector inserted each time can be specified
  batch: 1000

# Index parameters can be added dynamically
index_params:
  index_type: HNSW
  index_param:
    M: 8
    efConstruction: 200

# Load parameters can be added dynamically
# example:
#  load_params:
#     timeout: 60
load_params: {}

# Search parameters can be added dynamically
search_params:
  timeout: 1800

# The following three sets of parameters will be cross-combined,
# and then the corresponding recall rate test will be carried out in turn.
  top_k: [1]
  nq: [10]
  search_param:
    ef: [128, 512, 1024]

3、run command
python3 main.py recall --host xxx--engine milvus --dataset-name glove-200-angular --config-name milvus_recall_k8s.yaml

Milvus Log

[2023/07/17 13:12:52.433 +00:00] [WARN] [datanode/flush_task.go:230] ["flush task error detected"] [error="All attempts results:\nattempt #1:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #2:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #3:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #4:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #5:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #6:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #7:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #8:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #9:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #10:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\n"] []
[2023/07/17 13:12:52.433 +00:00] [ERROR] [datanode/flush_manager.go:775] ["flush pack with error, DataNode quit now"] [error="execution failed"] [stack="github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:775\ngithub.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:204"]
panic: execution failed

goroutine 877 [running]:
github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1(0xc062ade550)
/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:777 +0x1611
github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish(0xc00138c180, 0xc001b81b00, 0xc000651430)
/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:204 +0xdb
created by github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).init.func1
/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:121 +0xf1

Anything else?

No response

@iytprince2 iytprince2 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 17, 2023
@iytprince2
Copy link
Author

however, if i change insert_params: batch: 1000 to 100, everything will be ok

@xiaofan-luan
Copy link
Collaborator

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.2.11
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):   pulsar 
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus 2.2.1
- OS(Ubuntu or CentOS): debian
- CPU/Memory: 
- GPU: 
- Others: benchmark tool https://github.com/zilliztech/vectordb-benchmark

Current Behavior

when i start command python3 main.py recall --host xxx--engine milvus --dataset-name glove-200-angular --config-name milvus_recall_k8s.yaml

datanode start to fail pod/my-release-milvus-datanode-6b885d46-cvjzv 0/1 Running 2 11m pod/my-release-milvus-datanode-6b885d46-s6vj2 0/1 CrashLoopBackOff 2 11m

Expected Behavior

i can insert what i want

Steps To Reproduce

1、k8s resources config
proxy:
  enabled: true
  replicas: 2 #modified
  resources:
    requests:
      memory: "4Gi" #modified
      cpu: "0.5" #modified 
    limits:
      memory: "4Gi" #modified
      cpu: "1" #modified
   
rootCoordinator:
  enabled: true
  replicas: 2       
  resources:
    requests:
      memory: "2Gi" #modified  
      cpu: "0.5"  #modified
    limits:
      memory: "2Gi" #modified 
      cpu: "1" #modified
  activeStandby:
    enabled: true
  
queryCoordinator:
  enabled: true
  replicas: 2         
  resources:
    requests:
      memory: "1Gi" #modified 
      cpu: "0.5"  #modified
    limits:
      memory: "1Gi"  #modified
      cpu: "1" #modified
  activeStandby:
    enabled: true

queryNode:
  enabled: true
  replicas: 2 #modified
  resources:
    requests:
      memory: "8Gi" #modified  
      cpu: "2" #modified
    limits:
      memory: "8Gi" #modified  
      cpu: "4" #modified
  
indexCoordinator:
  enabled: true
  replicas: 2         
  resources:
    requests:
      memory: "1Gi" #modified  
      cpu: "0.5" #modified  
    limits:
      memory: "1Gi" #modified  
      cpu: "1" #modified
  activeStandby:
    enabled: true

indexNode:
  enabled: true
  replicas: 2 #modified
  resources:
    requests:
      memory: "4Gi" #modified  
      cpu: "2" #modified
    limits:
      memory: "4Gi" #modified  
      cpu: "2" #modified
        
dataCoordinator:
  enabled: true
  replicas: 2         
  resources:
    requests:
      memory: "1Gi" #modified  
      cpu: "0.5" #modified
    limits:
      memory: "1Gi" #modified  
      cpu: "1" #modified
  activeStandby:
    enabled: true
  
dataNode:
  enabled: true
  replicas: 2 #modified
  resources:
    requests:
      memory: "4Gi" #modified  
      cpu: "2"  #modified
    limits:
      memory: "8Gi" #modified 
      cpu: "4" #modified

2、benchmark tool config
# Connection parameters can be added dynamically
connection_params:
  secure: false
  port: 19532

# Collection parameters can be added dynamically
collection_params:
  collection_name: milvus_benchmark_collection
  other_fields: []

insert_params:
#  The data size of the vector inserted each time can be specified
  batch: 1000

# Index parameters can be added dynamically
index_params:
  index_type: HNSW
  index_param:
    M: 8
    efConstruction: 200

# Load parameters can be added dynamically
# example:
#  load_params:
#     timeout: 60
load_params: {}

# Search parameters can be added dynamically
search_params:
  timeout: 1800

# The following three sets of parameters will be cross-combined,
# and then the corresponding recall rate test will be carried out in turn.
  top_k: [1]
  nq: [10]
  search_param:
    ef: [128, 512, 1024]

3、run command
python3 main.py recall --host xxx--engine milvus --dataset-name glove-200-angular --config-name milvus_recall_k8s.yaml

Milvus Log

[2023/07/17 13:12:52.433 +00:00] [WARN] [datanode/flush_task.go:230] ["flush task error detected"] [error="All attempts results:\nattempt #1:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #2:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #3:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #4:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #5:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #6:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #7:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #8:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #9:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\nattempt #10:All attempts results:\nattempt #1:One or more of the specified parts could not be found. The part may not have been uploaded, or the specified entity tag may not match the part's entity tag.\n\n"] [] [2023/07/17 13:12:52.433 +00:00] [ERROR] [datanode/flush_manager.go:775] ["flush pack with error, DataNode quit now"] [error="execution failed"] [stack="github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:775\ngithub.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:204"] panic: execution failed

goroutine 877 [running]: github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1(0xc062ade550) /go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:777 +0x1611 github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish(0xc00138c180, 0xc001b81b00, 0xc000651430) /go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:204 +0xdb created by github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).init.func1 /go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:121 +0xf1

Anything else?

No response

One or more of the specified parts could not be found.
I don't think this is an error from milvus.
I guess this is more like a Minio issue

@yanliang567
Copy link
Contributor

@iytprince2 if you want to benchmark milvus, please try the new repo: VectorDBBench

/assign @iytprince2
/unassign

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 18, 2023
@iytprince2
Copy link
Author

@yanliang567
this is not probelm of benchmark, this involves data insert, if i change milvus version to 2.2.9 with the same config everything will be ok

@iytprince2
Copy link
Author

so i don't think it's problem of minio, i use the same one

@xiaofan-luan
Copy link
Collaborator

image

@xiaofan-luan
Copy link
Collaborator

the error comes form S3 service, with status code 400.

if you can offer details logs for datacoord and datanode we can help to check if there are more details.

@stale
Copy link

stale bot commented Aug 18, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Aug 18, 2023
@stale stale bot closed this as completed Sep 7, 2023
@iytprince2
Copy link
Author

finally,i prove this to be the minio proxy problem, the proxy and backend nodes have too much delay, but i still have no idea of the mechanism.

@iytprince2
Copy link
Author

/reopen

@sre-ci-robot
Copy link
Contributor

@iytprince2: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sre-ci-robot sre-ci-robot reopened this Sep 11, 2023
@stale stale bot removed the stale indicates no udpates for 30 days label Sep 11, 2023
@xiaofan-luan
Copy link
Collaborator

you may want to increase flush routine number or limit insert throughput?

What is the expected throughput of your insertion

@stale
Copy link

stale bot commented Oct 11, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Oct 11, 2023
@stale stale bot closed this as completed Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants