Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Indexnode will crash and auto restart later after create index #25438

Closed
1 task done
NicoYuan1986 opened this issue Jul 10, 2023 · 9 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@NicoYuan1986
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20230627-3a222e97 
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0.dev80
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Indexnode will crash and auto restart later.

devops : chaos-testing

kubectl logs search-it-tussv-milvus-indexnode-76cbd6db95-gnb4b --previous

terminate called after throwing an instance of 'milvus::storage::S3ErrorException'
  what():  Error:PutObjectBuffer:SlowDown  Resource requested is unreadable, please reduce your request rate
SIGABRT: abort
PC=0x7fbe4580e00b m=8 sigcode=18446744073709551610
signal arrived during cgo execution

Expected Behavior

create index successfully

Steps To Reproduce

search-it-tussv-milvus-indexnode-76cbd6db95-8zx2p                 1/1     Running                  169 (102s ago)    35h     10.102.5.65     devops-node21   <none>           <none>
search-it-tussv-milvus-indexnode-76cbd6db95-bhbtp                 1/1     Running                  166 (4m16s ago)   35h     10.102.5.64     devops-node21   <none>           <none>
search-it-tussv-milvus-indexnode-76cbd6db95-fvbsm                 1/1     Running                  148 (6m12s ago)   35h     10.102.9.101    devops-node13   <none>           <none>
search-it-tussv-milvus-indexnode-76cbd6db95-gnb4b                 1/1     Running                  147 (12m ago)     35h     10.102.9.100    devops-node13   <none>           <none>
search-it-tussv-milvus-indexnode-76cbd6db95-lfmhd                 1/1     Running                  146 (12m ago)     35h     10.102.9.74     devops-node13   <none>           <none>
search-it-tussv-milvus-indexnode-76cbd6db95-rqbpf                 1/1     Running                  165 (6m52s ago)   35h     10.102.5.62     devops-node21   <none>           <none>
search-it-tussv-milvus-indexnode-76cbd6db95-t4254                 1/1     Running                  165 (10m ago)     35h     10.102.5.59     devops-node21   <none>           <none>
search-it-tussv-milvus-indexnode-76cbd6db95-wp22z                 1/1     Running                  158 (10m ago)     3d9h    10.102.9.139    devops-node13   <none>           <none>

Milvus Log

No response

Anything else?

No response

@NicoYuan1986 NicoYuan1986 added kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jul 10, 2023
@NicoYuan1986 NicoYuan1986 added this to the 2.3 milestone Jul 10, 2023
@NicoYuan1986 NicoYuan1986 changed the title [Bug]: Indexnode will crash and auto restart later [Bug]: Indexnode will crash and auto restart later after create index Jul 10, 2023
@xiaocai2333
Copy link
Contributor

same as #25129, it has been fixed.
please retest with the newest image.

@NicoYuan1986
Copy link
Contributor Author

NicoYuan1986 commented Jul 10, 2023

try master-20230708-3112dad3:
every time flush(), datanode crash.
pod:

search-it-tussv-milvus-datanode-789fb8595d-8n56m                  1/1     Running                  2 (4m13s ago)    56m
search-it-tussv-milvus-datanode-789fb8595d-sdlq5                  1/1     Running                  1 (12m ago)      19m

log:

[2023/07/10 08:04:10.452 +00:00] [WARN] [storage/minio_chunk_manager.go:208] ["failed to put object"] [bucket=milvus-bucket] [path=file/stats_log/442673616441845659/442673616441845660/442748657200691429/100/1] [error="Resource requested is unreadable, please reduce your request rate"]
[2023/07/10 08:04:10.505 +00:00] [DEBUG] [datanode/flush_task.go:221] ["flush pack composed"] [segmentID=442748657200491051] [insertLogs=4] [statsLogs=1] [deleteLogs=0] [flushed=false] [dropped=false]
[2023/07/10 08:04:10.506 +00:00] [WARN] [datanode/flush_task.go:231] ["flush task error detected"] [error="attempt #0: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/1/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/stats_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/101/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/0/442754307931730063: Resource requested is unreadable, please reduce your request rate: attempt #1: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/0/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/1/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/stats_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/101/442754307931730063: Resource requested is unreadable, please reduce your request rate: attempt #2: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/101/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/0/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/1/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/stats_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: attempt #3: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/101/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/0/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/1/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/stats_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: attempt #4: failed to write file/stats_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/101/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/0/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/1/442754307931730063: Resource requested is unreadable, please reduce your request rate: attempt #5: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/101/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/0/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/1/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/stats_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: attempt #6: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/101/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/0/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/1/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/stats_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: attempt #7: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/101/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/0/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/1/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/stats_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: attempt #8: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/0/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/1/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/stats_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/101/442754307931730063: Resource requested is unreadable, please reduce your request rate: attempt #9: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/101/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/0/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/insert_log/442673616441845659/442673616441845660/442748657200491051/1/442754307931730063: Resource requested is unreadable, please reduce your request rate: failed to write file/stats_log/442673616441845659/442673616441845660/442748657200491051/100/442754307931730063: Resource requested is unreadable, please reduce your request rate"] []
[2023/07/10 08:04:10.509 +00:00] [ERROR] [datanode/flush_manager.go:853] ["flush pack with error, DataNode quit now"] [error="execution failed"] [errorVerbose="execution failed\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).getFlushPack\n | \t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:232\n | github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish\n | \t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:189\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1571\nWraps: (2) execution failed\nError types: (1) *withstack.withStack (2) *errutil.leafError"] [stack="github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:853\ngithub.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:205"]
panic: execution failed

@xiaofan-luan
Copy link
Contributor

looks like this is limited by S3, is this a very large test?

@xiaocai2333
Copy link
Contributor

/assign @XuanYang-cn
Can we prevent the datanode from crashing? Should we trigger milvus rate limit, does it make sense?

@XuanYang-cn
Copy link
Contributor

@xiaocai2333 Many conditions can trigger a sync, for example 10mins cp sync, denying insertion won't fix the problem.

when reaching s3 rate limit, the best solution is just wait for s3 to recover, it's no difference whether dn crashes or dn online but unable to process any msgs.

@xiaofan-luan
Copy link
Contributor

need to change flush to async logic.

@xiaofan-luan
Copy link
Contributor

currently flush is synchronously handled, we have to choice but panic

@stale
Copy link

stale bot commented Aug 18, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Aug 18, 2023
@NicoYuan1986
Copy link
Contributor Author

not reproduced for weeks.
master-20230823-7af0f7d9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants