Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [Nightly]DataNode crash reporting error syncTimestamp Failed : find no available rootcoord #25976

Closed
1 task done
NicoYuan1986 opened this issue Jul 28, 2023 · 5 comments
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@NicoYuan1986
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 833674c
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0.dev109
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

DataNode crash reporting error syncTimestamp Failed : find no available rootcoord.

grafana link: https://grafana-ci.zilliz.cc/d/uLf5cJ3Gz/milvus2-0?orgId=1&var-datasource=prometheus&var-app_name=milvus&var-namespace=milvus-ci&var-instance=mdp-446-n&var-collection=All&var-pod=mdp-446-n-milvus-datacoord-7cdb5ddb46-62tp2&var-component=&from=1690480084424&to=1690485759491

error message:
2023-07-28T03:10:08.822864838+08:00 stdout F [2023/07/27 19:10:08.816 +00:00] [WARN] [datanode/flush_manager.go:941] ["failed to SaveBinlogPaths"] [segmentID=443149363355077541] [error="attempt #0: err: find no available datacoord, check datacoord state\n, /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:352 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:129 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:325 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).SaveBinlogPaths\n/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:909 github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1.1\n/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:40 github.com/milvus-io/milvus/pkg/util/retry.Do\n/go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:908 github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1\n/go/src/github.com/milvus-io/milvus/internal/datanode/flush_task.go:205 github.com/milvus-io/milvus/internal/datanode.(*flushTaskRunner).waitFinish\n/usr/local/go/src/runtime/asm_amd64.s:1571 runtime.goexit\n: attempt #1: find no available datacoord, check datacoord state: attempt #2: find no available datacoord, check datacoord state: attempt #3: find no available datacoord, check datacoord state: attempt #4: find no available datacoord, check datacoord state: attempt #5: find no available datacoord, check datacoord state: attempt #6: find no available datacoord, check datacoord state: attempt #7: find no available datacoord, check datacoord state: attempt #8: find no available datacoord, check datacoord state: attempt #9: find no available datacoord, check datacoord state: attempt #10: find no available datacoord, check datacoord state: attempt #11: find no available datacoord, check datacoord state: attempt #12: find no available datacoord, check datacoord state: attempt #13: find no available datacoord, check datacoord state: attempt #14: find no available datacoord, check datacoord state: attempt #15: find no available datacoord, check datacoord state: attempt #16: find no available datacoord, check datacoord state: attempt #17: find no available datacoord, check datacoord state: attempt #18: find no available datacoord, check datacoord state: attempt #19: find no available datacoord, check datacoord state: attempt #20: find no available datacoord, check datacoord state: attempt #21: find no available datacoord, check datacoord state: attempt #22: find no available datacoord, check datacoord state: attempt #23: find no available datacoord, check datacoord state: attempt #24: find no available datacoord, check datacoord state: attempt #25: find no available datacoord, check datacoord state: attempt #26: find no available datacoord, check datacoord state: attempt #27: find no available datacoord, check datacoord state: attempt #28: find no available datacoord, check datacoord state: attempt #29: find no available datacoord, check datacoord state: attempt #30: find no available datacoord, check datacoord state: attempt #31: find no available datacoord, check datacoord state: attempt #32: find no available datacoord, check datacoord state: attempt #33: find no available datacoord, check datacoord state: attempt #34: find no available datacoord, check datacoord state: attempt #35: find no available datacoord, check datacoord state: attempt #36: find no available datacoord, check datacoord state: attempt #37: find no available datacoord, check datacoord state: attempt #38: find no available datacoord, check datacoord state: attempt #39: find no available datacoord, check datacoord state: attempt #40: find no available datacoord, check datacoord state: attempt #41: find no available datacoord, check datacoord state: attempt #42: find no available datacoord, check datacoord state: attempt #43: find no available datacoord, check datacoord state: attempt #44: find no available datacoord, check datacoord state: attempt #45: find no available datacoord, check datacoord state: attempt #46: find no available datacoord, check datacoord state: attempt #47: context canceled: attempt #48: context canceled: attempt #49: context canceled"]
2023-07-28T03:10:08.830866717+08:00 stderr F panic: attempt #0: err: find no available datacoord, check datacoord state
2023-07-28T03:10:08.830890559+08:00 stderr F , /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace
2023-07-28T03:10:08.830899292+08:00 stderr F /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:352 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall
2023-07-28T03:10:08.83090431+08:00 stderr F /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:129 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]
2023-07-28T03:10:08.830913422+08:00 stderr F /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:325 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).SaveBinlogPaths
2023-07-28T03:10:08.830917496+08:00 stderr F /go/src/github.com/milvus-io/milvus/internal/datanode/flush_manager.go:909 github.com/milvus-io/milvus/internal/datanode.flushNotifyFunc.func1.1

Expected Behavior

pass

Steps To Reproduce

No response

Milvus Log

  1. link: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/446/pipeline/237/
  2. log: artifacts-milvus-distributed-pulsar-nightly-446-pymilvus-e2e-logs.tar.gz
  3. the error starts at about [2023-07-27T19:07:48.041Z] [gw0] [ 74%] FAILED testcases/test_search.py::TestSearchString::test_search_with_different_string_expr[128-True-False-"0" <= varchar <= "100"]
  4. collection name: search_collection_sh3erWbW

Anything else?

No response

@NicoYuan1986 NicoYuan1986 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 28, 2023
@NicoYuan1986 NicoYuan1986 added this to the 2.3 milestone Jul 28, 2023
@yanliang567
Copy link
Contributor

/assign @jiaoew1991
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 28, 2023
@jiaoew1991
Copy link
Contributor

/assign @MrPresent-Han
/unassign

@MrPresent-Han
Copy link
Contributor

[2023/07/27 19:07:44.381 +00:00] [WARN] [sessionutil/session_util.go:798] ["connection lost detected, shuting down"]

  |   | [2023/07/27 19:07:44.381 +00:00] [WARN] [sessionutil/session_util.go:464] ["session keepalive channel closed"]
  |   | [2023/07/27 19:07:44.381 +00:00] [INFO] [sessionutil/session_util.go:845] ["session key is deleted, exit..."] [role=rootcoord] [key=by-dev/meta/session/rootcoord]
  |   | [2023/07/27 19:07:42.370 +00:00] [INFO] [rootcoord/root_coord.go:1096] ["received request to describe collection"] [traceID=4723c0af45537c9ddc345ac95cbc4dcb] [collectionName=] [dbName=] [id=443149363355077442] [ts=18446744073709551615] [allowUnavailable=true]
  |   | [2023/07/27 19:07:39.880 +00:00] [WARN] [rootcoord/root_coord.go:242] ["failed to update tso"] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).saveTimestamp\n | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:98\n | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).UpdateTimestamp\n | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:201\n | github.com/milvus-io/milvus/internal/tso.(*GlobalTSOAllocator).UpdateTSO\n | \t/go/src/github.com/milvus-io/milvus/internal/tso/global_allocator.go:100\n | github.com/milvus-io/milvus/internal/rootcoord.(*Core).tsLoop\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:241\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1571\nWraps: (2) context deadline exceeded\nError types: (1) *withstack.withStack (2) context.deadlineExceededError"]
  |   | [2023/07/27 19:07:39.880 +00:00] [WARN] [etcd/etcd_kv.go:647] ["Slow etcd operation save"] ["time spent"=10.000248845s] [key=by-dev/kv/tso/timestamp]

I believe it's because etcd perf bottleneck, resulting keep alive session expired and rootCoord exit by itself

@stale
Copy link

stale bot commented Sep 7, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Sep 7, 2023
@NicoYuan1986
Copy link
Contributor Author

Timestamp has been removed now. v2.3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants