Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Milvus standalone service cannot start #31858

Closed
1 task done
0215Arthur opened this issue Apr 3, 2024 · 15 comments
Closed
1 task done

[Bug]: Milvus standalone service cannot start #31858

0215Arthur opened this issue Apr 3, 2024 · 15 comments
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@0215Arthur
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.3.12
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):     rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2): 2.3.7
- OS(Ubuntu or CentOS): Ubuntu
- CPU/Memory:  16Core, 64GB
- GPU: -
- Others: - docker compose up

Current Behavior

I have built a collection (diskann index) with 12M embeddings (1536 dim), and I move the volume to the new machine with higher hardware config. But the milvus standalone cannot start in the new machine.

And I use the fio to check the disk performance (EBS gp3 disk): IOPS 850

fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=2200m --bs=2300 --name=mytest
milvus-standalone-prod-02  | [2024/04/03 02:07:31.193 +00:00] [INFO] [datacoord/index_builder.go:258] ["index builder peek client error, there is no available"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.193 +00:00] [INFO] [datacoord/index_builder.go:195] ["there is no idle indexing node, wait a minute..."]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.220 +00:00] [WARN] [rootcoord/root_coord.go:1598] ["failed to updateTimeTick"] [role=rootcoord] [error="skip ChannelTimeTickMsg from un-recognized session 38"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.220 +00:00] [WARN] [proxy/proxy.go:368] [sendChannelsTimeTickLoop.UpdateChannelTimeTick] [ErrorCode=UnexpectedError] [Reason="skip ChannelTimeTickMsg from un-recognized session 38"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.420 +00:00] [WARN] [rootcoord/root_coord.go:1598] ["failed to updateTimeTick"] [role=rootcoord] [error="skip ChannelTimeTickMsg from un-recognized session 38"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.420 +00:00] [WARN] [proxy/proxy.go:368] [sendChannelsTimeTickLoop.UpdateChannelTimeTick] [ErrorCode=UnexpectedError] [Reason="skip ChannelTimeTickMsg from un-recognized session 38"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [WARN] [sessionutil/session_util.go:551] ["fail to retry keepAliveOnce"] [serverName=rootcoord] [LeaseID=7587877772355459635] [error="attempt #0: etcdserver: requested lease not found: attempt #1: etcdserver: requested lease not found: attempt #2: etcdserver: requested lease not found"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [WARN] [sessionutil/session_util.go:551] ["fail to retry keepAliveOnce"] [serverName=datacoord] [LeaseID=7587877772355459653] [error="attempt #0: etcdserver: requested lease not found: attempt #1: etcdserver: requested lease not found: attempt #2: etcdserver: requested lease not found"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [WARN] [sessionutil/session_util.go:551] ["fail to retry keepAliveOnce"] [serverName=proxy] [LeaseID=7587877772355459687] [error="attempt #0: etcdserver: requested lease not found: attempt #1: etcdserver: requested lease not found: attempt #2: etcdserver: requested lease not found"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [WARN] [sessionutil/session_util.go:880] ["connection lost detected, shuting down"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [WARN] [sessionutil/session_util.go:880] ["connection lost detected, shuting down"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [ERROR] [rootcoord/root_coord.go:281] ["Root Coord disconnected from etcd, process will exit"] ["Server Id"=38] [stack="github.com/milvus-io/milvus/internal/rootcoord.(*Core).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:281"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [WARN] [sessionutil/session_util.go:880] ["connection lost detected, shuting down"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [ERROR] [datacoord/server.go:271] ["disconnected from etcd and exited"] [serverID=38] [stack="github.com/milvus-io/milvus/internal/datacoord.(*Server).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datacoord/server.go:271"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [WARN] [sessionutil/session_util.go:551] ["fail to retry keepAliveOnce"] [serverName=indexcoord] [LeaseID=7587877772355459650] [error="attempt #0: etcdserver: requested lease not found: attempt #1: etcdserver: requested lease not found: attempt #2: etcdserver: requested lease not found"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [WARN] [sessionutil/session_util.go:551] ["fail to retry keepAliveOnce"] [serverName=datanode] [LeaseID=7587877772355459707] [error="attempt #0: etcdserver: requested lease not found: attempt #1: etcdserver: requested lease not found: attempt #2: etcdserver: requested lease not found"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [WARN] [sessionutil/session_util.go:880] ["connection lost detected, shuting down"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [ERROR] [datanode/data_node.go:189] ["Data Node disconnected from etcd, process will exit"] ["Server Id"=38] [stack="github.com/milvus-io/milvus/internal/datanode.(*DataNode).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/datanode/data_node.go:189"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.42
3 +00:00] [WARN] [sessionutil/session_util.go:551] ["fail to retry keepAliveOnce"] [serverName=querynode] [LeaseID=7587877772355459714] [error="attempt #0: etcdserver: requested lease not found: attempt #1: etcdserver: requested lease not found: attempt #2: etcdserver: requested lease not found"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [WARN] [sessionutil/session_util.go:880] ["connection lost detected, shuting down"]
milvus-standalone-prod-02  | [2024/04/03 02:07:31.423 +00:00] [ERROR] [querynodev2/server.go:169] ["Query Node disconnected from etcd, process will exit"] ["Server Id"=38] [stack="github.com/milvus-io/milvus/internal/querynodev2.(*QueryNode).Register.func1\n\t/go/src/github.com/milvus-io/milvus/internal/querynodev2/server.go:169"]

full log: milvus.ec2.log

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@0215Arthur 0215Arthur added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 3, 2024
@yanliang567
Copy link
Contributor

I think the etcd is a bit slow and milvus components lost the keepalive with etcd service.
milvus-standalone-prod-02 | [2024/04/03 02:01:26.981 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=7.00170008s] [key=by-dev/kv/gid/timestamp]
I'd suggest you use nvme volumes as you are running diskann index. @0215Arthur do you have any metric to check the resource usage during that time?

/assign @0215Arthur
/unassign

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 3, 2024
@0215Arthur
Copy link
Author

I think the etcd is a bit slow and milvus components lost the keepalive with etcd service. milvus-standalone-prod-02 | [2024/04/03 02:01:26.981 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=7.00170008s] [key=by-dev/kv/gid/timestamp] I'd suggest you use nvme volumes as you are running diskann index. @0215Arthur do you have any metric to check the resource usage during that time?

/assign @0215Arthur /unassign

  • The data volume in nvme ssd (AWS EBS gp3 volumes, the Throughput is 125M/s), will the Throughput be the bottlenck?
image

@xiaofan-luan
Copy link
Contributor

I think the etcd is a bit slow and milvus components lost the keepalive with etcd service. milvus-standalone-prod-02 | [2024/04/03 02:01:26.981 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=7.00170008s] [key=by-dev/kv/gid/timestamp] I'd suggest you use nvme volumes as you are running diskann index. @0215Arthur do you have any metric to check the resource usage during that time?
/assign @0215Arthur /unassign

  • The data volume in nvme ssd (AWS EBS gp3 volumes, the Throughput is 125M/s), will the Throughput be the bottlenck?
image

what about your etcd disk?
Is there a chance etcd share disk with milvus node?

@xiaofan-luan
Copy link
Contributor

@0215Arthur
and 3000 iops usually not good enough for diskann.
we need disk with 50000 iops ore more

@0215Arthur
Copy link
Author

I think the etcd is a bit slow and milvus components lost the keepalive with etcd service. milvus-standalone-prod-02 | [2024/04/03 02:01:26.981 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=7.00170008s] [key=by-dev/kv/gid/timestamp] I'd suggest you use nvme volumes as you are running diskann index. @0215Arthur do you have any metric to check the resource usage during that time?
/assign @0215Arthur /unassign

  • The data volume in nvme ssd (AWS EBS gp3 volumes, the Throughput is 125M/s), will the Throughput be the bottlenck?
image

what about your etcd disk? Is there a chance etcd share disk with milvus node?

Yes. The etcd share the disk with milvus node. (Just standalone)

@xiaofan-luan
Copy link
Contributor

I think the etcd is a bit slow and milvus components lost the keepalive with etcd service. milvus-standalone-prod-02 | [2024/04/03 02:01:26.981 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=7.00170008s] [key=by-dev/kv/gid/timestamp] I'd suggest you use nvme volumes as you are running diskann index. @0215Arthur do you have any metric to check the resource usage during that time?
/assign @0215Arthur /unassign

  • The data volume in nvme ssd (AWS EBS gp3 volumes, the Throughput is 125M/s), will the Throughput be the bottlenck?
image

what about your etcd disk? Is there a chance etcd share disk with milvus node?

Yes. The etcd share the disk with milvus node. (Just standalone)

mainly because milvus eatup all iops and ebs rate limiting your disk.

  1. don't use ebs for diskann
  2. at least not share your etcd disk with milvus

@0215Arthur
Copy link
Author

@xiaofan-luan Thanks for your help very much.
Are there any other storage solutions that you would recommend if don't use ebs in aws?

@xiaofan-luan
Copy link
Contributor

we use local nvme ssd.

One possible solution is to buy 10 GP3 disk and raid them together

@0215Arthur
Copy link
Author

we use local nvme ssd.

One possible solution is to buy 10 GP3 disk and raid them together

Ok. Besides, how to setup the config of etcd disk to separate it from the milvus node in standalone mode?

  1. at least not share your etcd disk with milvus

@xiaofan-luan
Copy link
Contributor

we use local nvme ssd.
One possible solution is to buy 10 GP3 disk and raid them together

Ok. Besides, how to setup the config of etcd disk to separate it from the milvus node in standalone mode?

  1. at least not share your etcd disk with milvus

you need to mount at least 2 disk. etcd data dir has separate configs than local dir.

If you can use zilliz cloud for diskann that would save you a lot of troubles. We hit into similar issues every day that's why we got the knowledge about how to handle those

Copy link

stale bot commented May 3, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label May 3, 2024
@stale stale bot closed this as completed May 12, 2024
@nelsonspbr
Copy link

I am having similar issues trying to set up Milvus Standalone with 10k vectors of dim=1024. Disk is NVME, I get 160k randrw IOPS with fio. 1k vectors worked fine, takes seconds to finish. I left 10k running overnight and it never finished. Some WARN messages from Milvus container:

[2024/06/12 06:23:44.636 +00:00] [WARN] [tso/global_allocator.go:100] ["clock offset is huge, check network latency and clock skew"] [jet-lag=4.442299524s] [prev-physical=2024/06/12 06:23:40.194 +00:00] [now=2024/06/12 06:23:44.636 +00:00]
[2024/06/12 06:23:44.636 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation load with prefix"] ["time spent"=4.042356125s] [keys="[]"]
[2024/06/12 06:24:04.860 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=2.21633216s] [key=by-dev/kv/gid/timestamp]
[2024/06/12 06:24:25.083 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=4.439712441s] [key=by-dev/kv/gid/timestamp]
[2024/06/12 06:24:25.084 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation load with prefix"] ["time spent"=2.489465676s] [keys="[]"]
[2024/06/12 06:24:45.307 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=2.213150313s] [key=by-dev/kv/gid/timestamp]
...
[2024/06/12 06:33:21.147 +00:00] [WARN] [server/rocksmq_impl.go:670] ["rocksmq produce too slowly"] [topic=by-dev-rootcoord-dml_5] ["get lock elapse"=0] ["alloc elapse"=0] ["write elapse"=1438] ["updatePage elapse"=0] ["produce total elapse"=1438]
[2024/06/12 06:33:21.147 +00:00] [WARN] [rootcoord/timeticksync.go:313] ["rootcoord send tt to all channels too slowly"] [chanNum=1] [span=1438]
[2024/06/12 06:33:41.371 +00:00] [WARN] [tso/global_allocator.go:100] ["clock offset is huge, check network latency and clock skew"] [jet-lag=727.053189ms] [prev-physical=2024/06/12 06:33:40.644 +00:00] [now=2024/06/12 06:33:41.371 +00:00]
[2024/06/12 06:34:42.043 +00:00] [WARN] [tso/global_allocator.go:100] ["clock offset is huge, check network latency and clock skew"] [jet-lag=1.399967219s] [prev-physical=2024/06/12 06:34:40.643 +00:00] [now=2024/06/12 06:34:42.043 +00:00]
[2024/06/12 06:35:22.491 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=2.847038969s] [key=by-dev/kv/gid/timestamp]
[2024/06/12 06:35:42.715 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=2.071376625s] [key=by-dev/kv/gid/timestamp]
[2024/06/12 06:35:42.715 +00:00] [WARN] [tso/global_allocator.go:100] ["clock offset is huge, check network latency and clock skew"] [jet-lag=2.071443374s] [prev-physical=2024/06/12 06:35:40.644 +00:00] [now=2024/06/12 06:35:42.715 +00:00]
[2024/06/12 06:35:58.907 +00:00] [WARN] [server/rocksmq_impl.go:670] ["rocksmq produce too slowly"] [topic=by-dev-rootcoord-dml_5] ["get lock elapse"=0] ["alloc elapse"=0] ["write elapse"=599] ["updatePage elapse"=0] ["produce total elapse"=599]
[2024/06/12 06:35:58.908 +00:00] [WARN] [rootcoord/timeticksync.go:313] ["rootcoord send tt to all channels too slowly"] [chanNum=1] [span=599]
[2024/06/12 06:37:00.539 +00:00] [WARN] [tso/global_allocator.go:100] ["clock offset is huge, check network latency and clock skew"] [jet-lag=1.895624756s] [prev-physical=2024/06/12 06:36:58.643 +00:00] [now=2024/06/12 06:37:00.539 +00:00]

@xiaofan-luan
Copy link
Contributor

I am having similar issues trying to set up Milvus Standalone with 10k vectors of dim=1024. Disk is NVME, I get 160k randrw IOPS with fio. 1k vectors worked fine, takes seconds to finish. I left 10k running overnight and it never finished. Some WARN messages from Milvus container:

[2024/06/12 06:23:44.636 +00:00] [WARN] [tso/global_allocator.go:100] ["clock offset is huge, check network latency and clock skew"] [jet-lag=4.442299524s] [prev-physical=2024/06/12 06:23:40.194 +00:00] [now=2024/06/12 06:23:44.636 +00:00]
[2024/06/12 06:23:44.636 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation load with prefix"] ["time spent"=4.042356125s] [keys="[]"]
[2024/06/12 06:24:04.860 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=2.21633216s] [key=by-dev/kv/gid/timestamp]
[2024/06/12 06:24:25.083 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=4.439712441s] [key=by-dev/kv/gid/timestamp]
[2024/06/12 06:24:25.084 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation load with prefix"] ["time spent"=2.489465676s] [keys="[]"]
[2024/06/12 06:24:45.307 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=2.213150313s] [key=by-dev/kv/gid/timestamp]
...
[2024/06/12 06:33:21.147 +00:00] [WARN] [server/rocksmq_impl.go:670] ["rocksmq produce too slowly"] [topic=by-dev-rootcoord-dml_5] ["get lock elapse"=0] ["alloc elapse"=0] ["write elapse"=1438] ["updatePage elapse"=0] ["produce total elapse"=1438]
[2024/06/12 06:33:21.147 +00:00] [WARN] [rootcoord/timeticksync.go:313] ["rootcoord send tt to all channels too slowly"] [chanNum=1] [span=1438]
[2024/06/12 06:33:41.371 +00:00] [WARN] [tso/global_allocator.go:100] ["clock offset is huge, check network latency and clock skew"] [jet-lag=727.053189ms] [prev-physical=2024/06/12 06:33:40.644 +00:00] [now=2024/06/12 06:33:41.371 +00:00]
[2024/06/12 06:34:42.043 +00:00] [WARN] [tso/global_allocator.go:100] ["clock offset is huge, check network latency and clock skew"] [jet-lag=1.399967219s] [prev-physical=2024/06/12 06:34:40.643 +00:00] [now=2024/06/12 06:34:42.043 +00:00]
[2024/06/12 06:35:22.491 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=2.847038969s] [key=by-dev/kv/gid/timestamp]
[2024/06/12 06:35:42.715 +00:00] [WARN] [etcd/etcd_kv.go:656] ["Slow etcd operation save"] ["time spent"=2.071376625s] [key=by-dev/kv/gid/timestamp]
[2024/06/12 06:35:42.715 +00:00] [WARN] [tso/global_allocator.go:100] ["clock offset is huge, check network latency and clock skew"] [jet-lag=2.071443374s] [prev-physical=2024/06/12 06:35:40.644 +00:00] [now=2024/06/12 06:35:42.715 +00:00]
[2024/06/12 06:35:58.907 +00:00] [WARN] [server/rocksmq_impl.go:670] ["rocksmq produce too slowly"] [topic=by-dev-rootcoord-dml_5] ["get lock elapse"=0] ["alloc elapse"=0] ["write elapse"=599] ["updatePage elapse"=0] ["produce total elapse"=599]
[2024/06/12 06:35:58.908 +00:00] [WARN] [rootcoord/timeticksync.go:313] ["rootcoord send tt to all channels too slowly"] [chanNum=1] [span=599]
[2024/06/12 06:37:00.539 +00:00] [WARN] [tso/global_allocator.go:100] ["clock offset is huge, check network latency and clock skew"] [jet-lag=1.895624756s] [prev-physical=2024/06/12 06:36:58.643 +00:00] [now=2024/06/12 06:37:00.539 +00:00]

you need to check your system time. seems that etcd and milvus has a large clock skew. this could cause serious issue

@nelsonspbr
Copy link

Isn't etcd being set up by the same container that does Milvus? It's a standalone deployment -- single node, single container, everything in it. In which case, could they be operating on different times?

Also, I just ran more tests — using vectors of dim=1024, if I try to index 1023 vectors it works, but if I try 1024 it hangs:

(20240328-vdb) nmg@css-host-113 20240328-vdb$ ./setup.diskann --vec 1023

collection_name   = vdb
consistency_level = Strong
dim               = 1024
num_shards        = 1
seed              = 1337
step              = 8192
uri               = http://localhost:19530
vec               = 1023

VDB connect              1718286203.351802480 1718286203.371534270        0.020
VDB cleanup              1718286203.371534270 1718286203.375350260        0.004
VDB collection           1718286203.375350260 1718286203.384824931        0.009

100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  3.13it/s]

VDB insert               1718286203.384824931 1718286203.711976963        0.327
VDB flush                1718286203.711976963 1718286207.779546954        4.068
VDB index                1718286207.779546954 1718286208.799875698        1.020
VDB total                1718286203.351802480 1718286208.799875698        5.448
(20240328-vdb) nmg@css-host-113 20240328-vdb$ ./setup.diskann --vec 1024

collection_name   = vdb
consistency_level = Strong
dim               = 1024
num_shards        = 1
seed              = 1337
step              = 8192
uri               = http://localhost:19530
vec               = 1024

VDB connect              1718286258.797450141 1718286258.830366281        0.033
VDB cleanup              1718286258.830366281 1718286258.833845211        0.003
VDB collection           1718286258.833845211 1718286258.843161521        0.009

100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  2.99it/s]

VDB insert               1718286258.843161521 1718286259.182248213        0.339
VDB flush                1718286259.182248213 1718286263.213862396        4.032
<hangs>

@yanliang567
Copy link
Contributor

I doubt that it is caused by building diskann index, because Milvus only builds index when there are 1024 or more vectors in the collection. please file a new issue and offer full milvus logs for invesigation. @nelsonspbr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants