Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: There are still thousands of small segments here but the compaction has stopped #32553

Open
1 task done
ThreadDao opened this issue Apr 24, 2024 · 8 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: cardinal-milvus-io-2.3-d63549588-20240412 and cardinal-milvus-io-2.3-f6c0c15-20240408
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. birdwatcher show segments indicate there are 9843 small segments
--- Growing: 0, Sealed: 0, Flushed: 9907, Dropped: 145241
--- Small Segments: 9843, row count: 4419481	 Other Segments: 64, row count: 52319999
--- Total Segments: 9907, row count: 56739480
  1. But starting from about noon yesterday, the compaction seemed to have stopped.
    image
  1. by the way
Milvus(laion1b-test-2) > show channel-watch --collection 448960188666872753
=============================
key: laion1b-test-2/meta/channelwatch/2423/laion1b-test-2-rootcoord-dml_4_448960188666872753v0
Channel Name:laion1b-test-2-rootcoord-dml_4_448960188666872753v0 	 WatchState: WatchSuccess
Channel Watch start from: 2024-04-24 10:44:16 +0800, timeout at: 1970-01-01 08:00:00 +0800
Start Position ID: [8 129 133 1 16 129 159 1 24 0 32 0], time: 2024-04-24 10:43:23.703 +0800
Unflushed segments: []
Flushed segments: []
Dropped segments: []
=============================
key: laion1b-test-2/meta/channelwatch/2423/laion1b-test-2-rootcoord-dml_5_448960188666872753v1
Channel Name:laion1b-test-2-rootcoord-dml_5_448960188666872753v1 	 WatchState: WatchSuccess
Channel Watch start from: 2024-04-24 10:44:16 +0800, timeout at: 1970-01-01 08:00:00 +0800
Start Position ID: [8 255 132 1 16 214 140 2 24 0 32 0], time: 2024-04-24 10:43:23.703 +0800
Unflushed segments: []
Flushed segments: []
Dropped segments: []
--- Total Channels: 2

Milvus(laion1b-test-2) > show checkpoint --collection 448960188666872753
vchannel laion1b-test-2-rootcoord-dml_4_448960188666872753v0 seek to 2024-04-24 10:55:04.277 +0800 CST, cp channel: laion1b-test-2-rootcoord-dml_4_448960188666872753v0, Source: Channel Checkpoint
vchannel laion1b-test-2-rootcoord-dml_5_448960188666872753v1 seek to 2024-04-24 10:55:04.277 +0800 CST, cp channel: laion1b-test-2-rootcoord-dml_5_448960188666872753v1, Source: Channel Checkpoint

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

pods:

laion1b-test-2-etcd-0                                             1/1     Running                           1 (103d ago)      130d    10.104.25.31    4am-node30   <none>           <none>
laion1b-test-2-etcd-1                                             1/1     Running                           0                 130d    10.104.30.94    4am-node38   <none>           <none>
laion1b-test-2-etcd-2                                             1/1     Running                           0                 130d    10.104.34.225   4am-node37   <none>           <none>
laion1b-test-2-milvus-datanode-8d8768545-hxgzw                    1/1     Running                           40 (10m ago)      12d     10.104.31.76    4am-node34   <none>           <none>
laion1b-test-2-milvus-datanode-8d8768545-m5wfw                    1/1     Running                           0                 10h     10.104.6.205    4am-node13   <none>           <none>
laion1b-test-2-milvus-indexnode-65bd756d8c-76g79                  1/1     Running                           0                 14d     10.104.5.106    4am-node12   <none>           <none>
laion1b-test-2-milvus-indexnode-65bd756d8c-dxf6c                  1/1     Running                           1 (8d ago)        14d     10.104.19.223   4am-node28   <none>           <none>
laion1b-test-2-milvus-indexnode-65bd756d8c-gm26p                  1/1     Running                           3 (31h ago)       14d     10.104.30.167   4am-node38   <none>           <none>
laion1b-test-2-milvus-mixcoord-794cfb7b6d-m495w                   0/1     Running                           39 (5m37s ago)    12d     10.104.29.207   4am-node35   <none>           <none>
laion1b-test-2-milvus-proxy-fcbf74dfb-tjnqk                       1/1     Running                           0                 11d     10.104.34.246   4am-node37   <none>           <none>
laion1b-test-2-milvus-querynode-0-6c76d87fcd-4m98p                1/1     Running                           2 (5d10h ago)     12d     10.104.23.203   4am-node27   <none>           <none>
laion1b-test-2-milvus-querynode-0-6c76d87fcd-hmqv9                1/1     Running                           0                 9h      10.104.24.90    4am-node29   <none>           <none>
laion1b-test-2-milvus-querynode-0-6c76d87fcd-qlhms                1/1     Running                           2 (2d10h ago)     14d     10.104.17.251   4am-node23   <none>           <none>
laion1b-test-2-milvus-querynode-0-6c76d87fcd-wsrbz                1/1     Running                           3 (3h29m ago)     15h     10.104.29.249   4am-node35   <none>           <none>
laion1b-test-2-pulsar-bookie-0                                    1/1     Running                           0                 130d    10.104.33.107   4am-node36   <none>           <none>
laion1b-test-2-pulsar-bookie-1                                    1/1     Running                           0                 40m     10.104.18.164   4am-node25   <none>           <none>
laion1b-test-2-pulsar-bookie-2                                    1/1     Running                           0                 130d    10.104.25.32    4am-node30   <none>           <none>
laion1b-test-2-pulsar-broker-0                                    1/1     Running                           1 (3d7h ago)      12d     10.104.1.147    4am-node10   <none>           <none>
laion1b-test-2-pulsar-proxy-0                                     1/1     Running                           0                 12d     10.104.32.200   4am-node39   <none>           <none>
laion1b-test-2-pulsar-recovery-0                                  1/1     Running                           1 (7h21m ago)     32d     10.104.31.87    4am-node34   <none>           <none>
laion1b-test-2-pulsar-zookeeper-0                                 1/1     Running                           0                 130d    10.104.29.87    4am-node35   <none>           <none>
laion1b-test-2-pulsar-zookeeper-1                                 1/1     Running                           0                 12d     10.104.21.196   4am-node24   <none>           <none>
laion1b-test-2-pulsar-zookeeper-2                                 1/1     Running                           0                 130d    10.104.34.229   4am-node37   <none>           <none>

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 24, 2024
@ThreadDao ThreadDao added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Apr 24, 2024
@ThreadDao ThreadDao added this to the 2.3.15 milestone Apr 24, 2024
@xiaofan-luan
Copy link
Contributor

/assign @XuanYang-cn
I think I will saw similar cases here. We need to improve the compaction policy a little bit

@XuanYang-cn
Copy link
Contributor

XuanYang-cn commented Apr 24, 2024

/assign @tedxu
Please help investigate

@XuanYang-cn
Copy link
Contributor

image
All failed for illegal compaction plan

@XuanYang-cn
Copy link
Contributor

There're empty segments in the datacoord meta

[2024/04/24 09:41:05.433 +00:00] [WARN] [datanode/compactor.go:594] ["compact wrong, all segments' binlogs are empty"] [planID=449298080779028688]

@XuanYang-cn
Copy link
Contributor

image

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 25, 2024
@yanliang567 yanliang567 removed their assignment Apr 25, 2024
@XuanYang-cn
Copy link
Contributor

XuanYang-cn commented Apr 26, 2024

Msgs are lost for segment A 449118743319978903

The channel recovered at 449242669990281216, less than its start position at 449242704121430020
Segment A start position

{\"start_position\":{\"channel_name\":\"laion1b-test-2-rootcoord-dml_5_448960188666872753v1\",\"msgID\":\"CJaCARCGvgIYACAA\",\"msgGroup\":\"datanode-2313-laion1b-test-2-rootcoord-dml_5_448960188666872753v1-true\",\"timestamp\":449242704121430020},\"segmentID\":449118743319978903}

Recover

2024-04-21 19:07:54.994	
[2024/04/21 19:07:54.994 +00:00] [INFO] [datanode/channel_meta.go:220] ["adding segment"] [type=Normal] [segmentID=449118743319978903] [collectionID=448960188666872753] [partitionID=448960188666872761] [channel=laion1b-test-2-rootcoord-dml_5_448960188666872753v1] [startPosition=<nil>] [endPosition=<nil>] [recoverTs=449242669990281216] [importing=false]

First msg when channel recovered

the first msg it consumed is start=449242669990281216, end=449243021459849216, which is PosTime=2024/04/21 19:02:05.564 +00:00

$msgStart &lt; segAStart &lt; msgEnd$, this msgPack didn't skip the msg, and should've contained the insert rows of segA, but actually not.

2024-04-21 19:19:45.255	
[2024/04/21 19:19:45.255 +00:00] [INFO] [datanode/channel_meta.go:220] ["adding segment"] [type=New] [segmentID=449118743320195625] [collectionID=448960188666872753] [partitionID=448960188666872809] [channel=laion1b-test-2-rootcoord-dml_5_448960188666872753v1] [startPosition="channel_name:\"laion1b-test-2-rootcoord-dml_5_448960188666872753v1\" msgID:\"\\010\\226\\202\\001\\020\\301\\213\\001\\030\\000 \\000\" msgGroup:\"datanode-2313-laion1b-test-2-rootcoord-dml_5_448960188666872753v1-true\" timestamp:449242669990281216 "] [endPosition="channel_name:\"laion1b-test-2-rootcoord-dml_5_448960188666872753v1\" msgID:\"\\010\\245\\202\\001\\020\\343\\006\\030\\000 \\000\" msgGroup:\"datanode-2326-laion1b-test-2-rootcoord-dml_5_448960188666872753v1-true\" timestamp:449243021459849216 "] [recoverTs=0] [importing=false]

Flush segment with zero numRows

2024-04-21 19:20:32.987	
[2024/04/21 19:20:32.987 +00:00] [INFO] [datanode/flush_manager.go:885] [SaveBinlogPath] [segmentID=449118743319978903] [collectionID=448960188666872753] [vchannel=laion1b-test-2-rootcoord-dml_5_448960188666872753v1] [SegmentID=449118743319978903] [startPos=null] [checkPoints="[{\"segmentID\":449118743319978903,\"position\":{\"channel_name\":\"laion1b-test-2-rootcoord-dml_5_448960188666872753v1\",\"msgID\":\"CKWCARCcDRgAIAA=\",\"msgGroup\":\"datanode-2326-laion1b-test-2-rootcoord-dml_5_448960188666872753v1-true\",\"timestamp\":449243022429519925}}]"] ["Length of Field2BinlogPaths"=0] ["Length of Field2Stats"=0] ["Length of Field2Deltalogs"=0]

@xiaofan-luan
Copy link
Contributor

  1. let's figure a way to skip 0 length segments
  2. ideally we should commit only if segment is synced. need to remove the autoCommit inside messageQueue and commit only on sync

XuanYang-cn added a commit to XuanYang-cn/milvus that referenced this issue May 7, 2024
See also: milvus-io#32553

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
XuanYang-cn added a commit to XuanYang-cn/milvus that referenced this issue May 7, 2024
See also: milvus-io#32553
pr: milvus-io#32821

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue May 8, 2024
Add more informative logs

See also: #32553

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
XuanYang-cn added a commit to XuanYang-cn/milvus that referenced this issue May 8, 2024
See also: milvus-io#32553

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue May 8, 2024
See also: #32553
pr: #32821

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
XuanYang-cn added a commit to XuanYang-cn/milvus that referenced this issue May 11, 2024
See also: milvus-io#32553

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
XuanYang-cn added a commit to XuanYang-cn/milvus that referenced this issue May 11, 2024
See also: milvus-io#32553

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue May 13, 2024
See also: #32553

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
@yanliang567 yanliang567 modified the milestones: 2.3.15, 2.3.16 May 16, 2024
@XuanYang-cn
Copy link
Contributor

Should be fixed
/unassign
/unassign @tedxu
/assign @ThreadDao

@sre-ci-robot sre-ci-robot assigned ThreadDao and unassigned tedxu and XuanYang-cn Jul 1, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.16, 2.3.19 Jul 9, 2024
@yanliang567 yanliang567 modified the milestones: 2.3.19, 2.3.20 Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants