Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: System continues to create empty segments after upgrade to 2.4 #33646

Open
1 task done
Archalbc opened this issue Jun 5, 2024 · 19 comments
Open
1 task done

[Bug]: System continues to create empty segments after upgrade to 2.4 #33646

Archalbc opened this issue Jun 5, 2024 · 19 comments
Labels
help wanted Extra attention is needed

Comments

@Archalbc
Copy link

Archalbc commented Jun 5, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: from 2.3.12 to 2.4.4
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2): Go SDK 2.3.2

- MaxSegmentSize: 2048

Current Behavior

After upgrading our Milvus cluster from 2.3.12 to 2.4.4, the system started to continuously create empty segment in all my collections.
image
image
image

This bug should be fixed since 2.3.15 in: #32553.

We decided to downgrade to 2.3.17 and unfortunately we also encountered an issue as our indexes version were upgraded to "4", Querynode in 2.3.17 were not able to load that.
Sound related to: #33242

Timeline:

  • ~11:40 deploy the upgrade from 2.3.12 to 2.4.4
  • ~15:15 downgrade to 2.3.17 but figured out the downgrade issue
  • ~16:00 stop traffic and restore all collections using milvus-backup, now it's OK.

Client is using Go SDK 2.3.2, we don't think it's an issue (?).

Schema:
image

Expected Behavior

  • No empty segments created
  • Being able to rollback to 2.3.

Steps To Reproduce

1. Setup a Milvus cluster in 2.3.12
2. Add collections and start indexing
3. Upgrade to 2.4.4

Milvus Log

Sorry, I don't have logs anymore since we downgraded everything.

Anything else?

No response

@Archalbc Archalbc added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 5, 2024
@Archalbc Archalbc changed the title [Bug]: System continues to create empty segment after upgrade to 2.4 [Bug]: System continues to create empty segments after upgrade to 2.4 Jun 5, 2024
@yanliang567
Copy link
Contributor

I guess the empty segments actually are not "empty", they are compacted segments and pending for indexing. In Milvus 2.3.12, the default maxSize of a segment is 512MB, while in Milvus 2.4.4 it changed to 1GB, so after upgraded, Milvus is trying to compact the segments to a the new maxSize. If you checking the compaction tasks or indexing tasks, I believe you can see the new tasks.
BTW, a bigger segment size leads to a better search performance in general. But it is not the bigger the better, because a bigger segment also means lower efficiency usage in memory.

/assign @Archalbc
/unassign

@sre-ci-robot sre-ci-robot assigned Archalbc and unassigned yanliang567 Jun 6, 2024
@yanliang567 yanliang567 added help wanted Extra attention is needed and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. kind/bug Issues or changes related a bug labels Jun 6, 2024
@Archalbc
Copy link
Author

Archalbc commented Jun 6, 2024

I guess the empty segments actually are not "empty", they are compacted segments and pending for indexing. In Milvus 2.3.12, the default maxSize of a segment is 512MB, while in Milvus 2.4.4 it changed to 1GB, so after upgraded, Milvus is trying to compact the segments to a the new maxSize. If you checking the compaction tasks or indexing tasks, I believe you can see the new tasks. BTW, a bigger segment size leads to a better search performance in general. But it is not the bigger the better, because a bigger segment also means lower efficiency usage in memory.

/assign @Archalbc /unassign

Hey sorry I completely forgot to say that we are already at 2GB for the maxSegmentSize since 2.3.12 as our querynodes are 32GB. (I'm updating the initial post)

@Archalbc Archalbc removed their assignment Jun 6, 2024
@yanliang567
Copy link
Contributor

trying to reproduce the issue in house.... @Archalbc any configs else did you change for milvus cluster
/assign @Archalbc

@Archalbc
Copy link
Author

Archalbc commented Jun 7, 2024

trying to reproduce the issue in house.... @Archalbc any configs else did you change for milvus cluster /assign @Archalbc

Hello, here is my values.yaml for the milvus configuration:

extraConfigFiles:
  user.yaml: |+
    queryNode:
      scheduler:
        maxReadConcurrentRatio: 2
    dataCoord:
      segment:
        maxSize: 2048
    queryCoord:
      overloadedMemoryThresholdPercentage: 80

vector dimension: 256
HNSW : ef: 100, efContruction: 360, M: 8
The client calls both query() and search() interfaces, fields : embedding and createdAt.
4 shards per collection

@Archalbc Archalbc removed their assignment Jun 7, 2024
@xiaofan-luan
Copy link
Contributor

@Archalbc
are you using bulkinsertion?
#33604 this might be the issue related.
We will fix this soon

@Archalbc
Copy link
Author

@Archalbc are you using bulkinsertion? #33604 this might be the issue related. We will fix this soon

My workmate will anwser to you soon.

@Archalbc
Copy link
Author

Archalbc commented Jun 10, 2024

New event, after the downgrade to 2.3.17 (so finally I upgraded from 2.3.12 to 2.3.17). Everything was normal until last friday around 1h AM, we encountered again a huge quantities of segments created. But this time, I can't see those segments on ATTU or birdwatcher :o, they are telling me I should have around 100 sealed segments, but this is clearly not the case.
In 2.4.4 I was able to see those empty segments on ATTU.

image
image

Birdwatcher, right now is saying;

--- Growing: 32, Sealed: 0, Flushed: 68, Dropped: 19                                                                                                                                                               
--- Total Segments: 100, row count: 65129835                    

and ATTU for the collection ID 450234905777292996:
image

while the Sealed segment vizualization is saying more than 70 segments. As you can see, before the sudden increase of sealed segments, I should have 40 segments for this collection, which is normal because we have 2 replicas.

I rollback again to 2.3.12, And I got my 40 sealed segments again...
image

@xiaofan-luan
Copy link
Contributor

New event, after the downgrade to 2.3.17 (so finally I upgraded from 2.3.12 to 2.3.17). Everything was normal until last friday around 1h AM, we encountered again a huge quantities of segments created. But this time, I can't see those segments on ATTU or birdwatcher :o, they are telling me I should have around 100 sealed segments, but this is clearly not the case. In 2.4.4 I was able to see those empty segments on ATTU.

image image

Birdwatcher, right now is saying;

--- Growing: 32, Sealed: 0, Flushed: 68, Dropped: 19                                                                                                                                                               
--- Total Segments: 100, row count: 65129835                    

and ATTU for the collection ID 450234905777292996: image

while the Sealed segment vizualization is saying more than 70 segments. As you can see, before the sudden increase of sealed segments, I should have 40 segments for this collection, which is normal because we have 2 replicas.

I rollback again to 2.3.12, And I got my 40 sealed segments again... image

  1. did you have any import and delete at that time?
  2. you have 32 growing segments, which means you might have 32 partitions or enabled partitionkey features. If you have many partitions, has hundred segments might not be a big thing?

@xiaofan-luan
Copy link
Contributor

because each time you flush(or auto flush), your segment number will grow by 32. and after compaction the number will decrease

@Archalbc
Copy link
Author

  1. We did nothing special during the night, just usual operations.
  2. We always have only one partition per collection.

We don't call flush in the code, we let the system auto-flush.
There is definitively something strange because reverting back to 2.3.12, odd segments disappear and we are back to the normal number of segments.
We can clearly see that those strange segments are never flushed or compacted, they stay here forever...

@xiaofan-luan
Copy link
Contributor

  1. We did nothing special during the night, just usual operations.
  2. We always have only one partition per collection.

We don't call flush in the code, we let the system auto-flush. There is definitively something strange because reverting back to 2.3.12, odd segments disappear and we are back to the normal number of segments. We can clearly see that those strange segments are never flushed or compacted, they stay here forever...

If you only have one collection and one partition, there seems to be no reason you will see 32 growing segments.

did you do import or insert? can you offer full logs for debug?

@xiaofan-luan
Copy link
Contributor

I think we need logs and also information about "what operation did you do and see a segemnt with 0 entities"

@Archalbc
Copy link
Author

Archalbc commented Jun 10, 2024

  1. We did nothing special during the night, just usual operations.
  2. We always have only one partition per collection.

We don't call flush in the code, we let the system auto-flush. There is definitively something strange because reverting back to 2.3.12, odd segments disappear and we are back to the normal number of segments. We can clearly see that those strange segments are never flushed or compacted, they stay here forever...

If you only have one collection and one partition, there seems to be no reason you will see 32 growing segments.

did you do import or insert? can you offer full logs for debug?

No we have 8 collections, since the rollback to 2.3.12, we have the correct number of segments and it's not growing.
image

I'm starting to wonder if this is because client are only using SDK Go 2.3.2 ?

Sorry I cannot provide any logs since each time I rollback ASAP and our logging system is currently under maintenance u_u.
My workmate is not available today to give more insight on the client side.
Maybe we will see with the team if we can somehow reproduce that in another cluster.

@xiaofan-luan
Copy link
Contributor

I guess this might not be related to SDK version. But we need more clues about why the log is created

@yanliang567
Copy link
Contributor

@Archalbc I did not reproduce this issue in my milvus, but i can try more time. If it reproduced to you, could you please attach the etcd backup for investigation? Check this: https://github.com/milvus-io/birdwatcher for details about how to backup etcd with birdwatcher.
One more question is could you please double check with your workmates that there is no insert or delete operations before/during the system upgrading?

@flsworld
Copy link

Hi,
Sorry for late answer

are you using bulkinsertion?
#33604 this might be the issue related.
We will fix this soon

Sorry for late answer !! We're not using bulk insert. We're upserting each time

@xiaofan-luan
Copy link
Contributor

upsert seems to be the issue. becasue upsert also cause one delete on each search.
The segment is under L0 state. If you rollback your cluster version to 2.3, you might lost some deletes in L0 segments(because 2.3 don't recommend these deletes)

The segment number increase might be also related to L0 delete, but this might be as expected and won't affect search performance

@Archalbc
Copy link
Author

upsert seems to be the issue. becasue upsert also cause one delete on each search. The segment is under L0 state. If you rollback your cluster version to 2.3, you might lost some deletes in L0 segments(because 2.3 don't recommend these deletes)

The segment number increase might be also related to L0 delete, but this might be as expected and won't affect search performance

I can understand that upsert operations are kinda tricky, but I don't understand why we have a change of behavior on the cluster. Everything look fine in 2.3.12 and I doubt the cluster will be OK with an infinite number of segments (because you can see in the graph that it never stop creating segments and they are never cleaned.)

How can I securely provide you the etcd backup without exposing it here ?

@yanliang567
Copy link
Contributor

please send it to my mail: yanliang.qiao@zilliz.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants