-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: IndexNode OOM after upgrading from 2.3.12 to 2.4.4 #34273
Comments
Checking the logs, I did not see anything doubtful. The default value of max segment size changed from 512MB to 1024GB, that's the only suspected point I can think of. @artinshahverdian quick question: how did you observe the index memory usage? |
According to the log information, the size of the newly segment to build index is |
/assign @artinshahverdian |
can confirm the segment size default value is changed in 2.4.4:
these are my configs now. If I reduce these to:
and trigger compaction, will I get smaller segments and can I use an 8GB machine for indexNode or the existing segments cannot change anymore? |
It is no way to reduce the segment size through compaction. The recommended approach is to scale up the indexnode memory to 10GB; for 2.3GB segment, 10GB of memory should be sufficient for building the index. |
@artinshahverdian I have not changed the segment size or built a disk index. Is there anyway I can find the big segment and verify the size? |
@xiaocai2333 do you see any downside of changing the segment size back to 512? |
Reverting the configuration poses no issues. However, once a segment has been generated, it cannot be reduced from 2GB back to 512MB. |
@artinshahverdian It would be great if you have more logs from datacoord/datanode. We can try to investigate how the large segment was generated. |
@xiaocai2333 @artinshahverdian |
and why does this segment size becomes 2.3G with 1G setting? |
@xiaofan-luan my index is HNSW, ef_cunstroction: 50, m: 16. I looked at the files stored in s3, and cannot find any segment that is close to 2GB, all of them are less than 1GB. |
can you verify how much memory does it take to a 1g memory to build index? If it takes more than 4g, maybe 1G segment size is too large as default for 2c8g users |
If the question is addressed to me, We are in the middle of a data reset, and have reset most of our data so I can't really run this experiment. But I have reduced the segment size to 512 now and will see if we can use 8GB RAM for index node in the future. |
From the logs of the indexnode, it can be seen that an index is being built for a segment with 398898 rows and ad dimension of 1536. |
@artinshahverdian |
@artinshahverdian Could you provide the datacoord logs from the upgrade process? Or could you download this segment's data and send it to my email cai.zhang@zilliz.com? |
Sorry, I can't share the segment since there is sensitive data in there. I have restarted the pod after the migration was done, so I don't really have the logs anymore. |
Okay, were you able to successfully build the index after migrating your cluster? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Is there an existing issue for this?
Environment
Current Behavior
I am running Milvus 2.4.4 in cluster mode on AWS EKS. The I am seeing the indexnode being crashed while it's trying to index. I have just upgraded from 2.3.12 to 2.4.4 and have a dedicated nodegroup for the indexnode. The machine has 8GB memory. Why would the indexnode work fine in 2.3.12 with the same memory and get OOM after upgrading to 2.4.4. Anything I'm missing? Logs for indexnode are included from start until the crash. Logs are set at info level.
After upgrading to a 16GB Node, the memory usage didn't go above 6GB and it dropped multiple times and grew. I suspect Milvus is not monitoring memory usage and doesn't kick off a garbage collection before using more memory.
My segment size and max segment size are the default and I have not overridden anything.
indexnode.log
Expected Behavior
Indexnode should work fine as it was in 2.3.12 with an 8GB machine and run garbage collection periodically.
Steps To Reproduce
No response
Milvus Log
indexnode.log
Anything else?
No response
The text was updated successfully, but these errors were encountered: