Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: inaccurate low disk space warning during inode exhaustion, newly autodiscovered hosts not collected #2864

Closed
e271828- opened this Issue Jun 20, 2017 · 6 comments

Comments

Projects
None yet
3 participants
@e271828-
Copy link

e271828- commented Jun 20, 2017

After noticing that newly added hosts weren't reflected in the stats, we saw Prometheus was nearly pegging two CPUs (normal is 10-60%) and throwing many error messages in the logs around low disk space and writes failing. Around 875MB was still free according to df.

Handling of low disk space conditions should be better (why were newly autodiscovered hosts not being reported?) and there seems to be a bug in the TSDB writer. Unfortunately we don't have full forensics, but if it recurs we'll snapshot the host.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jun 20, 2017

@e271828-

This comment has been minimized.

Copy link
Author

e271828- commented Jun 20, 2017

@gouthamve inode exhaustion is possible. We'll check for that if it happens again. I see a 12 hour old prometheus install monitoring 2,000 hosts is using 194,455 files. That seems excessive.

@e271828- e271828- changed the title Bug(s): false low disk space warning, newly autodiscovered hosts not being collected with space low Bug(s): inaccurate low disk space warning (maybe inode exhaustion?); newly autodiscovered hosts not collected when space low Jun 20, 2017

@e271828- e271828- changed the title Bug(s): inaccurate low disk space warning (maybe inode exhaustion?); newly autodiscovered hosts not collected when space low Bug: inaccurate low disk space warning during inode exhaustion, newly autodiscovered hosts not collected Jun 20, 2017

@e271828-

This comment has been minimized.

Copy link
Author

e271828- commented Jun 20, 2017

We've recreated the scenario and confirmed it was inode exhaustion. At minimum the log messages should reflect that.

Old data should also just be purged in the disk-full state rather than dropping new host data on the ground.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jun 21, 2017

The error thrown in the case of inode exhaustion by the OS is the same as that of disk full.

This is caused due to the structure of the storage. We create a file per timeseries (we will have as many files as the number of ts) and then append new data to that file. We cannot drop old data and add new data as both belong to the same file and we ran out of inodes.

You could fix this by formatting the disk to have more inodes. This issue has been fixed in 2.0.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 6, 2017

Sounds like there's not much more we can do here.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.