Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve how we record file size metric to avoid a race in our file lagging alert #1194

Merged
merged 3 commits into from
Oct 24, 2019

Conversation

slim-bean
Copy link
Collaborator

Currently we read file size by looking at the matched list of files from the glob pattern and then for each doing a stat on the file. We also report the number of bytes read in a file from the tailer.

Together these two metrics are used to create an alert if a tailer is lagging too far behind the size of the file (was helpful when we had issues in the early day with files stopping being tailed)

The problem is, when a file rolls, the sync() code will lookup the list of files to follow and see the new file, which we then report the size on (which will be small). Meanwhile the number of bytes read are reported from the existing file (which will be large).

The alert takes an abs() on the difference of these two metrics (because if tailing a file was stuck/broken the new file size would continuously be changing (and often less than the current file size), so it was necessary to look at the overall difference between their two sizes)

The alert does require a tailing diff to exist for a certain duration, but when log files roll very quickly it's not uncommon for this race to happen frequent enough to fire the alert as a false positive.

This PR updates the forked hpcloud/tail library to report the size based on the file handle and not the file name, and it also updates the sync() functions check on file size to use this new method if the file is currently being tailed.

This change has the upside of removing the race between reading the current file size by file name and how fast files are rolled and the bytes read size is reported by the tailer.

It does however introduce a new problem, where a file tailer could stop reading a file within a few bytes of the size of the file, and the alert would not fire because it doesn't exceed the threshold (although this situation would still be visible through metrics)

I think the tradeoffs here are worth it to hopefully remove the false positives we see on very quickly rolling log files.

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with two minor nitpick I'd like to see changed.

…iler reports for size (or read the file size from the file name if no tailer exists yet)

Update vendored (and forked) hpcloud tail library which includes a Size() function which reports size from the file handle and not file name
@slim-bean slim-bean merged commit 94b9c4b into master Oct 24, 2019
@rfratto rfratto deleted the improve-file-size-metric branch November 1, 2019 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants