Promtail: Add a stream lagging metric #2618

slim-bean · 2020-09-11T20:43:47Z

This is similar to what we accomplish with promtail_file_bytes_total - promtail_read_bytes_total which gives a difference between the current file size and how far Promtail has read into it.

There are problems with those metrics however, and unfortunately there are also problems with this approach too, but I think there are less problems with this approach and the output is more meaningful.

How it works:

Every time a successful batch is sent to Loki, we iterate through all the streams in the batch, take the most recent entry and subtract it from time.Now() and then report this as a gauge named promtail_stream_lag_seconds with a label for the filename.

Problems with this approach:

This metric adds very little value if promtail is assigning the timestamp when the line is read, it still might be useful but not as useful as having promtail extract the timestamp from the log line.
If you are extracting the log line timestamp and the timestamps are somehow old or delayed, then this metric will be skewed.
Slowly updated log files will show a lag when none really exists, i.e. a log file writes an entry and it took 3s to send it, that metric will be present and scraped until the next log line for that stream is processed and changes that value, however if no new log lines are written to that file the metric will continue to output 3s until it expires and auto removes after 1m or the file stops being tailed. This leads to some strange artifacts of 1m long flat lines in the output, we could crank the expiration to a much smaller value but then you would likely lose visibility into slowly written log lines which are actually way behind.

All that being said I do think this helps close the loop in normal circumstances for normal volume log streams on how long it takes from when they are written until we get a 200 indicating Loki has stored them.

NOTE this PR also changes the default batch size from 100kb to 1MB, it was discovered 100KB is too small for high volume log files which can generate log's at a rate of over 100kb/sec, meaning multiple batches per second were required and would cause Promtail to get behind. Increasing this to 1MB addresses this issue and does not affect slower streams which will still be sent based on the BatchWait setting of default 1s

…in a batch

… just use filename.

…ove this in the future if the need arises

owen-d

Not super familiar with this code, but lgtm.

cyriltovena · 2020-09-16T08:34:48Z

pkg/promtail/client/client.go

@@ -34,39 +37,43 @@ const (
 	// Label reserved to override the tenant ID while processing
 	// pipeline stages
 	ReservedLabelTenantID = "__tenant_id__"
+
+	LatencyLabel = "filename"


ho I see, I think we need a better name.

pkg/promtail/targets/file/filetarget.go

pkg/promtail/client/client.go

codecov-commenter · 2020-09-16T13:48:49Z

Codecov Report

Merging #2618 into master will decrease coverage by 0.19%.
The diff coverage is 28.57%.

@@            Coverage Diff             @@
##           master    #2618      +/-   ##
==========================================
- Coverage   63.03%   62.83%   -0.20%     
==========================================
  Files         169      169              
  Lines       15015    15050      +35     
==========================================
- Hits         9464     9457       -7     
- Misses       4792     4832      +40     
- Partials      759      761       +2

Impacted Files	Coverage Δ
pkg/logentry/metric/metricvec.go	`75.00% <0.00%> (-21.43%)`	⬇️
pkg/promtail/api/types.go	`20.00% <ø> (ø)`
pkg/promtail/targets/file/filetarget.go	`64.40% <0.00%> (-1.88%)`	⬇️
pkg/promtail/client/client.go	`77.77% <40.00%> (-7.38%)`	⬇️
pkg/promtail/positions/positions.go	`46.49% <0.00%> (-13.16%)`	⬇️

…ffect sending for lower volumes which are still driven by the BatchWait setting. Through the addition of this metric it was found that for higher volume files > 100kb/sec this was much too low causing too many batches to be unnecessarily sent.

…-date defaults (#3559) * Doc: Remove removed --ingester.recover-from-wal option Removed in: 8a9b94a ("removes recover flag (#3326)") * Doc: Fix out-of-date defaults 0b1dbe2 ("Promtail: Add a stream lagging metric (#2618)") d3bf21e ("Promtail: (and also fluent-bit) change the max batch size to 1MB (#2710)")

slim-bean added 8 commits September 2, 2020 16:06

add a metric which displays how far a stream lags the last timestamp …

ea7b38a

…in a batch

register your metrics

7388fdc

recorded metric in the wrong spot, moved to the correct spot

3a8b585

cardinality is much too high if we add all the stream labels, instead…

deb3ae2

… just use filename.

work on using an interface to set/remove metrics

89dd93f

reworking interface again

b169fb9

for now just register this metric based on filename only, we can impr…

4a5f274

…ove this in the future if the need arises

change the timeout to 1min for expiring the metric

c39d22e

pull-request-size bot added the size/M label Sep 11, 2020

add host label

055d53d

owen-d approved these changes Sep 15, 2020

View reviewed changes

cyriltovena reviewed Sep 16, 2020

View reviewed changes

pkg/promtail/targets/file/filetarget.go Show resolved Hide resolved

cyriltovena reviewed Sep 16, 2020

View reviewed changes

pkg/promtail/client/client.go Outdated Show resolved Hide resolved

include host label when removing metric

7e9137a

slim-bean merged commit 0b1dbe2 into master Sep 18, 2020

slim-bean deleted the promtail-stream-lagging branch September 18, 2020 16:15

slim-bean mentioned this pull request Oct 1, 2020

Promtail: (and also fluent-bit) change the max batch size to 1MB #2710

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promtail: Add a stream lagging metric #2618

Promtail: Add a stream lagging metric #2618

slim-bean commented Sep 11, 2020 •

edited

Loading

owen-d left a comment

cyriltovena Sep 16, 2020

cyriltovena Sep 16, 2020

codecov-commenter commented Sep 16, 2020

Promtail: Add a stream lagging metric #2618

Promtail: Add a stream lagging metric #2618

Conversation

slim-bean commented Sep 11, 2020 • edited Loading

owen-d left a comment

Choose a reason for hiding this comment

cyriltovena Sep 16, 2020

Choose a reason for hiding this comment

cyriltovena Sep 16, 2020

Choose a reason for hiding this comment

codecov-commenter commented Sep 16, 2020

Codecov Report

slim-bean commented Sep 11, 2020 •

edited

Loading