Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing the threshold for a file lag and reducing the severity to warning #1749

Merged
merged 1 commit into from
Feb 26, 2020

Conversation

slim-bean
Copy link
Collaborator

After a year of playing with this alert and fixing race conditions, I think the races are fixed however the existing tolerance of 100kB is really much too small for busy promtail instances processing 5000lines/sec or more.

The original intent of this alert was to look for complete failures to tail a file and bugs in tailing which have since been squashed, therefore I think it's appropriate to increasing the threshold a file can fall behind to 1MB. This would still catch any new bugs in tailing or other issue but not be so flaky when log volume spikes on a promtail instance.

We also often find this alert is a red herring and sensitive to large bursts in log volume, and do not think it's appropriate to page someone when it fires. However it's still useful to know if this is happening as if nothing else it's going to indicate some delay in getting your logs to Loki.

The other reason to change it to warning is that it's hard to define an action anyone can take in an immediate sense when it fires which is another bar we hold for critial alerts.

Other alerts will catch the cases where Loki is not receiving logs at all.

Signed-off-by: Edward Welch edward.welch@grafana.com

…warning

Signed-off-by: Edward Welch <edward.welch@grafana.com>
Copy link
Member

@rfratto rfratto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@slim-bean slim-bean merged commit 8df56ce into master Feb 26, 2020
@slim-bean slim-bean deleted the adjust-promtail-file-lagging branch February 26, 2020 21:45
cyriltovena pushed a commit to cyriltovena/loki that referenced this pull request Jun 11, 2021
* Added new KV Store client, MultiClient.

This client is configured with multiple stores, one of them is designated as
primary. All client operations are forwarded to the primary store.

MultiClient also does "mirroring" of values from primary to secondary store.

MultiClient can listen on changes in runtime configuration (via overrides
mechanism), and switch primary store and enable/disable mirroring.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Use Stop, which is now part of kv.Client.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Put back setting of defaultLimits -- used when loading YAML files.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Moved setting of default limits for YAML unmarshal to separate function.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Pass multi-client context as argument.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* watchConfigChannel now reacts on context being done as well

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Changed Mirroring to *bool.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Ignore mock by yaml.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Renamed mirroring to mirror-enabled to be consistent with MultiConfig.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Renamed 'multi' to 'multi_kv_config' in overrides.yaml.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Forward writes done via CAS function to secondary client.

Mirroring goroutine was removed, and replaced by forwarding writes
done via CAS function to secondary client. Rate limits config was
removed, but there is now timeout for secondary write, to avoid
blocking CAS function for too long, if secondary write is slow
(eg. etcd being down can cause very long writes).

Only WatchKey and WatchPrefix functions now react on change of
primary client.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Added metrics to multi client.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Removed equality check when writing to secondary store.

Without watch-and-mirror functionality, there is no need to check
if value is already present in the secondary store.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Renamed OverridesManager and moved it to its own package.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Make lint happy, 3.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Make lint happy, 4.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Add metric type to variable names, yaml name changes, fixed metric names, removed forgotten log.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Fixed yet one more yaml name.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Fixed tests after changing yaml fields.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Fix bug when default limits are not applied until next overrides reload.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Ignore LoadConfig if LoadPath is empty.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Use channels to communicate config updates.

Instead of spawning new goroutine for each config update,
we now use channels to communicate config updates.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Initialize limits before starting runtimeconfig Manager.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Updated CHANGELOG.md and arguments.md.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Typo

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Fix compilation error in ingester_v2_test.go.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Fixed error after rebase.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Fixed error after rebase.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Use logger with component="multikv" to log messages.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Improve log message when runtime config file is not specified.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>

* Don't use memberlist in the example, as it is still experimental.
Addressed other review feedback.

Signed-off-by: Peter Štibraný <peter.stibrany@grafana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants