Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler: documentation for recording rules #3851

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions docs/sources/configuration/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -566,6 +566,61 @@ storage:
# CLI flag: -ruler.storage.local.directory
[directory: <filename> | default = ""]

# Remote-write configuration to send rule samples to a Prometheus remote-write endpoint.
remote_write:
# Enable remote-write functionality.
# CLI flag: -ruler.remote-write.enabled
[enabled: <boolean> | default = false]

client:
# The URL of the endpoint to send samples to.
url: <string>

# Timeout for requests to the remote write endpoint.
[remote_timeout: <duration> | default = 30s]

# Custom HTTP headers to be sent along with each remote write request.
# Be aware that headers that are set by Prometheus itself can't be overwritten.
headers:
[<string>: <string> ...]

# HTTP proxy server to use to connect to the targets.
[proxy_url: <string>]

# Sets the `Authorization` header on every remote write request with the
# configured username and password.
# password and password_file are mutually exclusive.
basic_auth:
[username: <string>]
[password: <secret>]
[password_file: <string>]

# `Authorization` header configuration.
authorization:
# Sets the authentication type.
[type: <string> | default: Bearer]
# Sets the credentials. It is mutually exclusive with
# `credentials_file`.
[credentials: <secret>]
# Sets the credentials with the credentials read from the configured file.
# It is mutually exclusive with `credentials`.
[credentials_file: <filename>]

tls_config:
# CA certificate to validate API server certificate with.
[ca_file: <filename>]

# Certificate and key files for client cert authentication to the server.
[cert_file: <filename>]
[key_file: <filename>]

# ServerName extension to indicate the name of the server.
# https://tools.ietf.org/html/rfc4366#section-3.1
[server_name: <string>]

# Disable validation of the server certificate.
[insecure_skip_verify: <boolean>]

# File path to store temporary rule files
# CLI flag: -ruler.rule-path
[rule_path: <filename> | default = "/rules"]
Expand Down Expand Up @@ -1751,6 +1806,10 @@ logs in Loki.
# If no rule is matched the `retention_period` is used.
[retention_stream: <array> | default = none]

# Capacity of remote-write queues; if a queue exceeds its capacity it will evict oldest samples.
# CLI flag: -ruler.remote-write.queue-capacity
[ruler_remote_write_queue_capacity: <int> | default = 10000]

# Feature renamed to 'runtime configuration', flag deprecated in favor of -runtime-config.file (runtime_config.file in YAML).
# CLI flag: -limits.per-user-override-config
[per_tenant_override_config: <string>]
Expand Down
176 changes: 99 additions & 77 deletions docs/sources/alerting/_index.md → docs/sources/rules/_index.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,17 @@
---
title: Alerting
aliases:
- /alerting/
KMiller-Grafana marked this conversation as resolved.
Show resolved Hide resolved
title: Alerting and Recording Rules
weight: 700
---

# Alerting
# Rules and the Ruler
KMiller-Grafana marked this conversation as resolved.
Show resolved Hide resolved

Loki includes a component called the Ruler, adapted from our upstream project, Cortex. The Ruler is responsible for continually evaluating a set of configurable queries and then alerting when certain conditions happen, e.g. a high percentage of error logs.
Loki includes a component called the Ruler, adapted from our upstream project, Cortex. The Ruler is responsible for continually evaluating a set of configurable queries and performing an action based on the result.

First, ensure the Ruler component is enabled. The following is a basic configuration which loads rules from configuration files:
This example configuration sources rules from a local disk.

[Ruler storage](#ruler-storage) provides further details.

```yaml
ruler:
Expand All @@ -24,72 +28,19 @@ ruler:

```

## Prometheus Compatible

When running the Ruler (which runs by default in the single binary), Loki accepts rules files and then schedules them for continual evaluation. These are _Prometheus compatible_! This means the rules file has the same structure as in [Prometheus' Alerting Rules](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/), except that the rules specified are in LogQL.

Let's see what that looks like:

The syntax of a rule file is:

```yaml
groups:
[ - <rule_group> ]
```

A simple example file could be:
We support two kinds of rules: [alerting](#alerting-rules) rules and [recording](#recording-rules) rules.

```yaml
groups:
- name: example
rules:
- alert: HighThroughputLogStreams
expr: sum by(container) (rate({job=~"loki-dev/.*"}[1m])) > 1000
for: 2m
```
## Alerting Rules

### `<rule_group>`
We support [Prometheus-compatible](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/) alerting rules. From Prometheus' documentation:

```yaml
# The name of the group. Must be unique within a file.
name: <string>
> Alerting rules allow you to define alert conditions based on Prometheus expression language expressions and to send notifications about firing alerts to an external service.

# How often rules in the group are evaluated.
[ interval: <duration> | default = Ruler.evaluation_interval || 1m ]

rules:
[ - <rule> ... ]
```

### `<rule>`

The syntax for alerting rules is (see the LogQL [Metric Queries](https://grafana.com/docs/loki/latest/logql/#metric-queries) for more details):

```yaml
# The name of the alert. Must be a valid label value.
alert: <string>

# The LogQL expression to evaluate (must be an instant vector). Every evaluation cycle this is
# evaluated at the current time, and all resultant time series become
# pending/firing alerts.
expr: <string>

# Alerts are considered firing once they have been returned for this long.
# Alerts which have not yet fired for long enough are considered pending.
[ for: <duration> | default = 0s ]

# Labels to add or overwrite for each alert.
labels:
[ <labelname>: <tmpl_string> ]

# Annotations to add to each alert.
annotations:
[ <labelname>: <tmpl_string> ]
```
Loki alerting rules are exactly the same, except they use LogQL for their expressions.

### Example

A full-fledged example of a rules file might look like:
A complete example of a rules file:

```yaml
groups:
Expand Down Expand Up @@ -117,25 +68,96 @@ groups:
severity: critical
```

## Use cases
## Recording Rules

The Ruler's Prometheus compatibility further accentuates the marriage between metrics and logs. For those looking to get started alerting based on logs, or wondering why this might be useful, here are a few use cases we think fit very well.
We support [Prometheus-compatible](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#recording-rules) recording rules. From Prometheus' documentation:

### We aren't using metrics yet
> Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series.

> Querying the precomputed result will then often be much faster than executing the original expression every time it is needed. This is especially useful for dashboards, which need to query the same expression repeatedly every time they refresh.

Loki allows you to run [_metric queries_](https://grafana.com/docs/loki/latest/logql/#metric-queries) over your logs, which means
that you can derive a numeric aggregation from your logs, like calculating the number of requests over time from your NGINX access log.

### Example

Many nascent projects, apps, or even companies may not have a metrics backend yet. We tend to add logging support before metric support, so if you're in this stage, alerting based on logs can help bridge the gap. It's easy to start building Loki alerts for things like _the percentage of error logs_ such as the example from earlier:
```yaml
- alert: HighPercentageError
expr: |
sum(rate({app="foo", env="production"} |= "error" [5m])) by (job)
/
sum(rate({app="foo", env="production"}[5m])) by (job)
> 0.05
name: NginxRules
interval: 1m
rules:
- record: nginx:requests:rate1m
expr: |
sum(
rate({container="nginx"}[1m])
)
labels:
cluster: "us-central1"
```

This query (`expr`) will be executed every 1 minute (`interval`), the result of which will be stored in the metric
name we have defined (`record`). This metric named `nginx:requests:rate1m` can now be sent to Prometheus, where it will be stored
just like any other metric.

### Remote-Write

With recording rules, you can run these metric queries continually on an interval, and have the resulting metrics written
to a Prometheus-compatible remote-write endpoint. They produce Prometheus metrics from log entries.

At the time of writing, these are the compatible backends that support this:

- [Prometheus](https://prometheus.io/docs/prometheus/latest/disabled_features/#remote-write-receiver) (`>=v2.25.0`):
Prometheus is generally a pull-based system, but since `v2.25.0` has allowed for metrics to be written directly to it as well.
- [Cortex](https://cortexmetrics.io/docs/api/#remote-write)
- [Thanos (`Receiver`)](https://thanos.io/tip/components/receive.md/)

Here is an example remote-write configuration for sending to a local Prometheus instance:

```yaml
ruler:
... other settings ...

remote_write:
enabled: true
client:
url: http://localhost:9090/api/v1/write
```

Further configuration options can be found under [ruler_config](/configuration#ruler_config).

### Resilience and Durability

Given the above remote-write configuration, one needs to take into account what would happen if the remote-write receiver
becomes unavailable.
Comment on lines +129 to +130
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Given the above remote-write configuration, one needs to take into account what would happen if the remote-write receiver
becomes unavailable.
A remote-write configuration needs to take into account what would happen if the remote-write receiver
becomes unavailable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's incorrect to refer to this by an indefinite article, as there can only be one.
I was trying to draw the readers attention to the section above in case they hadn't yet read it.

Keen to hear your thoughts on this


The Ruler component ensures some durability guarantees by buffering all outgoing writes in an in-memory queue. This queue
holds all metric samples that are due to be written to the remote-write receiver, and while that receiver is down, the buffer
will grow in size.

Once the queue is full, the oldest samples will be evicted from the queue. The size of this queue is controllable globally,
or on a per-tenant basis, with the [`ruler_remote_write_queue_capacity`](/configuration#limits_config) limit setting. By default, this value is set to 10000 samples.

**NOTE**: this queue only exists in-memory at this time; there is no Write-Ahead Log (WAL) functionality available yet.
This means that if your Ruler instance crashes, all pending metric samples in the queue that have not yet been written will be lost.

### Operational Considerations

Metrics are available to monitor recording rule evaluations and writes.

| Metric | Description |
|---|---|
| `recording_rules_samples_queued_current` | Number of samples queued to be remote-written. |
| `recording_rules_samples_queued_total` | Total number of samples queued. |
| `recording_rules_samples_queue_capacity` | Number of samples that can be queued before eviction of the oldest samples occurs. |
| `recording_rules_samples_evicted_total` | Number of samples evicted from queue because the queue is full. |
| `recording_rules_remote_write_errors` | Number of samples that failed to be remote-written due to error. |

## Use cases

The Ruler's Prometheus compatibility further accentuates the marriage between metrics and logs. For those looking to get started with metrics and alerts based on logs, or wondering why this might be useful, here are a few use cases we think fit very well.

### Black box monitoring

We don't always control the source code of applications we run. Think load balancers and the myriad components (both open source and closed third-party) that support our applications; it's a common problem that these don't expose a metric you want (or any metrics at all). How then, can we bring them into our observability stack in order to monitor them effectively? Alerting based on logs is a great answer for these problems.
We don't always control the source code of applications we run. Load balancers and a myriad of other components, both open source and closed third-party, support our applications while they don't expose the metrics we want. Some don't expose any metrics at all. Loki's alerting and recording rules can produce metrics and alert on the state of the system, bringing the components into our observability stack by using the logs. This is an incredibly powerful way to introduce advanced observability into legacy architectures.

### Event alerting

Expand All @@ -162,7 +184,7 @@ Creating these alerts in LogQL is attractive because these metrics can be extrac

## Interacting with the Ruler

Because the rule files are identical to Prometheus rule files, we can interact with the Loki Ruler via [`cortex-tool`](https://github.com/grafana/cortex-tools#rules). The CLI is in early development, but works alongside both Loki and cortex. Make sure to pass the `--backend=loki` argument to commands when using it with Loki.
Because the rule files are identical to Prometheus rule files, we can interact with the Loki Ruler via [`cortextool`](https://github.com/grafana/cortex-tools#rules). The CLI is in early development, but it works with both Loki and Cortex. Pass the `--backend=loki` option when using it with Loki.

> **Note:** Not all commands in cortextool currently support Loki.

Expand Down Expand Up @@ -275,8 +297,8 @@ Yaml files are expected to be [Prometheus compatible](#Prometheus_Compatible) bu

There are a few things coming to increase the robustness of this service. In no particular order:

- Recording rules.
- Backend metric stores adapters for generated alert and recording rule data. The first will likely be Cortex, as Loki is built atop it.
- WAL for recording rule.
- Backend metric stores adapters for generated alert rule data.

## Misc Details: Metrics backends vs in-memory

Expand Down