Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions _topic_maps/_topic_map.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3283,6 +3283,8 @@ Topics:
File: network-observability-network-policy
- Name: Observing the network traffic
File: observing-network-traffic
- Name: Network observability alerts
File: network-observability-alerts
- Name: Using metrics with dashboards and alerts
File: metrics-alerts-dashboards
- Name: Monitoring the Network Observability Operator
Expand Down
152 changes: 152 additions & 0 deletions modules/network-observability-alerts-about-promql-expression.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
// Module included in the following assemblies:
//
// * network_observability/network-observability-alerts.adoc

:_mod-docs-content-type: REFERENCE
[id="network-observability-alerts-about-promql-expression_{context}"]
= About the PromQL expression for alerts

[role="_abstract"]
Learn about the base query for Prometheus Query Language (`PromQL`), and how to customize it so you can configure network observability alerts for your specific needs.

The alerting API in the network observability `FlowCollector` custom resource (`CR`) is mapped to the Prometheus Operator API, generating a `PrometheusRule`. You can see the `PrometheusRule` in the default `netobserv` namespace by running the following command:

[source,terminal]
----
$ oc get prometheusrules -n netobserv -oyaml
----

[id="example-example-query-alert-for-surge-in-incoming-traffic_{context}"]
== An example query for an alert in a surge of incoming traffic

This example provides the base `PromQL` query pattern for an alert about a surge in incoming traffic:

[source,promql]
----
sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)
----

This query calculates the byte rate coming from the `openshift-ingress` namespace to any of your workloads' namespaces over the past 30 minutes.

You can customize the query, including retaining only some rates, running the query for specific time periods, and setting a final threshold.

Filtering noise:: Appending `> 1000` to this query retains only the rates observed that are greater than `1 KB/s`, which eliminates noise from low-bandwidth consumers.
+
`(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)`
+
The byte rate is relative to the sampling interval defined in the `FlowCollector` custom resource (`CR`) configuration. If the sampling interval is `1:100`, the actual traffic might be approximately 100 times higher than the reported metrics.

Time comparison:: You can run the same query for a particular period of time using the `offset` modifier. For example, a query for one day earlier can be run using `offset 1d`, and a query for five hours ago can be run using `offset 5h`.
+
`sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))`
+
You can use the formula `100 * (<query now> - <query from the previous day>) / <query from the previous day>` to calculate the percentage of increase compared to the previous day. This value can be negative if the byte rate today is lower than the previous day.

Final threshold:: You can apply a final threshold to filter increases that are lower than the desired percentage. For example, `> 100` eliminates increases that are lower than 100%.

Together, the complete expression for the `PrometheusRule` looks like the following:

[source,promql]
----
...
expr: |-
(100 *
(
(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
- sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
)
/ sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
> 100
----

[id="alert-metadata-fields_{context}"]
== Alert metadata fields

The Network Observability Operator uses components from other {product-title} features, such as the monitoring stack, to enhance visibility into network traffic. For more information, see: "Monitoring stack architecture".

Some metadata must be configured for the alert definitions. This metadata is used by Prometheus and the `Alertmanager` service from the monitoring stack, or by the *Network Health* dashboard.

The following example shows an `AlertingRule` resource with the configured metadata:

[source,yaml]
----
apiVersion: monitoring.openshift.io/v1
kind: AlertingRule
metadata:
name: netobserv-alerts
namespace: openshift-monitoring
spec:
groups:
- name: NetObservAlerts
rules:
- alert: NetObservIncomingBandwidth
annotations:
netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}'
message: |-
NetObserv is detecting a surge of incoming traffic: current traffic to {{ $labels.DstK8S_Namespace }} has increased by more than 100% since yesterday.
summary: "Surge in incoming traffic"
expr: |-
(100 *
(
(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
- sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
)
/ sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
> 100
for: 1m
labels:
app: netobserv
netobserv: "true"
severity: warning
----

where:

`spec.groups.rules.alert.labels.netobserv`::
Specifies the alert for the *Network Health* dashboard to detect when set to `true`.
`spec.groups.rules.alert.labels.severity`::
Specifies the severity of the alert. The following values are valid: `critical`, `warning`, or `info`.

You can leverage the output labels from the defined `PromQL` expression in the `message` annotation. In the example, since results are grouped per `DstK8S_Namespace`, the expression `{{ $labels.DstK8S_Namespace }}` is used in the message text.

The `netobserv_io_network_health` annotation is optional, and controls how the alert is rendered on the *Network Health* page.

The `netobserv_io_network_health` annotation is a JSON string consisting of the following fields:

.Fields for the netobserv_io_network_health annotation
[cols="2,2,6",options="header"]
|===
| Field
| Type
| Description

| `namespaceLabels`
| List of strings
| One or more labels that hold namespaces. When provided, the alert appears under the *Namespaces* tab.

| `nodeLabels`
| List of strings
| One or more labels that hold node names. When provided, the alert appears under the *Nodes* tab.

| `threshold`
| String
| The alert threshold, expected to match the threshold defined in the `PromQL` expression.

| `unit`
| String
| The data unit, used only for display purposes.

| `upperBound`
| String
| An upper bound value used to compute the score on a closed scale. Metric values exceeding this bound are clamped.

| `links`
| List of objects
| A list of links to display contextually with the alert. Each link requires a `name` (display name) and `url`.

| `trafficLinkFilter`
| String
| An additional filter to inject into the URL for the *Network Traffic* page.
|===

The `namespaceLabels` and `nodeLabels` are mutually exclusive. If neither is provided, the alert appears under the *Global* tab.
60 changes: 60 additions & 0 deletions modules/network-observability-alerts-about.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
// Module included in the following assemblies:
//
// * network_observability/network-observability-alerts.adoc

:_mod-docs-content-type: CONCEPT
[id="network-observability-alerts-about_{context}"]
= About network observability alerts

[role="_abstract"]
Network observability includes predefined alerts. Use these alerts to gain insight into the health and performance of your {product-title} applications and infrastructure.

The predefined alerts provide a quick health indication of your cluster's network in the *Network Health* dashboard. You can also customize alerts using Prometheus Query Language (PromQL) queries.

By default, network observability creates alerts that are contextual to the features you enable.

For example, packet drop-related alerts are created only if the `PacketDrop` agent feature is enabled in the `FlowCollector` custom resource (CR). Alerts are built on metrics, and you might see configuration warnings if enabled alerts are missing their required metrics.

You can configure these metrics in the `spec.processor.metrics.includeList` object of the `FlowCollector` CR.

[id="network-observability-default-alert-templates_{context}"]
== List of default alert templates

These alert templates are installed by default:

`PacketDropsByDevice`:: Triggers on high percentage of packet drops from devices (`/proc/net/dev`).
`PacketDropsByKernel`:: Triggers on high percentage of packet drops by the kernel; it requires the `PacketDrop` agent feature.
`IPsecErrors`:: Triggers when IPsec encryption errors are detected by network observability; it requires the `IPSec` agent feature.
`NetpolDenied`:: Triggers when traffic denied by network policies is detected by network observability; it requires the `NetworkEvents` agent feature.
`LatencyHighTrend`:: Triggers when an increase of TCP latency is detected by network observability; it requires the `FlowRTT` agent feature.
`DNSErrors`:: Triggers when DNS errors are detected by network observability; it requires the `DNSTracking` agent feature.
//* `ExternalEgressHighTrend`: TODO.
//* `ExternalIngressHighTrend`: TODO.

These are operational alerts that relate to the self-health of network observability:

`NetObservNoFlows`:: Triggers when no flows are being observed for a certain period.
`NetObservLokiError`:: Triggers when flows are being dropped due to Loki errors.

You can configure, extend, or disable alerts for network observability. You can view the resulting `PrometheusRule` resource in the default `netobserv` namespace by running the following command:

[source,terminal]
----
$ oc get prometheusrules -n netobserv -oyaml
----

[id="network-health-dashboard_{context}"]
== Network Health dashboard

When alerts are enabled in the Network Observability Operator, two things happen:

* New alerts appear in *Observe* → *Alerting* → *Alerting rules* tab in the {product-title} web console.
* A new *Network Health* dashboard appears in {product-title} web console → *Observe*.

The *Network Health* dashboard provides a summary of triggered alerts and pending alerts, distinguishing between critical, warning, and minor issues. Alerts for rule violations are displayed in the following tabs:

* *Global*: Shows alerts that are global to the cluster.
* *Nodes*: Shows alerts for rule violations per node.
* *Namespaces*: Shows alerts for rule violations per namespace.

Click on a resource card to see more information. Next to each alert, a three dot menu appears. From this menu, you can navigate to *Network Traffic* → *Traffic flows* to see more detailed information for the selected resource.
44 changes: 44 additions & 0 deletions modules/network-observability-configuring-predefined-alerts.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
// Module included in the following assemblies:
//
// network_observability/network-observability-alerts.adoc

:_mod-docs-content-type: CONCEPT
[id="network-observability-configuring-predefined-alerts_{context}"]
= Configuring predefined alerts

[role="_abstract"]
Alerts in the Network Observability Operator are defined using alert templates and variants in the `spec.processor.metrics.alerts` object of the `FlowCollector` custom resource (CR). You can customize the default templates and variants for flexible, fine-grained alerting.

After you enable alerts, the *Network Health* dashboard appears in the *Observe* section of the {product-title} web console.

For each template, you can define a list of variants, each with their own thresholds and grouping configurations. For more information, see the "List of default alert templates".

Here is an example:

[source,yaml,subs="attributes,verbatim"]
----
apiVersion: flows.netobserv.io/v1beta1
kind: FlowCollector
metadata:
name: flow-collector
spec:
processor:
metrics:
alerts:
- template: PacketDropsByKernel
variants:
# triggered when the whole cluster traffic (no grouping) reaches 10% of drops
- thresholds:
critical: "10"
# triggered when per-node traffic reaches 5% of drops, with gradual severity
- thresholds:
critical: "15"
warning: "10"
info: "5"
groupBy: Node
----

[NOTE]
====
Customizing an alert replaces the default configuration for that template. If you want to keep the default configurations, you must manually replicate them.
====
40 changes: 40 additions & 0 deletions modules/network-observability-creating-custom-alert-rules.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
// Module included in the following assemblies:
//
// * network_observability/network-observability-alerts.adoc

:_mod-docs-content-type: PROCEDURE
[id="network-observability-creating-custom-alert-rules_{context}"]
= Creating custom alert rules

[role="_abstract"]
Use the Prometheus Query Language (`PromQL`) to define a custom `AlertingRule` resource to trigger alerts based on specific network metrics (e.g., traffic surges).

.Prerequisites

* Familiarity with `PromQL`.
* You have installed {product-title} 4.14 or later.
* You have access to the cluster as a user with the `cluster-admin` role.
* You have installed the Network Observability Operator.
.Procedure

. Create a YAML file named `custom-alert.yaml` that contains your `AlertingRule` resource.
. Apply the custom alert rule by running the following command:
+
[source,terminal]
----
$ oc apply -f custom-alert.yaml
----

.Verification

. Verify that the `PrometheusRule` resource was created in the `netobserv` namespace by running the following command:
+
[source,terminal]
----
$ oc get prometheusrules -n netobserv -oyaml
----
+
The output should include the `netobserv-alerts` rule you just created, confirming that the resource was generated correctly.

. Confirm the rule is active by checking the *Network Health* dashboard in the {product-title} web console → *Observe*.
12 changes: 12 additions & 0 deletions modules/network-observability-disabling-predefined-alerts.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
// Module included in the following assemblies:
//
// * network_observability/network-observability-alerts.adoc

:_mod-docs-content-type: REFERENCE
[id="network-observability-disabling-predefined-alerts_{context}"]
= Disabling predefined alerts

[role="_abstract"]
Alert templates can be disabled in the `spec.processor.metrics.disableAlerts` field of the `FlowCollector` custom resource (CR). This setting accepts a list of alert template names. For a list of alert template names, see: "List of default alerts".

If a template is disabled and overridden in the `spec.processor.metrics.alerts` field, the disable setting takes precedence and the alert rule is not created.
32 changes: 32 additions & 0 deletions modules/network-observability-enabling-alerts.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
// Module included in the following assemblies:
//
// network_observability/network-observability-alerts.adoc

:_mod-docs-content-type: PROCEDURE
[id="network-observability-enabling-alerts_{context}"]
= Enabling Technology Preview alerts in network observability

[role="_abstract"]
Network Observability Operator alerts are a Technology Preview feature. To use this feature, you must enable it in the `FlowCollector` custom resource (CR), and then continue with configuring alerts to your specific needs.

.Procedure

. Edit the `FlowCollector` CR to set the experimental alerts flag to `true`:

[source,yaml]
----
apiVersion: flows.netobserv.io/v1beta1
kind: FlowCollector
metadata:
name: flow-collector
spec:
processor:
advanced:
env:
EXPERIMENTAL_ALERTS_HEALTH: "true"
----

You can still use the existing method for creating alerts. For more information, see "Creating alerts".

//for NetObserv 1.10, specific to new alerts functionality and new health dashboard as a Technology Preview feature. This may or may not be needed when the feature GA's.
//Kept the ID generic but the title specific to Technology Preview, just in case this needs to be updated for GA, then only the title will change and not the URL or ID, so xrefs should still function.
Loading