Skip to content

Commit

Permalink
Merge pull request #63276 from skrthomas/OSDOCS-6754
Browse files Browse the repository at this point in the history
OSDOCS-6754: Network Observability easier configuration
  • Loading branch information
skrthomas committed Sep 28, 2023
2 parents 8241acc + ea323a5 commit 8012330
Show file tree
Hide file tree
Showing 7 changed files with 93 additions and 7 deletions.
6 changes: 4 additions & 2 deletions modules/network-observability-flowcollector-view.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -57,14 +57,15 @@ spec:
type: configmap
name: loki-gateway-ca-bundle
certFile: service-ca.crt
namespace: loki-namespace # <5>
consolePlugin:
register: true
logLevel: info
portNaming:
enable: true
portNames:
"3100": loki
quickFilters: <5>
quickFilters: <6>
- name: Applications
filter:
src_namespace!: 'openshift-,netobserv'
Expand All @@ -87,4 +88,5 @@ spec:
<2> You can set the Sampling specification, `spec.agent.ebpf.sampling`, to manage resources. Lower sampling values might consume a large amount of computational, memory and storage resources. You can mitigate this by specifying a sampling ratio value. A value of 100 means 1 flow every 100 is sampled. A value of 0 or 1 means all flows are captured. The lower the value, the increase in returned flows and the accuracy of derived metrics. By default, eBPF sampling is set to a value of 50, so 1 flow every 50 is sampled. Note that more sampled flows also means more storage needed. It is recommend to start with default values and refine empirically, to determine which setting your cluster can manage.
<3> The optional specifications `spec.processor.logTypes`, `spec.processor.conversationHeartbeatInterval`, and `spec.processor.conversationEndTimeout` can be set to enable conversation tracking. When enabled, conversation events are queryable in the web console. The values for `spec.processor.logTypes` are as follows: `FLOWS` `CONVERSATIONS`, `ENDED_CONVERSATIONS`, or `ALL`. Storage requirements are highest for `ALL` and lowest for `ENDED_CONVERSATIONS`.
<4> The Loki specification, `spec.loki`, specifies the Loki client. The default values match the Loki install paths mentioned in the Installing the Loki Operator section. If you used another installation method for Loki, specify the appropriate client information for your install.
<5> The `spec.quickFilters` specification defines filters that show up in the web console. The `Application` filter keys,`src_namespace` and `dst_namespace`, are negated (`!`), so the `Application` filter shows all traffic that _does not_ originate from, or have a destination to, any `openshift-` or `netobserv` namespaces. For more information, see Configuring quick filters below.
<5> The original certificates are copied to the Network Observability instance namespace and watched for updates. When not provided, the namespace defaults to be the same as "spec.namespace". If you chose to install Loki in a different namespace, you must specify it in the `spec.loki.tls.caCert.namespace` field. Similarly, the `spec.exporters.kafka.tls.caCert.namespace` field is available for Kafka installed in a different namespace.
<6> The `spec.quickFilters` specification defines filters that show up in the web console. The `Application` filter keys,`src_namespace` and `dst_namespace`, are negated (`!`), so the `Application` filter shows all traffic that _does not_ originate from, or have a destination to, any `openshift-` or `netobserv` namespaces. For more information, see Configuring quick filters below.
7 changes: 4 additions & 3 deletions modules/network-observability-lokistack-create.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ It is recommended to deploy the LokiStack in the same namespace referenced by th
kind: LokiStack
metadata:
name: loki
namespace: netobserv
namespace: netobserv <1>
spec:
size: 1x.small
storage:
Expand All @@ -30,11 +30,12 @@ It is recommended to deploy the LokiStack in the same namespace referenced by th
secret:
name: loki-s3
type: s3
storageClassName: gp3 <1>
storageClassName: gp3 <2>
tenants:
mode: openshift-network
----
<1> Use a storage class name that is available on the cluster for `ReadWriteOnce` access mode. You can use `oc get storageclasses` to see what is available on your cluster.
<1> The installation examples in this documentation use the same namespace, `netobserv`, across all components. You can optionally use a different namespace.
<2> Use a storage class name that is available on the cluster for `ReadWriteOnce` access mode. You can use `oc get storageclasses` to see what is available on your cluster.
+
[IMPORTANT]
====
Expand Down
30 changes: 30 additions & 0 deletions modules/network-observability-rate-limit-alert.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
//
// network_observability/configuring-operator.adoc

:_content-type: CONCEPT
[id="network-observability-netobserv-dashboard-rate-limit-alerts_{context}"]
= Creating Loki rate limit alerts for the NetObserv dashboard
You can create custom rules for the *Netobserv* dashboard metrics to trigger alerts when Loki rate limits have been reached.

An example of an alerting rule configuration YAML file is as follows:
[source,yaml]
----
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: loki-alerts
namespace: openshift-operators-redhat
spec:
groups:
- name: LokiRateLimitAlerts
rules:
- alert: LokiTenantRateLimit
annotations:
message: |-
{{ $labels.job }} {{ $labels.route }} is experiencing 429 errors.
summary: "At any number of requests are responded with the rate limit error code."
expr: sum(irate(loki_request_duration_seconds_count{status_code="429"}[1m])) by (job, namespace, route) / sum(irate(loki_request_duration_seconds_count[1m])) by (job, namespace, route) * 100 > 0
for: 10s
labels:
severity: warning
----
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
// Module included in the following assemblies:

// * networking/network_observability/troubleshooting-network-observability.adoc

:_content-type: PROCEDURE
[id="network-observability-troubleshooting-loki-tenant-rate-limit_{context}"]
= LokiStack rate limit errors
A rate-limit placed on the Loki tenant can result in potential temporary loss of data and a 429 error: `Per stream rate limit exceeded (limit:xMB/sec) while attempting to ingest for stream`. You might consider having an alert set to notify you of this error. For more information, see "Creating Loki rate limit alerts for the NetObserv dashboard" in the Additional resources of this section.

You can update the LokiStack CRD with the `perStreamRateLimit` and `perStreamRateLimitBurst` specifications, as shown in the following procedure.

.Procedure
. Navigate to *Operators* -> *Installed Operators*, viewing *All projects* from the *Project* dropdown.
. Look for *Loki Operator*, and select the *LokiStack* tab.
. Create or edit an existing *LokiStack* instance using the *YAML view* to add the `perStreamRateLimit` and `perStreamRateLimitBurst` specifications:
+
[source, yaml]
----
apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
name: loki
namespace: netobserv
spec:
limits:
global:
ingestion:
perStreamRateLimit: 6 <1>
perStreamRateLimitBurst: 30 <2>
tenants:
mode: openshift-network
managementState: Managed
----
<1> The default value for `perStreamRateLimit` is `3`.
<2> The default value for `perStreamRateLimitBurst` is `15`.

. Click *Save*.

.Verification
Once you update the `perStreamRateLimit` and `perStreamRateLimitBurst` specifications, the pods in your cluster restart and the 429 rate-limit error no longer occurs.
4 changes: 4 additions & 0 deletions networking/network_observability/installing-operators.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ include::modules/network-observability-without-loki.adoc[leveloffset=+1]
include::modules/network-observability-loki-install.adoc[leveloffset=+1]
include::modules/network-observability-loki-secret.adoc[leveloffset=+2]
[role="_additional-resources"]
.Additional resources
* For more information about the option to use different namespaces for the separate components, see the `spec.loki.tls.caCert.namespace` specification in the xref:../network_observability/flowcollector-api.adoc#network-observability-flowcollector-api-specifications_network_observability[Flow Collector API Reference] and callout number 5 in the xref:../network_observability/configuring-operator.adoc#network-observability-flowcollector-view_network_observability[Flow Collector sample resource].
include::modules/network-observability-lokistack-create.adoc[leveloffset=+2]
include::modules/network-observability-lokistack-ingestion-query.adoc[leveloffset=+2]
include::modules/network-observability-auth-multi-tenancy.adoc[leveloffset=+1]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,9 @@ You can use the web console to monitor alerts related to the health of the Netwo


include::modules/network-observability-viewing-alerts.adoc[leveloffset=+1]
include::modules/network-observability-disabling-health-alerts.adoc[leveloffset=+2]
include::modules/network-observability-disabling-health-alerts.adoc[leveloffset=+2]
include::modules/network-observability-rate-limit-alert.adoc[leveloffset=+1]

[role="_additional-resources"]
.Additional resources
* For more information about creating alerts that you can see on the dashboard, see xref:../../monitoring/managing-alerts.adoc#creating-alerting-rules-for-user-defined-projects_managing-alerts[Creating alerting rules for user-defined projects].
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,8 @@ include::modules/troubleshooting-network-observability-flowlogs-pipeline-kafka.a

include::modules/troubleshooting-network-observability-network-flow.adoc[leveloffset=+1]

include::modules/troubleshooting-network-observability-controller-manager-pod-out-of-memory.adoc[leveloffset=+1]
include::modules/troubleshooting-network-observability-controller-manager-pod-out-of-memory.adoc[leveloffset=+1]

== Resource troubleshooting

include::modules/troubleshooting-network-observability-loki-tenant-rate-limit.adoc[leveloffset=+1]

0 comments on commit 8012330

Please sign in to comment.