diff --git a/_topic_maps/_topic_map.yml b/_topic_maps/_topic_map.yml index 5f01d2577153..d2215e8439f0 100644 --- a/_topic_maps/_topic_map.yml +++ b/_topic_maps/_topic_map.yml @@ -40,6 +40,8 @@ Topics: File: configuring-lokistack-otlp - Name: OpenTelemetry data model File: opentelemetry-data-model +- Name: Loki query performance troubleshooting + File: loki-query-performance-troubleshooting --- Name: Upgrading logging Dir: upgrading diff --git a/configuring/configuring-the-log-store.adoc b/configuring/configuring-the-log-store.adoc index fa18fde710d5..7f791d83ff72 100644 --- a/configuring/configuring-the-log-store.adoc +++ b/configuring/configuring-the-log-store.adoc @@ -55,6 +55,7 @@ include::modules/logging-loki-reliability-hardening.adoc[leveloffset=+2] include::modules/loki-retention.adoc[leveloffset=+2] include::modules/loki-memberlist-ip.adoc[leveloffset=+2] include::modules/loki-restart-hardening.adoc[leveloffset=+2] +//include::modules/enabling-automatic-stream-sharding.adoc[leveloffset=+2] //Advanced deployment and scalability [id="advanced_{context}"] diff --git a/configuring/loki-query-performance-troubleshooting.adoc b/configuring/loki-query-performance-troubleshooting.adoc new file mode 100644 index 000000000000..2852e43053f2 --- /dev/null +++ b/configuring/loki-query-performance-troubleshooting.adoc @@ -0,0 +1,22 @@ +:_newdoc-version: 2.18.4 +:_template-generated: 2025-09-22 +:_mod-docs-content-type: ASSEMBLY +include::_attributes/common-attributes.adoc[] + +:toc: +[id="loki-query-performance-troubleshooting_{context}"] += Loki query performance troubleshooting + +:context: loki-query-performance-troubleshooting + +This documentation details methods for optimizing your Logging stack to improve query performance and provides steps for troubleshooting. + +include::modules/best-practices-for-loki-query-performance.adoc[leveloffset=+1] + +include::modules/best-practices-for-loki-labels.adoc[leveloffset=+1] + +include::modules/configuration-of-stream-labels-in-loki-operator.adoc[leveloffset=+1] + +include::modules/analyzing-loki-query-performance.adoc[leveloffset=+1] + +include::modules/query-performance-analysis.adoc[leveloffset=+1] diff --git a/modules/analyzing-loki-query-performance.adoc b/modules/analyzing-loki-query-performance.adoc new file mode 100644 index 000000000000..9405068d4e54 --- /dev/null +++ b/modules/analyzing-loki-query-performance.adoc @@ -0,0 +1,68 @@ +// Module included in the following assemblies: +// +// * configuring/loki-query-performance-troubleshooting.adoc + +:_newdoc-version: 2.18.4 +:_template-generated: 2025-10-24 +:_mod-docs-content-type: PROCEDURE + +[id="analyzing-loki-query-performance_{context}"] += Analyzing Loki query performance + +Every query and subquery in Loki generates a `metrics.go` log line with performance statistics. Subqueries emit the log line in the queriers. +Every query has an associated single summary `metrics.go` line emitted by the query-front end. +Use these statistics to calculate the query performance metrics. + +.Prerequisites +* You have administrator permissions. +* You have access to the {ocp-product-title} web console. +* You installed and configured {loki-op}. + +.Procedure +. In the {ocp-product-title} web console, navigate to the *Metrics* -> *Observe* tab. + +. Note the following values: + +* *duration*: Denotes the amount of time a query took to run. +* *queue_time*: Denotes the time a query spent in the queue before being processed. +* *chunk_refs_fetch_time*: Denotes the amount of time spent in getting chunk information from the index. +* *store_chunks_download_time*: Denotes the amount of time in getting chunks from cache or storage. + +. Calculate the following performance metrics: + +** total query time as `total_duration`: ++ +[subs=+quotes] +---- +total_duration = *duration* + *queue_time* +---- + +** Percentage of the total duration that a query spent in the queue as `Queue Time`: ++ +[subs=+quotes] +---- +Queue Time = *queue_time* / total_duration * 100 +---- + +** Calculate the percentage of the total duration that was spent in getting chunk information from the index as `Chunk Refs Fetch Time`: ++ +[subs=+quotes] +---- +Chunk Refs Fetch Time = *chunk_refs_fetch_time* / total_duration * 100 +---- + +** Calculate the percentage of the total duration that was spent in getting chunks from cache or storage: ++ +[subs=+quotes] +---- +Chunks Download Time = *store_chunks_download_time* / total_duration * 100 +---- + +** Calculate the percentage of the total duration that was spent in executing the query: ++ +[subs=+quotes] +---- +Execution Time = (*duration* - *chunk_refs_fetch_time* - *store_chunks_download_time*) / total_duration * 100 +---- + +. Refer to https://docs.redhat.com/en/documentation/red_hat_openshift_logging/latest/html/about_openshift_logging/index/analyze-query-performance_loki-query-performance-troubleshooting[Query performance analysis] to understand the reason for each metric and how each metric affects query performance. diff --git a/modules/best-practices-for-loki-labels.adoc b/modules/best-practices-for-loki-labels.adoc new file mode 100644 index 000000000000..292045c815a9 --- /dev/null +++ b/modules/best-practices-for-loki-labels.adoc @@ -0,0 +1,20 @@ +// Module included in the following assemblies: +// +// * configuring/loki-query-performance-troubleshooting.adoc + +:_newdoc-version: 2.18.4 +:_template-generated: 2025-09-25 +:_mod-docs-content-type: CONCEPT + +[id="best-practices-for-loki-labels_{context}"] += Best practices for Loki labels + +Labels in Loki are the keyspace on which Loki shards incoming data. They are also the index used for finding logs at query-time. You can optimize query performance by properly using labels. + +Consider the following criteria when creating labels: + +* Labels should describe infrastructure. This could include regions, clusters, servers, applications, namespaces, or environments. + +* Labels are long-lived. Label values should generate logs perpetually, or at least for several hours. + +* Labels are intuitive for querying. diff --git a/modules/best-practices-for-loki-query-performance.adoc b/modules/best-practices-for-loki-query-performance.adoc new file mode 100644 index 000000000000..7900c7f6336c --- /dev/null +++ b/modules/best-practices-for-loki-query-performance.adoc @@ -0,0 +1,38 @@ +// Module included in the following assemblies: +// +// * configuring/loki-query-performance-troubleshooting.adoc + + +:_newdoc-version: 2.18.4 +:_template-generated: 2025-09-25 +:_mod-docs-content-type: CONCEPT + +[id="best-practices-for-loki-query-performance_{context}"] += Best practices for Loki query performance + +You can take the following steps to improve Loki query performance: + +* Ensure that you are running the latest version of the {loki-op}. + +* Ensure that you have migrated LokiStack schema to the `v13` version. + +* Ensure that you use reliable and fast object storage. Loki places significant demands on object storage. +If you are not using an object store solution from a cloud provider, use solid-state drive (SSD) for your object storage. +By using SSDs you can benefit from the high parallelization capabilities of Loki. ++ +To better understand the utilization of object storage by Loki, you can use the following query in the *Metrics* dashboard in the {ocp-product-title} web console: ++ +[source] +---- +sum by(status, container, operation) (label_replace(rate(loki_s3_request_duration_seconds_count{namespace="openshift-logging"}[5m]), "status", "${1}xx", "status_code", "([0-9])..")) +---- + +* {loki-op} enables automatic stream sharding by default. The default automatic stream sharding mechanism should be adequate in most cases and users should not need to configure `perStream*` attributes. + +* If you use the OpenTelemetry Protocol (OTLP) data model, you can configure additional stream labels in LokiStack. For more information, see link:https://docs.redhat.com/en/documentation/red_hat_openshift_logging/latest/html/configuring/configuring-the-log-store#best-practices-for-loki-labels_loki-query-performance-troubleshooting[Best practices for Loki labels]. + +* Different types of queries have different performance characteristics. Use simple filter queries instead of regular expressions for better performance. + +[role="_additional-resources"] +.Additional resources +* link:https://docs.redhat.com/en/documentation/red_hat_openshift_logging/latest/html/about_openshift_logging/index/analyzing-loki-query-performance_loki-query-performance-troubleshooting[Analyzing Loki query performance] diff --git a/modules/configuration-of-stream-labels-in-loki-operator.adoc b/modules/configuration-of-stream-labels-in-loki-operator.adoc new file mode 100644 index 000000000000..2d5e3295ff30 --- /dev/null +++ b/modules/configuration-of-stream-labels-in-loki-operator.adoc @@ -0,0 +1,59 @@ +// Module included in the following assemblies: +// +// * configuring/loki-query-performance-troubleshooting.adoc + +:_newdoc-version: 2.18.4 +:_template-generated: 2025-09-25 +:_mod-docs-content-type: CONCEPT + +[id="configuration-of-stream-labels-in-loki-operator_{context}"] += Configuration of stream labels in {loki-op} + +Configuring which labels the {loki-op} will use as stream labels depends on the data model you are using: ViaQ or OpenTelemetry Protocol (OTLP). + +Both models come with a predefined set of stream labels, for more information, see link:https://docs.redhat.com/en/documentation/red_hat_openshift_logging/latest/html/configuring_logging/opentelemetry-data-model[OpenTelemetry data model]. + +ViaQ model:: +ViaQ does not support structured metadata. +To configure stream labels for the ViaQ model, add the configuration in the `ClusterLogForwarder` resource. For example: ++ +[source,yaml] +---- +apiVersion: observability.openshift.io/v1 +kind: ClusterLogForwarder +metadata: + name: instance + namespace: openshift-logging +spec: + serviceAccount: + name: logging-collector + outputs: + - name: lokistack-out + type: lokiStack + lokiStack: + target: + name: logging-loki + namespace: openshift-logging + labelKeys: + application: + ignoreGlobal: + labelKeys: [] + audit: + ignoreGlobal: + labelKeys: [] + infrastructure: + ignoreGlobal: + labelKeys: [] + global: [] +---- ++ +`lokiStack.labelKeys` field contains the configuration that maps log record keys to Loki labels used to identify streams. + +OTLP model:: +In the OTLP model all labels that are not specified as stream labels are attached as structured metadata. + +The following are the best practices for creating stream labels: + +* have a low cardinality with at most tens of values. +* The values are long lived. For example, the first level of an HTTP path: `/load`, `/save`, and `/update`. +* The labels can be used in queries to improve query performance. diff --git a/modules/query-performance-analysis.adoc b/modules/query-performance-analysis.adoc new file mode 100644 index 000000000000..bea3adf98ee0 --- /dev/null +++ b/modules/query-performance-analysis.adoc @@ -0,0 +1,49 @@ +// Module included in the following assemblies: +// +// * configuring/loki-query-performance-troubleshooting.adoc + +:_newdoc-version: 2.18.4 +:_template-generated: 2025-09-22 +:_mod-docs-content-type: CONCEPT + +[id="query-performance-analysis_{context}"] += Query performance analysis + +For best query performance, you want to see as much time as possible spent in execution time, denoted by the `Execution Time` metric. +See the table below for the reason other performance metrics might be higher and the steps you can take to improve them. +You can also reduce the execution time by modifying your queries, thereby improving the overall performance. + +[options="header",cols="2,5,5"] +|==== +|Issue +|Reason +|Fix +.2+|High `Execution Time` +|Queries might be doing many CPU-intensive operations such as regular expression processing. + +a| You can make the following changes: + +* Change your queries to reduce or remove regular expressions. +* Add more CPU resources. + +|Your queries have many small log lines. + +|If your queries have many small lines, execution becomes dependent on how fast Loki can iterate the lines themselves. This becomes a CPU clock frequency bottleneck. To make things faster you need a faster CPU. + + +|High `Queue Time` +|You do not have enough queriers running. +|The only fix is to increase the number of queriers replicas in the `LokiStack` spec. + +|High `Chunk Refs Fetch Time` +|Insufficient number of index-gateway replicas in the `LokiStack` spec. +|Increase the number of index-gateway replicas or ensure they have enough CPU resources. + +|High `Chunks Download Time` +|The chunks might be too small +|Check the average chunk size by dividing `total_bytes` value by `cache_chunk_req` value. The average represents the average uncompressed bytes per chunk. The value for best performance should be in the order of magnitude of megabytes. If the chunks are only a few hundred bytes or kilobytes in size, revisit labels to ensure that you are not splitting your data into very small chunks. + +|Query timing out +|Query timeout value might be too low +|Increase the `queryTimeout` value in the LokiStack spec. +|====