title | description | lead | date | lastmod | draft | images | menu | weight | toc | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Standard Operational Procedures |
The LokiStack Alerts and Standard Operational Procedures |
2022-06-21 08:48:45 +0000 |
2022-06-21 08:48:45 +0000 |
false |
|
100 |
true |
The following page describes Standard Operational Procedures for alerts provided and managed by the Loki Operator for any LokiStack instance.
A service(s) is unable to perform its duties for a number of requests, resulting in potential loss of data.
A service(s) is failing to process at least 10% of all incoming requests.
Critical
- Console access to the cluster
- Edit access to the deployed operator and Loki namespace:
- OpenShift
openshift-logging
(LokiStack)openshift-operators-redhat
(Loki Operator)
- OpenShift
- Check the logs of the service that is emitting the server error (5XX)
- Ensure that store services (
ingester
,querier
,index-gateway
,compactor
) can communicate with backend storage - Examine metrics for signs of failure
- WAL Complications
loki_ingester_wal_disk_full_failures_total
loki_ingester_wal_corruptions_total
- WAL Complications
The LokiStack Gateway component is unable to perform its duties for a number of write requests, resulting in potential loss of data.
The LokiStack Gateway is failing to process at least 10% of all incoming write requests.
Critical
- Console access to the cluster
- Edit access to the deployed operator and Loki namespace:
- OpenShift
openshift-logging
(LokiStack)openshift-operators-redhat
(Loki Operator)
- OpenShift
- Ensure that the LokiStack Gateway component is ready and available
- Ensure that the
distributor
,ingester
, andindex-gateway
components are ready and available - Ensure that store services (
ingester
,querier
,index-gateway
,compactor
) can communicate with backend storage - Examine metrics for signs of failure
- WAL Complications
loki_ingester_wal_disk_full_failures_total
loki_ingester_wal_corruptions_total
- WAL Complications
The LokiStack Gateway component is unable to perform its duties for a number of query requests, resulting in a potential disruption.
The LokiStack Gateway is failing to process at least 10% of all incoming query requests.
Critical
- Console access to the cluster
- Edit access to the deployed operator and Loki namespace:
- OpenShift
openshift-logging
(LokiStack)openshift-operators-redhat
(Loki Operator)
- OpenShift
- Ensure that the LokiStack Gateway component is ready and available
- Ensure that the
query-frontend
,querier
,ingester
, andindex-gateway
components are ready and available - Ensure that store services (
ingester
,querier
,index-gateway
,compactor
) can communicate with backend storage - Examine metrics for signs of failure
- WAL Complications
loki_ingester_wal_disk_full_failures_total
loki_ingester_wal_corruptions_total
- WAL Complications
A service(s) is unavailable to unavailable, resulting in potential loss of data.
A service(s) has crashed.
Critical
- Console access to the cluster
- Edit access to the deployed operator and Loki namespace:
- OpenShift
openshift-logging
(LokiStack)openshift-operators-redhat
(Loki Operator)
- OpenShift
- Check the logs of the service that is panicking
- Examine metrics for signs of failure
A service(s) is affected by slow request responses.
A service(s) is slower than expected at processing data.
Critical
- Console access to the cluster
- Edit access to the deployed operator and Loki namespace:
- OpenShift
openshift-logging
(LokiStack)openshift-operators-redhat
(Loki Operator)
- OpenShift
- Check the logs of all the services
- Check to ensure that the Loki components can reach the storage
- Particularly for queriers, examine metrics for a small query queue:
cortex_query_scheduler_inflight_requests
- Particularly for queriers, examine metrics for a small query queue:
A tenant is being rate limited, resulting in potential loss of data.
A service(s) is rate limiting at least 10% of all incoming requests.
Warning
- Console access to the cluster
- Edit access to the deployed operator and Loki namespace:
- OpenShift
openshift-logging
(LokiStack)openshift-operators-redhat
(Loki Operator)
- OpenShift
- Examine the metrics for the reason and tenant that is being limited:
loki_discarded_samples_total{namespace="<namespace>"}
- Increase the limits allocated to the tenant in the LokiStack CRD
- For ingestion limits, please consult the table below
- For query limits, the
MaxEntriesLimitPerQuery
,MaxChunksPerQuery
, orMaxQuerySeries
can be changed to raise the limit
Reason | Corresponding Ingestion Limit Keys |
---|---|
rate_limited |
ingestionRate , ingestionBurstSize |
stream_limit |
maxGlobalStreamsPerTenant |
label_name_too_long |
maxLabelNameLength |
label_value_too_long |
maxLabelValueLength |
line_too_long |
maxLineSize |
max_label_names_per_series |
maxLabelNamesPerSeries |
per_stream_rate_limit |
perStreamRateLimit , perStreamRateLimitBurst |
The cluster is unable to push logs to backend storage in a timely manner.
The cluster is unable to push logs to backend storage in a timely manner.
Warning
- Console access to the cluster
- Edit access to the deployed operator and Loki namespace:
- OpenShift
openshift-logging
(LokiStack)openshift-operators-redhat
(Loki Operator)
- OpenShift
- Ensure that the cluster can communicate with the backend storage
The cluster is unable to retrieve logs to backend storage in a timely manner.
The cluster is unable to retrieve logs to backend storage in a timely manner.
Warning
- Console access to the cluster
- Edit access to the deployed operator and Loki namespace:
- OpenShift
openshift-logging
(LokiStack)openshift-operators-redhat
(Loki Operator)
- OpenShift
- Ensure that the cluster can communicate with the backend storage
The write path is under high pressure and requires a storage flush.
The write path is flushing the storage in response to back-pressuring.
Warning
- Console access to the cluster
- Edit access to the deployed operator and Loki namespace:
- OpenShift
openshift-logging
(LokiStack)openshift-operators-redhat
(Loki Operator)
- OpenShift
- Adjust the ingestion limits for the affected tenant or increase the number of ingesters
The read path is under high load.
The query queue is currently under high load.
Warning
- Console access to the cluster
- Edit access to the deployed operator and Loki namespace:
- OpenShift
openshift-logging
(LokiStack)openshift-operators-redhat
(Loki Operator)
- OpenShift
- Increase the number of queriers
The LokiStack warns on a newer object storage schema being available for configuration.
The schema configuration does not contain the most recent schema version and needs an update.
Warning
- Console access to the cluster
- Edit access to the namespace where the LokiStack is deployed:
- OpenShift
openshift-logging
(LokiStack)
- OpenShift
- Add a new object storage schema V13 with a future EffectiveDate