Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/images/actions-pipeline-history.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/alert-history.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/debug-alert-history.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/pipeline-history.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/user-guide/alerts/.pages
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ nav:
- Alerts Overview: index.md
- Alerts in OpenObserve: alerts.md
- Alert Folders: alert-folders.md
- Alert History: alert-history.md
- Alert Conditions: alert-conditions.md
- Import and Export Alerts: import-export-alerts.md
- Multi-window Selector in Scheduled Alerts (SQL Mode): multi-window-selector-scheduled-alerts-concept.md
Expand Down
66 changes: 66 additions & 0 deletions docs/user-guide/alerts/alert-history.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@

This guide provides information about how the Alert History feature in OpenObserve works, where the data originates from, who can access it, how to interpret the Alert History table, and how to debug failed or skipped alerts.

## Overview
All alert trigger data is stored in the triggers stream inside the `_meta` organization. Access to `_meta` is restricted and managed through IAM, which means only users with elevated privileges can view it directly.

The Alert History page brings this information into the user’s organization. It provides visibility into alert evaluations, including when each alert ran, its evaluation duration, and its final status. This design allows alert owners to monitor alert performance and troubleshoot issues without requiring access to the `_meta` organization.

!!! note "Who can access it"
Any user who has permission to view, update, or delete alerts can also access Alert History. This ensures that alert managers and operators can analyze their alerts’ execution history without depending on users with higher administrative access.


## How to interpret the Alert History table
![alert-history](../../images/alert-history.png)

Each row represents one alert evaluation.

- **Alert Name**: The name of the configured alert.
- **Type**: Indicates whether the alert is Scheduled or Real-time.
- **Is Silenced**: Shows whether the alert was silenced during evaluation.
- **Timestamp**: Time when the alert manager picked up the alert for evaluation.
- **Start Time** and **End Time**: The time range of data evaluated.
- **Duration**: How long the alert condition remained true.
- **Status**: The result of the alert evaluation.
- **Retries**: Number of times the system retried alert delivery when the destination did not acknowledge it. The system retries up to three times. <br>
**Note**: The environment variable `ZO_SCHEDULER_MAX_RETRIES` defines how many times the scheduler retries a failed execution.
- **Actions**: Opens a detailed view that includes:

- **Evaluation Time**: The time taken to complete the alert’s search query.
- **Silenced**: Indicates whether the alert was silenced.
- **Source Node**: The node that processed the alert. Useful for debugging distributed environments.

- **Status codes**:

- **completed**: The alert condition was met, and the notification was sent to its destination.
- **failed**: The alert evaluation or delivery failed. The trigger record in `_meta` includes the error field with details.
- **condition_not_met**: The configured alert condition was not satisfied for that time range.
- **skipped**: The scheduled evaluation window was missed due to a delay, and the system evaluated the next aligned window.

## How to debug a failed alert
This process applies only to users who have access to the `_meta` organization.
![debug-alert-history](../../images/debug-alert-history.png)

1. From the organization selector, switch to `_meta`.
2. Go to **Logs**.
3. From the stream dropdown, select `triggers`.
4. Set the required time range and select Run query.
5. Locate the error log for the failed alert.
6. Identify the `scheduler_trace_id` field. <br>
**Example:** <br>
`scheduler_trace_id: 358zJCLiWVdApBSBXM50YYnIwgA/019a5d91932174c3bab84fff2785f73f` <br>
The trace ID is the part after the slash: `019a5d91932174c3bab84fff2785f73f`
7. Copy this trace ID.
8. Switch back to your organization.
9. Go to **Logs**.
10. In the SQL editor, search using the trace ID.
For example,
`match_all('019a5d91932174c3bab84fff2785f73f')`
This displays all logs related to that alert evaluation.
11. Review the logs to identify the failure cause, such as query issues, destination errors, timeouts, or node resource problems.
12. Use the **Source Node** field from the **Alert History** details to locate the node that processed the alert and check its introspection logs if needed.

## Why you might see a skipped status
A **skipped** status appears when a scheduled alert runs later than its expected window. <br>
For example, an alert configured with a 5-minute period and 5-minute frequency is scheduled to run at 12:00 PM. It should normally evaluate data from 11:55 to 12:00.
If the alert manager experiences a delay and runs the job at 12:05 PM, it evaluates the current aligned window (12:00 to 12:05) instead of the earlier one. The earlier window (11:55 to 12:00) is marked as skipped to indicate that evaluation for that range did not occur because of delay in job pickup or data availability.
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,13 @@ While stream schemas are synchronized across all clusters in real-time, the actu

This design maintains data residency compliance while enabling unified configuration management.

## How nodes coordinate internally using NATS
OpenObserve uses NATS for internal coordination between nodes within a region. This coordination enables nodes to share information for purposes such as caching and maintaining cluster awareness.
<br>
As part of the reliability improvement in inter-node communication, OpenObserve now uses NATS stream queues to broadcast NATS events instead of using NATS key-value watchers. The NATS stream queue ensures reliable delivery by retrying event transmission until all subscribers receive the event for processing.
<br>
Except for the nodes list, nothing is now stored in NATS key-value storage.

## Limitations

**No cluster identification in results:** Query results do not indicate which cluster provided specific data. To identify the source, query each cluster individually.
37 changes: 37 additions & 0 deletions docs/user-guide/pipelines/pipeline-history.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
This guide provides information about how the Pipeline History feature in OpenObserve works, where the data originates from, who can access it, and how to interpret the pipeline execution records.


## Overview
Pipeline History provides visibility into every pipeline run, including its execution time, status, and duration. Each record represents one instance of a scheduled or manually triggered pipeline execution.

!!! "Who can access"

Any user who has permission to view, update, or delete pipelines can also view pipeline history. This ensures that users responsible for managing or maintaining pipelines can monitor their performance and investigate failures without requiring elevated access.


## How to interpret the Pipeline History table
![pipeline-history](../../images/pipeline-history.png)
The table lists each pipeline run with key execution details.

- **Pipeline Name**: Name of the executed pipeline.
- **Type**: Execution type. The current verified value is Scheduled.
- **Is Silenced**: Indicates whether the pipeline was silenced during execution. The green speaker icon means it was not silenced.
- **Timestamp**: Time when the pipeline run was recorded.
- **Start Time**: Time when the execution started.
- **End Time**: Time when the execution finished.
- **Duration**: Total execution time for the pipeline run.
- **Status**: Final outcome of the execution. One of the following values:

- **completed** – The pipeline ran successfully and completed execution.
- **failed** – The pipeline encountered an error and did not complete.
- **condition_not_satisfied** – The pipeline’s scheduled condition was evaluated but not met, so no data processing occurred.

- **Retries**: Number of times the scheduler retried the run after a failure. The maximum number of retries is defined by the environment variable `ZO_SCHEDULER_MAX_RETRIES`, which controls retry behavior for both pipelines and alerts.
- **Actions**: When you select the Actions icon, the Pipeline Execution Details dialog opens and displays additional metadata for that specific execution:

![actions-pipeline-history](../../images/actions-pipeline-history.png)

- **Query Time**: Time taken by the SQL query within the pipeline to execute. This helps measure query performance.
- **Source Node**: Node responsible for executing the run. This helps identify where the execution occurred for debugging and performance monitoring.<br>
Example: <br>
**Source Node**: `o2-openobserve-alertmanager-0`
3 changes: 3 additions & 0 deletions docs/user-guide/streams/schema-settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,9 @@ To learn more, visit the [Fields and Index in Streams](streams/fields-and-index-
## User-Defined Schema (UDS)

By default, OpenObserve stores all fields it detects. However, in high-ingestion environments, especially those with thousands of unique fields per log, such as large-scale services, this can degrade performance.

> UDS applies to logs, metrics, and traces.

User-Defined Schema (UDS) allows you to select a subset of fields that are:

- Retained for storage
Expand Down