From 519d7a2cf9272a8b15f96bad5aa3677cd12b8844 Mon Sep 17 00:00:00 2001 From: Israel Blancas Date: Mon, 30 Sep 2024 16:48:36 +0200 Subject: [PATCH 1/2] TRACING-4615: improve OpenTelemetry connectors documentation Signed-off-by: Israel Blancas --- .../otel-collector-connectors.adoc | 321 +++++++++++++++--- 1 file changed, 278 insertions(+), 43 deletions(-) diff --git a/observability/otel/otel-collector/otel-collector-connectors.adoc b/observability/otel/otel-collector/otel-collector-connectors.adoc index eeb52eb26876..37c68b388349 100644 --- a/observability/otel/otel-collector/otel-collector-connectors.adoc +++ b/observability/otel/otel-collector/otel-collector-connectors.adoc @@ -8,47 +8,6 @@ toc::[] A connector connects two pipelines. It consumes data as an exporter at the end of one pipeline and emits data as a receiver at the start of another pipeline. It can consume and emit data of the same or different data type. It can generate and emit data to summarize the consumed data, or it can merely replicate or route data. -[id="routing-connector_{context}"] -== Routing Connector - -The Routing Connector routes logs, metrics, and traces to specified pipelines according to resource attributes and their routing conditions, which are written as OpenTelemetry Transformation Language (OTTL) statements. - -:FeatureName: The Routing Connector -include::snippets/technology-preview.adoc[] - -.OpenTelemetry Collector custom resource with an enabled Routing Connector -[source,yaml] ----- - config: | - connectors: - routing: - table: # <1> - - statement: route() where attributes["X-Tenant"] == "dev" # <2> - pipelines: [traces/dev] # <3> - - statement: route() where attributes["X-Tenant"] == "prod" - pipelines: [traces/prod] - default_pipelines: [traces/dev] # <4> - error_mode: ignore # <5> - match_once: false # <6> - service: - pipelines: - traces/in: - receivers: [otlp] - exporters: [routing] - traces/dev: - receivers: [routing] - exporters: [otlp/dev] - traces/prod: - receivers: [routing] - exporters: [otlp/prod] ----- -<1> Connector routing table. -<2> Routing conditions written as OTTL statements. -<3> Destination pipelines for routing the matching telemetry data. -<4> Destination pipelines for routing the telemetry data for which no routing condition is satisfied. -<5> Error-handling mode: The `propagate` value is for logging an error and dropping the payload. The `ignore` value is for ignoring the condition and attempting to match with the next one. The `silent` value is the same as `ignore` but without logging the error. The default is `propagate`. -<6> When set to `true`, the payload is routed only to the first pipeline whose routing condition is met. The default is `false`. - [id="forward-connector_{context}"] == Forward Connector @@ -94,6 +53,126 @@ service: # ... ---- +[id="routing-connector_{context}"] +== Routing Connector + +The Routing Connector routes logs, metrics, and traces to specified pipelines according to resource attributes and their routing conditions, which are written as OpenTelemetry Transformation Language (OTTL) statements. + +:FeatureName: The Routing Connector +include::snippets/technology-preview.adoc[] + +.OpenTelemetry Collector custom resource with an enabled Routing Connector +[source,yaml] +---- + config: | + connectors: + routing: + table: + - statement: route() where attributes["X-Tenant"] == "dev" # <1> + pipelines: [traces/dev] # <2> + - statement: route() where attributes["X-Tenant"] == "prod" + pipelines: [traces/prod] + default_pipelines: [traces/dev] # <3> + error_mode: ignore + match_once: false + service: + pipelines: + traces/in: + receivers: [otlp] + exporters: [routing] + traces/dev: + receivers: [routing] + exporters: [otlp/dev] + traces/prod: + receivers: [routing] + exporters: [otlp/prod] +---- +<1> Routing conditions written as OTTL statements. +<2> Destination pipelines for routing the matching telemetry data. In this case, the `traces/dev` pipeline when the `X-Tenant` attribute is equal to `dev`. +<3> In this example, if `X-Tenant` attribute is not `dev` or `prod`, the telemetry data will be forwarded to those pipelines. + +The connector routes exclusively based on OTTL statements, which are limited to resource attributes. Currently, it does not support matching against context values. + +.Parameters used by the Routing Connector +[options="header"] +[cols="a,a,a"] +|=== +|Parameter |Description |Default + +|`table` +|Connector routing table containing routing conditions and destination pipelines. +|`[]` + +|`table.statement` +|Routing conditions written as OTTL (OpenTelemetry Transformation Language) statements. +|N/A + +|`table.pipelines` +|Destination pipelines for routing the matching telemetry data. +|N/A + +|`default_pipelines` +|Destination pipelines for routing the telemetry data for which no routing condition is satisfied. +|`[]` + +|`error_mode` +|Error-handling mode: `propagate` logs an error and drops the payload, `ignore` ignores the condition and attempts to match with the next one, and `silent` is the same as `ignore` but without logging the error. +|`propagate` + +|`match_once` +|When set to `true`, the payload is routed only to the first pipeline whose routing condition is met. +|`false` +|=== + +=== Troubleshooting the Routing Connector + +If you encounter issues with the Routing Connector, use the following troubleshooting steps to help diagnose and resolve common problems. + +==== Pipeline not receiving telemetry data + +If telemetry data is not being routed to the expected pipeline, check the following: + +.Procedure + +- Verify the OTTL statement syntax in the routing table: ensure that the `attributes` used in your `statement` match the telemetry resource attributes exactly, including correct attribute names and values. Typos in the OTTL expressions can prevent routing. ++ +- Ensure that the destination pipeline exists in the `service.pipelines` configuration: for example, ensure that `traces/dev` or `traces/prod` pipelines are defined correctly in the OpenTelemetry Collector custom resource. + +==== Incorrect routing conditions + +If telemetry data is not being routed to the intended pipelines, the routing condition in the OTTL statement might be incorrect. + +.Procedure + +- Confirm that the attribute used in the condition exists in the telemetry resource: use the <> to inspect telemetry data and verify that the `X-Tenant` attribute or other used attributes are present and correctly populated. ++ +- Test with a simplified condition: try routing based on simpler conditions (e.g., `route() where attributes["X-Tenant"] != ""`) to isolate potential issues with the logic in more complex expressions. + +==== Default pipeline is not applied + +If telemetry data that doesn’t match any routing conditions is not being routed to the default pipeline, check the following: + +.Procedure + +- Verify the `default_pipelines` parameter: ensure that `default_pipelines` is correctly defined. For example, the `traces/dev` pipeline should be listed in `default_pipelines`. ++ +- Ensure that the attribute used in the routing conditions is correct: if the attribute doesn't exist, the Routing Connector will skip routing and use the default pipeline. + +==== Error handling behavior not as expected + +If errors are not handled as expected, check the `error_mode` parameter: + +.Procedure + +- `propagate`: Logs errors and drops the payload. Ensure that the logging level is set to capture these errors. ++ +- `ignore`: Ignores the error and attempts to match the next condition. Ensure this behavior matches your intended setup. ++ +- `silent`: Does not log errors. If you're not seeing errors in logs, confirm that `silent` is not set in your configuration. + +Use verbose logging to gain more insight into how routing decisions are made and where issues may occur in the telemetry flow. + + [id="spanmetrics-connector_{context}"] == Spanmetrics Connector @@ -109,7 +188,31 @@ include::snippets/technology-preview.adoc[] config: | connectors: spanmetrics: - metrics_flush_interval: 15s # <1> + histogram: + explicit: + buckets: [100us, 1ms, 2ms, 6ms, 10ms, 100ms, 250ms] + unit: ms + disable: false + dimensions: # <1> + - name: http.method + default: GET + - name: http.status_code + exemplars: + enabled: true + exclude_dimensions: ['status.code'] + dimensions_cache_size: 1000 + aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE" + metrics_flush_interval: 15s + metrics_expiration: 5m + events: + enabled: true + dimensions: + - name: exception.type + - name: exception.message + resource_metrics_key_attributes: + - service.name + - telemetry.sdk.language + - telemetry.sdk.name service: pipelines: traces: @@ -118,9 +221,141 @@ include::snippets/technology-preview.adoc[] receivers: [spanmetrics] # ... ---- -<1> Defines the flush interval of the generated metrics. Defaults to `15s`. + +The following outlines the process for how these metrics are calculated: + +* Request counts are calculated based on the number of spans observed for each unique combination of dimensions, including errors. Multiple metrics can be aggregated, allowing users to view call counts filtered by specific dimensions such as `service.name` and `span.name`. +* Error counts are derived from request counts that include the `Error Status Code` as a metric dimension. +* Duration is calculated by the difference between the span's start and end times and is placed into the appropriate duration histogram bucket for each unique set of dimensions. + +Each metric will include at least the following dimensions, as they are common across all spans: + +* service.name +* span.name +* span.kind +* status.code + +.Parameters used by the Spanmetrics Connector +[options="header"] +[cols="a,a,a"] +|=== +|Parameter |Description |Default + +|`histogram.explicit` +|Used to configure the type of histogram for recording span duration measurements. The options are either `explicit` or `exponential`. +|`explicit` + +|`histogram.buckets` +|Duration histogram time buckets. +|`[2ms, 4ms, 6ms, 8ms, 10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms, 2s, 5s, 10s, 15s]` + +|`histogram.unit` +|The time unit for recording duration measurements. Can be `ms` or `s`. +|`ms` + +|`histogram.disable` +|Disable all histogram metrics. +|`false` + +|`dimensions` +|The list of additional dimensions to include alongside the default ones. +|`[]` + +|`exemplars.enabled` +|Used to configure the method for attaching exemplars to metrics. +|`false` + +|`exclude_dimensions` +|The list of dimensions to exclude from the default set of dimensions. This is used to remove unnecessary data from the metrics. +|`[]` + +|`dimensions_cache_size` +|Specifies the cache size for storing dimensions to optimize the collector's memory usage. Must be a number bigger than 0. +|`1000` + +|`aggregation_temporality` +|Specifies the aggregation temporality for the generated metrics. It can be set to either `AGGREGATION_TEMPORALITY_CUMULATIVE` or `AGGREGATION_TEMPORALITY_DELTA`. +|`AGGREGATION_TEMPORALITY_CUMULATIVE` + +|`metrics_flush_interval` +|Flush interval of the generated metrics. +|`60s` + +|`metrics_expiration` +|Specifies the expiration time as the delta in nanoseconds, after which metrics will no longer be exported if no new spans are received. A value of 0 indicates that the metrics will never expire. +|`0` + +|`events.enabled` +|Enable event metrics. +|`false` + +|`events.dimensions` +|Specifies the list of span event attributes to add as dimensions to the events metric, in addition to the common and configured dimensions for span and resource attributes. +|`[]` + +|`resource_metrics_key_attributes` +|Filters the resource attributes used to generate the resource metrics key map hash. This is useful when changing resource attributes (e.g., process ID) may disrupt counter metrics. +|`[]` +|=== + + +=== Troubleshooting the Spanmetrics Connector + +If you encounter issues with the Spanmetrics Connector, use the following troubleshooting steps to help diagnose and resolve common problems. + +==== Missing or incomplete metrics + +If the expected metrics (e.g., request count, error count, or duration) are missing or incomplete, consider the following: + +.Procedure + +- Check the span data: ensure that the incoming spans contain the required attributes (e.g., `service.name`, `span.kind`, `span.name`) to aggregate the metrics correctly. Missing attributes may result in incomplete metrics. ++ +- Validate `metrics_flush_interval` and `metrics_expiration`: ensure the `metrics_flush_interval` is correctly set, and `metrics_expiration` is sufficient. Metrics might not be exported if the expiration time is too short or spans are not consistently received. + +==== Incorrect histogram bucket configuration + +If the span duration is not being correctly categorized into histogram buckets, check the following: + +.Procedure + +- Ensure the correct bucket intervals are defined: review the `histogram.buckets` configuration. Adjust the bucket intervals to align with the span durations in your telemetry data. If spans fall outside the specified bucket ranges, they may not be aggregated as expected. ++ +- Check the `histogram.unit`: verify that the `histogram.unit` (e.g., `ms` or `s`) matches the desired time unit for recording span durations. Incorrect units could result in unexpected distribution of spans across histogram buckets. + +==== Unexpected aggregation behavior + +If the metrics are aggregated in a way that does not meet your expectations, review the following: + +.Procedure + +- Verify `aggregation_temporality`: ensure that the `aggregation_temporality` is set to the appropriate value (`AGGREGATION_TEMPORALITY_CUMULATIVE` or `AGGREGATION_TEMPORALITY_DELTA`) depending on your needs. Cumulative aggregation might result in metrics that increase over time, while delta aggregation resets after each export. + +==== Metrics not including the expected dimensions + +If certain dimensions (e.g., `http.method`, `http.status_code`) are missing from the metrics, check the following: + +.Procedure + +- Ensure dimensions are correctly specified: review the `dimensions` field in the configuration and ensure all necessary dimensions are listed. If a dimension is missing from this list, it will not be included in the aggregated metrics. ++ +- Check `exclude_dimensions`: ensure that the dimensions you expect are not listed in the `exclude_dimensions` field. Excluded dimensions will not appear in the final metrics. + +==== Events not showing in the metrics + +If event-based metrics (e.g., exception types or messages) are not showing up, check the following: + +.Procedure + +- Verify `events.enabled`: ensure that `events.enabled` is set to `true` in the configuration. If this parameter is `false`, event metrics will not be generated. ++ +- Review event dimensions: make sure that the event attributes you want to track are listed under `events.dimensions`. Only the specified event attributes will be included in the event metrics. + +- Check if spans include events: ensure that the incoming spans contain the events you are trying to aggregate, such as `exception.type` and `exception.message`. + [role="_additional-resources"] [id="additional-resources_otel-collector-connectors_{context}"] == Additional resources * link:https://opentelemetry.io/docs/specs/otlp/[OpenTelemetry Protocol (OTLP) documentation] +* link:https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/ottl/README.md[OpenTelemetry Transformation Language (OTTL) documentation] From 1b8937a59b4a6edff72246002a0be2cdcb8f6ba6 Mon Sep 17 00:00:00 2001 From: Israel Blancas Date: Tue, 1 Oct 2024 17:24:41 +0200 Subject: [PATCH 2/2] Improve wording Signed-off-by: Israel Blancas --- .../otel-collector/otel-collector-connectors.adoc | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/observability/otel/otel-collector/otel-collector-connectors.adoc b/observability/otel/otel-collector/otel-collector-connectors.adoc index 37c68b388349..51a54b5e44d4 100644 --- a/observability/otel/otel-collector/otel-collector-connectors.adoc +++ b/observability/otel/otel-collector/otel-collector-connectors.adoc @@ -6,12 +6,12 @@ include::_attributes/common-attributes.adoc[] toc::[] -A connector connects two pipelines. It consumes data as an exporter at the end of one pipeline and emits data as a receiver at the start of another pipeline. It can consume and emit data of the same or different data type. It can generate and emit data to summarize the consumed data, or it can merely replicate or route data. +A connector connects two pipelines. It consumes data as an exporter at the end of one pipeline and emits data as a receiver at the start of another pipeline. It can consume and emit data of the same or different types. It can generate and emit data to summarize the consumed data, or it can replicate or route data. [id="forward-connector_{context}"] == Forward Connector -The Forward Connector merges two pipelines of the same type. +The Forward Connector merges two pipelines that process the same data type. :FeatureName: The Forward Connector include::snippets/technology-preview.adoc[] @@ -91,7 +91,10 @@ include::snippets/technology-preview.adoc[] <2> Destination pipelines for routing the matching telemetry data. In this case, the `traces/dev` pipeline when the `X-Tenant` attribute is equal to `dev`. <3> In this example, if `X-Tenant` attribute is not `dev` or `prod`, the telemetry data will be forwarded to those pipelines. +[NOTE] +==== The connector routes exclusively based on OTTL statements, which are limited to resource attributes. Currently, it does not support matching against context values. +==== .Parameters used by the Routing Connector [options="header"] @@ -104,7 +107,7 @@ The connector routes exclusively based on OTTL statements, which are limited to |`[]` |`table.statement` -|Routing conditions written as OTTL (OpenTelemetry Transformation Language) statements. +|Routing conditions written as OTTL statements. |N/A |`table.pipelines` @@ -270,7 +273,7 @@ Each metric will include at least the following dimensions, as they are common a |`[]` |`dimensions_cache_size` -|Specifies the cache size for storing dimensions to optimize the collector's memory usage. Must be a number bigger than 0. +|Defines the cache size for storing dimensions, optimizing the memory usage of the collector. Must be a number bigger than 0. |`1000` |`aggregation_temporality`