You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a companion proposal to the Error Code Foundation and a second entry in the series that began with #6861 — Dependency Version Pinning for Reproducible Builds. It describes how Autoware's three telemetry channels — logs, diagnostics, and traces — can be unified under a common identifier using OpenTelemetry as the consumption stack.
Motivation
Autoware currently produces three parallel telemetry streams that cannot be correlated:
/rosout — free-text strings emitted by RCLCPP_* macros. Fault detail is buried in human-readable sentences; no structured key is shared with other channels.
/diagnostics — DiagnosticStatus records with name/message/hardware_id and an open-ended values[] key-value list. Rich fault classification exists (via DiagnosticStatus.level and diagnostic_aggregator / DiagGraph / HazardStatus), but it is not linked to any log entry that describes the same fault.
Traces — not standardized at all in Autoware today. Node latency, callback chains, and service durations are not exported to any trace backend.
The result is that diagnosing a fault requires manually cross-referencing three streams with no common key. MRM activations, planning failures, and hardware anomalies each produce evidence in multiple channels that today's tooling cannot join automatically.
This proposal adopts OpenTelemetry as the single observability backbone that ingests all three signals, joined by a common trace_id. The emission surface (the AUTOWARE_LOG_*_CODE macros, the autoware::error::set diagnostic helper, and the trace_id/span_id slots) is defined by the companion Error Code Foundation proposal. This document is concerned solely with how that surface is consumed — and OpenTelemetry adoption here is a proposal decided on its own merits, independent of whether the Error Code Foundation is adopted.
Goals
G1: Single trace join. All three telemetry channels — /rosout log records, /diagnostics status records, and OTLP spans — carry the same trace_id, enabling backend correlation without bespoke tooling.
G2: No custom Collector code for the logging path. Structured keys in the /rosout suffix are named to align with OpenTelemetry semantic conventions, so a stock OpenTelemetry Collector (filelog/syslog receiver + key_value_parser operator) can extract them into OTLP attributes with configuration only — zero code.
G3: Non-disruptive to existing diagnostic infrastructure.diagnostic_aggregator, DiagGraph, and HazardStatus keep working as-is; observability keys are added to DiagnosticStatus.values[] alongside existing keys.
G4: Backend-agnostic. Jaeger, Tempo, Loki, Grafana, Foxglove — any OTLP-capable backend is interchangeable. The choice is a deployment concern, not a contract concern.
G5: Not error-specific. The same logfmt + OTel backbone carries INFO-level telemetry (state transitions, mission events, throughput metrics) with no error.* keys, so normal-operation and fault data land on one correlated timeline.
Non-Goals
Not a precondition for the Error Code Foundation. The error-code emission surface stands on its own; this document can be adopted, deferred, or replaced without modifying that contract.
Not bit-for-bit identical to the upstream OTel Collector ROS receiver. If a dedicated ROS 2 → OTLP bridge already exists or emerges upstream, this proposal's architecture is fully compatible with it; the logging path's stock-Collector approach is a starting point, not a ceiling.
Not defining the full SLO or alerting policy. What thresholds trigger alerts, and how they are routed, is outside scope.
Not replacing the ROS 2 rcl logging subsystem. The existing RCLCPP_* macros continue to work unchanged; the structured macros are a superset.
Proposed Design
1. Logging Path
Emission (defined by the Error Code Foundation): the AUTOWARE_LOG_*_CODE macros append a logfmt-encoded suffix — with a fixed | delimiter and escaped values — to the end of rcl_interfaces/Log.msg.msg on /rosout. Because values are escaped, a newline, =, or quote in the human-readable message cannot corrupt the record.
The logfmt keys are named to align with OpenTelemetry semantic conventions. A stock Collector needs no translation table:
logfmt key
OTel attribute / OTLP field
error.code
error.code (16-bit code)
error.canonical
error.canonical
error.canonical_name
error.canonical_name
error.domain
error.domain
error.domain_name
error.domain_name
error.value
error.value
error.value_name
error.value_name
error.detail
error.detail
trace_id
OTLP LogRecord.trace_id (16 bytes)
span_id
OTLP LogRecord.span_id (8 bytes)
The error.* keys are an Autoware extension in the semconv error.* namespace and do not collide with the standard code.* (source-code location) namespace.
Consumption: A stock otel/opentelemetry-collector-contrib image with a filelog receiver tailing the container log stream and a key_value_parser operator extracts the suffix into OTLP LogRecord.attributes. No custom Collector plugin and no ROS bridge node are required. This configuration-only approach is the design's central bet — validated by the proof of concept below.
Representative call sites (planning and control):
// autoware_planning_validator — beforeRCLCPP_ERROR(get_logger(), "Invalid Trajectory detected. Use previous trajectory.");
// afterAUTOWARE_LOG_ERROR_CODE(get_logger(), planning_validator::INVALID_TRAJECTORY,
"Invalid Trajectory detected. Use previous trajectory.");
// autoware_mpc_lateral_controller — beforeRCLCPP_ERROR(logger_, "MPC failed due to %s", mpc_solved_status.reason.c_str());
// after — finite reason set becomes enumerated valuesAUTOWARE_LOG_ERROR_CODE(logger_, control_lateral::QP_SOLVER_ERROR,
"MPC failed due to %s", mpc_solved_status.reason.c_str());
Resulting record — the cause is legible from error.value_name alone:
MPC failed due to qp solver error | error.code=8455 error.canonical=13 error.canonical_name=INTERNAL error.domain=33 error.domain_name=CONTROL_LATERAL error.value=7 error.value_name=QP_SOLVER_ERROR trace_id=4bf92f3577b34da6a3ce929d0e0e4736 span_id=00f067aa0ba902b7
Non-error telemetry on the same path. The logfmt + OTel backbone is not error-specific. INFO-level events use the event.* namespace; measurements use metric.*. Four representative cases:
All four join the same trace timeline as any fault in scope, so normal operation and faults are one correlated story.
2. Diagnostic Path
Emission (defined by the Error Code Foundation): a single autoware::error::set(status, ec, fault_class) call is inserted into existing DiagnosticStatus-emitting code. It writes reserved KV keys into DiagnosticStatus.values[]:
Key
Content
autoware.error.code
16-bit error code
autoware.error.canonical
canonical error class integer
autoware.error.domain_name
domain name string
autoware.error.value_name
value name string
autoware.fault.class
NF / SF / LF / SPF
autoware.trace.id
hex trace_id — join key for log and span
Existing diagnostic_aggregator, DiagGraph, and HazardStatus keep working unchanged; set() only appends to values[]. The fault_class is also preserved in the existing four-array HazardStatus structure for backward-compatible readers.
Representative call site (autoware_system_monitor, voltage_monitor.cpp):
A consumer reads these with autoware::error::get(status) and joins the diagnostic record with the matching /rosout entry and any trace span via autoware.trace.id.
3. Tracing Path
A ScopedSpan RAII helper provides trace_id and span_id for a scope. All log and diagnostic emission inside the scope inherits these IDs from a thread-local current() context. Spans are exported to a trace backend (Jaeger, Tempo, or any OTLP-capable store) via OTLP/HTTP.
Representative call site (autoware_mission_planner_universe, mission_planner.cpp):
voidMissionPlanner::on_set_lanelet_route(
const SetLaneletRoute::Request::SharedPtr req,
const SetLaneletRoute::Response::SharedPtr res)
{
autoware::error::tracing::ScopedSpan span("svc/set_lanelet_route");
// existing handler body — the route_set INFO log and any AUTOWARE_LOG_*_CODE// inside inherit span's trace_id / span_id from current()
}
Its trace_id is the join key: a successful route-set event and any fault logged during route-setting land on one trace timeline. Tracing is a telemetry feature in its own right; error codes ride on it but do not define it.
Multi-threaded executor caveat. When a callback hands work to another thread (e.g., under MultiThreadedExecutor), trace_id/span_id must be explicitly captured before the handoff and re-attached on the worker thread. The PoC exercises this pattern (see the Proof of Concept section below).
The logging path requires no custom Collector plugin — only standard filelog receiver configuration. The diagnostic path requires a small consumer that reads autoware.error.* keys from DiagGraphStatus and emits OTLP. The tracing path exports directly via OTLP/HTTP from each node. Backends are interchangeable behind the OTLP boundary.
Implementation Note: opentelemetry-cpp
The proof of concept (below) uses a deliberately minimal, hand-rolled tracing implementation — UUID generation + thread_local for IDs, and a ~50-line libcurl OTLP/HTTP POST — chosen to keep the feasibility skeleton dependency-light and runnable with no extra apt packages.
The production implementation will use opentelemetry-cpp, the official CNCF OpenTelemetry C++ API and SDK. opentelemetry-cpp provides:
W3C traceparent / tracestate context propagation across process and thread boundaries
Batched, async OTLP export (gRPC and HTTP) with configurable retry and backpressure
Sampling API (head-based, tail-based via Collector)
Unified logs, metrics, and traces signals through a single SDK
Native OTLP LogRecord.trace_id / span_id fields (eliminating the attribute-level workaround noted in §5)
The idiomatic integration path in a ROS 2 workspace is a vendor package (e.g., opentelemetry_cpp_vendor), following the same pattern as tvm_vendor, qpoases_vendor, and similar packages in autoware_universe. opentelemetry-cpp is not available in the Ubuntu Noble apt repositories, but all of its build dependencies (protobuf, nlohmann-json, libcurl) are. An OTLP-HTTP-only build (no gRPC) is straightforward and keeps the dependency footprint small.
The hand-rolled PoC exporter is a stand-in that validates the design claims; it is not a competitor to opentelemetry-cpp and will be replaced before any production rollout.
O1 — No custom bridge for the logging path. A stock otel/opentelemetry-collector-contrib image with only a filelog receiver and a key_value_parser operator extracts error.code, error.canonical_name, trace_id, and span_id from the ROS container log files. No custom Collector plugin was written, and no ROS bridge node runs. This directly validates the semconv-alignment bet: because the logfmt keys match OTel attribute names, configuration alone is sufficient.
O2 — The trace_id join. The same trace_id value appears in three places simultaneously: the span in Jaeger, the error log ingested via filelog, and the diagnostic status record emitted as an OTLP log. Manual cross-channel lookup is eliminated.
O3 — ScopedSpan under MultiThreadedExecutor. A span opened in one callback survives hand-off to a worker thread via explicit capture()/attach(), and the exported span covers the full async operation.
O4 — End-to-end OTLP export. Spans export from a ROS 2 node to Jaeger via OTLP/HTTP with no intermediate bridge process.
One honest caveat. In the PoC, the diagnostic log carries trace_id as an OTLP attribute (autoware.trace.id) rather than the first-class OTLP LogRecord.trace_id / traceId field. This means automatic backend trace↔diagnostic-log correlation requires an OTLP transform processor step in the Collector config, rather than working natively. In the production implementation using opentelemetry-cpp, the SDK sets LogRecord.trace_id natively, eliminating this step. The value-match already proves the join works; the caveat is about ergonomics, not correctness.
Feedback Requested
Stock-Collector bet (G2). The logging path's core claim is that logfmt keys aligned to OTel semconv names let a standard filelog + key_value_parser Collector configuration replace a bespoke ROS receiver — no code required. Does the community see failure modes in this approach for production log volumes (~10 Hz per node, multi-node deployments)?
opentelemetry-cpp vendor package. Is there appetite for an opentelemetry_cpp_vendor package in autoware_universe? Alternatively, is there a preference to depend on a system-installed OTel SDK once distros ship a recent enough version?
/diagnostics reserved-key namespace. The autoware.error.* / autoware.fault.* / autoware.trace.* keys are added to DiagnosticStatus.values[]. Are there known consumers that would conflict with this namespace, or a preference for a different prefix?
Tracing scope granularity. The proposal instruments at ROS service / action handler boundaries (ScopedSpan per handler). Is this the right granularity, or should spans also wrap individual topic callbacks? What is the acceptable overhead threshold?
Relationship to the Error Code Foundation. This proposal builds on the emission surface defined by the companion Error Code Foundation proposal (the AUTOWARE_LOG_*_CODE macros and autoware::error::set). If the community wants to evaluate this observability stack independently — before or without adopting the error-code scheme — what would a minimal standalone integration look like?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
This is a companion proposal to the Error Code Foundation and a second entry in the series that began with #6861 — Dependency Version Pinning for Reproducible Builds. It describes how Autoware's three telemetry channels — logs, diagnostics, and traces — can be unified under a common identifier using OpenTelemetry as the consumption stack.
Motivation
Autoware currently produces three parallel telemetry streams that cannot be correlated:
/rosout— free-text strings emitted byRCLCPP_*macros. Fault detail is buried in human-readable sentences; no structured key is shared with other channels./diagnostics—DiagnosticStatusrecords withname/message/hardware_idand an open-endedvalues[]key-value list. Rich fault classification exists (viaDiagnosticStatus.levelanddiagnostic_aggregator/ DiagGraph /HazardStatus), but it is not linked to any log entry that describes the same fault.The result is that diagnosing a fault requires manually cross-referencing three streams with no common key. MRM activations, planning failures, and hardware anomalies each produce evidence in multiple channels that today's tooling cannot join automatically.
This proposal adopts OpenTelemetry as the single observability backbone that ingests all three signals, joined by a common
trace_id. The emission surface (theAUTOWARE_LOG_*_CODEmacros, theautoware::error::setdiagnostic helper, and thetrace_id/span_idslots) is defined by the companion Error Code Foundation proposal. This document is concerned solely with how that surface is consumed — and OpenTelemetry adoption here is a proposal decided on its own merits, independent of whether the Error Code Foundation is adopted.Goals
/rosoutlog records,/diagnosticsstatus records, and OTLP spans — carry the sametrace_id, enabling backend correlation without bespoke tooling./rosoutsuffix are named to align with OpenTelemetry semantic conventions, so a stock OpenTelemetry Collector (filelog/syslog receiver +key_value_parseroperator) can extract them into OTLP attributes with configuration only — zero code.diagnostic_aggregator, DiagGraph, andHazardStatuskeep working as-is; observability keys are added toDiagnosticStatus.values[]alongside existing keys.error.*keys, so normal-operation and fault data land on one correlated timeline.Non-Goals
rcllogging subsystem. The existingRCLCPP_*macros continue to work unchanged; the structured macros are a superset.Proposed Design
1. Logging Path
Emission (defined by the Error Code Foundation): the
AUTOWARE_LOG_*_CODEmacros append a logfmt-encoded suffix — with a fixed|delimiter and escaped values — to the end ofrcl_interfaces/Log.msg.msgon/rosout. Because values are escaped, a newline,=, or quote in the human-readable message cannot corrupt the record.The logfmt keys are named to align with OpenTelemetry semantic conventions. A stock Collector needs no translation table:
error.codeerror.code(16-bit code)error.canonicalerror.canonicalerror.canonical_nameerror.canonical_nameerror.domainerror.domainerror.domain_nameerror.domain_nameerror.valueerror.valueerror.value_nameerror.value_nameerror.detailerror.detailtrace_idLogRecord.trace_id(16 bytes)span_idLogRecord.span_id(8 bytes)The
error.*keys are an Autoware extension in the semconverror.*namespace and do not collide with the standardcode.*(source-code location) namespace.Consumption: A stock
otel/opentelemetry-collector-contribimage with afilelogreceiver tailing the container log stream and akey_value_parseroperator extracts the suffix into OTLPLogRecord.attributes. No custom Collector plugin and no ROS bridge node are required. This configuration-only approach is the design's central bet — validated by the proof of concept below.Representative call sites (planning and control):
Resulting
/rosoutrecord:Resulting record — the cause is legible from
error.value_namealone:Non-error telemetry on the same path. The logfmt + OTel backbone is not error-specific. INFO-level events use the
event.*namespace; measurements usemetric.*. Four representative cases:State-machine transition (
autoware_default_adapi_universe,autoware_state.cpp):MRM activation (
autoware_mrm_handler,mrm_handler_core.cpp) — a top safety KPI; any transition toMRM_OPERATINGis alert-worthy:Route lifecycle (
autoware_mission_planner_universe,mission_planner.cpp):Perception throughput (
autoware_tensorrt_bevformer,bevformer_node.cpp) — fires at inference rate (~10 Hz); pre-aggregate or sample before export:All four join the same trace timeline as any fault in scope, so normal operation and faults are one correlated story.
2. Diagnostic Path
Emission (defined by the Error Code Foundation): a single
autoware::error::set(status, ec, fault_class)call is inserted into existingDiagnosticStatus-emitting code. It writes reserved KV keys intoDiagnosticStatus.values[]:autoware.error.codeautoware.error.canonicalautoware.error.domain_nameautoware.error.value_nameautoware.fault.classNF/SF/LF/SPFautoware.trace.idtrace_id— join key for log and spanExisting
diagnostic_aggregator, DiagGraph, andHazardStatuskeep working unchanged;set()only appends tovalues[]. Thefault_classis also preserved in the existing four-arrayHazardStatusstructure for backward-compatible readers.Representative call site (
autoware_system_monitor,voltage_monitor.cpp):The resulting
values[]:A consumer reads these with
autoware::error::get(status)and joins the diagnostic record with the matching/rosoutentry and any trace span viaautoware.trace.id.3. Tracing Path
A
ScopedSpanRAII helper providestrace_idandspan_idfor a scope. All log and diagnostic emission inside the scope inherits these IDs from a thread-localcurrent()context. Spans are exported to a trace backend (Jaeger, Tempo, or any OTLP-capable store) via OTLP/HTTP.Representative call site (
autoware_mission_planner_universe,mission_planner.cpp):The exported span carries no
error.*payload:Its
trace_idis the join key: a successful route-set event and any fault logged during route-setting land on one trace timeline. Tracing is a telemetry feature in its own right; error codes ride on it but do not define it.Multi-threaded executor caveat. When a callback hands work to another thread (e.g., under
MultiThreadedExecutor),trace_id/span_idmust be explicitly captured before the handoff and re-attached on the worker thread. The PoC exercises this pattern (see the Proof of Concept section below).4. Collection Architecture
The three-signal pipeline:
The logging path requires no custom Collector plugin — only standard
filelogreceiver configuration. The diagnostic path requires a small consumer that readsautoware.error.*keys fromDiagGraphStatusand emits OTLP. The tracing path exports directly via OTLP/HTTP from each node. Backends are interchangeable behind the OTLP boundary.Implementation Note: opentelemetry-cpp
The proof of concept (below) uses a deliberately minimal, hand-rolled tracing implementation — UUID generation +
thread_localfor IDs, and a ~50-line libcurl OTLP/HTTP POST — chosen to keep the feasibility skeleton dependency-light and runnable with no extra apt packages.The production implementation will use
opentelemetry-cpp, the official CNCF OpenTelemetry C++ API and SDK.opentelemetry-cppprovides:traceparent/tracestatecontext propagation across process and thread boundariesLogRecord.trace_id/span_idfields (eliminating the attribute-level workaround noted in §5)The idiomatic integration path in a ROS 2 workspace is a vendor package (e.g.,
opentelemetry_cpp_vendor), following the same pattern astvm_vendor,qpoases_vendor, and similar packages inautoware_universe.opentelemetry-cppis not available in the Ubuntu Noble apt repositories, but all of its build dependencies (protobuf, nlohmann-json, libcurl) are. An OTLP-HTTP-only build (no gRPC) is straightforward and keeps the dependency footprint small.The hand-rolled PoC exporter is a stand-in that validates the design claims; it is not a competitor to
opentelemetry-cppand will be replaced before any production rollout.Proof of Concept
A runnable PoC is available at https://github.com/youtalk/awf-error-observability-poc, targeting ROS 2 Jazzy. Start it with
docker compose upand openlocalhost:16686(Jaeger UI).The PoC validates four observability claims:
O1 — No custom bridge for the logging path. A stock
otel/opentelemetry-collector-contribimage with only afilelogreceiver and akey_value_parseroperator extractserror.code,error.canonical_name,trace_id, andspan_idfrom the ROS container log files. No custom Collector plugin was written, and no ROS bridge node runs. This directly validates the semconv-alignment bet: because the logfmt keys match OTel attribute names, configuration alone is sufficient.O2 — The
trace_idjoin. The sametrace_idvalue appears in three places simultaneously: the span in Jaeger, the error log ingested viafilelog, and the diagnostic status record emitted as an OTLP log. Manual cross-channel lookup is eliminated.O3 —
ScopedSpanunderMultiThreadedExecutor. A span opened in one callback survives hand-off to a worker thread via explicitcapture()/attach(), and the exported span covers the full async operation.O4 — End-to-end OTLP export. Spans export from a ROS 2 node to Jaeger via OTLP/HTTP with no intermediate bridge process.
One honest caveat. In the PoC, the diagnostic log carries
trace_idas an OTLP attribute (autoware.trace.id) rather than the first-class OTLPLogRecord.trace_id/traceIdfield. This means automatic backend trace↔diagnostic-log correlation requires an OTLPtransformprocessor step in the Collector config, rather than working natively. In the production implementation usingopentelemetry-cpp, the SDK setsLogRecord.trace_idnatively, eliminating this step. The value-match already proves the join works; the caveat is about ergonomics, not correctness.Feedback Requested
Stock-Collector bet (G2). The logging path's core claim is that logfmt keys aligned to OTel semconv names let a standard
filelog+key_value_parserCollector configuration replace a bespoke ROS receiver — no code required. Does the community see failure modes in this approach for production log volumes (~10 Hz per node, multi-node deployments)?opentelemetry-cppvendor package. Is there appetite for anopentelemetry_cpp_vendorpackage inautoware_universe? Alternatively, is there a preference to depend on a system-installed OTel SDK once distros ship a recent enough version?/diagnosticsreserved-key namespace. Theautoware.error.*/autoware.fault.*/autoware.trace.*keys are added toDiagnosticStatus.values[]. Are there known consumers that would conflict with this namespace, or a preference for a different prefix?Tracing scope granularity. The proposal instruments at ROS service / action handler boundaries (
ScopedSpanper handler). Is this the right granularity, or should spans also wrap individual topic callbacks? What is the acceptable overhead threshold?Relationship to the Error Code Foundation. This proposal builds on the emission surface defined by the companion Error Code Foundation proposal (the
AUTOWARE_LOG_*_CODEmacros andautoware::error::set). If the community wants to evaluate this observability stack independently — before or without adopting the error-code scheme — what would a minimal standalone integration look like?Beta Was this translation helpful? Give feedback.
All reactions