Skip to content

Design proposal for extension system in the OTel Dataflow Engine#2293

Merged
jmacd merged 8 commits intoopen-telemetry:mainfrom
lquerel:extension-proposal
Mar 18, 2026
Merged

Design proposal for extension system in the OTel Dataflow Engine#2293
jmacd merged 8 commits intoopen-telemetry:mainfrom
lquerel:extension-proposal

Conversation

@lquerel
Copy link
Copy Markdown
Contributor

@lquerel lquerel commented Mar 12, 2026

Change Summary

This PR adds a design proposal describing the extension system for the OTel Dataflow Engine.

The document introduces a capability-based extension architecture allowing receivers, processors, and exporters to access non-pdata functionality through well-defined capability interfaces maintained in the engine core.

The proposal covers:

  • core concepts such as capabilities, extension providers, and extension instances
  • integration of extensions into the existing configuration model
  • the user experience for declaring extensions and binding capabilities
  • the developer experience for implementing extension providers
  • the runtime architecture for resolving and instantiating extensions
  • the execution models supported by extensions (local vs shared)
  • comparison with the Go Collector extension model
  • a phased evolution plan (native extensions → hierarchical placement → WASM extensions)
  • implementation recommendations for building high-performance extensions aligned with the engine's thread-per-core design

The goal of this document is to provide maintainers with a clear architectural proposal to review before implementing the extension system.

What issue does this PR close?

How are these changes tested?

This PR introduces documentation only and does not modify runtime code.

Are there any user-facing changes?

Yes.

This proposal describes a future extension system that will introduce new configuration capabilities such as:

  • an extensions section in pipeline configurations
  • a capabilities section in node definitions

These changes are not implemented yet but outline the intended user-facing configuration model for extensions.

@lquerel lquerel requested a review from a team as a code owner March 12, 2026 21:31
@github-actions github-actions bot added the rust Pull requests that update Rust code label Mar 12, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.72%. Comparing base (4ec2c64) to head (7cf41ff).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2293      +/-   ##
==========================================
- Coverage   87.73%   87.72%   -0.01%     
==========================================
  Files         578      578              
  Lines      198334   198334              
==========================================
- Hits       174004   173990      -14     
- Misses      23804    23818      +14     
  Partials      526      526              
Components Coverage Δ
otap-dataflow 89.75% <ø> (-0.01%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 90.61% <ø> (ø)
syslog_cef_receivers ∅ <ø> (∅)
otel-arrow-go 52.44% <ø> (ø)
quiver 91.91% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.


* cheap to clone
* cheap to pass around during initialization
* inexpensive to call
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed that you have mentioned 5. Use background tasks for slow-path work in the doc below.

I like this separation of concerns. If we are expecting capability handles to be lightweight, should we establish that capability handle methods are sync? That would make the separation clear - extensions do I/O in their background tasks, handles just serve the latest state. It also aligns with the Go Collector model where all auth interfaces (e.g., Server.Authenticate, HTTPClient.RoundTripper) are synchronous.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I generally agree with that. We should start with synchronous handles, even if it means adding support for asynchronous handles later (if we determine that it's important).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should start with synchronous handles

Agreed. Nearly all logging libraries only offer sync version, and it handles the heavy lifting (disk/network) in background. I try of think of "extensions" very similar to logging.

Copy link
Copy Markdown
Contributor

@gouslu gouslu Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I think of extensions, for example, I think of general purpose, service-locator like solutions. For example, an extension can be an authentication provider, a database adapter, a rate limiter counter and so on. Some of which do absolutely require async methods. For example, if there is a cache processor that can support multiple key/value stores, and an extension is written for a capability named "KeyValueAccessor" or something like that for databases like Redis, AWS, Azure etc. that extension can, based on configuration, write to any of these databases.

But, I also see extensions as something that will hit an "increment(tenant_id)" method to track something like a rate limiting functionality, where lifecycle regularly resets the counters based on time and other information.

I think extensions can and should always provide ability to execute async calls as well as non-async ones.

I agree that a lot of problems can be solved via updating the state in the lifecycle where the extension capabilities are sync traits with methods to just the check the current state, but we don't have to live with that limitation. In fact, I struggle to understand on what basis there is such a limitation that we have to accept to begin with.

I think the real discussion should be about the memory boundaries and how we can address all the requirements in this document and how we can achieve all goals in the long term, and a design choice that can be the best overall choice with the minimal drawbacks and choosing the right trade-offs.

I agree that hot-path capability methods should be as lightweight as possible — ideally sync reads of cached state. But I don't think we should prevent async methods at the trait level and we don't have to make that trade-off, because some capabilities are inherently async (such as database queries). We can recommend sync-first as a best practice while leaving async as an option for capabilities that need it. I don't think we need to already decide that we need to give up async method support. There is no good reason for it.

Copy link
Copy Markdown
Contributor

@gouslu gouslu Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed that you have mentioned 5. Use background tasks for slow-path work in the doc below.

I like this separation of concerns. If we are expecting capability handles to be lightweight, should we establish that capability handle methods are sync? That would make the separation clear - extensions do I/O in their background tasks, handles just serve the latest state. It also aligns with the Go Collector model where all auth interfaces (e.g., Server.Authenticate, HTTPClient.RoundTripper) are synchronous.

Go extensions are sync because goroutines make blocking transparent, there is no async/await keywords in go, all methods in go are sync by the nature of the language; which is a different philosophy than Rust, so these two are not really comparable.

That's not a design decision, it's the only option Go has.

Copy link
Copy Markdown
Contributor

@gouslu gouslu Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing to mention, Go collector has many extensions that would require async functions in Rust. For example, redisstorageextension implements storage.Client interface. This pattern allows for possibility to select from many storage providers. This functionality cannot be implemented without async in Rust, this is the reality. We don't have any reason to have this pathway of non-sync today and async tomorrow, because firstly, Rust's type system will make that switch very painful, and secondly, it is not a trade-off we need to make today. Adding async support from day 1 is not only very possible, I already have a prototype that demonstrates this capability using a pattern for shared state that is used in very large libraries like tokio-rs/axum, which is mentioned here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go extensions are sync because goroutines make blocking transparent, there is no async/await keywords in go, all methods in go are sync by the nature of the language; which is a different philosophy than Rust, so these two are not really comparable.

That's not a design decision, it's the only option Go has.

True, Go doesn't have async/await but the usage pattern is still sync. When a receiver calls Server.Authenticate(), it doesn't spawn a goroutine or select on a channel. It calls and gets an answer. That's the pattern we're applying.

For example, redisstorageextension implements storage.Client interface. This pattern allows for possibility to select from many storage providers. This functionality cannot be implemented without async in Rust, this is the reality. We don't have any reason to have this pathway of non-sync today and async tomorrow, because firstly, Rust's type system will make that switch very painful, and secondly, it is not a trade-off we need to make today.

Great find! The storage example is a fair point — a KV store needs request-response, so the sync-read-of-cached-state pattern doesn't fit there. But that doesn't mean every capability should be async. Auth and token sources work naturally as sync — the extension refreshes in the background, the handle serves the latest value.

If we later need an async capability like KV storage, we'd define that trait as async from the start (no migration needed). I don't think the switch would be painful here. From what I understand, we're defining capability traits per capability, not one universal trait. The existing sync traits wouldn't change. So, it wouldn't be a "sync today, convert to async tomorrow" situation. It's "sync capabilities stay sync, async capabilities are born async." We don't need to make everything async today to leave room for it later.

Copy link
Copy Markdown
Contributor

@gouslu gouslu Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find! The storage example is a fair point — a KV store needs request-response, so the sync-read-of-cached-state pattern doesn't fit there. But that doesn't mean every capability should be async. Auth and token sources work naturally as sync — the extension refreshes in the background, the handle serves the latest value.

If we later need an async capability like KV storage, we'd define that trait as async from the start (no migration needed). I don't think the switch would be painful here. From what I understand, we're defining capability traits per capability, not one universal trait. The existing sync traits wouldn't change. So, it wouldn't be a "sync today, convert to async tomorrow" situation. It's "sync capabilities stay sync, async capabilities are born async." We don't need to make everything async today to leave room for it later.

Agreed! It is about the extension system's capability to support async. If this capability isn't baked in from the get go, I think that migrating to that would be a non-trivial job. Because, async doesn't come for free. So, original design should take that into consideration as a non-negotiable requirement, even if the support isn't there from day 1, the path to support it should be crystal clear and well-defined. Our best bet for doing this is to start with a design that is the friendliest for such goal, preferably capable from day one. And there are very good desing choices we can take that has async support from day one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start by supporting synchronous capability traits. In a later phase we can add support for asynchronous capabilities, but I don't believe this is the type of capability we want to encourage initially.

The engine is designed around predictable latency, thread-per-core execution, and minimal synchronization, so capability handles are expected to be lightweight and safe to use on the hot path. For that reason, the preferred model is that capability methods are synchronous and operate on already available local state, while any I/O or slow operations are handled by background tasks within the extension.

When we introduce async capabilities later, we should ensure they are used in ways that preserve these properties. In particular, async capabilities should generally be reserved for control-path or slow-path operations, not for the hot data path.

For example, exposing a Redis interface directly in the hot path would likely be a poor fit. A better pattern would be for the Redis extension to run background tasks that synchronize or refresh a local cache or snapshot, while the capability handle exposes a synchronous lookup against that locally available state. This keeps the data path predictable while still allowing integration with external systems.

To keep this evolution possible, we should ensure that the capability registry and handle model do not assume capabilities are always synchronous, even if the first version only supports sync traits. That way the system can evolve to support async capabilities where they make sense without requiring a redesign of the extension framework.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should start with sync, plan to add async later.

The async space will be trickier, as we all can see. I do imagine a useful processor that makes one request/response to Redis per pipeline data. We will have to address how to hand off async work (local or shared), how to limit concurrency and apply backpressure, etc.

Comment on lines +351 to +355
```text
Rc
RefCell
Cell
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like almost everything in this design proposal, but this requirement creates a tension, because it means that we need traits that are !Send + !Sync, but we also need traits that are Send for shared execution model and for usage in shared nodes. This causes a situation where we need some kind of dual mode system for each capability, and dual implementation. This is not a bad idea, but it means that ideally we will create dual traits and ask implementers to do dual implementation.

I really like how we bind the extensions to capabilities and hierarcical approach for local or shared execution models, and everything in the design so far seems achievable with a caveat. For example, if capability is configured at a global level and then bound to an exporter in the pipeline, when the pipeline does a get::<MyCapability>() on that capability, the engine will know what to return based on that binding configuration. That is amazing! It can work in a system where there is only one trait per capability (not local and shared versions of that trait) and a registry system with hierarchical storage of capabilities that is botstrapped at the startup based on the configuration. It cannot work in a system where local and shared traits of the same capability exists, because the consumer of that extension would need to know whether to request for the shared or the local version based on the config, which would make the binding config not really work.

Additionally, shared nodes cannot use !Send capability trait objects at all. With dual traits, a misconfiguration would only be caught at runtime with a confusing error.

At least that's based on my understanding of how Rust's type system works and how the memory boundaries work. If you think that I am missing something, I would love that you share your vision with me on how to achieve this, otherwise I think if we drop this Rc Refcell Cell requirement entirely, I think we have a solution that we can work with. (Although I will need to prototype it a little bit before I can confirm 100%).

I would say that extensions shouldn't even have a concept of shared and local, thet should always be Send, and that would be the cleanest approach as far as my experimentation on this design goes.

This assessment is based on the understanding that this design document is asking from us all these three:

  1. !Send capability traits (Rc/RefCell in trait impls).
  2. Single capability trait per capability (to make the bindings and hierarchical scope work)
  3. Cross-pipeline sharing (handle cloned to other cores)

What I say is that, I think by dropping the !Send requirement and making all extensions Send and not having local/shared separation, we can achieve everything here at once (I believe, but again subject to a PoC), and what we lose by dropping this requirement is relative, but likely to be minimal compared to the volume of pdata and I/O.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to re-iterate, I really like the binding and multi-scope extensions. We already have a need for a global extension it seems for the Azure Monitor Exporter. So, regardless of what we come up with on the other side, I am hugely supportive of this design aspect.

Copy link
Copy Markdown
Contributor

@gouslu gouslu Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, when I re-read here I realized that asking to drop Refcell and Cell is not necessary, but where I came from to this ask is that I don't think Refcell and Cell would be very useful in most cases if we were to have Send only capability traits with async support where state is shared across the clones via Arc<> wrapped fields.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not want to downplay this concern, and yet, I think we can do this. (Confident!)

We will pay double the cost, in places, for implementing both shared- and local- capabilities for extension binding/discovery/registry.

Comment on lines +45 to +47
Capability interfaces (traits) are **defined and maintained in the OTel Dataflow
Engine core**. Extensions implement these predefined capabilities but **cannot
introduce new capability interfaces outside the core**.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could make the core repo a bottleneck. Every new capability type would require an engine PR before anyone can build an extension for it. We already have contrib-nodes for community-contributed nodes; it seems natural to allow contrib-defined capabilities too.

Config validation wouldn't be affected. Capabilities would use the same distributed_slice discovery as nodes. Whether the trait lives in the engine or contrib, it's in the same binary and discoverable at startup.

A possible middle ground could be that core capabilities like auth stay in the engine, but the system allows capabilities to be defined externally too (using the same distributed_slice registration). The engine doesn't need to know every capability type; it just needs to wire a node's binding to the right extension instance. Type safety is enforced by the trait bound at the node level.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, start sealed -> open up if there is a great justification with a good game plan. I think sealing can be an on/off switch, rather than an important design decision that we need to make today (?). The only reason I suggest this is that, making it open is a bit like opening pandora's box, especially if the code base is not mature enough for it yet.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fair concern. The Collector ecosystem has begun to see requests for new, experimental extension interfaces. One example is a request for https://github.com/cockroachdb/pebble as an API surface or even as a "get pebble API stub" extension. For example, someone has a feature for tailsamplingprocessor that lets it offload data for longer-window trace assembly. We don't want to add a pebble dependency to every collector-contrib release, though, so we imagine a single-consumer extension API which exists only to allow optional configuration of a heavy dependency.

Eventually, capability definitions will need to live outside the core repository. At that point, we will require some level of stability in the registry and discovery process.

jmacd added a commit to jmacd/otel-arrow that referenced this pull request Mar 16, 2026
The engine supports two config formats — single-pipeline (-p) and
full engine (-c) — but only -p was documented. This caused confusion
during review of the extension system proposal (PR open-telemetry#2293), whose
examples use the engine-level format.

- configs/README.md: add Configuration Formats section explaining
  both modes with examples and CLI flags, document engine-conf/
  subdirectory
- src/README.md: fix CLI examples, add -c usage, link to configs
  README
- README.md: replace inaccurate claim that all configs deserialize
  into EngineConfig with accurate description of both formats

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lquerel
Copy link
Copy Markdown
Contributor Author

lquerel commented Mar 17, 2026

@jmacd do we merge this spec or not?


* provider name
* supported capabilities
* supported execution model (local or shared)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an exclusive-or, I think. If an extension provider wished to be both local and shared, it would register two different variations of itself. Does that match your expectation?

otlp_recv1:
type: receiver:otlp
capabilities:
auth_check: auth_main
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This supports a 1-to-1 mapping from capability (name) to extension (name).

I'm thinking about cases where the mapping is 1-to-N, and the Collector concept of middleware extension might apply. The collector does not have a node-level configuration type at all, but it does have standard configuration types, and the configuration for middleware extensions is a list.

    middlewares: [first_intercept, second_intercept, third_intercept]

Can we imagine wanting named capabilities, like for example two storage extensions (metadata and data)

      storage/meta: metadata_store # on SSD
      storage/data: oteldata_store    # on disk

with (somewhere else) documentation or configuration of which extension to choose at which point

      rate_limit: ordinary_limiter
      rate_limit/special_tenant: unlimited_limiter

```text
Rc
RefCell
Cell
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me nervous to see both Cell and RefCell in this list.

I'm familiar with Rc<RefCell<Box<dyn Thing>>> as a standard pattern in Rust. I'm not sure when you would use Cell for capability discovery or binding.

Comment thread rust/otap-dataflow/docs/extensions.md Outdated
Comment thread rust/otap-dataflow/docs/extensions.md Outdated
jmacd added a commit to jmacd/otel-arrow that referenced this pull request Mar 18, 2026
Add a clean-room extension system with name-based capability binding,
inspired by but stripped from PR open-telemetry#2293 (~780 lines vs ~4,100).

Core design:
- Three-level type hierarchy: InstanceCapabilities (per extension instance)
  → ExtensionRegistry (per pipeline) → Capabilities (per node, TypeId-keyed)
- CapabilitySlot: Shared (Arc<dyn Trait>, cross-core) vs Local (factory
  producing Box<dyn Trait>, one per core)
- CoreLocalCache: per-core !Send cache using Rc<RefCell<Box<T>>>, enables
  multiple Local nodes on the same core to share a single instance
- NodeContext: wraps PipelineContext + &CoreLocalCache, passed to node
  factories via Deref<Target=PipelineContext>

Config model:
- extensions: section in PipelineConfig declares instances by name
- nodes: each node has capabilities: mapping capability_name → instance_name
- Two nodes can bind the same capability to different extension instances

Key properties:
- Sync-first: no async anywhere
- Factory signature: PipelineContext → NodeContext (all 35 factories updated)
- Extension trait + ExtensionFactory with distributed_slice registration
- Per-node resolution in build() before node factory calls

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
jmacd added a commit to jmacd/otel-arrow that referenced this pull request Mar 18, 2026
Add a clean-room extension system with name-based capability binding,
inspired by but stripped from PR open-telemetry#2293 (~540 lines vs ~4,100).

Core design:
- Three-level type hierarchy: InstanceCapabilities (per extension instance)
  → ExtensionRegistry (per pipeline) → Capabilities (per node, TypeId-keyed)
- CapabilitySlot: Shared (Arc<dyn Trait>, cross-core) vs Local (factory
  producing Box<dyn Trait>, one per core)
- Capabilities accessible via PipelineContext.capabilities() — no factory
  signature changes needed

Config model:
- extensions: section in PipelineConfig declares instances by name
- nodes: each node has capabilities: mapping capability_name → instance_name
- Two nodes can bind the same capability to different extension instances

Key properties:
- Sync-first: no async anywhere
- Zero factory signature changes: capabilities flow through PipelineContext
- Extension trait + ExtensionFactory with distributed_slice registration
- Per-node resolution in build() before node factory calls

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jmacd jmacd added this pull request to the merge queue Mar 18, 2026
Merged via the queue into open-telemetry:main with commit cd545fa Mar 18, 2026
69 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants