Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start order of extensions is not guaranteed? #8732

Closed
yurishkuro opened this issue Oct 24, 2023 · 15 comments · Fixed by #8768
Closed

Start order of extensions is not guaranteed? #8732

yurishkuro opened this issue Oct 24, 2023 · 15 comments · Fixed by #8768
Assignees
Labels
bug Something isn't working

Comments

@yurishkuro
Copy link
Member

Describe the bug
Maybe I am misreading the situation, but this code iterates over map when starting extensions, so it cannot guarantee in which order they are started:

for extID, ext := range bes.extMap {

Even though the documentation implies that the order is define by the declaration order:

service:
# extensions lists the extensions added to the service. They are started
# in the order presented below and stopped in the reverse order.
extensions: [health_check, pprof, zpages]

I observed some spontaneous failures in my experiments with JaegerV2 which declares extensions with an order expectation: extensions: [jaeger_storage, jaeger_query] but the logs show differently:

2023-10-24T01:17:01.143Z	info	extensions/extensions.go:33	Starting extensions...
2023-10-24T01:17:01.143Z	info	extensions/extensions.go:36	Extension is starting...	{"kind": "extension", "name": "jaeger_query"}

Steps to reproduce
It's random behavior, hard to reproduce reliably.

What did you expect to see?
Extensions should be starting in the order defined in the config.

What did you see instead?
Seeing different order.

What version did you use?
Version: v0.87.0

What config did you use?
https://github.com/jaegertracing/jaeger/blob/28520b31471b36d9e949de4ffaa016c2640a3be7/cmd/jaeger-v2/internal/all-in-one.yaml#L1

Environment
OS: MacOS, Linux
Compiler(if manually compiled): go 1.20.x

@atoulme
Copy link
Contributor

atoulme commented Oct 24, 2023

I'd be happy to work on this one.

@dmitryax
Copy link
Member

I'm not sure this is a bug. I think this was implemented intentionally. All extensions that we have in core and contrib do not depend on each other. This assumption is used in some discussions around config grouping/merging, e.g.: open-telemetry/opentelemetry-collector-contrib#18509 and #8394.

I think we need a strong argument to introduce this behavior.

@yurishkuro do you have an extension that depends on another extension? Can you please let us know why it's needed in your use case?

@yurishkuro
Copy link
Member Author

yurishkuro commented Oct 25, 2023

The use case is explained here: https://docs.google.com/document/d/1s4_6VgAS7qAVp6iEm5KYvpiGw3h2Ja5T5HpKo29iv00/view#heading=h.1ded8prqa1xm

flowchart LR
    E[Exporter] -->|Get storage| S(StorageExtension)
    Q[Jaeger-Query extension] -->|Get storage| S(StorageExtension)

I believe this may also be required by the Encoding extensions (#6272), which is a similar use case.

The alternative to declaration order would be allowing to express dependencies explicitly and then building a dependency graph to control the initialization order.

@yurishkuro
Copy link
Member Author

@djaglowski You mentioned service graph in open-telemetry/opentelemetry-collector-contrib#18509 (comment), is there a way to declare dependencies between extensions into that graph?

@atoulme
Copy link
Contributor

atoulme commented Oct 25, 2023

I believe this may also be required by the Encoding extensions (open-telemetry/opentelemetry-collector-contrib#28686), which is a similar use case.

The encoding extensions don't do much with their lifecycle, since the plan is just to cast them to marshaler/unmarshaler types.

I'm more worried that our current documentation is making a promise it is not keeping. I don't really see the harm in being deterministic in the start/shutdown order, it might pay off to have the pprof extension load before others for example.

@yurishkuro
Copy link
Member Author

The encoding extensions don't do much with their lifecycle, since the plan is just to cast them to marshaler/unmarshaler types.

it's not about their lifecycle, but about other components that depend on them, some of which may also be extensions.

@yurishkuro
Copy link
Member Author

I don't really see the harm in being deterministic in the start/shutdown order, it might pay off to have the pprof extension load before others for example.

I agree. Current non-determinism leads to difficult to troubleshoot bugs - it took me a while to realize that this was the root cause of the failures I encountered.

@djaglowski
Copy link
Member

@djaglowski You mentioned service graph in open-telemetry/opentelemetry-collector-contrib#18509 (comment), is there a way to declare dependencies between extensions into that graph?

Not currently. The relationships between pipeline components are very rigidly defined, but the way extensions are used by pipeline components is more of a convention. e.g. the component defines a parameter called storage and expects to find an extension with that ID. We could in theory formalize this but it would be quite some work and it's not clear to me that it's necessary.

I agree extensions should be started deterministically. Since the current behavior is non-deterministic, users already must implicitly accept that extensions may start in any order, including the order we would choose deterministically.

Would it be sufficient to start extensions in order, then start pipeline components as we do now? Stopping would be the opposite - pipeline components, then extensions in reverse order.

@yurishkuro
Copy link
Member Author

Would it be sufficient to start extensions in order, then start pipeline components as we do now? Stopping would be the opposite - pipeline components, then extensions in reverse order.

That would work for me, and what the open PR is doing.

@yurishkuro
Copy link
Member Author

Just to hold off until discussion in #8732 is resolved

@dmitryax since you blocked the PR, could you explain what form of resolution you expect to happen?

@dmitryax
Copy link
Member

dmitryax commented Oct 26, 2023

The use case is explained here: https://docs.google.com/document/d/1s4_6VgAS7qAVp6iEm5KYvpiGw3h2Ja5T5HpKo29iv00/view#heading=h.1ded8prqa1xm

If users would have to point the jaeger_query extension and the exporter to the same storage extension, maybe it can be just one extension (jaeger_query or any other name) implementing both serving the query path and the storage interface that exporters can use. The storage part of the extension can be configurable to use any storage under the hood and provide it to the exporter via the storage interface. I believe, that way, we would avoid users misconfiguring exporter and the jaeger_query extension to use different storages. WDYT?

I agree. Current non-determinism leads to difficult to troubleshoot bugs - it took me a while to realize that this was the root cause of the failures I encountered.

I'm just worried that by introducing this behavior, we limit possible options for grouping/extending configs which are currently being discussed in different issues. Also, if we accept this change and decide to integrate the extensions into the pipeline graph later, it'll be a breaking change for the users expecting the order of initialization according to the config definition. And given that that order is currently not applicable to receivers and exporters, this change doesn't bring any consistency.

@yurishkuro
Copy link
Member Author

maybe it can be just one extension

That doesn't work well. First, traditional jaeger-collector binary has no UI parts, and the intent is to be able to replicate it with jaeger-v2 by configuring the binary without the UI. Being forced into using only one extension means we have to bundle a lot of functionality into a single monolith module. That's rather antithetical to the OTel Collector modular design.

And given that that order is currently not applicable to receivers and exporters

What about processors, isn't their order enforced?

@dmitryax
Copy link
Member

First, traditional jaeger-collector binary has no UI parts, and the intent is to be able to replicate it with jaeger-v2 by configuring the binary without the UI. Being forced into using only one extension means we have to bundle a lot of functionality into a single monolith module. That's rather antithetical to the OTel Collector modular design.

I was talking only about embedding storage extension interface into the jaeger_query extension. It doesn't require bringing UI, right? They still should be different Go modules, but do they need to be exposed to the end users as different components?

What about processors, isn't their order enforced?

It is, but as part of the pipeline graph. Extensions are not part of the graph.

@yurishkuro
Copy link
Member Author

I was talking only about embedding storage extension interface into the jaeger_query extension.

jaeger_query and UI are the same thing, in practice. If they weren't, we'd again have a dependency of one extension on another, because neither of those can be expressed as another component type.

What about processors, isn't their order enforced?
It is, but as part of the pipeline graph. Extensions are not part of the graph.

What I mean is - isn't the order of processors also determined by their declaration order in the config? afaik there is no other way to declare dependencies between processors.

Frankly, having a graph-based mechanism of ordering would definitely be more preferable for extensions than config-based. When constructing an extension I would specify that it depends on another extension and the runtime would make sure they are started in the correct order (I think it's fine to create them in any order).

@dmitryax
Copy link
Member

I just wanted to confirm that exposing the extensions as different components in the configuration interface is useful for Jaeger end-users. If it is, I'm ok with merging the PR. Sooner or later, we eventually might run into other extensions with interdependencies.

dmitryax pushed a commit that referenced this issue Nov 2, 2023
**Description**:
Enforce order of start and shutdown of extensions according to their
internally declared dependencies

**Link to tracking Issue**:
Resolves #8732

**Motivation**:
This is an alternative approach to #8733 which uses declaration order in
the config to start extensions. That approach (a) enforces order when
it's not always necessary to enforce, and (b) exposes unnecessary
complexity to the user by making them responsible for the order.

This PR instead derives the desired order of extensions based on the
dependencies they declare by implementing a `DependentExtension`
interface. That means that extensions that must depend on others can
expose this interface and be guaranteed to start after their
dependencies, while other extensions can be started in arbitrary order
(same as happens today because of iterating over a map).

The extensions that have dependencies have two options to expose them:
1. if the dependency is always static (e.g. `jaeger_query` extension
depending on `jaeger_storage` as in the OP), the extension can express
this statically as well, by returning a predefined ID of the dependent
extension
2. in cases where dependencies are dynamic, the extension can read the
names of the dependencies from its configuration.

The 2nd scenario is illustrated by the following configuration. Here
each complex extension knows that it needs dependencies that implement
`storage` and `encoding` interfaces (both existing APIs in collector &
contrib), but does not know statically which instances of those, the
actual names are supplied by the user in the configuration.

```yaml
extensions:
  complex_extension_1:
    storage: filestorage
    encoding: otlpencoding
  complex_extension_2:
    storage: dbstorage
    encoding: jsonencoding
  filestorage:
    ...
  dbstorage:
    ...
  otlpencoding:
  jsonencoding:
```

**Changes**:
* Introduce `DependentExtension` optional interface 
* Change `Extensions` constructor to derive the required order using a
directed graph (similar to pipelines)
* Inherited from #8733 - use new ordered list of IDs to
start/stop/notify extensions in the desired order (previously a map was
used to iterate over, which resulted in random order).
* Tests

**Testing**:
Unit tests

---------

Signed-off-by: Yuri Shkuro <github@ysh.us>
Co-authored-by: Antoine Toulme <antoine@lunar-ocean.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants