Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Service] Partial start can cause panic #6507

Open
MovieStoreGuy opened this issue Nov 8, 2022 · 9 comments
Open

[Service] Partial start can cause panic #6507

MovieStoreGuy opened this issue Nov 8, 2022 · 9 comments
Labels
area: management Collector lifecycle management area:service bug Something isn't working

Comments

@MovieStoreGuy
Copy link
Contributor

Describe the bug

On service start up, if a component was to error, the start up process is stopped and then the shutdown method is invoked.

If a component requires Start to be called in order to populate internal values (ie, storing a reference to a context.CancelFunc), the service.Shutdown will panic and the original error is lost.

Steps to reproduce
If possible, provide a recipe for reproducing the error.

Have two components, one that errors on start up and another that requires Start to be called for the Shutdown method to be valid.

For example the prometheus receiver requires Start to be called in order for Shutdown to be valid.
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/prometheusreceiver/metrics_receiver.go#L307-L312

However, I believe that is a seperate concern.

What did you expect to see?

I think there is a few things that are part of this is clarify the expected behaviour for component.Componet to include if Shutdown can be called without a call to Start.

I think resolve that expectation will dictate what would be the following steps.

What did you see instead?

As the bug describes, it is possible for a graceful shutdown to panic.

What version did you use?
Version: v0.63.0

However, this impacts all versions I believe

What config did you use?
NA

Environment
OS: Darwin/arm & linux/arm (within a docker container)
Compiler(if manually compiled): go 1.18.4 and go 1.19.2

@bogdandrutu
Copy link
Member

On service start up, if a component was to error, the start up process is stopped and then the shutdown method is invoked.

Should we call "Shutdown" only on the components that we started?

@MovieStoreGuy
Copy link
Contributor Author

I know we had spoken about this over slack but just to recap:

Should we call "Shutdown" only on the components that we started?

I don't think we should, mainly because that would imply that the service is managing that state and I think would over complicate cognitive model, and potential break the OpAMP.

However, I do think that following the example of the behaviour for http.Server as Shutdown is safe to call regardless of starting however, once Shutdown correctly, nil is returned along with any repeated called to Shutdown and if the shutdown errors then the original error is returned for each repeated calls.

What I would expect as a definition of done:

  • The existing components description is updated to clarify behaviour
  • A component helper method to wrap this functionality for the components
  • Updated lifecycle tests to include validating shutting down a component is safe without being started

cc: @tigrannajaryan since this may impact work with OpAMP

@tigrannajaryan
Copy link
Member

I don't think we should, mainly because that would imply that the service is managing that state and I think would over complicate cognitive model, and potential break the OpAMP.

@MovieStoreGuy can you expand on this? How can this break the OpAMP?

@MovieStoreGuy
Copy link
Contributor Author

MovieStoreGuy commented Nov 16, 2022

I don't think we should, mainly because that would imply that the service is managing that state and I think would over complicate cognitive model, and potential break the OpAMP.

@MovieStoreGuy can you expand on this? How can this break the OpAMP?

My concern was that if we start mandating that components are not restartable, then OpAMP may have to change it works managing components.

However, I believe we should keep restartable components however add in that Shutdown can be called independent from Start which I don't think is a problem but I more wanted to confirm that adding in this requirement of a component wouldn't impact the project.

@tigrannajaryan
Copy link
Member

We do not necessarily need the component instances to be restartable. The current restart logic is that we shutdown all components and create new components before we start them.

This seems to be a cleaner solution since it removes the need to rely on component implementations being restartable.

Independently from this, we need to decide who is responsible for tracking that Start was called and didn't fail so that Shutdown must be called for that component.

I am reluctant to make this a component responsibility. If any component implement this state tracking incorrectly we will have a problem.

I would prefer that the Service keeps track of Start-ed components and calls their Shutdown (only if Start succeeded). We can do this in one place in the Service and will make sure it it works correctly for all components.

Not a strong opinion, but this seems to be a better option to me.

@cpheps
Copy link
Contributor

cpheps commented Jan 4, 2023

@MovieStoreGuy I wanted to check in on that status of this issue as it's been open for a few months. We've run into this quite a few times with our distribution of the collector especially using the hostmetrics receiver. We might be able to lend some help if need be.

@djaglowski
Copy link
Member

djaglowski commented Jan 10, 2023

We do not necessarily need the component instances to be restartable. The current restart logic is that we shutdown all components and create new components before we start them.

This seems to be a cleaner solution since it removes the need to rely on component implementations being restartable.

I agree with this. Restartability could introduce a lot of complexity. I would prefer we stick with this approach for now. The main downside of this approach is that state can be lost, but I think we can solve this in the future by supporting a notion of passing state to subsequent instances of a component. In any case, I think restartability is a separate concern from the original issue.

@djaglowski
Copy link
Member

djaglowski commented Jan 10, 2023

It seems we've already done a few things to address this. (Clarification added in #6535, and tests that call Shutdown without calling Start)

However, I'm not convinced we're addressing a partial start, as called out in the original issue. When Start fails, it may leave a component in an unusual state. This may just be a correctness concern for each individual implementation, but I'm wondering if we should identify a general strategy that is more resilient.

As a simple example, consider:

type MyComp struct {
    statefulThingOne
    statefulThingTwo
}

func (c *MyComp) Start(...) error {
    c.statefulThingOne := newState(...)
    c.statefulThingTwo, err := mightFail(...) // returns nil, err in case of failure
    return err
}

func (c *MyComp) Shutdown(...) error {
    c.statefulThingOne.Stop() 
    c.statefulThingTwo.Stop() // oops, nil pointer
}

If an error returned from Start means that we do not to call Shutdown, then isn't there potentially some partial state can remain problematic? Could this reserve a resource that interferes with subsequent attempts to start a new instance of the component, or just cause a leak?

If on the other hand, we call Shutdown function regardless of the result of Start, then each component must correctly account for any possible partial state. Again, this is largely a matter of correctness, but given the issues we've observed already, I think it may be worth looking for ways to protect against these cases.

@atoulme
Copy link
Contributor

atoulme commented Dec 19, 2023

This is a pattern we can use to support partial starts:
open-telemetry/opentelemetry-collector-contrib#30017

I am finding out that the tests we have in otelcontribcol do not do a good job catching all issues with a Shutdown call because they don't account for errors returned on creating the component.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: management Collector lifecycle management area:service bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants