Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services: stop moduleService underlying service after timing out on AwaitRunning #409

Merged
merged 2 commits into from
Oct 21, 2023

Conversation

dimitarvdimitrov
Copy link
Contributor

Context

We discovered this in the Mimir ruler which can have a subservice which takes a really long time to fully start (order of minutes). If the process start is interrupted during that time, then that service remains running until the process is killed.

Description of the bug

When moduleService is interrupted during startup it immediately enters a Failed state. This means the services manager doesn't attempt to shut it down because it couldn't even start. This is a problem because

  1. the moduleService delegates to another services.Service the lifecycle of the underlying service; this means that the state of moduleService may not match that of the underlying service
  2. the moduleService can fail its own startup if it couldn't wait for the underlying service to properly start; this means that the service can start properly eventually, but its parent moduleService shows as Failed; this prevents from shutting down the subservice; so the goroutines continue running until the process stops

Fix

This PR introduces two changes to address this

  1. When the startup procedure is interrupted when waiting for the underlying service, assume the underlying service will later complete its startup and return no errors.
  2. When stopping the moduleService also account for services which are still left in their starting state. We still want to signal to them to stop.

Checklist

  • Tests updated
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…waitRunning

### Context
We discovered this in the Mimir ruler which can have a subservice which takes a really long time to fully start (order of minutes). If the start is interrupted, that service remains running until the process is terminated.

### Description of the bug
When `moduleService` is interrupted during startup it immediately enters a Failed state. This means the services manager doesn't attempt to shut it down because it couldn't even start. This is a problem because

1. the moduleService delegates to another services.Service the lifecycle of the underlying service; this means that the state of moduleService may not match that of the underlying service
2. the moduleService can fail its own startup if it couldn't wait for the underlying service to properly start; this means that the service can start properly eventually, but its parent moduleService shows as Failed; this prevents from shutting down the subservice; so the goroutines continue running until the process stops

### Fix

This PR introduces two changes to address this
1. When the startup procedure is interrupted when waiting for the underlying service, assume the underlying service will later complete its startup and return no errors.
2. When stopping the moduleService also account for services which are still left in their starting state. We still want to signal to them to stop.
@aknuds1 aknuds1 requested a review from a team October 18, 2023 06:46
Copy link

@flxbk flxbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@pracucci pracucci self-requested a review October 18, 2023 11:53
fallthrough
case services.Starting:
// If upstream services can tolerate this service being failed and/or not started, and they have already started,
// they should track the state of this service and change state accordingly.
level.Debug(w.logger).Log("msg", "stopping", "module", w.name)

err = services.StopAndAwaitTerminated(context.Background(), w.service)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we could end up calling the stop function while the start is still running? I think that would be violation of the contract.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Dimitar on Slack:

regarding your comment: since we’re using the services.BasicService abstraction to start and stop it will manage the state of the underlying w.service. When we call StopAndAwaitTerminated, this calls w.service.StopAsync() . Then if w.service is still running its startingFn, the context to that will be cancelled. And depending on the return error of that startingFn the BasicService will either run stoppingFn or will exit directly

Copy link
Contributor

@pracucci pracucci Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that using the BasicService implementation your change is safe. By contract I meant the service design documented here. However, I can't find this specific case documented, but it's actually documented in the Service interface:

	// StopAsync tell the service to stop. This method doesn't block and can be called multiple times.
	// If Service is New, it is Terminated without having been started nor stopped.
	// If Service is in Starting or Running state, this initiates shutdown and returns immediately.
	// If Service has already been stopped, this method returns immediately, without taking action.
	StopAsync()

Specifically this:

If Service is in Starting or Running state, this initiates shutdown and returns immediately.

So all good for me.

@dimitarvdimitrov dimitarvdimitrov merged commit 736bca5 into main Oct 21, 2023
3 checks passed
@dimitarvdimitrov dimitarvdimitrov deleted the dimitar/moduleService-shutdown branch October 21, 2023 09:53
ying-jeanne pushed a commit that referenced this pull request Nov 2, 2023
…waitRunning (#409)

* services: stop moduleService underlying service after timing out on AwaitRunning

### Context
We discovered this in the Mimir ruler which can have a subservice which takes a really long time to fully start (order of minutes). If the start is interrupted, that service remains running until the process is terminated.

### Description of the bug
When `moduleService` is interrupted during startup it immediately enters a Failed state. This means the services manager doesn't attempt to shut it down because it couldn't even start. This is a problem because

1. the moduleService delegates to another services.Service the lifecycle of the underlying service; this means that the state of moduleService may not match that of the underlying service
2. the moduleService can fail its own startup if it couldn't wait for the underlying service to properly start; this means that the service can start properly eventually, but its parent moduleService shows as Failed; this prevents from shutting down the subservice; so the goroutines continue running until the process stops

### Fix

This PR introduces two changes to address this
1. When the startup procedure is interrupted when waiting for the underlying service, assume the underlying service will later complete its startup and return no errors.
2. When stopping the moduleService also account for services which are still left in their starting state. We still want to signal to them to stop.

* Add CHANGELOG.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants