-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
services: stop moduleService underlying service after timing out on AwaitRunning #409
Conversation
…waitRunning ### Context We discovered this in the Mimir ruler which can have a subservice which takes a really long time to fully start (order of minutes). If the start is interrupted, that service remains running until the process is terminated. ### Description of the bug When `moduleService` is interrupted during startup it immediately enters a Failed state. This means the services manager doesn't attempt to shut it down because it couldn't even start. This is a problem because 1. the moduleService delegates to another services.Service the lifecycle of the underlying service; this means that the state of moduleService may not match that of the underlying service 2. the moduleService can fail its own startup if it couldn't wait for the underlying service to properly start; this means that the service can start properly eventually, but its parent moduleService shows as Failed; this prevents from shutting down the subservice; so the goroutines continue running until the process stops ### Fix This PR introduces two changes to address this 1. When the startup procedure is interrupted when waiting for the underlying service, assume the underlying service will later complete its startup and return no errors. 2. When stopping the moduleService also account for services which are still left in their starting state. We still want to signal to them to stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
fallthrough | ||
case services.Starting: | ||
// If upstream services can tolerate this service being failed and/or not started, and they have already started, | ||
// they should track the state of this service and change state accordingly. | ||
level.Debug(w.logger).Log("msg", "stopping", "module", w.name) | ||
|
||
err = services.StopAndAwaitTerminated(context.Background(), w.service) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we could end up calling the stop function while the start is still running? I think that would be violation of the contract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From Dimitar on Slack:
regarding your comment: since we’re using the services.BasicService abstraction to start and stop it will manage the state of the underlying w.service. When we call StopAndAwaitTerminated, this calls w.service.StopAsync() . Then if w.service is still running its startingFn, the context to that will be cancelled. And depending on the return error of that startingFn the BasicService will either run stoppingFn or will exit directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that using the BasicService
implementation your change is safe. By contract I meant the service design documented here. However, I can't find this specific case documented, but it's actually documented in the Service
interface:
// StopAsync tell the service to stop. This method doesn't block and can be called multiple times.
// If Service is New, it is Terminated without having been started nor stopped.
// If Service is in Starting or Running state, this initiates shutdown and returns immediately.
// If Service has already been stopped, this method returns immediately, without taking action.
StopAsync()
Specifically this:
If Service is in Starting or Running state, this initiates shutdown and returns immediately.
So all good for me.
…waitRunning (#409) * services: stop moduleService underlying service after timing out on AwaitRunning ### Context We discovered this in the Mimir ruler which can have a subservice which takes a really long time to fully start (order of minutes). If the start is interrupted, that service remains running until the process is terminated. ### Description of the bug When `moduleService` is interrupted during startup it immediately enters a Failed state. This means the services manager doesn't attempt to shut it down because it couldn't even start. This is a problem because 1. the moduleService delegates to another services.Service the lifecycle of the underlying service; this means that the state of moduleService may not match that of the underlying service 2. the moduleService can fail its own startup if it couldn't wait for the underlying service to properly start; this means that the service can start properly eventually, but its parent moduleService shows as Failed; this prevents from shutting down the subservice; so the goroutines continue running until the process stops ### Fix This PR introduces two changes to address this 1. When the startup procedure is interrupted when waiting for the underlying service, assume the underlying service will later complete its startup and return no errors. 2. When stopping the moduleService also account for services which are still left in their starting state. We still want to signal to them to stop. * Add CHANGELOG.md
Context
We discovered this in the Mimir ruler which can have a subservice which takes a really long time to fully start (order of minutes). If the process start is interrupted during that time, then that service remains running until the process is killed.
Description of the bug
When
moduleService
is interrupted during startup it immediately enters a Failed state. This means the services manager doesn't attempt to shut it down because it couldn't even start. This is a problem becauseFix
This PR introduces two changes to address this
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]