New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service instance deploy state, errors #2034
Conversation
@prev_state = current_state | ||
|
||
if state != @prev_state | ||
rpc_client.async.notification('/node_service_pods/set_state', [node.id, state]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is an async notification, we won't notice if the RPC gets dropped, and the server is left waiting... if this was changed to a rpc_client.request
instead, we would get an error if the RPC fails... then we could crash and retry sooner? OTOH, I suppose right now we'll eventually re-send this notification, so it should still make progress ATM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #2050
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged
self.grid_service.set(:deployed_at => deploy_rev) | ||
|
||
deploy_futures = [] | ||
total_instances = self.instance_count | ||
self.grid_service.grid_service_instances.where(:instance_number.gt => total_instances).destroy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should only happen after all instances have been deployed...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it's safer that way. Maybe also so that it would send the notification to the nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go full-on, then the deploy operation would also wait for those extra instances to get terminated, and also report any failures there...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why after? Imho these should be terminated before the actual deploy (to make room for scheduling).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... the scheduler still uses the Container
info for scheduling decisions, not the ServiceInstance
models? Would need to fix that if you want the scheduler to re-schedule the new instances based on the old instances being gone, plus wait for the containers to terminate and release their resources.
…ng update would confuse new host node running update
|
||
# bail out early if anything fails | ||
if deploy_futures.select{|f| f.ready? }.map{|f| f.value }.any?{|d| d.error? } # raises on any Future exceptions | ||
raise DeployError, "one or more instances failed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There could be a configurable threshold... continue deployment even if fewer than min_instances
fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe min_instances_to_fail = (1 - min_health) * instance_count
? So with 0.8
min health 20% of the deployed instances can fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imho this could be a deploy_opt, cancel or continue on error. Calculation based on min_health is too magical.
host_node_id: node.id, | ||
deploy_rev: deploy_rev, | ||
desired_state: desired_state, | ||
rev: nil, # reset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is problematic because both the ensure_service_instance(..., 'stopped')
and ensure_service_instance(..., 'running')
happen with the same deploy_rev
... Ideally, each ensure_service_instance
operation would just set its own deploy_rev: Time.now.utc
(XXX: with sub-second percision?), but then I'm not sure what affect that would have on the actual Docker container deploy_rev
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that problematic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, setting it to nil
is a workaround for that issue, but it feels slightly wrong... The problematic thing is that we we have to do two "deploys" using the same deploy_rev
, because the service instance moves across two nodes.
end | ||
|
||
self.terminate if service_pod.terminated? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not terminate the actor if the ensure_terminated
fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
terminate
is bit confusing as it's used to terminate the pod, container and the actor itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The terminate
term is from celluloid, so we would have to rename the service_pods.desired_state
:)
BTW what happens here now is that the server destroys the GridServiceInstance
, and the agent signals ServicePodWorker.destroy
... then if the docker rm
fails, then the ServicePodWorker
just logs a warning, and terminates itself. The docker container remains running, and the ServicePodManager
doesn't pick that up until you restart the agent.
I'm not sure the existing implementation behaves better either... it will crash the ServicePodWorker
, and then the ServicePodManager
will not restart it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we stick around if the ensure_terminated
fails, so the ServicePodManager
will re-try the destroy
on the next update loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea to stick around and try again on next update loop
Design-wise looks good IMO |
Still WIP on the CLI |
Took out the |
I suggest we merge this with the agent/server/api changes, and the basic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #2032, fixes #1907
If creating the Docker container for a service instance failed during deployment, previous versions of Kontena would only log the errors on the agent, and they would not be reported back to the server or shown in the CLI. With #1607, the agent
/service_pods/create
errors would be reported back to the server, but theGridServiceDeployer
would not return them to the CLI (#1907). With #1873, the agentServicePodWorker
no longer reported the Docker API errors back to the server, and would crash instead (#2032).This PR provides end-to-end per-instance error handling for service deployments, with the agent reporting
ServiceInstance.error
states back to the server, the server tracks theGridServiceInstanceDeploy.error
state, the API includes/v1/services/:grid/:stack/:service/deploys/:deploy
{"instance_deploys": {"error": ...}}
states, and the CLI can render those (WIP).The
ServiceInstance
now also has a newerror
field. This is updated independently of thestate
field: a newdeploy_rev
with adesired_state: stopped
might still result instate: running
witherror: failed to stop...
. Then later, the agent might independently re-apply and update thatdeploy_rev
tostate: stopped
error: null
.Agent
Refactor
ServicePodWorker#apply
to handle theensure_desired_state
success and error cases, protocol state and actor lifecycleSimplify
@prev_state
to just compare on the complete state update insync_state_to_master
The agent will also sync state to the master if the
error
state changes, like on a laterupdate
cycle.Server
Add a new
error
field to theServiceInstance
modelAdd a new
GridServiceInstanceDeploy
model embedded in theGridServiceDeploy
, with per-instance state and errorsChange
GridServiceInstanceDeployer#ensure_service_instance
to update and wait for thedeploy_rev
, also when stopping an old instance on a different nodeThis ensures that the instance was actually stopped. This also now warns if it was unable to stop it.
Log warnings if unable to stop existing service instance
Refactor the
GridServiceInstanceDeployer
around the newGridServiceInstanceDeploy
modelAPI
instance_deploys
field to the/v1/services/:grid/:stack/:service/deploys/:deploy
API/v1/services/:grid/:stack/:service/deploys/:deploy
CLI
Rough hack for
kontena service deploy
to report the per-instance errors.TODO support for
kontena stack deploy
kontena service scale redis 4
kontena service scale redis-fail 2