Service instance deploy state, errors #2034

SpComb · 2017-03-30T12:56:18Z

Fixes #2032, fixes #1907

If creating the Docker container for a service instance failed during deployment, previous versions of Kontena would only log the errors on the agent, and they would not be reported back to the server or shown in the CLI. With #1607, the agent /service_pods/create errors would be reported back to the server, but the GridServiceDeployer would not return them to the CLI (#1907). With #1873, the agent ServicePodWorker no longer reported the Docker API errors back to the server, and would crash instead (#2032).

This PR provides end-to-end per-instance error handling for service deployments, with the agent reporting ServiceInstance.error states back to the server, the server tracks the GridServiceInstanceDeploy.error state, the API includes /v1/services/:grid/:stack/:service/deploys/:deploy {"instance_deploys": {"error": ...}} states, and the CLI can render those (WIP).

The ServiceInstance now also has a new error field. This is updated independently of the state field: a new deploy_rev with a desired_state: stopped might still result in state: running with error: failed to stop.... Then later, the agent might independently re-apply and update that deploy_rev to state: stopped error: null.

Agent

Refactor ServicePodWorker#apply to handle the ensure_desired_state success and error cases, protocol state and actor lifecycle
Simplify @prev_state to just compare on the complete state update in sync_state_to_master

The agent will also sync state to the master if the error state changes, like on a later update cycle.

Server

Add a new error field to the ServiceInstance model
Add a new GridServiceInstanceDeploy model embedded in the GridServiceDeploy, with per-instance state and errors
Change GridServiceInstanceDeployer#ensure_service_instance to update and wait for the deploy_rev, also when stopping an old instance on a different node

This ensures that the instance was actually stopped. This also now warns if it was unable to stop it.

Log warnings if unable to stop existing service instance

INFO -- GridServiceInstanceDeployer: Deploying service instance development/null/redis-2 to node core-02 at 2017-03-30 14:01:30 UTC...
INFO -- GridServiceInstanceDeployer: Stopping existing service service development/null/redis instance 2 on previous node core-01...

INFO -- GridServiceInstanceDeployer: Deploying service instance development/null/redis-2 to node core-01 at 2017-03-30 14:02:59 UTC...
INFO -- GridServiceInstanceDeployer: Stopping existing service service development/null/redis instance 2 on previous node core-02...
WARN -- GridServiceInstanceDeployer: Failed to stop existing service development/null/redis instance 2 on previous node core-02: Host node is offline

WARN -- GridServiceInstanceDeployer: Replacing orphaned service development/null/redis instance 2 on destroyed node

Refactor the GridServiceInstanceDeployer around the new GridServiceInstanceDeploy model

API

Add a new instance_deploys field to the /v1/services/:grid/:stack/:service/deploys/:deploy API

`/v1/services/:grid/:stack/:service/deploys/:deploy`

{
   "reason" : "GridServiceDeployer::DeployError: one or more instances failed",
   "id" : "58dcf836de3578252b000001",
   "finished_at" : null,
   "state" : "error",
   "instance_deploys" : [
      {
         "state" : "error",
         "instance_number" : 1,
         "node" : "core-02",
         "error" : "GridServiceInstanceDeployer::AgentError: Docker::Error::NotFoundError: No such image: redis:nonexist\n"
      },
      {
         "node" : "core-01",
         "instance_number" : 2,
         "error" : null,
         "state" : "ongoing"
      },
      {
         "state" : "ongoing",
         "error" : null,
         "node" : "core-02",
         "instance_number" : 3
      },
      {
         "state" : "ongoing",
         "error" : null,
         "instance_number" : 4,
         "node" : "core-01"
      }
   ],
   "started_at" : "2017-03-30T12:21:10.801+00:00",
   "created_at" : "2017-03-30T12:21:10.616Z",
   "service_id" : "development/null/redis-fail"
}

CLI

Rough hack for kontena service deploy to report the per-instance errors.

TODO support for kontena stack deploy

`kontena service scale redis 4`

 [done] Scaling redis to 4 instances      
⊛ Deployed instance development/null/redis-1 to node core-01
⊛ Deployed instance development/null/redis-2 to node core-02
⊛ Deployed instance development/null/redis-3 to node core-01
⊛ Deployed instance development/null/redis-4 to node core-02

`kontena service scale redis-fail 2`

 [fail] Scaling redis-fail to 1 instances      
 [error] GridServiceDeployer::DeployError: one or more instances failed:
	⊗ Failed to deploy instance development/null/redis-fail-1 to node core-01: GridServiceInstanceDeployer::ServiceError: Docker::Error::NotFoundError: No such image: redis-nonexist:latest

…ate_to_master with error handling

SpComb · 2017-03-30T13:01:39Z

agent/lib/kontena/workers/service_pod_worker.rb

-      @prev_state = current_state
+
+      if state != @prev_state
+        rpc_client.async.notification('/node_service_pods/set_state', [node.id, state])


Because this is an async notification, we won't notice if the RPC gets dropped, and the server is left waiting... if this was changed to a rpc_client.request instead, we would get an error if the RPC fails... then we could crash and retry sooner? OTOH, I suppose right now we'll eventually re-send this notification, so it should still make progress ATM.

SpComb · 2017-03-30T13:04:38Z

server/app/services/grid_service_deployer.rb

    self.grid_service.set(:deployed_at => deploy_rev)

-    deploy_futures = []
    total_instances = self.instance_count
    self.grid_service.grid_service_instances.where(:instance_number.gt => total_instances).destroy


Perhaps this should only happen after all instances have been deployed...?

Yes, I think it's safer that way. Maybe also so that it would send the notification to the nodes

If we go full-on, then the deploy operation would also wait for those extra instances to get terminated, and also report any failures there...

Why after? Imho these should be terminated before the actual deploy (to make room for scheduling).

Hmm... the scheduler still uses the Container info for scheduling decisions, not the ServiceInstance models? Would need to fix that if you want the scheduler to re-schedule the new instances based on the old instances being gone, plus wait for the containers to terminate and release their resources.

…ng update would confuse new host node running update

SpComb · 2017-03-30T13:15:25Z

server/app/services/grid_service_deployer.rb

+
+      # bail out early if anything fails
+      if deploy_futures.select{|f| f.ready? }.map{|f| f.value }.any?{|d| d.error? } # raises on any Future exceptions
+        raise DeployError, "one or more instances failed"


There could be a configurable threshold... continue deployment even if fewer than min_instances fail?

maybe min_instances_to_fail = (1 - min_health) * instance_count ? So with 0.8 min health 20% of the deployed instances can fail

Imho this could be a deploy_opt, cancel or continue on error. Calculation based on min_health is too magical.

SpComb · 2017-03-30T14:09:08Z

server/app/services/grid_service_instance_deployer.rb

+      host_node_id: node.id,
+      deploy_rev: deploy_rev,
+      desired_state: desired_state,
+      rev: nil, # reset


This is problematic because both the ensure_service_instance(..., 'stopped') and ensure_service_instance(..., 'running') happen with the same deploy_rev... Ideally, each ensure_service_instance operation would just set its own deploy_rev: Time.now.utc (XXX: with sub-second percision?), but then I'm not sure what affect that would have on the actual Docker container deploy_rev?

Why is that problematic?

Well, setting it to nil is a workaround for that issue, but it feels slightly wrong... The problematic thing is that we we have to do two "deploys" using the same deploy_rev, because the service instance moves across two nodes.

SpComb · 2017-03-30T15:55:55Z

agent/lib/kontena/workers/service_pod_worker.rb

        end
+
+        self.terminate if service_pod.terminated?


Should not terminate the actor if the ensure_terminated fails.

terminate is bit confusing as it's used to terminate the pod, container and the actor itself.

The terminate term is from celluloid, so we would have to rename the service_pods.desired_state :)

BTW what happens here now is that the server destroys the GridServiceInstance, and the agent signals ServicePodWorker.destroy... then if the docker rm fails, then the ServicePodWorker just logs a warning, and terminates itself. The docker container remains running, and the ServicePodManager doesn't pick that up until you restart the agent.

I'm not sure the existing implementation behaves better either... it will crash the ServicePodWorker, and then the ServicePodManager will not restart it.

Now we stick around if the ensure_terminated fails, so the ServicePodManager will re-try the destroy on the next update loop.

I think it's a good idea to stick around and try again on next update loop

jnummelin · 2017-03-31T07:49:25Z

Design-wise looks good IMO

…e-deploy-errors

SpComb · 2017-04-03T17:07:25Z

Still WIP on the CLI kontena stack deploy and possibly better kontena service deploy output, but otheriwse ready for review.

…e-deploy-errors

SpComb · 2017-04-05T09:21:29Z

Took out the GridServiceDeployer changes, including the part waiting for ongoing instance deploy futures to complete. Fixing the GridServiceDeployer future stuff can be a separate PR, making this easier to review.

SpComb · 2017-04-05T10:25:58Z

I suggest we merge this with the agent/server/api changes, and the basic kontena service deploy output, and then have a separate PR to improve the CLI stack/service deploy output further.

… failed

jakolehm

LGTM

Tero Marttila added 4 commits March 30, 2017 13:44

agent: ServicePodWorker#apply to wrap ensure_desired_state -> sync_st…

539aab9

…ate_to_master with error handling

server: store GridServiceInstance.error from agent

9015380

server: track each GridServiceInstanceDeploy with error handling

dc74067

wip cli: hacky service deploy instance errors

d83a7e6

SpComb added agent cli enhancement server status/design-review labels Mar 30, 2017

consistent ordering of fields

ab080dc

SpComb changed the title ~~[WIP] Service instance deploy errors~~ [WIP] Service instance deploy state, errors Mar 30, 2017

SpComb commented Mar 30, 2017

View reviewed changes

Tero Marttila added 3 commits March 30, 2017 16:43

server: logging, stricter ensure_service_instance checks

4ec7a5e

server: reset ServiceInstance.rev on updates, or old host node stoppi…

a7e02ef

…ng update would confuse new host node running update

better node errors

803cbf3

SpComb commented Mar 30, 2017

View reviewed changes

SpComb added this to the 1.2.0 milestone Mar 30, 2017

SpComb commented Mar 30, 2017

View reviewed changes

Tero Marttila added 6 commits April 3, 2017 18:17

Merge remote-tracking branch 'origin/master' into fix/service-instanc…

0372cc0

…e-deploy-errors

specs

5f3edea

deploy_state specs

7806756

docdoc

a0fc598

fix deploy ensure_service_instance deploy_rev

913f50e

cli: show service deploy instances

aefe583

SpComb changed the title ~~[WIP] Service instance deploy state, errors~~ Service instance deploy state, errors Apr 3, 2017

SpComb added status/code-review and removed status/design-review labels Apr 3, 2017

Merge remote-tracking branch 'origin/master' into fix/service-instanc…

ae08362

…e-deploy-errors

Tero Marttila added 2 commits April 5, 2017 12:18

rollback irrelevant GridServiceDeployer changes

ccc85cf

Merge remote-tracking branch 'origin/master' into fix/service-instanc…

260bfde

…e-deploy-errors

Tero Marttila added 3 commits April 5, 2017 12:26

mergefix agent service pod worker

9eaeb9b

mergefix service GridServiceInstanceDeployer specs

f9299b9

server api: add service deploy instance_count

cf9dfb8

agent service pod worker: do not terminate actor if ensure_terminated…

135a3b6

… failed

jakolehm approved these changes Apr 5, 2017

View reviewed changes

SpComb merged commit 7aec8a5 into master Apr 5, 2017

SpComb deleted the fix/service-instance-deploy-errors branch April 5, 2017 12:29

This was referenced Apr 6, 2017

Grid/stack/service event logs #2028

Merged

Stack deploy and service deploy error states are broken #2127

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service instance deploy state, errors #2034

Service instance deploy state, errors #2034

SpComb commented Mar 30, 2017 •

edited

SpComb Mar 30, 2017

jakolehm Apr 3, 2017

SpComb Apr 4, 2017

SpComb Mar 30, 2017

jnummelin Mar 31, 2017

SpComb Mar 31, 2017

jakolehm Apr 3, 2017

SpComb Apr 3, 2017

SpComb Mar 30, 2017

jnummelin Mar 31, 2017

jakolehm Apr 3, 2017

SpComb Mar 30, 2017

jnummelin Mar 31, 2017

SpComb Mar 31, 2017

SpComb Mar 30, 2017

jnummelin Apr 5, 2017

SpComb Apr 5, 2017

SpComb Apr 5, 2017

jnummelin Apr 5, 2017 •

edited

jnummelin commented Mar 31, 2017

SpComb commented Apr 3, 2017

SpComb commented Apr 5, 2017

SpComb commented Apr 5, 2017

jakolehm left a comment

Service instance deploy state, errors #2034

Service instance deploy state, errors #2034

Conversation

SpComb commented Mar 30, 2017 • edited

Agent

Server

API

/v1/services/:grid/:stack/:service/deploys/:deploy

CLI

kontena service scale redis 4

kontena service scale redis-fail 2

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnummelin Apr 5, 2017 • edited

Choose a reason for hiding this comment

jnummelin commented Mar 31, 2017

SpComb commented Apr 3, 2017

SpComb commented Apr 5, 2017

SpComb commented Apr 5, 2017

jakolehm left a comment

Choose a reason for hiding this comment

SpComb commented Mar 30, 2017 •

edited

`/v1/services/:grid/:stack/:service/deploys/:deploy`

`kontena service scale redis 4`

`kontena service scale redis-fail 2`

jnummelin Apr 5, 2017 •

edited