Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service instance deploy state, errors #2034

Merged
merged 21 commits into from Apr 5, 2017

Conversation

SpComb
Copy link
Contributor

@SpComb SpComb commented Mar 30, 2017

Fixes #2032, fixes #1907

If creating the Docker container for a service instance failed during deployment, previous versions of Kontena would only log the errors on the agent, and they would not be reported back to the server or shown in the CLI. With #1607, the agent /service_pods/create errors would be reported back to the server, but the GridServiceDeployer would not return them to the CLI (#1907). With #1873, the agent ServicePodWorker no longer reported the Docker API errors back to the server, and would crash instead (#2032).

This PR provides end-to-end per-instance error handling for service deployments, with the agent reporting ServiceInstance.error states back to the server, the server tracks the GridServiceInstanceDeploy.error state, the API includes /v1/services/:grid/:stack/:service/deploys/:deploy {"instance_deploys": {"error": ...}} states, and the CLI can render those (WIP).

The ServiceInstance now also has a new error field. This is updated independently of the state field: a new deploy_rev with a desired_state: stopped might still result in state: running with error: failed to stop.... Then later, the agent might independently re-apply and update that deploy_rev to state: stopped error: null.

Agent

  • Refactor ServicePodWorker#apply to handle the ensure_desired_state success and error cases, protocol state and actor lifecycle

  • Simplify @prev_state to just compare on the complete state update in sync_state_to_master

    The agent will also sync state to the master if the error state changes, like on a later update cycle.

Server

  • Add a new error field to the ServiceInstance model

  • Add a new GridServiceInstanceDeploy model embedded in the GridServiceDeploy, with per-instance state and errors

  • Change GridServiceInstanceDeployer#ensure_service_instance to update and wait for the deploy_rev, also when stopping an old instance on a different node

    This ensures that the instance was actually stopped. This also now warns if it was unable to stop it.

  • Log warnings if unable to stop existing service instance

    INFO -- GridServiceInstanceDeployer: Deploying service instance development/null/redis-2 to node core-02 at 2017-03-30 14:01:30 UTC...
    INFO -- GridServiceInstanceDeployer: Stopping existing service service development/null/redis instance 2 on previous node core-01...
    
    INFO -- GridServiceInstanceDeployer: Deploying service instance development/null/redis-2 to node core-01 at 2017-03-30 14:02:59 UTC...
    INFO -- GridServiceInstanceDeployer: Stopping existing service service development/null/redis instance 2 on previous node core-02...
    WARN -- GridServiceInstanceDeployer: Failed to stop existing service development/null/redis instance 2 on previous node core-02: Host node is offline
    
    WARN -- GridServiceInstanceDeployer: Replacing orphaned service development/null/redis instance 2 on destroyed node
    
  • Refactor the GridServiceInstanceDeployer around the new GridServiceInstanceDeploy model

API

  • Add a new instance_deploys field to the /v1/services/:grid/:stack/:service/deploys/:deploy API

/v1/services/:grid/:stack/:service/deploys/:deploy

{
   "reason" : "GridServiceDeployer::DeployError: one or more instances failed",
   "id" : "58dcf836de3578252b000001",
   "finished_at" : null,
   "state" : "error",
   "instance_deploys" : [
      {
         "state" : "error",
         "instance_number" : 1,
         "node" : "core-02",
         "error" : "GridServiceInstanceDeployer::AgentError: Docker::Error::NotFoundError: No such image: redis:nonexist\n"
      },
      {
         "node" : "core-01",
         "instance_number" : 2,
         "error" : null,
         "state" : "ongoing"
      },
      {
         "state" : "ongoing",
         "error" : null,
         "node" : "core-02",
         "instance_number" : 3
      },
      {
         "state" : "ongoing",
         "error" : null,
         "instance_number" : 4,
         "node" : "core-01"
      }
   ],
   "started_at" : "2017-03-30T12:21:10.801+00:00",
   "created_at" : "2017-03-30T12:21:10.616Z",
   "service_id" : "development/null/redis-fail"
}

CLI

Rough hack for kontena service deploy to report the per-instance errors.

TODO support for kontena stack deploy

kontena service scale redis 4

 [done] Scaling redis to 4 instances      
⊛ Deployed instance development/null/redis-1 to node core-01
⊛ Deployed instance development/null/redis-2 to node core-02
⊛ Deployed instance development/null/redis-3 to node core-01
⊛ Deployed instance development/null/redis-4 to node core-02

kontena service scale redis-fail 2

 [fail] Scaling redis-fail to 1 instances      
 [error] GridServiceDeployer::DeployError: one or more instances failed:
	⊗ Failed to deploy instance development/null/redis-fail-1 to node core-01: GridServiceInstanceDeployer::ServiceError: Docker::Error::NotFoundError: No such image: redis-nonexist:latest

@SpComb SpComb changed the title [WIP] Service instance deploy errors [WIP] Service instance deploy state, errors Mar 30, 2017
@prev_state = current_state

if state != @prev_state
rpc_client.async.notification('/node_service_pods/set_state', [node.id, state])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is an async notification, we won't notice if the RPC gets dropped, and the server is left waiting... if this was changed to a rpc_client.request instead, we would get an error if the RPC fails... then we could crash and retry sooner? OTOH, I suppose right now we'll eventually re-send this notification, so it should still make progress ATM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #2050

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged

self.grid_service.set(:deployed_at => deploy_rev)

deploy_futures = []
total_instances = self.instance_count
self.grid_service.grid_service_instances.where(:instance_number.gt => total_instances).destroy
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should only happen after all instances have been deployed...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's safer that way. Maybe also so that it would send the notification to the nodes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go full-on, then the deploy operation would also wait for those extra instances to get terminated, and also report any failures there...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why after? Imho these should be terminated before the actual deploy (to make room for scheduling).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... the scheduler still uses the Container info for scheduling decisions, not the ServiceInstance models? Would need to fix that if you want the scheduler to re-schedule the new instances based on the old instances being gone, plus wait for the containers to terminate and release their resources.


# bail out early if anything fails
if deploy_futures.select{|f| f.ready? }.map{|f| f.value }.any?{|d| d.error? } # raises on any Future exceptions
raise DeployError, "one or more instances failed"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be a configurable threshold... continue deployment even if fewer than min_instances fail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe min_instances_to_fail = (1 - min_health) * instance_count ? So with 0.8 min health 20% of the deployed instances can fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imho this could be a deploy_opt, cancel or continue on error. Calculation based on min_health is too magical.

host_node_id: node.id,
deploy_rev: deploy_rev,
desired_state: desired_state,
rev: nil, # reset
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic because both the ensure_service_instance(..., 'stopped') and ensure_service_instance(..., 'running') happen with the same deploy_rev... Ideally, each ensure_service_instance operation would just set its own deploy_rev: Time.now.utc (XXX: with sub-second percision?), but then I'm not sure what affect that would have on the actual Docker container deploy_rev?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is that problematic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, setting it to nil is a workaround for that issue, but it feels slightly wrong... The problematic thing is that we we have to do two "deploys" using the same deploy_rev, because the service instance moves across two nodes.

@SpComb SpComb added this to the 1.2.0 milestone Mar 30, 2017
end

self.terminate if service_pod.terminated?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not terminate the actor if the ensure_terminated fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

terminate is bit confusing as it's used to terminate the pod, container and the actor itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The terminate term is from celluloid, so we would have to rename the service_pods.desired_state :)

BTW what happens here now is that the server destroys the GridServiceInstance, and the agent signals ServicePodWorker.destroy... then if the docker rm fails, then the ServicePodWorker just logs a warning, and terminates itself. The docker container remains running, and the ServicePodManager doesn't pick that up until you restart the agent.

I'm not sure the existing implementation behaves better either... it will crash the ServicePodWorker, and then the ServicePodManager will not restart it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we stick around if the ensure_terminated fails, so the ServicePodManager will re-try the destroy on the next update loop.

Copy link
Contributor

@jnummelin jnummelin Apr 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to stick around and try again on next update loop

@jnummelin
Copy link
Contributor

Design-wise looks good IMO

@SpComb
Copy link
Contributor Author

SpComb commented Apr 3, 2017

Still WIP on the CLI kontena stack deploy and possibly better kontena service deploy output, but otheriwse ready for review.

@SpComb SpComb changed the title [WIP] Service instance deploy state, errors Service instance deploy state, errors Apr 3, 2017
@SpComb
Copy link
Contributor Author

SpComb commented Apr 5, 2017

Took out the GridServiceDeployer changes, including the part waiting for ongoing instance deploy futures to complete. Fixing the GridServiceDeployer future stuff can be a separate PR, making this easier to review.

@SpComb
Copy link
Contributor Author

SpComb commented Apr 5, 2017

I suggest we merge this with the agent/server/api changes, and the basic kontena service deploy output, and then have a separate PR to improve the CLI stack/service deploy output further.

Copy link
Contributor

@jakolehm jakolehm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SpComb SpComb merged commit 7aec8a5 into master Apr 5, 2017
@SpComb SpComb deleted the fix/service-instance-deploy-errors branch April 5, 2017 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants