Healthcheck Support #641

aluzzardi · 2016-05-20T06:00:09Z

Since health checks are coming in the engine, we should do something about them (both in the orchestration and in networking).

/cc @mrjana @aaronlehmann

aluzzardi · 2016-06-08T01:41:04Z

@dongluochen / @runshenzhu / @dperny / @nishanttotla Could one of you take care of this one? It's self contained and quite high value.

Basically we need:

Agent to move a task to FAILED if health checks are failing
Add health check details into ContainerStatus

I can go over the details with one or two volunteers :)

/cc @stevvooe

mrjana · 2016-06-08T01:43:37Z

@aluzzardi Does that mean the task is also brought down? If so, that would just naturally work for networking.

aluzzardi · 2016-06-08T01:44:24Z

@mrjana Yeah, right. When health check fails, we should bring the task down.

nishanttotla · 2016-06-08T01:58:38Z

@aluzzardi I'd like to pick this up. Let me put in some thought then we can discuss tomorrow.

runshenzhu · 2016-06-08T07:31:08Z

@nishanttotla @aluzzardi Sorry for the late. Can I join the discussion tomorrow?

nishanttotla · 2016-06-10T20:15:07Z

For reference, here are the relevant docker/docker heathcheck PRs:
moby/moby#22719
moby/moby#23218

nishanttotla · 2016-06-10T21:38:31Z

Here's a brief update based on our discussion earlier.
These are the healthcheck options Docker supports:

  --health-cmd            Command to run to check health
  --health-interval       Time between running the check
  --health-retries        Consecutive failures needed to report unhealthy
  --health-timeout        Maximum time to allow one check to run
  --no-healthcheck        Disable any container-specified HEALTHCHECK

We should be able to support all of these at the service level, with the understanding that they're passed down to Tasks as is.

The ContainerState struct in engine-api contains a Health field that is returned to us on inspect.

// Health stores information about the container's healthcheck results
type Health struct {
    Status        string               // Status is one of Starting, Healthy or Unhealthy
    FailingStreak int                  // FailingStreak is the number of consecutive failures
    Log           []*HealthcheckResult // Log contains the last few results (oldest first)
}

Based on the Status, we can decide what action to take for the task.

TODO for this issue:
0. Update engine-api vendored version

Parse healthcheck options at the CLI into a HealthConfig
Add the HealthConfig to the service spec, plumb it down to task spec and container config
Check health data before reporting Task status

Docker currently doesn't exit/remove failing containers, so it's up to us to do that.

One big question to think about here is whether the healthchecks should be decoupled from Docker. What this means is whether we should simply rely on the Status reported by inspect, or have more complex logic inside Swarmkit that allows us to do more using the FailingStreak and Log fields. My opinion is to go with the former right now for simplicity.

Small note: Healthchecks can alternatively be defined in the Dockerfile and be part of the image, but options specified while running containers override them.

Also at this time I'm not sure how or if we should allow for updating of healthcheck options.

cc @aluzzardi @runshenzhu @dperny

stevvooe · 2016-06-14T23:17:23Z

When do we plan on collecting this? Right now, it only seems relevant at start or wait, but we don't have a mechanism to poll the health state.

stevvooe · 2016-06-14T23:19:55Z

Also, who is going to consume this data? The current PR is very container-specific. Shouldn't a determination be made at the agent-level and then we change the status and report it in the orchestrator? I really think we need to avoid having the orchestrator dip down into container status to figure out the the state of a container. That is supposed to be reported in the task state.

nishanttotla · 2016-06-14T23:32:14Z

@stevvooe correct, the orchestrator shouldn't have to look into the container status to retrieve health info. The agent should set the task status based on container health, and the orchestrator should make a decision based on that.

As for polling, should the orchestrator be responsible for that, or just assume that the agent manages it and updates task status whenever health checks are failing?

stevvooe · 2016-06-15T04:54:23Z

The agent should set the task status based on container health, and the orchestrator should make a decision based on that.

The orchestrator should not know about containers. I'm really worried this health checking model is fatally flawed. We have task status and the health should flow into that and fail the task. If we get too many things flying back and forth, we will experience massive stability problems.

nishanttotla · 2016-06-15T17:00:43Z

The orchestrator should not know about containers.

Sorry if my articulation was unclear, but this is how I see it too. The healthchecks should stay on the agent, and any health info should flow into task status (as you point out). There should be no back and forth between orchestrator and agent on this.

stevvooe · 2016-06-15T17:27:31Z

I guess the main issue with this proposal is that if focuses on what docker has as a health check, rather than how swarmkit views unhealthy task execution. We need to define a model for how swarmkit should interpret health, if at all, and then add that functionality to the docker executor.

We need to ask the following:

Will the manager take action based on the health check result? (probably not)
Will the manager make the results of health checks available? (yes)
Will the Worker/TaskManager/Controller understand the concept of task health?
Can task health be folded into the current state model?

The current proposal has the health check data being interpreted by the task manager, which will affect genericity. The interpretation should be entirely within the executor. Such tasks should be considered failed from a state machine perspective, albeit we can define a richer reasoning behind the failure (per 2).

BTW, I see no issue in flowing this data into the manager, but we need to come up with a way of not taking action on it (because we don't want to do 1).

We need to be careful that what ever we adopt into the distributed model doesn't contribute to instability. Something as mutable as a health check result isn't going to have strong convergence properties, meaning that decisions will most likely be made on stale data (read: it will be flappy). This is just as true for the agent, as it is for the orchestrator.

I don't really know the answers to 3 and 4. My recommendation is that we try to work with the current state model and have controllers communicate health issues by returning terminal errors. If we find that this model is insufficient or commonly complex, we can import it into the task execution model.

gianarb · 2016-07-19T22:51:05Z

Hello,
This talk explain very well what in my opinion to a good health..
https://www.youtube.com/watch?v=l-w2skD_56E

I am speaking in a very high level point but I hope to help someone that is working on this feature that it's very important IMHO.

Healthcheck from swarm point of view is necessary when we start a deploy, it's a good way to add or not some containers into the "visible" pool. Right now (If I understood well) there is only one factor to elect a container ready to be visible the status of the container, we can also check the status of the health if it exists for this container.

Will the Worker/TaskManager/Controller understand the concept of task health?
Yes in order to understand how an update is doing and manage rollback if necessary
Can task health be folded into the current state model?
I think yes, it must the one of the factor to manage the current status.

aluzzardi · 2016-08-04T21:16:16Z

@runshenzhu @stevvooe I think this was implemented, correct?

runshenzhu · 2016-08-04T21:45:30Z

yes, I believe it's implemented.

nishanttotla · 2016-08-04T22:51:55Z

@runshenzhu is it more than monitoring events from running containers? Do we want a broader healthcheck framework in SwarmKit?

aelsabbahy · 2016-08-04T22:53:24Z

@gianarb which talk are you referring to? The video you linked is 7hrs+ long.

gianarb · 2016-08-05T11:57:24Z

@aelsabbahy it starts from a specific point :) the speaker is @kelseyhightower.
but I discover now that there is something similar now in swarmkit #1085 but when i replayed here first time it was not ready :)

aelsabbahy · 2016-08-05T14:51:51Z

@gianarb Ah thanks! Yeah that was a great talk, here's a direct link to that presentation for anyone that's interested: https://vimeo.com/173610242

I also blogged about a similar subject here.

aluzzardi · 2016-08-05T22:57:20Z

Closing this down since it was implemented.

Will open another one regarding networking and HC.

aluzzardi added priority/P1 area/networking area/orchestration labels May 20, 2016

aluzzardi added this to the M3 milestone May 20, 2016

aluzzardi modified the milestones: M3, M4 Jun 8, 2016

nishanttotla self-assigned this Jun 8, 2016

This was referenced Jun 11, 2016

Updating engine-api vendor version #919

Merged

Prototype for simple healthchecks #975

Closed

nishanttotla mentioned this issue Jun 17, 2016

More descriptive error when healthcheck fails #1017

Closed

dongluochen mentioned this issue Jun 27, 2016

Feature: Swarm rolling update w/healthcheck docker-archive/classicswarm#2366

Closed

jhmartin mentioned this issue Jun 28, 2016

Feature: Swarm rolling update w/healthcheck #1085

Closed

aluzzardi closed this as completed Aug 5, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Healthcheck Support #641

Healthcheck Support #641

aluzzardi commented May 20, 2016

aluzzardi commented Jun 8, 2016 •

edited

Loading

mrjana commented Jun 8, 2016

aluzzardi commented Jun 8, 2016

nishanttotla commented Jun 8, 2016

runshenzhu commented Jun 8, 2016

nishanttotla commented Jun 10, 2016

nishanttotla commented Jun 10, 2016 •

edited

Loading

stevvooe commented Jun 14, 2016

stevvooe commented Jun 14, 2016

nishanttotla commented Jun 14, 2016 •

edited

Loading

stevvooe commented Jun 15, 2016

nishanttotla commented Jun 15, 2016

stevvooe commented Jun 15, 2016

gianarb commented Jul 19, 2016

aluzzardi commented Aug 4, 2016

runshenzhu commented Aug 4, 2016

nishanttotla commented Aug 4, 2016 •

edited

Loading

aelsabbahy commented Aug 4, 2016

gianarb commented Aug 5, 2016

aelsabbahy commented Aug 5, 2016

aluzzardi commented Aug 5, 2016

Healthcheck Support #641

Healthcheck Support #641

Comments

aluzzardi commented May 20, 2016

aluzzardi commented Jun 8, 2016 • edited Loading

mrjana commented Jun 8, 2016

aluzzardi commented Jun 8, 2016

nishanttotla commented Jun 8, 2016

runshenzhu commented Jun 8, 2016

nishanttotla commented Jun 10, 2016

nishanttotla commented Jun 10, 2016 • edited Loading

stevvooe commented Jun 14, 2016

stevvooe commented Jun 14, 2016

nishanttotla commented Jun 14, 2016 • edited Loading

stevvooe commented Jun 15, 2016

nishanttotla commented Jun 15, 2016

stevvooe commented Jun 15, 2016

gianarb commented Jul 19, 2016

aluzzardi commented Aug 4, 2016

runshenzhu commented Aug 4, 2016

nishanttotla commented Aug 4, 2016 • edited Loading

aelsabbahy commented Aug 4, 2016

gianarb commented Aug 5, 2016

aelsabbahy commented Aug 5, 2016

aluzzardi commented Aug 5, 2016

aluzzardi commented Jun 8, 2016 •

edited

Loading

nishanttotla commented Jun 10, 2016 •

edited

Loading

nishanttotla commented Jun 14, 2016 •

edited

Loading

nishanttotla commented Aug 4, 2016 •

edited

Loading