Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healthcheck Support #641

Closed
aluzzardi opened this issue May 20, 2016 · 21 comments
Closed

Healthcheck Support #641

aluzzardi opened this issue May 20, 2016 · 21 comments

Comments

@aluzzardi
Copy link
Member

Since health checks are coming in the engine, we should do something about them (both in the orchestration and in networking).

/cc @mrjana @aaronlehmann

@aluzzardi
Copy link
Member Author

aluzzardi commented Jun 8, 2016

@dongluochen / @runshenzhu / @dperny / @nishanttotla Could one of you take care of this one? It's self contained and quite high value.

Basically we need:

  1. Agent to move a task to FAILED if health checks are failing
  2. Add health check details into ContainerStatus

I can go over the details with one or two volunteers :)

/cc @stevvooe

@mrjana
Copy link
Contributor

mrjana commented Jun 8, 2016

@aluzzardi Does that mean the task is also brought down? If so, that would just naturally work for networking.

@aluzzardi
Copy link
Member Author

@mrjana Yeah, right. When health check fails, we should bring the task down.

@nishanttotla
Copy link
Contributor

@aluzzardi I'd like to pick this up. Let me put in some thought then we can discuss tomorrow.

@nishanttotla nishanttotla self-assigned this Jun 8, 2016
@runshenzhu
Copy link
Contributor

@nishanttotla @aluzzardi Sorry for the late. Can I join the discussion tomorrow?

@nishanttotla
Copy link
Contributor

For reference, here are the relevant docker/docker heathcheck PRs:
moby/moby#22719
moby/moby#23218

@nishanttotla
Copy link
Contributor

nishanttotla commented Jun 10, 2016

Here's a brief update based on our discussion earlier.
These are the healthcheck options Docker supports:

  --health-cmd            Command to run to check health
  --health-interval       Time between running the check
  --health-retries        Consecutive failures needed to report unhealthy
  --health-timeout        Maximum time to allow one check to run
  --no-healthcheck        Disable any container-specified HEALTHCHECK

We should be able to support all of these at the service level, with the understanding that they're passed down to Tasks as is.

The ContainerState struct in engine-api contains a Health field that is returned to us on inspect.

// Health stores information about the container's healthcheck results
type Health struct {
    Status        string               // Status is one of Starting, Healthy or Unhealthy
    FailingStreak int                  // FailingStreak is the number of consecutive failures
    Log           []*HealthcheckResult // Log contains the last few results (oldest first)
}

Based on the Status, we can decide what action to take for the task.

TODO for this issue:
0. Update engine-api vendored version

  1. Parse healthcheck options at the CLI into a HealthConfig
  2. Add the HealthConfig to the service spec, plumb it down to task spec and container config
  3. Check health data before reporting Task status

Docker currently doesn't exit/remove failing containers, so it's up to us to do that.

One big question to think about here is whether the healthchecks should be decoupled from Docker. What this means is whether we should simply rely on the Status reported by inspect, or have more complex logic inside Swarmkit that allows us to do more using the FailingStreak and Log fields. My opinion is to go with the former right now for simplicity.

Small note: Healthchecks can alternatively be defined in the Dockerfile and be part of the image, but options specified while running containers override them.

Also at this time I'm not sure how or if we should allow for updating of healthcheck options.

cc @aluzzardi @runshenzhu @dperny

@stevvooe
Copy link
Contributor

When do we plan on collecting this? Right now, it only seems relevant at start or wait, but we don't have a mechanism to poll the health state.

@stevvooe
Copy link
Contributor

Also, who is going to consume this data? The current PR is very container-specific. Shouldn't a determination be made at the agent-level and then we change the status and report it in the orchestrator? I really think we need to avoid having the orchestrator dip down into container status to figure out the the state of a container. That is supposed to be reported in the task state.

@nishanttotla
Copy link
Contributor

nishanttotla commented Jun 14, 2016

@stevvooe correct, the orchestrator shouldn't have to look into the container status to retrieve health info. The agent should set the task status based on container health, and the orchestrator should make a decision based on that.

As for polling, should the orchestrator be responsible for that, or just assume that the agent manages it and updates task status whenever health checks are failing?

@stevvooe
Copy link
Contributor

The agent should set the task status based on container health, and the orchestrator should make a decision based on that.

The orchestrator should not know about containers. I'm really worried this health checking model is fatally flawed. We have task status and the health should flow into that and fail the task. If we get too many things flying back and forth, we will experience massive stability problems.

@nishanttotla
Copy link
Contributor

The orchestrator should not know about containers.

Sorry if my articulation was unclear, but this is how I see it too. The healthchecks should stay on the agent, and any health info should flow into task status (as you point out). There should be no back and forth between orchestrator and agent on this.

@stevvooe
Copy link
Contributor

I guess the main issue with this proposal is that if focuses on what docker has as a health check, rather than how swarmkit views unhealthy task execution. We need to define a model for how swarmkit should interpret health, if at all, and then add that functionality to the docker executor.

We need to ask the following:

  1. Will the manager take action based on the health check result? (probably not)
  2. Will the manager make the results of health checks available? (yes)
  3. Will the Worker/TaskManager/Controller understand the concept of task health?
  4. Can task health be folded into the current state model?

The current proposal has the health check data being interpreted by the task manager, which will affect genericity. The interpretation should be entirely within the executor. Such tasks should be considered failed from a state machine perspective, albeit we can define a richer reasoning behind the failure (per 2).

BTW, I see no issue in flowing this data into the manager, but we need to come up with a way of not taking action on it (because we don't want to do 1).

We need to be careful that what ever we adopt into the distributed model doesn't contribute to instability. Something as mutable as a health check result isn't going to have strong convergence properties, meaning that decisions will most likely be made on stale data (read: it will be flappy). This is just as true for the agent, as it is for the orchestrator.

I don't really know the answers to 3 and 4. My recommendation is that we try to work with the current state model and have controllers communicate health issues by returning terminal errors. If we find that this model is insufficient or commonly complex, we can import it into the task execution model.

@gianarb
Copy link
Member

gianarb commented Jul 19, 2016

Hello,
This talk explain very well what in my opinion to a good health..
https://www.youtube.com/watch?v=l-w2skD_56E

I am speaking in a very high level point but I hope to help someone that is working on this feature that it's very important IMHO.

Healthcheck from swarm point of view is necessary when we start a deploy, it's a good way to add or not some containers into the "visible" pool. Right now (If I understood well) there is only one factor to elect a container ready to be visible the status of the container, we can also check the status of the health if it exists for this container.

  1. Will the Worker/TaskManager/Controller understand the concept of task health?
    Yes in order to understand how an update is doing and manage rollback if necessary
  2. Can task health be folded into the current state model?
    I think yes, it must the one of the factor to manage the current status.

@aluzzardi
Copy link
Member Author

@runshenzhu @stevvooe I think this was implemented, correct?

@runshenzhu
Copy link
Contributor

yes, I believe it's implemented.

@nishanttotla
Copy link
Contributor

nishanttotla commented Aug 4, 2016

@runshenzhu is it more than monitoring events from running containers? Do we want a broader healthcheck framework in SwarmKit?

@aelsabbahy
Copy link

@gianarb which talk are you referring to? The video you linked is 7hrs+ long.

@gianarb
Copy link
Member

gianarb commented Aug 5, 2016

@aelsabbahy it starts from a specific point :) the speaker is @kelseyhightower.
but I discover now that there is something similar now in swarmkit #1085 but when i replayed here first time it was not ready :)

@aelsabbahy
Copy link

@gianarb Ah thanks! Yeah that was a great talk, here's a direct link to that presentation for anyone that's interested: https://vimeo.com/173610242

I also blogged about a similar subject here.

@aluzzardi
Copy link
Member Author

Closing this down since it was implemented.

Will open another one regarding networking and HC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants