-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Healthcheck Support #641
Comments
@dongluochen / @runshenzhu / @dperny / @nishanttotla Could one of you take care of this one? It's self contained and quite high value. Basically we need:
I can go over the details with one or two volunteers :) /cc @stevvooe |
@aluzzardi Does that mean the task is also brought down? If so, that would just naturally work for networking. |
@mrjana Yeah, right. When health check fails, we should bring the task down. |
@aluzzardi I'd like to pick this up. Let me put in some thought then we can discuss tomorrow. |
@nishanttotla @aluzzardi Sorry for the late. Can I join the discussion tomorrow? |
For reference, here are the relevant |
Here's a brief update based on our discussion earlier.
We should be able to support all of these at the service level, with the understanding that they're passed down to Tasks as is. The
Based on the TODO for this issue:
Docker currently doesn't exit/remove failing containers, so it's up to us to do that. One big question to think about here is whether the healthchecks should be decoupled from Docker. What this means is whether we should simply rely on the Small note: Healthchecks can alternatively be defined in the Dockerfile and be part of the image, but options specified while running containers override them. Also at this time I'm not sure how or if we should allow for updating of healthcheck options. |
When do we plan on collecting this? Right now, it only seems relevant at start or wait, but we don't have a mechanism to poll the health state. |
Also, who is going to consume this data? The current PR is very container-specific. Shouldn't a determination be made at the agent-level and then we change the status and report it in the orchestrator? I really think we need to avoid having the orchestrator dip down into container status to figure out the the state of a container. That is supposed to be reported in the task state. |
@stevvooe correct, the orchestrator shouldn't have to look into the container status to retrieve health info. The agent should set the task status based on container health, and the orchestrator should make a decision based on that. As for polling, should the orchestrator be responsible for that, or just assume that the agent manages it and updates task status whenever health checks are failing? |
The orchestrator should not know about containers. I'm really worried this health checking model is fatally flawed. We have task status and the health should flow into that and fail the task. If we get too many things flying back and forth, we will experience massive stability problems. |
Sorry if my articulation was unclear, but this is how I see it too. The healthchecks should stay on the agent, and any health info should flow into task status (as you point out). There should be no back and forth between orchestrator and agent on this. |
I guess the main issue with this proposal is that if focuses on what docker has as a health check, rather than how swarmkit views unhealthy task execution. We need to define a model for how swarmkit should interpret health, if at all, and then add that functionality to the docker executor. We need to ask the following:
The current proposal has the health check data being interpreted by the task manager, which will affect genericity. The interpretation should be entirely within the executor. Such tasks should be considered failed from a state machine perspective, albeit we can define a richer reasoning behind the failure (per 2). BTW, I see no issue in flowing this data into the manager, but we need to come up with a way of not taking action on it (because we don't want to do 1). We need to be careful that what ever we adopt into the distributed model doesn't contribute to instability. Something as mutable as a health check result isn't going to have strong convergence properties, meaning that decisions will most likely be made on stale data (read: it will be flappy). This is just as true for the agent, as it is for the orchestrator. I don't really know the answers to 3 and 4. My recommendation is that we try to work with the current state model and have controllers communicate health issues by returning terminal errors. If we find that this model is insufficient or commonly complex, we can import it into the task execution model. |
Hello, I am speaking in a very high level point but I hope to help someone that is working on this feature that it's very important IMHO. Healthcheck from swarm point of view is necessary when we start a deploy, it's a good way to add or not some containers into the "visible" pool. Right now (If I understood well) there is only one factor to elect a container
|
@runshenzhu @stevvooe I think this was implemented, correct? |
yes, I believe it's implemented. |
@runshenzhu is it more than monitoring events from running containers? Do we want a broader healthcheck framework in SwarmKit? |
@gianarb which talk are you referring to? The video you linked is 7hrs+ long. |
@aelsabbahy it starts from a specific point :) the speaker is @kelseyhightower. |
@gianarb Ah thanks! Yeah that was a great talk, here's a direct link to that presentation for anyone that's interested: https://vimeo.com/173610242 I also blogged about a similar subject here. |
Closing this down since it was implemented. Will open another one regarding networking and HC. |
Since health checks are coming in the engine, we should do something about them (both in the orchestration and in networking).
/cc @mrjana @aaronlehmann
The text was updated successfully, but these errors were encountered: