New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set NodeReady=False when docker is dead #7763

Merged
merged 1 commit into from May 7, 2015

Conversation

Projects
None yet
6 participants
@wojtek-t
Member

wojtek-t commented May 5, 2015

@googlebot googlebot added the cla: yes label May 5, 2015

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t
Member

wojtek-t commented May 5, 2015

Ref #7222

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t May 5, 2015

Member

e2e tests are passing:

Ran 38 of 43 Specs in 1474.965 seconds
SUCCESS! -- 38 Passed | 0 Failed | 1 Pending | 4 Skipped I0505 13:31:05.709929 18131 driver.go:96] All tests pass

Member

wojtek-t commented May 5, 2015

e2e tests are passing:

Ran 38 of 43 Specs in 1474.965 seconds
SUCCESS! -- 38 Passed | 0 Failed | 1 Pending | 4 Skipped I0505 13:31:05.709929 18131 driver.go:96] All tests pass

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t May 5, 2015

Member

cc @dchen1107 (since @vmarmol seems to be OOO today)

Member

wojtek-t commented May 5, 2015

cc @dchen1107 (since @vmarmol seems to be OOO today)

@vmarmol

This comment has been minimized.

Show comment
Hide comment
@vmarmol

vmarmol May 5, 2015

Contributor

Only thing that worries me about this approach is whether we have a lot of transient errors due to this. Setting the status won't remove scheduled pods so maybe it isn't much of a concern. We may not be a schedulable node for a bit.

Contributor

vmarmol commented May 5, 2015

Only thing that worries me about this approach is whether we have a lot of transient errors due to this. Setting the status won't remove scheduled pods so maybe it isn't much of a concern. We may not be a schedulable node for a bit.

@dchen1107

This comment has been minimized.

Show comment
Hide comment
@dchen1107

dchen1107 May 5, 2015

Member

I have the same concern as @vmarmol above. We observed that Docker might be temporiarily unavailable a lot under some average workload pressure even, we should tolerate such failures at our system. My concern is that we might ping-pong those workloads due to switch node status. here.

Member

dchen1107 commented May 5, 2015

I have the same concern as @vmarmol above. We observed that Docker might be temporiarily unavailable a lot under some average workload pressure even, we should tolerate such failures at our system. My concern is that we might ping-pong those workloads due to switch node status. here.

@dchen1107

This comment has been minimized.

Show comment
Hide comment
@dchen1107

dchen1107 May 5, 2015

Member

My concern is that we just simply query docker version, it might run into "resource temporarily unavailable" (forget if this is exact err throwed by docker or not) a lot.

Member

dchen1107 commented May 5, 2015

My concern is that we just simply query docker version, it might run into "resource temporarily unavailable" (forget if this is exact err throwed by docker or not) a lot.

@davidopp

This comment has been minimized.

Show comment
Hide comment
@davidopp

davidopp May 5, 2015

Member

NodeController won't evict pods until node has been NodeReady=False for 5 minutes. Are the transient Docker unavailability longer than 5 minutes?

(Though it still may be a problem from a UX perspective. Maybe it's better to mask transient Docker problem on the Kubelet side so user doesn't get confused.)

Member

davidopp commented May 5, 2015

NodeController won't evict pods until node has been NodeReady=False for 5 minutes. Are the transient Docker unavailability longer than 5 minutes?

(Though it still may be a problem from a UX perspective. Maybe it's better to mask transient Docker problem on the Kubelet side so user doesn't get confused.)

@dchen1107

This comment has been minimized.

Show comment
Hide comment
@dchen1107

dchen1107 May 5, 2015

Member

@davidopp I agreed with you on this. Kubelet should have some logic inside to decide when docker is not available. Today monit peridically check docker's health, and restart it if it is unresponsive for a certain time (forget the time interval I initially set.) We might should ask kubelet to do such thing instead of monit. But if we take such position, what about other container Runtime? rkt + systemd?

Member

dchen1107 commented May 5, 2015

@davidopp I agreed with you on this. Kubelet should have some logic inside to decide when docker is not available. Today monit peridically check docker's health, and restart it if it is unresponsive for a certain time (forget the time interval I initially set.) We might should ask kubelet to do such thing instead of monit. But if we take such position, what about other container Runtime? rkt + systemd?

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t May 5, 2015

Member

Do you have any suggestions how to fix this?

Member

wojtek-t commented May 5, 2015

Do you have any suggestions how to fix this?

@vmarmol

This comment has been minimized.

Show comment
Hide comment
@vmarmol

vmarmol May 5, 2015

Contributor

We should only surface the failure when we've observed Docker to be down
for 5min or so. Keep us from flapping due to transient issues. We may need
to ping Docker at a higher interval than we do for node status.
On May 5, 2015 12:08 PM, "Wojciech Tyczynski" notifications@github.com
wrote:

Do you have any suggestions how to fix this?


Reply to this email directly or view it on GitHub
#7763 (comment)
.

Contributor

vmarmol commented May 5, 2015

We should only surface the failure when we've observed Docker to be down
for 5min or so. Keep us from flapping due to transient issues. We may need
to ping Docker at a higher interval than we do for node status.
On May 5, 2015 12:08 PM, "Wojciech Tyczynski" notifications@github.com
wrote:

Do you have any suggestions how to fix this?


Reply to this email directly or view it on GitHub
#7763 (comment)
.

@yujuhong

This comment has been minimized.

Show comment
Hide comment
@yujuhong

yujuhong May 5, 2015

Contributor

Just FYI, there is a function in kubelet.go checking whether docker is alive at startup. You could probably repurpose that for your use. https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/kubelet/kubelet.go#L100

Contributor

yujuhong commented May 5, 2015

Just FYI, there is a function in kubelet.go checking whether docker is alive at startup. You could probably repurpose that for your use. https://github.com/GoogleCloudPlatform/kubernetes/blob/master/pkg/kubelet/kubelet.go#L100

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t May 6, 2015

Member

@yujuhong thanks - I will try to fix this today.

Member

wojtek-t commented May 6, 2015

@yujuhong thanks - I will try to fix this today.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t May 6, 2015

Member

I've update a PR so that the failure will be reported only if Docker didn't respond during last 5 minutes.

PTAL

Member

wojtek-t commented May 6, 2015

I've update a PR so that the failure will be reported only if Docker didn't respond during last 5 minutes.

PTAL

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t May 6, 2015

Member

I'm running e2e tests - will post the results in few minutes.

Member

wojtek-t commented May 6, 2015

I'm running e2e tests - will post the results in few minutes.

@@ -1444,6 +1455,15 @@ func (kl *Kubelet) GetPodByName(namespace, name string) (*api.Pod, bool) {
return kl.podManager.GetPodByName(namespace, name)
}
func (kl *Kubelet) updateRuntimeUp() {

This comment has been minimized.

@vmarmol

vmarmol May 6, 2015

Contributor

Looks like we don't actually call this anywhere in the Kubelet?

@vmarmol

vmarmol May 6, 2015

Contributor

Looks like we don't actually call this anywhere in the Kubelet?

This comment has been minimized.

@wojtek-t

wojtek-t May 6, 2015

Member

Good catch! Thanks.

@wojtek-t

wojtek-t May 6, 2015

Member

Good catch! Thanks.

Show outdated Hide outdated pkg/kubelet/kubelet.go
@@ -1444,6 +1455,15 @@ func (kl *Kubelet) GetPodByName(namespace, name string) (*api.Pod, bool) {
return kl.podManager.GetPodByName(namespace, name)
}
func (kl *Kubelet) updateRuntimeUp() {
err := waitUntilRuntimeIsUp(kl.containerRuntime, 0)

This comment has been minimized.

@vmarmol

vmarmol May 6, 2015

Contributor

I don't think we can use 0 as a value here. I think the check will always fail (or be racy)

@vmarmol

vmarmol May 6, 2015

Contributor

I don't think we can use 0 as a value here. I think the check will always fail (or be racy)

This comment has been minimized.

@wojtek-t

wojtek-t May 6, 2015

Member

Done.

@wojtek-t

wojtek-t May 6, 2015

Member

Done.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t May 6, 2015

Member

I will rerun e2e tests and will update this thread.

Member

wojtek-t commented May 6, 2015

I will rerun e2e tests and will update this thread.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t May 6, 2015

Member

It seems that e2e tests are failing. I will take a deeper look tomorrow, but if you have any ideas what can be wrong please let me know.

Member

wojtek-t commented May 6, 2015

It seems that e2e tests are failing. I will take a deeper look tomorrow, but if you have any ideas what can be wrong please let me know.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t May 6, 2015

Member

Ran 38 of 43 Specs in 798.277 seconds
FAIL! -- 22 Passed | 16 Failed | 1 Pending | 4 Skipped F0506 19:35:42.270701 31759 driver.go:94] At least one test failed

The first 22 tests passed, the next 16 failed :(

Member

wojtek-t commented May 6, 2015

Ran 38 of 43 Specs in 798.277 seconds
FAIL! -- 22 Passed | 16 Failed | 1 Pending | 4 Skipped F0506 19:35:42.270701 31759 driver.go:94] At least one test failed

The first 22 tests passed, the next 16 failed :(

Show outdated Hide outdated pkg/kubelet/kubelet.go
@@ -1444,6 +1456,15 @@ func (kl *Kubelet) GetPodByName(namespace, name string) (*api.Pod, bool) {
return kl.podManager.GetPodByName(namespace, name)
}
func (kl *Kubelet) updateRuntimeUp() {
err := waitUntilRuntimeIsUp(kl.containerRuntime, 25*time.Millisecond)

This comment has been minimized.

@vmarmol

vmarmol May 6, 2015

Contributor

We can probably wait longer since we have a thread for this.

@vmarmol

vmarmol May 6, 2015

Contributor

We can probably wait longer since we have a thread for this.

This comment has been minimized.

@wojtek-t

wojtek-t May 7, 2015

Member

Done.

@wojtek-t

wojtek-t May 7, 2015

Member

Done.

@wojtek-t

This comment has been minimized.

Show comment
Hide comment
@wojtek-t

wojtek-t May 7, 2015

Member

I've run e2e tests - all tests passed:

Ran 38 of 43 Specs in 1593.325 seconds
SUCCESS! -- 38 Passed | 0 Failed | 1 Pending | 4 Skipped I0507 10:56:37.327217 5607 driver.go:96] All tests pass

Member

wojtek-t commented May 7, 2015

I've run e2e tests - all tests passed:

Ran 38 of 43 Specs in 1593.325 seconds
SUCCESS! -- 38 Passed | 0 Failed | 1 Pending | 4 Skipped I0507 10:56:37.327217 5607 driver.go:96] All tests pass

@vmarmol

This comment has been minimized.

Show comment
Hide comment
@vmarmol

vmarmol May 7, 2015

Contributor

LGTM, thanks @wojtek-t!

Contributor

vmarmol commented May 7, 2015

LGTM, thanks @wojtek-t!

vmarmol added a commit that referenced this pull request May 7, 2015

Merge pull request #7763 from wojtek-t/kubelet_removing_nodes
Set NodeReady=False when docker is dead

@vmarmol vmarmol merged commit 581821d into kubernetes:master May 7, 2015

4 checks passed

Shippable Shippable builds completed
Details
cla/google All necessary CLAs are signed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.06%) to 49.53%
Details

@wojtek-t wojtek-t deleted the wojtek-t:kubelet_removing_nodes branch May 15, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment