Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evict pods w/o rate-limit when cloud says node is gone. #21187

Merged
merged 1 commit into from
Mar 1, 2016
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
21 changes: 19 additions & 2 deletions pkg/controller/node/nodecontroller.go
Original file line number Diff line number Diff line change
Expand Up @@ -480,9 +480,26 @@ func (nc *NodeController) monitorNodeStatus() error {
continue
}
if remaining {
// queue eviction of the pods on the node
// Immediately evict pods (skip rate-limited evictor)
glog.V(2).Infof("Deleting node %s is delayed while pods are evicted", node.Name)
nc.evictPods(node.Name)
go func(nodeName string) {
nc.evictorLock.Lock()
defer nc.evictorLock.Unlock()
remaining, err := nc.deletePods(nodeName)
if err != nil {
glog.Errorf("Unable to evict pods from node %s: %v", nodeName, err)
nc.podEvictor.Add(nodeName)
return
}
if !remaining {
return
}
// Immediately terminate pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just suggesting adding to terminationEvictor here, instead of invoking terminatePods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I had this morning, but @gmarek convinced me otherwise. We don't expect any actual pods to be running, though they may still be represented in the apiserver. If I understand the system, it should come out fine whether we queue or directly terminate.

If we try to directly terminate and it fails for whatever reason, we'll bail, but next scan through the nodes we should attempt to evict/terminate again.

If we add to the evictor queue, we might get more retries before the next full node scan, but we may also be backing up the rate-limited piece for no good reason.

If that description doesn't match with what you think would actually happen, I'm happy to change the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup - I don't think that rate limiting makes any sense here. This is a branch responsible for evicting Pods from Nodes that are gone from the cloud provider's perspective. If for whatever reason we would be able to contact this Node it would mean that there's some serious bug in the control plane of the given provider.

if _, _, err := nc.terminatePods(nodeName, time.Now()); err != nil {
glog.Errorf("Unable to terminate pods on node %s: %v", nodeName, err)
nc.terminationEvictor.Add(nodeName)
}
}(node.Name)
continue
}

Expand Down