-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement InstanceShutdownByProviderID to aws cloudprovider #59930
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zetaab There is a race condition here when the instance is stopping. That'd cause it to be deleted under the current logic since InstanceExistsByProviderID
only checks if the instance isn't running.
/sig aws |
@jhorwit2 you are right https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html Do you have suggestions how I should solve this? Because if I add "stopping" state to this shutdown function as well. In theory that is not correct, but what you think? Also node might be deleted if instance is in pending state after it is started again. We could add stopping and pending to this shutdown function? In case of new node, pending state does not have any effect on current working solution because node is not added cluster yet. |
hmm actually I got better idea: currently there are two possibilities how nodes are deleted: using nodelifecycle controller or cloud controller. Nodelifecycle controller: https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go#L1291-L1306 this does not care about instance state at all? So when using nodelifecycle controller instance is not deleted currently in AWS. If we use this cloudprovider shutdown function as-is. And node taint is added correctly, ONLY when it is safe to detach volumes. Cloud controller manager: What you think, can we modify InstanceExistsByProviderID to work little bit different way in order to keep nodes in cluster? The current solution is not perfect because if instance is in state rebooting for long time, node is deleted from cluster. Which solution is better? 1) add pending and stopping states to shutdown 2) modify InstanceExistsByProviderID (we should take care that we will modify this in all cloudproviders then). The spec is going to be changed from |
return false, err | ||
} | ||
if len(instances) == 0 { | ||
return false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply that we took so long that AWS removed the instance from the list? I agree it is unexpected, but maybe glog.Warning
and return true, nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 we should return true here with a warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we return true here it means that instance is not removed from cluster. Instead shutdown taint is added. But the instance is deleted for real from aws and imo correct one is return false here. By returning false code will continue checks and finally remove node from cluster
|
||
state := instances[0].State.Name | ||
// valid state for detaching volumes is stopped | ||
if *state == "stopped" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think also terminated
.
Might also be better to use the constants in the ec2 sdk: InstanceStateNameStopped
and InstanceStateNameTerminated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinsb can you read my longer comment above? The thing here is that with nodelifecycle controller this thing is going to work pretty well. Nodes are not deleted from cluster if those are in some weird state like pending. However, in future with cloud controller manager this thing is not going to work. We have two interfaces where we should check all possible states of instance. Otherwise those are deleted from cluster. Currently we check only stopped, terminated(? i will add it) and active. We should add states shutting-down
stopping
pending
and rebooting
instance states to InstanceShutdownByProviderID or InstanceExistsByProviderID. Imo exists is correct place for these.
return false, fmt.Errorf("multiple instances found for instance: %s", instanceID) | ||
} | ||
|
||
state := instances[0].State.Name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe aws.StringValue(instances[0].State.Name)
to avoid a panic if name is not set (though we'll still panic if state is nil, but ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer checking state before doing this
I'm not very familiar with these controller - it seems like they are new. I'm digging in to what they do, but some comments on the code in the meantime. |
Also, why are you stopping your instances on AWS, vs just terminating them? |
Ah - this is brand new. Let's review alongside the reapplication of #59323. I put some comments on the API here: #59323 (comment) /ok-to-test |
fyi @eparis |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
/milestone v1.12 |
@zetaab: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
@kubernetes/sig-aws-misc |
if instance.State != nil { | ||
state := aws.StringValue(instance.State.Name) | ||
// valid state for detaching volumes | ||
if state == "stopped" || state == "terminated" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use InstanceStateNameStopped
and InstanceStateNameTerminated
constants instead?
mostly looks good to me. one minor change and we are good to go. |
changes according what was asked use string do not delete instance if it is in any other state than running use constants fix
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gnufied, zetaab The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue (batch tested with PRs 67368, 59930, 68074). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md. |
@zetaab: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What this PR does / why we need it: implement InstanceShutdownByProviderID to aws cloudprovider
Which issue(s) this PR fixes:
Fixes #59925
Special notes for your reviewer:
Release note: