Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instances: initial implementation of instancesV2 interface #131

Merged
merged 1 commit into from Oct 21, 2020

Conversation

nicolehanjing
Copy link
Member

What type of PR is this?
/kind feature

What this PR does / why we need it:
Take over the initial work here: #127
This is a first pass at implementing Instances.
Some TODOs:

  • add more unit tests
  • support node naming policy other than private DNS names

Which issue(s) this PR fixes:

Part of #125

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Add initial implementation of Instances for the v2 provider.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 25, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @nicolehanjing. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 25, 2020
// parseInstanceIDFromProviderID parses the node's instance ID based on the well-known provider ID format:
// * aws://<availability-zone>/<instance-id>
// This function always assumes a valid providerID format was provided.
func parseInstanceIDFromProviderID(providerID string) (string, error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim is that intended to only parse well-formatted providerID? Should I take care of invalid cases?


if ec2Instance.State != nil {
state := aws.StringValue(ec2Instance.State.Name)
if state == ec2.InstanceStateNameTerminated || state == ec2.InstanceStateNameStopping || state == ec2.InstanceStateNameStopped {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim A little confused on how we define "shutdown"
I feel that states after "shutting down" (terminated, stopping, stopped) should all be considered as a "shutdown" state, is that right? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "shutdown" is really referring to "stopped" here. The key difference from "terminated" is that a stopped instance can go back to running state, where-as terminated instances are gone for good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should remove the check for ec2.InstanceStateNameTerminated here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, thanks for the explanation!

var err error
var ec2Instance *ec2.Instance
if node.Spec.ProviderID == "" {
// TODO: support node name policy other than private DNS names
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I have enough contexts here, can you share more inputs? :) @andrewsykim

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the existing implementation, only private DNS is allowed for a node's name (see kubernetes/kubernetes#52241). We should allow other naming policies. I think a reasonable starting point is allowing the node name to be either the private DNS or the instance name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that maybe we need a ConfigMap or something to store this information. For this PR maybe just implementing with private DNS is sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, thanks for the contexts!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking into kops and EKS in some detail, I don't think we would want "arbitrary" names, or at least if we do, it would have to come with huge caveats around node security.

In particular, EKS and kops-controller match a instance ID attestation with the privateDNSName, via an STS based authentication webhook in EKS using aws-iam-authenticator, or via issuance of a certificate with the privateDNSname as the node name with presentation of the AWS instance identity document in kops-controller's case (CAPI will likely also implement the latter).

In theory, this means we can securely support instanceID, privateDNSName or other unique identifiers on the EC2 DescribeInstances, but beyond that, it would have to come with big "you don't want to use this" warnings.

cc @nckturner @justinsb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why would we use a ConfigMap vs. say ComponentConfig for the controller?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EKS currently has a lot of built-in assumptions on private DNS. That being said, instance ID is attractive because its guaranteed to be unique. I'm ok with making it possible for other names, but we will need to be able to restrict it for EKS to stuff available to DescribeInstances, like @randomvariable said. I don't think a ConfigMap is a good idea though. Its seems like that suggests we want to allow on-the-fly reconfiguration, which for EKS we definitely wouldn't want. Guessing, but if we allowed customers to make changes here then we'd probably want it protected behind an EKS API, which would be easier to do if it was configuration passed in via flags/ComponentConfig file on disk, and required a restart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why would we use a ConfigMap vs. say ComponentConfig for the controller?

I'm not particularly tied to using ConfigMap here, but if we use a config file it should ideally be yaml/json and not INI. I'm not sure ComponentConfig is relevant here since this is a config file read by aws-cloud-controller-manager, not a config for any of its options/flags.

EKS currently has a lot of built-in assumptions on private DNS. That being said, instance ID is attractive because its guaranteed to be unique. I'm ok with making it possible for other names, but we will need to be able to restrict it for EKS to stuff available to DescribeInstances

Sounds like there needs to be a broader discussion on this topic for sure. @nicolehanjing since we know for sure that we still want to support private DNS, let's get this PR only working with that and make sure we have a follow-up PR to support instance ID and other naming policies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha! thanks for the info!

var err error
var ec2Instance *ec2.Instance
if node.Spec.ProviderID == "" {
// TODO: support node name policy other than private DNS names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the existing implementation, only private DNS is allowed for a node's name (see kubernetes/kubernetes#52241). We should allow other naming policies. I think a reasonable starting point is allowing the node name to be either the private DNS or the instance name

var err error
var ec2Instance *ec2.Instance
if node.Spec.ProviderID == "" {
// TODO: support node name policy other than private DNS names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that maybe we need a ConfigMap or something to store this information. For this PR maybe just implementing with private DNS is sufficient.

pkg/providers/v2/instances.go Show resolved Hide resolved

if ec2Instance.State != nil {
state := aws.StringValue(ec2Instance.State.Name)
if state == ec2.InstanceStateNameTerminated || state == ec2.InstanceStateNameStopping || state == ec2.InstanceStateNameStopped {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "shutdown" is really referring to "stopped" here. The key difference from "terminated" is that a stopped instance can go back to running state, where-as terminated instances are gone for good.


if ec2Instance.State != nil {
state := aws.StringValue(ec2Instance.State.Name)
if state == ec2.InstanceStateNameTerminated || state == ec2.InstanceStateNameStopping || state == ec2.InstanceStateNameStopped {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should remove the check for ec2.InstanceStateNameTerminated here.

pkg/providers/v2/instances_test.go Show resolved Hide resolved
@nicolehanjing nicolehanjing force-pushed the nicoleh-instances-v2 branch 3 times, most recently from a14a944 to 72c968f Compare September 29, 2020 04:58
}
}

func TestInstanceExists(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim Updated the unit tests, let me know if you have any other suggestions :)

@randomvariable
Copy link
Member

randomvariable commented Sep 29, 2020

We'll need some follow up issues to do the following, but not blocking this PR:

  • Add rate limiters and metrics collection to the AWS request handlers
  • Check if there's anything needed to support custom EC2 endpoints (for GovCloud/Secret Region and Outposts setups)

@randomvariable
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 29, 2020
@andrewsykim
Copy link
Member

andrewsykim commented Sep 29, 2020

We'll need some follow up issues to do the following, but not blocking this PR:
Add rate limiters and metrics collection to the AWS request handlers
Check if there's anything needed to support custom EC2 endpoints (for airgapped and Outposts setups)

+1, we should definitely add the first one before removing the alpha gating env var (but in a follow-up PR):

if cloudProvider == awsv2.ProviderName {
if v2Enabled := os.Getenv(enableAlphaV2EnvVar); v2Enabled != "true" {
klog.Fatalf("aws/v2 cloud provider requires environment variable ENABLE_ALPHA_V2=true to be set")
}
}

@nicolehanjing nicolehanjing changed the title [WIP] instances: initial implementation of instancesV2 interface instances: initial implementation of instancesV2 interface Sep 29, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 29, 2020
func (i *instances) InstanceExists(ctx context.Context, node *v1.Node) (bool, error) {
var err error
if node.Spec.ProviderID == "" {
_, err = i.getInstanceByPrivateDNSName(ctx, node.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't mind some level 4 (or maybe higher) logging here that printed a line "looking for node X by private DNS name".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, will add!

}
}

_, err = i.getInstanceByProviderID(ctx, node.Spec.ProviderID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another log line at 4 or higher for "looking for node by provider ID".

if err != nil {
return false, err
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the instance does exist by private DNS name, and the provider ID is empty, we are still falling through this block and calling getInstanceByProviderID()? Should we have an else statement for the provider ID check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated! I wrap the logic in an if-else block so that ensures we only call getInstanceByProviderID() when the providerID is not empty

if node.Spec.ProviderID == "" {
_, err = i.getInstanceByPrivateDNSName(ctx, node.Name)
if err == cloudprovider.InstanceNotFound {
return false, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we group this entire thing into an if else, then we can also add a single err != nil and err == cloudprovider.InstanceNotFound at the end of the function, and add logs (maybe at level 6?) for not found instances.

var ec2Instance *ec2.Instance
if node.Spec.ProviderID == "" {
// TODO: support node name policy other than private DNS names
ec2Instance, err = i.getInstanceByPrivateDNSName(ctx, node.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having the getInstanceByPrivateDNSName and then getInstanceByProviderID logic in multiple places, can we put all that logic into a getInstance(ctx context.Context, node *Node) (*ec2.Instance, error) function? That way it will be easier to refactor or add other forms of getInstance in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only difference in these two functions is the ec2 request, I think we could put the logic into a getInstance(ctx context.Context, node *Node) (*ec2.Instance, error) function but inside that function we need to differentiate the request based on the type of given node info

@nckturner
Copy link
Contributor

/cc @micahhausler @wongma7


// getInstance returns the instance if the instance with the given node info still exists.
// If false an error will be returned, the instance will be immediately deleted by the cloud controller manager.
func (i *instances) getInstance(ctx context.Context, node *v1.Node) (*ec2.Instance, error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nckturner Updated!

  • added the logs
  • unified two functions getInstanceByProviderID and getInstanceByPrivateDNSName into getInstance and the only difference is the ec2 request input
  • likely unified two functions InstanceShutdownByProviderID and InstanceShutdownByPrivateDNSName into InstanceShutdown
    PTAL! :)

request = &ec2.DescribeInstancesInput{
InstanceIds: []*string{aws.String(instanceID)},
Filters: []*ec2.Filter{
newEc2Filter("instance-state-name", aliveFilter...),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to describe without this aliveFilter and instead filter out terminated instances locally like v1 did. I do not have numbers or documentation but my understanding is that filters can affect describeinstances performance. cc kubernetes/kubernetes#78140

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, will update! Thanks for the contexts

Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really minor comments

_, err := i.getInstance(ctx, node)

if err == cloudprovider.InstanceNotFound {
klog.V(6).Infof("instance not found")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log the node name here for the instance:

klog.V(6).Infof("instance not found for node: %s", node.Name)

pkg/providers/v2/instances.go Outdated Show resolved Hide resolved

nodeName := "ip-192-168-0-1.ec2.internal"

tests := []struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are looking great :)

tests := []struct {
name string
node *v1.Node
expectedEc2Output *ec2.DescribeInstancesOutput
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this field mockedEC2Output instead, since we are not actually validating against this as an "expected" valuie.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same applies for other tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, will update!

@nicolehanjing
Copy link
Member Author

/retest

@nicolehanjing nicolehanjing force-pushed the nicoleh-instances-v2 branch 2 times, most recently from ccc5ee3 to 278380e Compare October 20, 2020 03:48
@nicolehanjing
Copy link
Member Author

@andrewsykim updated! PTAL :)


if err == ErrInstanceTerminated {
klog.V(6).Infof("instance terminated for node: %s", node.Name)
return true, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ErrInstanceTerminated is for terminated instances which I don't think apply for the "shutdown" case. Checking ec2.InstanceStateNameStopped might be enough for checking shutdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see you're already checking that below. In that case for terminated state this should return false, nil or false, err

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole block can be simplified actually to:

ec2Instance, err := i.getInstance(ctx, node)
if err != nil {
    return false, err
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, updated!

func (i *instances) InstanceShutdown(ctx context.Context, node *v1.Node) (bool, error) {
ec2Instance, err := i.getInstance(ctx, node)
if err != nil {
return false, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim Updated the shutdown checks, PTAL!
Let me know if you have any other suggestions :)

}

state := instances[0].State.Name
if *state == ec2.InstanceStateNameTerminated {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If DescribeInstances returns only 1 instance and that instance has state InstanceStateNameTerminated, then I think we should treat it similar to the case of len(instances) == 0 and here return nil, cloudprovider.InstanceNotFound.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, updated!
I was thinking to have a different log for terminated instances, but as "terminated" is just an intermediate state I agree that we should treat it the same as 'not exist'

}

if len(instances) > 1 {
return nil, fmt.Errorf("getInstance: multiple instances found")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should use errors.New

Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reasonable starting point. Nothing added here is binding so we can revisit some decisions in a follow-up PR. Some functionality we need to revisit is:

  • Cluster tagging support
  • Support instance ID as the node name
  • Configuration API to store the various thing that are configurable.

Thanks @nicolehanjing

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 21, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, nicolehanjing

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 21, 2020
@k8s-ci-robot k8s-ci-robot merged commit 805fde8 into kubernetes:master Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants