Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take node labels from cloud tags on AWS #9575

Merged
merged 4 commits into from
Oct 23, 2020

Conversation

johngmyers
Copy link
Member

@johngmyers johngmyers commented Jul 15, 2020

On AWS, refactors the kops-controller to source the node labels from the instance's cloud tags. This is so kops-controller doesn't read the (possibly unapplied) instancegroup spec from the state store.

The other cloud providers are strongly encouraged to refactor to this design.

Reverts #9519 due to its inability to handle tags with empty values (as mentioned in #9536).

Fixes #9856

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 15, 2020
@johngmyers
Copy link
Member Author

/cc @rifelpet since reverting his PR.

@k8s-ci-robot k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Jul 15, 2020
@johngmyers
Copy link
Member Author

/retest

@johngmyers
Copy link
Member Author

/assign @justinsb

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 19, 2020
@johngmyers
Copy link
Member Author

/approve cancel
Would like @justinsb review on this design change.

@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 19, 2020
@johngmyers johngmyers force-pushed the node-labels branch 2 times, most recently from cced7d0 to b92da3e Compare July 20, 2020 03:55
@hakman
Copy link
Member

hakman commented Jul 22, 2020

I think @rifelpet was saying that AWS will fix soon the inability to handle tags with empty values. Maybe best to exclude from this PR.

@johngmyers
Copy link
Member Author

As can be seen from the expected test results, this PR will try to push k8s.io/cluster-autoscaler/node-template/label/node-role.kubernetes.io/node tags with empty values. So it cannot go in without either reverting #9519 or waiting for AWS to fix.

@hakman
Copy link
Member

hakman commented Aug 15, 2020

For Cluster Autoscaler it does not matter what value the tags have. Could set the value to "1" or "true" for now and remove it when the AWS fix is fully deployed. It is already available in us-east-1 I think, right @rifelpet ?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2020
@johngmyers johngmyers force-pushed the node-labels branch 2 times, most recently from eca4809 to cf3b6e2 Compare August 17, 2020 21:20
@rifelpet
Copy link
Member

I believe you can remove the revert commit now, the AWS fix has been rolled out.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2020
@johngmyers
Copy link
Member Author

@olemarkus that Test_ServerGroupModelBuilder looks difficult to update. A good project would be to refactor it to use golden output files like the AWS tests use.

@johngmyers
Copy link
Member Author

It's #9211 that's causing Test_ServerGroupModelBuilder to interact badly with this change. Now every time there's a change to model.CloudTagesForInstanceGroup() a whole lot of manual changes need to be made to the expected output it has encoded in test case structs.

I'm inclined to disable that test until it can be recoded to use model.RunGoldenTest().

@justinsb
Copy link
Member

I'm trying to understand the underlying problem. We update the instancegroup to add/remove labels, and then they get applied immediately?

I'd like to get the nodes able to get their configuration directly from kops-controller rather than requiring it to be on S3 (and giving the nodes S3 permission); that does require all the information in the instancegroup.yaml. The PoC is in #8198 . So the issue is that if we head in that direction, we likely won't be able to put all the information into the instance tags (?). I feel that the problem likely still exists when booting the nodes as well, as they read the latest cluster & instancegroup from S3. So I'm trying to understand the problem & the direction here :-)

The changes in here to fix the labels are good though - we should have been using BuildNodeLabels - that looks like a bug.

@johngmyers
Copy link
Member Author

I've listed the dependency graph in #9229. The underlying problem is indeed that changes to labels get applied too soon. There's also the issue that the instancegroup spec is written as the versioned API but read as the unversioned API.

The direction I'm trying to take is refactoring away from reading things that are updated at edit time towards things that are updated only on a "kops update cluster --yes" or "terraform apply".

Not that I understand why people want kops to generate Terraform output.

@johngmyers
Copy link
Member Author

/retest

1 similar comment
@johngmyers
Copy link
Member Author

/retest

@olemarkus
Copy link
Member

@olemarkus that Test_ServerGroupModelBuilder looks difficult to update. A good project would be to refactor it to use golden output files like the AWS tests use.

Yeah it is a beast, but it has also been very helpful. I'll about converting to golden output after I am done with the PR/branches I have in queue now.

@johngmyers
Copy link
Member Author

@justinsb I'm having some difficulty teasing out the remaining intent of #8198, especially since my recent kops-controller worker-node bootstrap work addresses the problem of getting the keypairs.

We are currently advertising a model where edits to the cluster and instancegroup specs don't have an immediate effect on the cluster. In this model the admin can get a preview of the cloud-provider-level changes and the changes are applied as a set only when the admin takes a confirming action. After that point new instances get the new specification, whereas existing nodes tend to keep their existing configuration.

Reality is a lot more fuzzy, with some changes taking effect at edit and/or preview time. I expect admins would find such discrepancies undesirably surprising.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 5, 2020
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Sep 5, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 11, 2020
@justinsb
Copy link
Member

Thanks @johngmyers ... I think my objections were around whether there were going to be more things to solve down the road, which is a silly thing for me to block on ... let's try it and see!

/approve
/lgtm

We can discuss today whether we want this in 1.19

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 23, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: justinsb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2020
@k8s-ci-robot k8s-ci-robot merged commit fbb172c into kubernetes:master Oct 23, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Oct 23, 2020
@johngmyers johngmyers deleted the node-labels branch October 25, 2020 00:13
k8s-ci-robot added a commit that referenced this pull request Oct 25, 2020
…575-upstream-release-1.19

Automated cherry pick of #9575: Rename NodeReconciler to LegacyNodeReconciler
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kops-controller area/provider/aws Issues or PRs related to aws provider area/provider/openstack Issues or PRs related to openstack provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kops-controller can't determine AWS region
6 participants