Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes zone/region labels setup and kubelet stucking on startup if credentials stored in secret for legacy vSphere cloudprovider. #101028

Merged

Conversation

lobziik
Copy link
Member

@lobziik lobziik commented Apr 12, 2021

What this PR does / why we need it:

Disables attempts of obtaining zones within kubelet during initial node registration by vSphere provider if credentials stored in secret.
Setting zone and region labels for node moved to KCM in such case.

If credentials stored in cloud-provider config file as plaintext current behaviour does not change.

Which issue(s) this PR fixes:

Fixes #75175. Kubelet does not stucking on startup with this patch. Zone labels populates for nodes during KCM startup.

Notes:

For proper functioning ClusterRole + ClusterRoleBinding need to be created if RBAC is in use. This should be documented somewhere, would be awesome if anybody will point me a good place for this. In case of lack of permissions labels will not be set.

What type of PR is this?

/kind bug
/sig cloud-provider

/assing @andrewsykim

Does this PR introduce a user-facing change?

Fixed bug with leads to Node goes "Not-ready" state when credentials for vCenter stored in a secret and Zones feature is in use.
Zone labels setup moved to KCM component, kubelet skips this step during startup in such case. If credentials stored in cloud-provider config file as plaintext current behaviour does not change and no action required.

Action required: For proper functioning "kube-system:vsphere-legacy-cloud-provider" should be allowed to update node object if vCenter credentials stored in secret and Zone feature used.

@k8s-ci-robot k8s-ci-robot added release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 12, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @lobziik. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Danil-Grigorev
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 12, 2021
func (vs *VSphere) GetZone(ctx context.Context) (cloudprovider.Zone, error) {
nodeName, err := vs.CurrentNodeName(ctx, vs.hostName)
if err != nil {
klog.Errorf("Cannot get node name.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Errorf("Cannot get node name.")
klog.ErrorS(nil, "Cannot get node name.")

@yangjunmyfm192085
Copy link
Contributor

Is it necessary to structured log migration?

@lobziik
Copy link
Member Author

lobziik commented Apr 13, 2021

@yangjunmyfm192085 Dunno, in my opinion logs format migration should be done for entire module. Would prefer to keep it consistent with other provider's code for now.

@lobziik
Copy link
Member Author

lobziik commented Apr 13, 2021

/retest

@lobziik
Copy link
Member Author

lobziik commented May 20, 2021

@andrewsykim @cheftako, Hello! Could you please take a brief look on this please? Is this approach reasonable? :)

@@ -894,7 +901,16 @@ func (vs *VSphere) LoadBalancer() (cloudprovider.LoadBalancer, bool) {
}

func (vs *VSphere) isZoneEnabled() bool {
return vs.cfg != nil && vs.cfg.Labels.Zone != "" && vs.cfg.Labels.Region != ""
isEnabled := vs.cfg != nil && vs.cfg.Labels.Zone != "" && vs.cfg.Labels.Region != ""
Copy link
Member

@cheftako cheftako May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really both zone and region need to be set or just one of zone and region needs to be set? (I see that was the previous logic, so no problems going forward like that, just checking)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://vmware.github.io/vsphere-storage-for-kubernetes/documentation/zones.html this doc tells that both need to be set. Case with one parameter does not described at all.

@cheftako
Copy link
Member

Happy to approve but would like to see one of the primary vsphere contributors lgtm first.

@lobziik
Copy link
Member Author

lobziik commented Jun 22, 2021

@divyenpatel @andrewsykim @SandeepPissay Hello! Could you please take a look on this please? Would be awesome to get a feedback from vsphere maintainers :)

@SandeepPissay
Copy link
Contributor

@lobziik unfortunately I do not have bandwidth to review code changes in legacy vSphere cloud provider. Its also been a while I looked at that code.

@jcpowermac
Copy link

@lobziik unfortunately I do not have bandwidth to review code changes in legacy vSphere cloud provider. Its also been a while I looked at that code.

@SandeepPissay our mutal customers would like to be able to support zones. If you are unable to review is there someone else in the reviewer/approver list that can?

@gnufied
Copy link
Member

gnufied commented Oct 21, 2021

/assign

@lobziik lobziik force-pushed the vsphere-zones-with-secret-creds branch from 2030d91 to bfdbfa2 Compare January 5, 2022 14:58
@lobziik
Copy link
Member Author

lobziik commented Jan 5, 2022

Updated pr a bit, what was done:

  • rebased on top of current master
  • fixes according @lubronzhan review

I noticed this in the release notes, is kube-syste:vsphere-legacy-cloud-provider an official role or just an example? We should make this clear in the release notes

@andrewsykim, official, i think. I introduced client instantiation here: https://github.com/kubernetes/kubernetes/pull/101028/files#diff-5000d45019379218cc35bc5f967f2731b4dbff230399361c8f6b9a8af3e7e1f0R277
Name is just hardcoded as in other legacy providers. I fixed typo in PR description, thanks for catching that!

@lobziik
Copy link
Member Author

lobziik commented Jan 5, 2022

/test pull-kubernetes-integration

@andrewsykim
Copy link
Member

@andrewsykim, official, i think. I introduced client instantiation here: https://github.com/kubernetes/kubernetes/pull/101028/files#diff-5000d45019379218cc35bc5f967f2731b4dbff230399361c8f6b9a8af3e7e1f0R277
Name is just hardcoded as in other legacy providers. I fixed typo in PR description, thanks for catching that!

Fwiw I think this is just used as the user-agent in the client and not for RBAC

@lubronzhan
Copy link
Contributor

lubronzhan commented Jan 5, 2022

Thanks for reminding Andrew!
Just realized that. If we don't automatically add the role binding and role before node registration, the code here will try send 3 requests to APIserver and get error of not enough permission, failing to add the labels and won't ever add it again.
Unless customer add those rolebinding and role, then recreate nodes again.
Should we consider adding the rolebinding and role as default?

@andrewsykim, official, i think. I introduced client instantiation here: https://github.com/kubernetes/kubernetes/pull/101028/files#diff-5000d45019379218cc35bc5f967f2731b4dbff230399361c8f6b9a8af3e7e1f0R277
Name is just hardcoded as in other legacy providers. I fixed typo in PR description, thanks for catching that!

Fwiw I think this is just used as the user-agent in the client and not for RBAC

Ok so it means the client will still use the kubelet's service account? Let me check again

@lubronzhan
Copy link
Contributor

Looking at the code of ClientOrDie here, looks like the input will be used as service account name to create config

cachedTokenSource := transport.NewCachedTokenSource(&tokenSourceImpl{
namespace: t.Namespace,
serviceAccountName: saName,
coreClient: t.CoreClient,
expirationSeconds: t.expirationSeconds,
leewayPercent: t.leewayPercent,
})
configCopy.Wrap(transport.ResettableTokenSourceWrapTransport(cachedTokenSource))
t.roundTripperFuncMap[saName] = configCopy.WrapTransport

@andrewsykim
Copy link
Member

Ah you're right @lubronzhan, I was thinking of the simple client builder:

// Config returns a client config for a fixed client
func (b SimpleControllerClientBuilder) Config(name string) (*restclient.Config, error) {
clientConfig := *b.ClientConfig
return restclient.AddUserAgent(&clientConfig, name), nil
}

@andrewsykim
Copy link
Member

Should we consider adding the rolebinding and role as default?

We no longer support bootstrapping role/rolebindings specific to cloud providers, these should be handled externally

@lubronzhan
Copy link
Contributor

If the role/rolebinding is handled externally, could we run into this issue?

If we don't automatically add the role binding and role before node registration, the code here will try send 3 requests to APIserver and get error of not enough permission, failing to add the labels and won't ever add it again.
Unless customer add those rolebinding and role, then recreate nodes again.

@lobziik
Copy link
Member Author

lobziik commented Jan 11, 2022

@lubronzhan, I went through the code again and yes, seems this will engage only once - during new Node handling within KCM...

I will need check this again from kubelet perspective, poked around quite a while ago and don't exactly remember how it behaves with secret there.

@lobziik
Copy link
Member Author

lobziik commented Jan 12, 2022

I dug into kubelet code, and, as far as I can say there is only one place where zone labels populates:

zones, ok := kl.cloud.Zones()

In other words, yes, in case of not enough permissions this what would happen exactly.

If we don't automatically add the role binding and role before node registration, the code here will try send 3 requests to APIserver and get error of not enough permission, failing to add the labels and won't ever add it again.
Unless customer add those rolebinding and role, then recreate nodes again.

@lobziik
Copy link
Member Author

lobziik commented Jan 13, 2022

I attempted to fix the case with not sufficient permissions in 3dfc011
Doubting about this solution though - this handler might be a quite hot piece of code (calls on every node update), but didn't figure out anything better yet...
@lubronzhan could you please take a look?

@lubronzhan
Copy link
Contributor

Hi @lobziik Right now the node label will be set every time when node is updated.
Does that mean after the customer add the role and rolebinding he needs to somehow trigger the node updates?

@lobziik
Copy link
Member Author

lobziik commented Jan 25, 2022

Hi @lobziik Right now the node label will be set every time when node is updated.
Does that mean after the customer add the role and rolebinding he needs to somehow trigger the node updates?

Hi! For some reason i thought that this handler would be triggered time to time, during status update/heartbeat, but seems not. So, yes, in current variant node update need to be to be triggered somehow for reconcile zone labels.

p.s. made this handler run every 5 minutes in cd1d530 for avoid necessity to trigger node update.
Additionaly, this function might be used for reconcile changes in vSphere tags, since topology labels based on that

cc WDYT @lubronzhan ?

In the case of `vsphere-legacy-cloud-provider` client
has insufficient permissions to update Node after its addition,
this handler will attempt to populate Nodes topology labels later.
@lobziik lobziik force-pushed the vsphere-zones-with-secret-creds branch from 3dfc011 to cd1d530 Compare January 25, 2022 13:38
@lobziik
Copy link
Member Author

lobziik commented Feb 7, 2022

Gentle ping @lubronzhan, @andrewsykim :)

@lubronzhan
Copy link
Contributor

Sorry I missed the email notification.
Yeah now it makes more sense!
/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheftako, lobziik, lubronzhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit c2ba0a4 into kubernetes:master Feb 7, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 7, 2022
lobziik added a commit to lobziik/cluster-kube-controller-manager-operator that referenced this pull request Jun 1, 2022
Starting kube 1.24, legacy vsphere cloud-provider requires
permission to update nodes in order to set topology related labels.

For more info see kubernetes/kubernetes#101028
atiratree pushed a commit to atiratree/cluster-kube-controller-manager-operator that referenced this pull request Jun 20, 2022
Starting kube 1.24, legacy vsphere cloud-provider requires
permission to update nodes in order to set topology related labels.

For more info see kubernetes/kubernetes#101028
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

VSphere cloud provider settings with secret not working.