New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vSphere Cloud Provider should support implement Zones() interface #64021
Comments
|
As briefly discussed via slack you want to avoid communication to the vSphere API while the kubelet is bootstrapping; thus the metadata service and or other options. I wonder if there is some work going regarding that metadata service or other options for the kubelet to the get some data from the ESXi host or other parts of the system. If so, could you share some links to PRs, branches or that like? |
|
@embano1 sounds good on the first pass, need to think through it some more, but a few notes for now: We need Go bindings for the tagging API. There have been several requests to add support in govmomi and govc: vmware/govmomi#957 As for metadata, there are at least two ways for vCenter and guests to share data without a network connection: "guestinfo" and "namespaceDB". Both are key-val like stores, where from within the guest data can be read/written via guest RPC over VMCI or vm backdoor. Standard vmware-tools ships with |
|
@dougm Are you saying that a metadata-service is not needed, but |
|
@hoegaarden yes I think guestinfo/namespaceDB could solve this problem rather than having to define+build a new metadata-service. |
|
@dougm So you think about configuring the region & zone info into e.g. the If I understand the initial proposal correctly, @embano1 thought about tagging only the host (and/or the cluster and/or the DC) but not each individual VM. And then have the metadata service collect all those tags and flatten them. My questions here:
|
|
Populating the guestinfo and keeping it in sync is one option. We can use property collector notifications to sync after a migration for example. I need to take a closer look at the namespaceDB option, but there is an event queue designed for this type of interaction. George updated vmware/govmomi#1123 with some of the advantages of namespaceDB over guestinfo. |
|
I am little bit hesitant on changing the labels of the kubelet VM after a vMotion/HA operation. Some questions that come to my mind:
For the reasons listed above, I'd rather leave the labels as applied initially and go with a recommendation engine for phase 2 which reports out of compliance status and allows an admin to trigger reconciliation actions that ship with the controller (CRD). Comparable to DRS partially automated mode, i.e. initial placement and then recommendations only. E.g. (unfinished thoughts) # query current compliance status (CRD)
$ kubectl status vSphere
+-----------+-------------------+----------------------------------+
| VM | Status | Details |
+-----------+-------------------+----------------------------------+
| worker-01 | Compliant | |
| worker-02 | Out of compliance | DRS Anti-Affinity rule violated |
+-----------+-------------------+----------------------------------+
# get details
$ kubectl describe vSphere worker-02
....
Status: Out of compliance
Recommendation: Migration to ESXi host ESX-1
....
# apply recommendation
$ kubectl rebalance vSphere worker-02
worker-02 successfully migrated to ESXi-1 |
|
On the issue of updating Kubelet labels after initialization: #59314
|
|
Thanks @embano1 for raising this issue. I also support the approach where we do not update the labels as in my understanding that would create more issues than resolve in a DRS-activated cluster. |
|
I think the labels are important to manage somehow, especially with vMotion or that like in mind. However, I believe that is definitely phase 2 and I am not to worried about that right now. I also like the idea of that CRD-managed suggestions / rebalance. I think right now, for phase 1 I'd be important to figure out how we actually share the region/zone information from the hosts to the guests.
I'd love to come to an agreement on how the information of region/zone is passed into the VM / can be queried by the VM, so we can start to work on that. Having said that, chances are good I am missing some information and people are already working on that. In that case, please let me know :) |
|
/assign @jiatongw |
|
@dougm: GitHub didn't allow me to assign the following users: jiatongw. Note that only kubernetes members and repo collaborators can be assigned. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/assign |
|
Update after discussion with @jiatongw on open questions for phase 1: vSphere categories map to Kubernetes well-known labels (region and zone), e.g. [
{
"category": [
{
"description": "Represents a well-known label in Kubernetes mapping to failure-domain.beta.kubernetes.io/region",
"allowEmpty": false,
"name": "k8s-io-region",
"tags": [
"EMEA",
"US"
]
},
{
"description": "Represents a well-known label in Kubernetes mapping to failure-domain.beta.kubernetes.io/zone",
"allowEmpty": false,
"name": "k8s-io-zone",
"tags": [
"Cluster-ABC-Site-A",
"Cluster-ABC-Site-B"
]
}
]
}
]These categories would be applied on a VM-level to allow drift detection, e.g. after vMotion. Also, these tags may only be associated with VMs and only one tag per category is allowed (options in vSphere when creating categories). Categorie and tag names have to be configurable by the vSphere admin. Recommendation could be "k8s-io-<...>" to avoid parsing errors with some special characters. If the vSphere admin does not create the tags/spelling error, the implementation can be configured to warn or fail when the Kubelet starts (see The implementation would also add another failure-domain (and thus topologyKey), specific to vSphere. The ESXi host is also a failure domain (multiple VMs on one host). The Kubernetes scheduler should be aware of this failure domain by adding a custom topologyKey, e.g. By having the zone locality tag associated on a VM-level, a controller could check for drift, e.g. after HA or vMotion (phase 2). Reconcilation behavior is customer specific. One approach w/out disruption to existing workloads could be taints. The controller would taint the migrated VM (kubelet) so that existing workloads continue to run ( Also todo: double-check with @tusharnt on whether we need modifications for persistent volume controllers. Since vSphere storage (VMFS, vSAN) typically is shared across all nodes in the cluster, expecting no issues out of the box. Any concerns for non-uniform stretched storage clusters? From Kubernetes docs:
|
|
Latest upate: The tagging would be applied on host-level. We can use
Example of vSphere configuration file shows as below: If users don't provide |
|
Update from a chat with @hoegaarden on CFCR implications with these changes: Phase 1
Phase 2
|
Automatic merge from submit-queue (batch tested with PRs 67052, 67094, 66795). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. Add zones support for vSphere cloud provider(in-tree) **What this PR does / why we need it**: This PR added zones(built-in node labels) support for vSphere cloud provider(in-tree). More details can be found in the issue as below. **Which issue(s) this PR fixes** : Partially fixes phase 1 of issue #64021 **Special notes for your reviewer**: **Release note**: ```release-note NONE ```
Update required to continue work on kubernetes#64021 - The govmomi tag API changed - Pulling in the new vapi/simulator package for testing the VCP Zones impl
Automatic merge from submit-queue (batch tested with PRs 66973, 67704, 67722, 67723, 63512). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. godeps: update vmware/govmomi **What this PR does / why we need it**: Update required to continue work on #64021 - The govmomi tag API changed - Pulling in the new vapi/simulator package for testing the VCP Zones impl **Release note**: ```release-note NONE ```
- Add tests for GetZones() - Fix bug where a host tag other than region or zone caused an error - Fix bug where GetZones() errored if zone tag was set, but region was not Follow up to PR kubernetes#66795 / towards kubernetes#64021
Automatic merge from submit-queue (batch tested with PRs 66980, 67604, 67741, 67715). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. vsphere: add tests for Cloud Provider Zones implementation **What this PR does / why we need it**: - Add tests for GetZones() - Fix bug where a host tag other than region or zone caused an error - Fix bug where GetZones() errored if zone tag was set, but region was not Follow up to PR #66795 / towards #64021 **Release note**: ```release-note NONE ```
Rather than just looking for zone tags at the VM's Host level, traverse up the hierarchy. This allows zone tags to be attached at host level, along with cluster, datacenter, root folder and any inventory folders in between. Issue kubernetes#64021
Automatic merge from submit-queue (batch tested with PRs 54935, 67768, 67896, 67787). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. vsphere: support zone tags at any level in the hierarchy **What this PR does / why we need it**: Rather than just looking for zone tags at the VM's Host level, traverse up the hierarchy. This allows zone tags to be attached at host level, along with cluster, datacenter, root folder and any inventory folders in between. Issue kubernetes#64021 Example log output from the tests, with tags attached at host level: ```console Found "k8s-region" tag (k8s-region-US) for e85df495-93b9-4b0e-96f1-dc9d56e97263 attached to HostSystem:host-19 Found "k8s-zone" tag (k8s-zone-US-CA1) for e85df495-93b9-4b0e-96f1-dc9d56e97263 attached to HostSystem:host-19 ``` And region tag at Datacenter level and zone tag at Cluster level: ```console Found "k8s-zone" tag (k8s-zone-US-CA1) for e85df495-93b9-4b0e-96f1-dc9d56e97263 attached to ComputeResource:computeresource-21 Found "k8s-region" tag (k8s-region-US) for e85df495-93b9-4b0e-96f1-dc9d56e97263 attached to Datacenter:datacenter-2 ``` **Release note**: ```release-note NONE ```
|
I was recently made aware of this issue, however, it's not being tracked for Kubernetes 1.12 release. Is there a reason why an issue in kubernetes/features was never opened? I only ask because it doesn't have any visibility to folks on the release team. So it may not be added to blogs, release notes, etc. |
|
We were missing a release note, but #66795 has it now and it'll be included in the next generation of CHANGELOG-1.12.md I think we skipped k/features for the same reason as kubernetes/enhancements#501 (comment) "Since this is an entirely VMWare feature, it does not need to be tracked here." Zones feature already existed, this was just the vSphere implementation - so we assumed approval was not required. But had not considered how a k/features issue would be used in docs, etc. |
|
@dougm thanks for the clarification. I'm still learning over here too. Let me chat with some release folks and see what's gray area. |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
/remove-lifecycle stale |
|
how about using dmidecode to get UUID? For smaller sites it is interesting enough to not land on the . same ESX host |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind feature
/sig vmware
What happened:
Currently, the vSphere Cloud Provider (VCP) does not implement discovering Zones/Failure Domains based on well-known labels out of the box. To quote the docs:
The consequences for users are:
What you expected to happen:
VCP should populate Kubernetes well-known labels in order for the scheduler (default scheduling policy) and end user (affinity/anti-affinity settings) to work out of the box in a vSphere environment.
VCP should also be able to reconcile labels, e.g. in case of a VM failover (HA) or DRS being enabled in a vSphere cluster, e.g. "should" rules to balance host utilization within a rack/failure domain.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Discussed the following two-step approach with VMware Kubernetes engineering team:
Phase 1:
Implement basic functionality for VCP to support zones. This requires consensus on what a zone and region maps to in a vSphere environment. E.g. a region could map to a vSphere data center and a zone to a vSphere DRS/HA enabled cluster. VMware Cloud on AWS (VMC) multi-AZ deployments would nicely map.
But that might conflict with customer on-premises environments, where a single vSphere cluster is stretched between two sites/data center buildings, i.e. from a logical perspective single vSphere data center and cluster.
This is why a labelling (tagging) mechanism must be employed. A vSphere or Kubernetes cluster operator would define and apply vSphere tags to data centers, clusters and ESXi hosts. For example:
In the stretched cluster example mentioned above that tagging/labeling scheme would translate to:
Each Kubelet VM running on ESXi, through a to be defined local (i.e. 169.x.y.z, no external network call) metadata service, would query the pre-defined tags and translate them into region/zone labels. To continue with the stretched cluster example:
failure-domain.beta.kubernetes.io/region=EMEAfailure-domain.beta.kubernetes.io/zone=DE-K8s-1failure-domain.beta.kubernetes.io/region=EMEAfailure-domain.beta.kubernetes.io/zone=DE-K8s-2Phase 2:
The in phase 1 suggested improvements on VCP zone support would solve initial placement and correct labelling of VMs and kubelets in a vSphere environment. Since many commercial Kubernetes distributions, e.g. OpenShift, test and certify against VCP, enriching the VCP with this functionality would be a benefit for all vSphere customers running any Kubernetes distribution.
However, vSphere offers advanced features like dynamic cluster rebalancing and high availability for VMs, which even in a Kubernetes environment provide a lot of value. That's why on "day 2", the initial labelling applied by the Kubelet could change (e.g. after HA or vMotion/DRS) and thus break Kubernetes scheduling assumptions/decisions.
This is why an monitoring/reconciliation control loop is needed. This could be implemented as a controller inside Kubernetes and in fact was demonstrated by @anfernee during a recent SIG VMware community call.
From a testing/certification perspective, VMware would need to work jointly with vendors of commercial Kubernetes distributions so that this controller would be recommended and shipped out of the box for production environments and customers continue to gain the benefits of the vSphere platform, protecting their investment.
@frapposelli @cantbewong
Environment:
kubectl version): all versions affecteduname -a): n/aRelated Issues/Discussions
kubernetes/pkg/cloudprovider/providers/vsphere/vsphere.go
Line 701 in 9d6d1a1
The text was updated successfully, but these errors were encountered: