Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reflect the VM statuses to the KubeVirtCluster status #272

Merged

Conversation

nunnatsa
Copy link
Contributor

@nunnatsa nunnatsa commented Dec 14, 2023

What this PR does / why we need it

Be more verbose when VM is not scheduled. Add meaningful reasons and messages to the KubeVirtMachine conditions, to be reflected in the Cluster resources.

Also, make the KubeVirtMachine.status.ready field to be printed also if it false, and adding it as a new printed column named "Rwady".

  • todo: complete unit tests

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 14, 2023
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 14, 2023
@nunnatsa
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Dec 14, 2023
@@ -275,6 +276,7 @@ func (r *KubevirtMachineReconciler) reconcileNormal(ctx *context.MachineContext)
// Mark VMProvisionedCondition to indicate that the VM has successfully started
conditions.MarkTrue(ctx.KubevirtMachine, infrav1.VMProvisionedCondition)
} else {
conditions.MarkFalse(ctx.KubevirtMachine, infrav1.VMProvisionedCondition, "notScheduled", clusterv1.ConditionSeverityInfo, "unknown reason")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we need here is a series of checks into why the VM isn't ready.

So, think about a function here to set the Ready condition to false. something like the logic below. The idea here is we start by looking for detailed reasons the VM might not be ready, and then slowly work down to more generic reasons the VM isn't ready.

func mapVMStatusToMachineCondition(...) {

    if <vm has condition  reason: Unschedulable status: "False" type: PodScheduled> {
        conditions.MarkFalse(ctx.KubevirtMachine, infrav1.VMProvisionedCondition, "notScheduled", clusterv1.ConditionSeverityInfo, <reason given from vm condition> )
    } else {
      conditions.MarkFalse(ctx.KubevirtMachine, infrav1.VMProvisionedCondition, "vmNotReady", clusterv1.ConditionSeverityInfo, <reason vm condition is set to not ready> )
    }
}

In the future that function will keep getting expanded. The logic will look something like this from a high level

if vm is not schedulable {
} else if dv storage profile not found {

} else if dv pending storage provisioning {

} else if dv failure to import image { 

} else {
< generic VMProvisioned = false reason >
}

@coveralls
Copy link

coveralls commented Dec 17, 2023

Pull Request Test Coverage Report for Build 7285359056

  • 59 of 65 (90.77%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+1.2%) to 63.356%

Changes Missing Coverage Covered Lines Changed/Added Lines %
controllers/kubevirtmachine_controller.go 3 4 75.0%
pkg/kubevirt/machine.go 56 61 91.8%
Totals Coverage Status
Change from base Build 7226166058: 1.2%
Covered Lines: 1008
Relevant Lines: 1591

💛 - Coveralls

@nunnatsa
Copy link
Contributor Author

/hold

need to complete unit tests

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 17, 2023
@nunnatsa nunnatsa force-pushed the reflect-vm-status-to-cluster branch 2 times, most recently from a22e2d1 to 341359d Compare December 18, 2023 12:39
@nunnatsa
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 18, 2023
Copy link
Contributor

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One high level comment is that I want to make sure the capi Machine object get's the condition messages from the KubeVirtMachine object.

From what I understand, right now the KubeVirtMachine's ready condition will show something like the output below when a DV can't be imported because of an insecure registry.

{
  "conditions": [
    {
      "lastTransitionTime": "2023-12-17T15:37:31Z",
      "message": "0 of 2 completed",
      "reason": "DVNotReady",
      "severity": "Info",
      "status": "False",
      "type": "Ready"
    },
    {
      "lastTransitionTime": "2023-12-17T15:37:31Z",
      "message": "DataVolume capi-quickstart-md-0-rjxxm-capi-quickstart-boot-volume is still provisioning",
      "reason": "DVNotReady",
      "severity": "Info",
      "status": "False",
      "type": "VMProvisioned"
    }
  ]
}

What does the corresponding Machine object and MachineDeployment have in it's status for these KubeVIrtMachine conditions?

The goal is for the KubeVirtMachine object to present these conditions in a way that the capi Machine apis bubble up information regarding to why the VM isn't progresssing as expected

config/default/manager_image_patch.yaml Outdated Show resolved Hide resolved
Comment on lines -479 to +483
clusterToKubevirtMachines, err := util.ClusterToObjectsMapper(mgr.GetClient(), &infrav1.KubevirtMachineList{}, mgr.GetScheme())
clusterToKubevirtMachines, err := util.ClusterToTypedObjectsMapper(mgr.GetClient(), &infrav1.KubevirtMachineList{}, mgr.GetScheme())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm curious what this change does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ClusterToObjectsMapper function is deprecated. In its documentation, it says to use ClusterToTypedObjectsMapper instead:

// Deprecated: This function is deprecated and will be removed in a future release, use ClusterToTypedObjectsMapper instead.
// The problem with this function is that it uses UnstructuredList to retrieve objects, with the default client configuration
// this will lead to uncached List calls, which is a major performance issue.

case cdiv1.Failed:
return "DVFailed", fmt.Sprintf("DataVolume %s failed", m.dataVolume.Name), true
default:
msg := "DataVolume is not ready"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add the phase to this default message?

msg: fmt.Sprintf("DataVolume %s is not ready. Phase = %s", m.dataVolume.Name, m.dataVolume.Status.Phase)

pkg/kubevirt/machine.go Outdated Show resolved Hide resolved
pkg/kubevirt/machine.go Outdated Show resolved Hide resolved
pkg/kubevirt/machine.go Outdated Show resolved Hide resolved
pkg/kubevirt/machine.go Outdated Show resolved Hide resolved
@nunnatsa
Copy link
Contributor Author

nunnatsa commented Dec 19, 2023

One high level comment is that I want to make sure the capi Machine object get's the condition messages from the KubeVirtMachine object.

For this KubeVirtMachine status:

{
  "conditions": [
    {
      "lastTransitionTime": "2023-12-19T07:26:28Z",
      "message": "0 of 2 completed",
      "reason": "DVNotReady",
      "severity": "Info",
      "status": "False",
      "type": "Ready"
    },
    {
      "lastTransitionTime": "2023-12-19T07:28:28Z",
      "message": "DataVolume capi-quickstart-md-0-d7xjr-capi-quickstart-boot-volume import is not running: DataVolume too small to contain image",
      "reason": "DVNotReady",
      "severity": "Info",
      "status": "False",
      "type": "VMProvisioned"
    }
  ],
  "ready": false
}

We get this Machine (capi) status (only the Reason field get here, without the message):

{
  "bootstrapReady": true,
  "conditions": [
    {
      "lastTransitionTime": "2023-12-19T07:26:28Z",
      "message": "1 of 2 completed",
      "reason": "DVNotReady",
      "severity": "Info",
      "status": "False",
      "type": "Ready"
    },
    {
      "lastTransitionTime": "2023-12-19T07:26:18Z",
      "status": "True",
      "type": "BootstrapReady"
    },
    {
      "lastTransitionTime": "2023-12-19T07:26:28Z",
      "message": "0 of 2 completed",
      "reason": "DVNotReady",
      "severity": "Info",
      "status": "False",
      "type": "InfrastructureReady"
    },
    {
      "lastTransitionTime": "2023-12-19T07:24:36Z",
      "reason": "WaitingForNodeRef",
      "severity": "Info",
      "status": "False",
      "type": "NodeHealthy"
    }
  ],
  "lastUpdated": "2023-12-19T07:26:18Z",
  "observedGeneration": 2,
  "phase": "Provisioning"
}

What does the corresponding Machine object and MachineDeployment have in it's status for these KubeVIrtMachine conditions?

And the MachineDeployment status is (no reason nor message is reflected here):

{
  "conditions": [
    {
      "lastTransitionTime": "2023-12-19T07:24:36Z",
      "message": "Minimum availability requires 1 replicas, current 0 available",
      "reason": "WaitingForAvailableMachines",
      "severity": "Warning",
      "status": "False",
      "type": "Ready"
    },
    {
      "lastTransitionTime": "2023-12-19T07:24:36Z",
      "message": "Minimum availability requires 1 replicas, current 0 available",
      "reason": "WaitingForAvailableMachines",
      "severity": "Warning",
      "status": "False",
      "type": "Available"
    }
  ],
  "observedGeneration": 1,
  "phase": "ScalingUp",
  "replicas": 1,
  "selector": "cluster.x-k8s.io/cluster-name=capi-quickstart,cluster.x-k8s.io/deployment-name=capi-quickstart-md-0",
  "unavailableReplicas": 1,
  "updatedReplicas": 1
}

However, the capi Cluster shows nothing:

{
  "conditions": [
    {
      "lastTransitionTime": "2023-12-19T07:26:11Z",
      "status": "True",
      "type": "Ready"
    },
    {
      "lastTransitionTime": "2023-12-19T07:26:11Z",
      "status": "True",
      "type": "ControlPlaneInitialized"
    },
    {
      "lastTransitionTime": "2023-12-19T07:26:11Z",
      "status": "True",
      "type": "ControlPlaneReady"
    },
    {
      "lastTransitionTime": "2023-12-19T07:24:41Z",
      "status": "True",
      "type": "InfrastructureReady"
    }
  ],
  "infrastructureReady": true,
  "observedGeneration": 2,
  "phase": "Provisioned"
}

@nunnatsa
Copy link
Contributor Author

Comparing the conditions with the current version (main branch)

capi Machine InfrastructureReady condition (the only change):

Main:

    {
      "lastTransitionTime": "2023-12-19T08:13:21Z",
      "message": "0 of 2 completed",
      "reason": "WaitingForBootstrapData",
      "severity": "Info",
      "status": "False",
      "type": "InfrastructureReady"
    },

PR:

    {
      "lastTransitionTime": "2023-12-19T07:26:28Z",
      "message": "0 of 2 completed",
      "reason": "DVNotReady",
      "severity": "Info",
      "status": "False",
      "type": "InfrastructureReady"
    },

No change in MachineDeployment

@@ -90,6 +94,23 @@ func NewMachine(ctx *context.MachineContext, client client.Client, namespace str
machine.vmInstance = vm
}

if vm != nil && vm.Spec.Template != nil {
for _, vol := range vm.Spec.Template.Spec.Volumes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i didn't catch this earlier.

Technically we need to be looking for the presence of vm.Spec.DataVolumeTemplates here instead of vm.Spec.Template.Spec.Volumes.

This is because the Volumes could be attaching the DataVolume as a PVC instead of a DataVolume. These are interchangable.

The source of truth in our case for whether a VM has a data import flow associated with it is to look at the DataVolumeTemplate section.

@@ -63,6 +66,7 @@ func NewMachine(ctx *context.MachineContext, client client.Client, namespace str
vmiInstance: nil,
vmInstance: nil,
sshKeys: sshKeys,
dataVolume: nil,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be a slice instead of a single varable. It's possible for a VM to have mulitple DataVolumeTemplates associated with it. In the use case that you and I are most familiar with (hypershift) this will not be the case right now, but it's possible in the generic capk use case.

Comment on lines +48 to +55
- apiGroups:
- cdi.kubevirt.io
resources:
- datavolumes
verbs:
- get
- list
- watch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i guess this is needed here too, clusterkubevirtadm/cmd/credentials/credentials.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidvossel - done

If not all the cluster VMs are ready, CAPK operator now sets the new
`AllMachinesAreReady` condition in the KubeVirtCluster CR to `False`.

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>
Copy link
Contributor

@davidvossel davidvossel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

Looks like a good start. I'm sure we'll expand on this in the future

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 21, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: davidvossel, nunnatsa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [davidvossel,nunnatsa]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 1ab9fc5 into kubernetes-sigs:main Dec 21, 2023
13 checks passed
@nunnatsa nunnatsa deleted the reflect-vm-status-to-cluster branch December 21, 2023 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants