BUG 1836141: Assert VM exists if VM state is Creating #147

JoelSpeed · 2020-07-10T10:33:58Z

What this PR does / why we need it:

Currently, when a VM is being created in Azure, we first check whether the VM exists and then attempt to create it if not.

On the first attempt to create, we always get an asynchronous time out:

E0709 16:44:25.752572       1 actuator.go:78] Machine error: failed to reconcile machine "jspeed-test-8cnpg-worker-centralus3-g6htz"s: failed to create vm jspeed-test-8cnpg-worker-centralus3-g6htz: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed

This will cause the Machine to be requeued. Currently, because Exists does not determine the machine to exist, we attempt to create again and get the following error:

E0709 16:44:26.642473       1 actuator.go:78] Machine error: failed to reconcile machine "jspeed-test-8cnpg-worker-centralus3-g6htz"s: failed to create vm jspeed-test-8cnpg-worker-centralus3-g6htz: vm jspeed-test-8cnpg-worker-centralus3-g6htz is still in provisioning state Creating, reconcile

The VM exists immediately after the first call to create the VM, but we are attempting to recreate it becuase Exists claims that it does not exist. This can lead to issues if there is a transient error after the first Create but before the VM becomes Running. We could see an error, determine the creation failed and move the Machine to the Failed phase.

If this happens, the VM will still start, but because the Machine is Failed, we do not track it, we do not remove it if the Machine is deleted. Therefore we can leak VMs.

This PR fixes Exists such that if the VM exists on the API, but is in the Creating phase, it is considered to exist. Now we do not attempt to create the VM more than once and, if the first VM creation is successful (even with the async error), the Machine is considered to exist, so we will not fail a Machine while the VM is in the Creating phase.

openshift-ci-robot · 2020-07-10T10:34:05Z

@JoelSpeed: This pull request references Bugzilla bug 1836141, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

BUG 1836141: Assert VM exists if VM state is Creating

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Danil-Grigorev

/lgtm

pkg/cloud/azure/actuators/machine/reconciler.go

Currently, when a VM is being created in Azure, we first check whether the VM exists and then attempt to create it if not. On the first attempt to create, we always get an asynchronous time out: ``` E0709 16:44:25.752572 1 actuator.go:78] Machine error: failed to reconcile machine "jspeed-test-8cnpg-worker-centralus3-g6htz"s: failed to create vm jspeed-test-8cnpg-worker-centralus3-g6htz: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed ``` This will cause the Machine to be requeued. Currently, because Exists does not determine the machine to exist, we attempt to create again and get the following error: ``` E0709 16:44:26.642473 1 actuator.go:78] Machine error: failed to reconcile machine "jspeed-test-8cnpg-worker-centralus3-g6htz"s: failed to create vm jspeed-test-8cnpg-worker-centralus3-g6htz: vm jspeed-test-8cnpg-worker-centralus3-g6htz is still in provisioning state Creating, reconcile ``` The VM exists immediately after the first call to create the VM, but we are attempting to recreate it becuase `Exists` claims that it does not exist. This can lead to issues if there is a transient error after the first `Create` but before the VM becomes `Running`. We could see an error, determine the creation failed and move the Machine to the `Failed` phase. If this happens, the VM will still start, but because the Machine is Failed, we do not track it, we do not remove it if the Machine is deleted. Therefore we can leak VMs. This PR fixes `Exists` such that if the VM exists on the API, but is in the `Creating` phase, it is considered to exist. Now we do not attempt to create the VM more than once and, if the first VM creation is successful (even with the async error), the Machine is considered to exist, so we will not fail a Machine while the VM is in the `Creating` phase.

JoelSpeed · 2020-07-10T12:24:00Z

Just noticed that this change will prevent us from entering this block of code

cluster-api-provider-azure/pkg/cloud/azure/actuators/machine/reconciler.go

Lines 607 to 632 in fa840d1

    
           } else { 
        
           	vm, ok := vmInterface.(compute.VirtualMachine) 
        
           	if !ok { 
        
           		return errors.New("returned incorrect vm interface") 
        
           	} 
        
           	if vm.ProvisioningState == nil { 
        
           		return fmt.Errorf("vm %s is nil provisioning state, reconcile", s.scope.Machine.Name) 
        
           	} 
        
           	vmState := getVMState(vm) 
        
           	s.scope.MachineStatus.VMID = vm.ID 
        
           	s.scope.MachineStatus.VMState = &vmState 
        
           	s.setMachineCloudProviderSpecifics(vm) 
        
           	if *vm.ProvisioningState == "Failed" { 
        
           		// If VM failed provisioning, delete it so it can be recreated 
        
           		err = s.Delete(ctx) 
        
           		if err != nil { 
        
           			return fmt.Errorf("failed to delete machine: %w", err) 
        
           		} 
        
           		return fmt.Errorf("vm %s is deleted, retry creating in next reconcile", s.scope.Machine.Name) 
        
           	} else if *vm.ProvisioningState != "Succeeded" { 
        
           		return fmt.Errorf("vm %s is still in provisioning state %s, reconcile", s.scope.Machine.Name, *vm.ProvisioningState) 
        
           	} 
        
           }

However, if the VM goes failed, exist will be false, but the machine is provisioned, so the Machine will go failed anyway (which is arguably more idiomatic behaviour, rather than retrying), and the other details are updated anyway.
I could clean this up in this PR or a later PR, thoughts?

Is there any other provisioning state we want to include here e.g deleting?

I'd be tempted to keep the scope of this narrow and create a separate BZ for the fact that we should have a deleting check here too. Currently we remove the node object before the VM is gone which is not the intention of the code in the machine controller, it checks the VM exists and is meant to wait for it to go before removing the node and finalizer. So we are artificially saying the VM is gone before it is presently.

JoelSpeed · 2020-07-15T11:18:57Z

/retest

enxebre · 2020-07-20T08:15:16Z

/retest

enxebre · 2020-07-29T12:16:44Z

/approve

openshift-ci-robot · 2020-07-29T12:17:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Danil-Grigorev · 2020-08-10T12:48:47Z

/lgtm

openshift-ci-robot · 2020-08-10T15:46:58Z

@JoelSpeed: All pull requests linked via external trackers have merged: openshift/cluster-api-provider-azure#147. Bugzilla bug 1836141 has been moved to the MODIFIED state.

In response to this:

BUG 1836141: Assert VM exists if VM state is Creating

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jul 10, 2020

openshift-ci-robot requested review from alexander-demicev and Danil-Grigorev July 10, 2020 10:34

Danil-Grigorev approved these changes Jul 10, 2020

View reviewed changes

openshift-ci-robot assigned Danil-Grigorev Jul 10, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 10, 2020

enxebre reviewed Jul 10, 2020

View reviewed changes

pkg/cloud/azure/actuators/machine/reconciler.go Show resolved Hide resolved

JoelSpeed force-pushed the creating-exists branch from e8d6568 to 9a4d561 Compare July 10, 2020 12:20

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 10, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 29, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 10, 2020

openshift-merge-robot merged commit f138f6b into openshift:master Aug 10, 2020

alexander-demicev mentioned this pull request Dec 13, 2021

Rebase to latest upstream release #248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG 1836141: Assert VM exists if VM state is Creating #147

BUG 1836141: Assert VM exists if VM state is Creating #147

JoelSpeed commented Jul 10, 2020

openshift-ci-robot commented Jul 10, 2020

Danil-Grigorev left a comment

JoelSpeed commented Jul 10, 2020

JoelSpeed commented Jul 15, 2020

enxebre commented Jul 20, 2020

enxebre commented Jul 29, 2020

openshift-ci-robot commented Jul 29, 2020

Danil-Grigorev commented Aug 10, 2020

openshift-ci-robot commented Aug 10, 2020

BUG 1836141: Assert VM exists if VM state is Creating #147

BUG 1836141: Assert VM exists if VM state is Creating #147

Conversation

JoelSpeed commented Jul 10, 2020

openshift-ci-robot commented Jul 10, 2020

Danil-Grigorev left a comment

Choose a reason for hiding this comment

JoelSpeed commented Jul 10, 2020

JoelSpeed commented Jul 15, 2020

enxebre commented Jul 20, 2020

enxebre commented Jul 29, 2020

openshift-ci-robot commented Jul 29, 2020

Danil-Grigorev commented Aug 10, 2020

openshift-ci-robot commented Aug 10, 2020