Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG 1836141: Assert VM exists if VM state is Creating #147

Merged
merged 1 commit into from Aug 10, 2020

Conversation

JoelSpeed
Copy link

What this PR does / why we need it:

Currently, when a VM is being created in Azure, we first check whether the VM exists and then attempt to create it if not.

On the first attempt to create, we always get an asynchronous time out:

E0709 16:44:25.752572       1 actuator.go:78] Machine error: failed to reconcile machine "jspeed-test-8cnpg-worker-centralus3-g6htz"s: failed to create vm jspeed-test-8cnpg-worker-centralus3-g6htz: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed

This will cause the Machine to be requeued. Currently, because Exists does not determine the machine to exist, we attempt to create again and get the following error:

E0709 16:44:26.642473       1 actuator.go:78] Machine error: failed to reconcile machine "jspeed-test-8cnpg-worker-centralus3-g6htz"s: failed to create vm jspeed-test-8cnpg-worker-centralus3-g6htz: vm jspeed-test-8cnpg-worker-centralus3-g6htz is still in provisioning state Creating, reconcile

The VM exists immediately after the first call to create the VM, but we are attempting to recreate it becuase Exists claims that it does not exist. This can lead to issues if there is a transient error after the first Create but before the VM becomes Running. We could see an error, determine the creation failed and move the Machine to the Failed phase.

If this happens, the VM will still start, but because the Machine is Failed, we do not track it, we do not remove it if the Machine is deleted. Therefore we can leak VMs.

This PR fixes Exists such that if the VM exists on the API, but is in the Creating phase, it is considered to exist. Now we do not attempt to create the VM more than once and, if the first VM creation is successful (even with the async error), the Machine is considered to exist, so we will not fail a Machine while the VM is in the Creating phase.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jul 10, 2020
@openshift-ci-robot
Copy link

@JoelSpeed: This pull request references Bugzilla bug 1836141, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

BUG 1836141: Assert VM exists if VM state is Creating

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

@Danil-Grigorev Danil-Grigorev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 10, 2020
Currently, when a VM is being created in Azure, we first check whether the VM exists and then attempt to create it if not.

On the first attempt to create, we always get an asynchronous time out:
```
E0709 16:44:25.752572       1 actuator.go:78] Machine error: failed to reconcile machine "jspeed-test-8cnpg-worker-centralus3-g6htz"s: failed to create vm jspeed-test-8cnpg-worker-centralus3-g6htz: failed to create or get machine: compute.VirtualMachinesCreateOrUpdateFuture: asynchronous operation has not completed
```

This will cause the Machine to be requeued. Currently, because Exists does not determine the machine to exist, we attempt to create again and get the following error:

```
E0709 16:44:26.642473       1 actuator.go:78] Machine error: failed to reconcile machine "jspeed-test-8cnpg-worker-centralus3-g6htz"s: failed to create vm jspeed-test-8cnpg-worker-centralus3-g6htz: vm jspeed-test-8cnpg-worker-centralus3-g6htz is still in provisioning state Creating, reconcile
```

The VM exists immediately after the first call to create the VM, but we are attempting to recreate it becuase `Exists` claims that it does not exist. This can lead to issues if there is a transient error after the first `Create` but before the VM becomes `Running`. We could see an error, determine the creation failed and move the Machine to the `Failed` phase.

If this happens, the VM will still start, but because the Machine is Failed, we do not track it, we do not remove it if the Machine is deleted. Therefore we can leak VMs.

This PR fixes `Exists` such that if the VM exists on the API, but is in the `Creating` phase, it is considered to exist. Now we do not attempt to create the VM more than once and, if the first VM creation is successful (even with the async error), the Machine is considered to exist, so we will not fail a Machine while the VM is in the `Creating` phase.
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 10, 2020
@JoelSpeed
Copy link
Author

Just noticed that this change will prevent us from entering this block of code

} else {
vm, ok := vmInterface.(compute.VirtualMachine)
if !ok {
return errors.New("returned incorrect vm interface")
}
if vm.ProvisioningState == nil {
return fmt.Errorf("vm %s is nil provisioning state, reconcile", s.scope.Machine.Name)
}
vmState := getVMState(vm)
s.scope.MachineStatus.VMID = vm.ID
s.scope.MachineStatus.VMState = &vmState
s.setMachineCloudProviderSpecifics(vm)
if *vm.ProvisioningState == "Failed" {
// If VM failed provisioning, delete it so it can be recreated
err = s.Delete(ctx)
if err != nil {
return fmt.Errorf("failed to delete machine: %w", err)
}
return fmt.Errorf("vm %s is deleted, retry creating in next reconcile", s.scope.Machine.Name)
} else if *vm.ProvisioningState != "Succeeded" {
return fmt.Errorf("vm %s is still in provisioning state %s, reconcile", s.scope.Machine.Name, *vm.ProvisioningState)
}
}

However, if the VM goes failed, exist will be false, but the machine is provisioned, so the Machine will go failed anyway (which is arguably more idiomatic behaviour, rather than retrying), and the other details are updated anyway.
I could clean this up in this PR or a later PR, thoughts?

Is there any other provisioning state we want to include here e.g deleting?

I'd be tempted to keep the scope of this narrow and create a separate BZ for the fact that we should have a deleting check here too. Currently we remove the node object before the VM is gone which is not the intention of the code in the machine controller, it checks the VM exists and is meant to wait for it to go before removing the node and finalizer. So we are artificially saying the VM is gone before it is presently.

@JoelSpeed
Copy link
Author

/retest

1 similar comment
@enxebre
Copy link
Member

enxebre commented Jul 20, 2020

/retest

@enxebre
Copy link
Member

enxebre commented Jul 29, 2020

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 29, 2020
@Danil-Grigorev
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 10, 2020
@openshift-merge-robot openshift-merge-robot merged commit f138f6b into openshift:master Aug 10, 2020
@openshift-ci-robot
Copy link

@JoelSpeed: All pull requests linked via external trackers have merged: openshift/cluster-api-provider-azure#147. Bugzilla bug 1836141 has been moved to the MODIFIED state.

In response to this:

BUG 1836141: Assert VM exists if VM state is Creating

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants