Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1808971: Machine status shows "running" when an instance was terminated #575

Conversation

Danil-Grigorev
Copy link
Contributor

Setting machine instance state to unknown on machine failure

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 29, 2020
@openshift-ci-robot
Copy link
Contributor

@Danil-Grigorev: This pull request references Bugzilla bug 1808971, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1808971: Machine status shows "running" when an instance was terminated

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@alexander-demicev alexander-demicev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexander-demicev
Copy link
Contributor

This can break the consistency because we set same annotation in 2 different places. However, the idea of setting this in one place is nice.

@@ -426,9 +426,18 @@ func (r *ReconcileMachine) setPhase(machine *machinev1.Machine, phase string, er
machine.Status.LastUpdated = &now
if phase == phaseFailed && errorMessage != "" {
machine.Status.ErrorMessage = &errorMessage
if machine.Annotations == nil {
machine.Annotations = map[string]string{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could save the if:

annotations := m.GetAnnotations()
annotations["key"] = "value"
m.SetAnnotations(annotations)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause a panic on Annotations being nil, especially in tests. I'm not sure how to avoid nil check here.

Copy link
Member

@enxebre enxebre Apr 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing where exactly would this panic? I'd expect GetAnnotations to give you an empty struct slice, may be I'm wrong. I'm not totally sure what the API machinery defaulting will return here. You're right the check is safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if machine.Annotations == nil {
machine.Annotations = map[string]string{}
}
machine.Annotations[MachineInstanceStateAnnotationName] = "Unknown"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put this into a constant?

if machine.Annotations == nil {
machine.Annotations = map[string]string{}
}
machine.Annotations[MachineInstanceStateAnnotationName] = "Unknown"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another place, where the machine will get into Failed pahse, and the VM would not be monitored/updated anymore:

if isInvalidMachineConfigurationError(err) {
if err := r.setPhase(m, phaseFailed, err.Error()); err != nil {
return reconcile.Result{}, err
}
I was thinking to handle it generically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, in the other scenario the provider actually have a chance to set a valid state as it goes failed so might good to leave it. Happy to hear others thoughts.

@Danil-Grigorev Danil-Grigorev force-pushed the set-unknown-vm-state-on-failure branch 3 times, most recently from c10a93c to 96bb756 Compare April 30, 2020 10:20
@@ -426,9 +426,18 @@ func (r *ReconcileMachine) setPhase(machine *machinev1.Machine, phase string, er
machine.Status.LastUpdated = &now
if phase == phaseFailed && errorMessage != "" {
machine.Status.ErrorMessage = &errorMessage
if machine.Annotations == nil {
machine.Annotations = map[string]string{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := r.Client.Patch(context.Background(), machine, baseToPatch); err != nil {
klog.Errorf("Failed to update machine %q: %v", machine.GetName(), err)
return err
}
}
if err := r.Client.Status().Patch(context.Background(), machine, baseToPatch); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check to see if there are conflict errors coming through now? Can't remember if Patch needs an up to date resource version or not for the patch call to succeed

@@ -382,7 +382,7 @@ func TestSetPhase(t *testing.T) {
t.Fatal(err)
}
if *lastUpdated != *got.Status.LastUpdated {
t.Errorf("Expected: %v, got: %v", *lastUpdated, *got.Status.LastUpdated)
t.Errorf("Expected: %v, got: %v", *lastUpdated, got.Status.LastUpdated)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changed? I don't see how it relates 🤔

@@ -401,12 +403,21 @@ func TestSetPhase(t *testing.T) {
}
// validate persisted object
if expecterErrorMessage != *got.Status.ErrorMessage {
t.Errorf("Expected: %v, got: %v", expecterErrorMessage, *got.Status.ErrorMessage)
t.Errorf("Expected: %v, got: %v", expecterErrorMessage, got.Status.ErrorMessage)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, why is this deref changed?

t.Fatal("Got phase nil")
}
if *got.Status.Phase != phaseFailed {
t.Errorf("Got: %v, expected: %v", *got.Status.Phase, phaseRunning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest of the assertions around here are this way around

Suggested change
t.Errorf("Got: %v, expected: %v", *got.Status.Phase, phaseRunning)
t.Errorf("Expected: %v, got: %v", phaseFailed, *got.Status.Phase)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I missed the fact that all those fields are pointers :D Thanks for the tip, will revert it.

@Danil-Grigorev Danil-Grigorev force-pushed the set-unknown-vm-state-on-failure branch 3 times, most recently from c1656b4 to 1de9a9b Compare May 5, 2020 12:51
Bug 1808971: Machine status shows "running" when an instance was terminated
@openshift-ci-robot
Copy link
Contributor

@Danil-Grigorev: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 c38b02f link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-azure-operator c38b02f link /test e2e-azure-operator
ci/prow/e2e-azure c38b02f link /test e2e-azure

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

if machine.Annotations == nil {
machine.Annotations = map[string]string{}
}
machine.Annotations[MachineInstanceStateAnnotationName] = unknownInstanceState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might cause skew if the actuators set something in the provider.Status. We might consolidate to consolidate that.

@enxebre
Copy link
Member

enxebre commented May 6, 2020

/approve
@JoelSpeed PTAL

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2020
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 6, 2020
@openshift-merge-robot openshift-merge-robot merged commit 67083c6 into openshift:master May 6, 2020
@openshift-ci-robot
Copy link
Contributor

@Danil-Grigorev: All pull requests linked via external trackers have merged: openshift/machine-api-operator#575. Bugzilla bug 1808971 has been moved to the MODIFIED state.

In response to this:

Bug 1808971: Machine status shows "running" when an instance was terminated

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants