Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 2029993: Prevent Machine from being considered provisioned until it exists in AWS #431

Merged
merged 4 commits into from Dec 13, 2021

Conversation

elmiko
Copy link

@elmiko elmiko commented Dec 7, 2021

We are seeing instances where AWS is not returning the instance we just created within a second or two of the initial creation time, returning a 400 error saying that the instance does not exist.
This makes the machine go into failed right now, and, in conjunction with an MHC, can lead to leaked machines as the instances are deleted before AWS acknowledges the existence of the Machine.

This PR is an attempt to improve this experience.
It does:

Do not set the addresses after create - this prevents the machine controller from saying the instance is provisioned, if exists does not return the instance, it will call create again
Before we create, we attempt exists again by looking for existing instances - if this succeeds, requeue and the next reconcile should work and call Update which corrects the status and provider ID
If the exists fails a second time at the beginning of the create, but an instance ID has been set on the status, requeue, we can't create another instance, that would cause a leak

The only flaw in this that I can currently think of is that if an instance is terminated before Exists ever succeeds, then we will requeue forever as Exists currently just silently ignores these kinds of errors. To improve this, Exists could return a "Instance was terminated" error which we can unwrap and handle specifically, though this would need to be handled in the core Machine controller, so needs further work.

this is a cherry-pick of openshift/machine-api-provider-aws#11

Co-authored-by: Joel Speed <joel.speed@hotmail.co.uk>
@openshift-ci openshift-ci bot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Dec 7, 2021
@openshift-ci
Copy link

openshift-ci bot commented Dec 7, 2021

@elmiko: This pull request references Bugzilla bug 2029993, which is invalid:

  • expected dependent Bugzilla bug 2025767 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE), but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 2029993: Prevent Machine from being considered provisioned until it exists in AWS

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Dec 7, 2021
@elmiko
Copy link
Author

elmiko commented Dec 7, 2021

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Dec 7, 2021
@elmiko
Copy link
Author

elmiko commented Dec 7, 2021

missed a reference, fixing it up now

Co-authored-by: Joel Speed <joel.speed@hotmail.co.uk>
@elmiko
Copy link
Author

elmiko commented Dec 7, 2021

/bugzilla refresh

@openshift-ci
Copy link

openshift-ci bot commented Dec 7, 2021

@elmiko: This pull request references Bugzilla bug 2029993, which is invalid:

  • expected dependent Bugzilla bug 2025767 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE), but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

elmiko and others added 2 commits December 7, 2021 15:13
…y issues

Co-authored-by: Joel Speed <joel.speed@hotmail.co.uk>
Co-authored-by: Joel Speed <joel.speed@hotmail.co.uk>
@elmiko
Copy link
Author

elmiko commented Dec 7, 2021

/cherry-pick release-4.8

@openshift-cherrypick-robot

@elmiko: once the present PR merges, I will cherry-pick it on top of release-4.8 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JoelSpeed
Copy link

/approve
/retest

@openshift-ci
Copy link

openshift-ci bot commented Dec 8, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 8, 2021
@cblecker
Copy link
Member

/retest
/bugzilla refresh

@openshift-ci openshift-ci bot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Dec 10, 2021
@openshift-ci
Copy link

openshift-ci bot commented Dec 10, 2021

@cblecker: This pull request references Bugzilla bug 2029993, which is valid. The bug has been moved to the POST state.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.z) matches configured target release for branch (4.9.z)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 2025767 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE))
  • dependent Bugzilla bug 2025767 targets the "4.10.0" release, which is one of the valid target releases: 4.10.0
  • bug has dependents

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

/retest
/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Dec 10, 2021
@JoelSpeed
Copy link

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 10, 2021
@sunzhaohua2
Copy link

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Dec 13, 2021
@openshift-merge-robot openshift-merge-robot merged commit 8226e86 into openshift:release-4.9 Dec 13, 2021
@openshift-ci
Copy link

openshift-ci bot commented Dec 13, 2021

@elmiko: All pull requests linked via external trackers have merged:

Bugzilla bug 2029993 has been moved to the MODIFIED state.

In response to this:

Bug 2029993: Prevent Machine from being considered provisioned until it exists in AWS

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@elmiko: #431 failed to apply on top of branch "release-4.8":

Applying: Prevent Machine from being considered provisioned until it exists in AWS
Applying: Fix existing tests after provisioning improvements
Using index info to reconstruct a base tree...
M	pkg/actuators/machine/actuator_test.go
M	pkg/actuators/machine/reconciler_test.go
M	pkg/actuators/machine/stubs.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/actuators/machine/stubs.go
Auto-merging pkg/actuators/machine/reconciler_test.go
CONFLICT (content): Merge conflict in pkg/actuators/machine/reconciler_test.go
Auto-merging pkg/actuators/machine/actuator_test.go
CONFLICT (content): Merge conflict in pkg/actuators/machine/actuator_test.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 Fix existing tests after provisioning improvements
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@elmiko
Copy link
Author

elmiko commented Dec 13, 2021

i'll manually pick to 4.8 on monday

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants