-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove provisioning paused annotation #6306
Remove provisioning paused annotation #6306
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: carbonin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b45aad7
to
69d9d6c
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6306 +/- ##
==========================================
- Coverage 68.29% 68.28% -0.01%
==========================================
Files 241 241
Lines 35863 35878 +15
==========================================
+ Hits 24491 24499 +8
- Misses 9212 9215 +3
- Partials 2160 2164 +4
|
This was added to avoid a race between the provisioning stack running on the bootstrap control plane and the real one running in the installed cluster. During IPI this annotation is removed when bootstrap is finished, but for assisted and agent installs it was never getting removed. This was causing baremetal operator to never start in the installed cluster which prevented the machine API from recognizing that all machines were running. This, in turn, resulted in the cluster never being able to finish finalizing with the machine API operator reporting an degraded condition like: ``` - lastTransitionTime: "2024-05-09T22:11:21Z" message: 'Failed when progressing towards operator: 4.16.0-0.nightly-2024-05-08-222442 because minimum worker replica count (2) not yet met: current running replicas 0, waiting for [test-infra-cluster-96-64zpj-worker-0-59bzh test-infra-cluster-96-64zpj-worker-0-cktvz]' reason: SyncingFailed status: "True" type: Degraded ``` Caused by https://issues.redhat.com/browse/OCPBUGS-33157 Related to https://issues.redhat.com/browse/OCPBUGS-33493
69d9d6c
to
5e534e3
Compare
@carbonin: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@@ -885,13 +887,18 @@ func (g *installerGenerator) updateBootstrap(ctx context.Context, bootstrapPath | |||
// drop this from the list of Files because we don't want to run BMO | |||
continue | |||
} | |||
if provErr := removeProvisioningPausedAnnotation(&config.Storage.Files[i]); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, whenever assisted is creating a BMH resource the BMH status is unmanaged
(SaaS) or externally provisioned
(ZTP). if assisted behavior change or ironic/metal3 will try to update the BMh status we will end up with the original issue (https://issues.redhat.com//browse/OCPBUGS-33157).
Perhaps removing this annotation from the assited-installer-controller after all control plane nodes are ready is preferable since it aligns with the desired CBO behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The annotation is purely to prevent BMO in the cluster interfering with the BMO on the baremetal bootstrap host while it is still doing its work.
In a system with no baremetal bootstrap host it is not required.
func fileToBMH(file *config_latest_types.File) (*bmh_v1alpha1.BareMetalHost, error) { | ||
func removeProvisioningPausedAnnotation(file *config_latest_types.File) error { | ||
provisioning := &metal3iov1alpha1.Provisioning{} | ||
if err := deserializeToObject(file, provisioning); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause us problems in future, because the API definition will have to be kept up to date every time a field is added to the Provisioning resource, and we will certainly forget that this is here and then wonder why the new data is getting silently dropped.
This might still be the best thing to do in the short term, but it's not a great substitute for a long-term fix that doesn't require adding this annotation at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long do you think it will be until we get a proper long-term fix?
Do you think doing this in the controller against the actual cluster with a patch API call will be any more maintainable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We got a successful test on openshift/installer#8391 but there's a few moving parts so it might take a minute to land.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a patch API call would avoid the out-of-date-API-definition problem.
Closing this as the issue was fixed in the most recent 4.16 payloads |
This was added to avoid a race between the provisioning stack running on the bootstrap control plane and the real one running in the installed cluster. During IPI this annotation is removed when bootstrap is finished, but for assisted and agent installs it was never getting removed.
This was causing baremetal operator to never start in the installed cluster which prevented the machine API from recognizing that all machines were running. This, in turn, resulted in the cluster never being able to finish finalizing with the machine API operator reporting an degraded condition like:
List all the issues related to this PR
Caused by https://issues.redhat.com/browse/OCPBUGS-33157
Related to https://issues.redhat.com/browse/OCPBUGS-33493
What environments does this code impact?
How was this code tested?
Checklist
docs
, README, etc)Reviewers Checklist