Skip to content

Commit

Permalink
controllers/provisioning_controller: Do not update ClusterOperator in…
Browse files Browse the repository at this point in the history
… SetupWithManager

SetupWithManager(mgr) is called before mgr.Start() in main(), so it's
running before the leader lease has been acquired.  We don't want to
be writing to anything before we acquire the lease, because that could
cause contention with an actual lease-holder that is managing those
resources.  We can still perform read-only prechecks in
SetupWithManager.  And then we wait patiently until we get the lease,
and update ClusterOperator (and anything else we manage) in Instead,
patiently wait until we have the lease in the Reconcile() function.

This partially rolls back 2e9d117 (Ensure baremetal CO is
completely setup before Reconcile, 2020-11-30, #81), but that addition
predated ensureClusterOperator being added early in Reconcile in
4f2d314 (Make sure ensureClusterOperator() is called before its
status is updated, 2020-12-15, #71):

  $ git log --oneline | grep -n '2e9d1177\|4f2d3141'
  468:4f2d3141 Make sure ensureClusterOperator() is called before its status is updated
  506:2e9d1177 Ensure baremetal CO is completely setup before Reconcile

So the ensureClusterOperator call in SetupWithManager is no longer
needed.

And this partially rolls back 8798044 (Handle case when
Provisioning CR is absent on the Baremetal platform, 2020-11-30, #81).
That "we're enabled, but there isn't a Provisioning custom resource
yet" handling happens continually in Reconcile (where the write will
be protected by the operator holding the lease).

Among other improvements, this change will prevent a
nominally-successful install where the operator never acquired a lease
[1]:

 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/pods/openshift-machine-api_cluster-baremetal-operator-5c57b874f5-s9zmq_cluster-baremetal-operator.log >cbo.log
  $ head -n4 cbo.log
  I1222 01:05:34.274563       1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080"
  I1222 01:05:34.318283       1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks
  I1222 01:05:34.403202       1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform"
  I1222 01:05:34.430552       1 provisioning_controller.go:620] "Network stack calculation" NetworkStack=1
  $ tail -n2 cbo.log
  E1222 02:36:57.323869       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"
  E1222 02:37:00.248442       1 leaderelection.go:332] error retrieving resource lock openshift-machine-api/cluster-baremetal-operator: leases.coordination.k8s.io "cluster-baremetal-operator" is forbidden: User "system:serviceaccount:openshift-machine-api:cluster-baremetal-operator" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "openshift-machine-api"

but still managed to write Available=True (with that 'new CO status' line):

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352/artifacts/e2e-agnostic-ovn/gather-extra/artifacts/clusteroperators.json | jq -r '.items[] | select(.metadata.name == "baremetal").status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message'
  2023-12-22T01:05:34Z Progressing=False WaitingForProvisioningCR:
  2023-12-22T01:05:34Z Degraded=False :
  2023-12-22T01:05:34Z Available=True WaitingForProvisioningCR: Waiting for Provisioning CR on BareMetal Platform
  2023-12-22T01:05:34Z Upgradeable=True :
  2023-12-22T01:05:34Z Disabled=False :

"I'll never get this lease, and I need a lease to run all my
controllers" doesn't seem very Available=True to me, and with this
commit, we won't touch the ClusterOperator and the install will time
out.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-baremetal-operator/395/pull-ci-openshift-cluster-baremetal-operator-master-e2e-agnostic-ovn/1737988020168036352
  • Loading branch information
wking committed Dec 22, 2023
1 parent 2043697 commit 2f98d8c
Showing 1 changed file with 1 addition and 18 deletions.
19 changes: 1 addition & 18 deletions controllers/provisioning_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -593,25 +593,8 @@ func (r *ProvisioningReconciler) updateProvisioningMacAddresses(ctx context.Cont
// SetupWithManager configures the manager to run the controller
func (r *ProvisioningReconciler) SetupWithManager(mgr ctrl.Manager) error {
ctx := context.Background()
err := r.ensureClusterOperator()
if err != nil {
return errors.Wrap(err, "unable to set get baremetal ClusterOperator")
}

// Check the Platform Type to determine the state of the CO
enabled := IsEnabled(r.EnabledFeatures)

// If Platform is BareMetal, we could still be missing the Provisioning CR
if enabled {
baremetalConfig, err := r.readProvisioningCR(context.Background())
if err != nil || baremetalConfig == nil {
err = r.updateCOStatus(ReasonProvisioningCRNotFound, "Waiting for Provisioning CR on BareMetal Platform", "")
if err != nil {
return fmt.Errorf("unable to put %q ClusterOperator in Available state: %w", clusterOperatorName, err)
}
}
}

var err error
r.NetworkStack, err = r.networkStackFromServiceNetwork(ctx)
if err != nil {
return fmt.Errorf("unable to determine network stack from service network: %w", err)
Expand Down

0 comments on commit 2f98d8c

Please sign in to comment.