Fix two different causes of infinite reconciliation loop #1473

luksa · 2023-11-17T15:18:09Z

No description provided.

jwendell · 2023-11-17T15:34:26Z

controllers/istio_controller.go

@@ -285,6 +285,10 @@ func (r *IstioReconciler) updateStatus(ctx context.Context, log logr.Logger, ist
 	}
 	status.AppliedValues = appliedValues

+	if reflect.DeepEqual(istio.Status, *status) {


worth adding a simple comment reminding readers why this is necessary?

It's pretty clear from the code itself. Why update the status if it hasn't changed? Additionally, the commit message fully explains why this is needed. I wouldn't want to add that long explanation here.

jwendell · 2023-11-17T15:35:17Z

tests/integration/common-operator-integ-suite.sh

@@ -158,6 +158,11 @@ check_ready() {
  ${COMMAND} wait deployment "${DEPLOYMENT_NAME}" -n "${NS}" --for condition=Available=True --timeout=${TIMEOUT}
 }

+fail() {


nit: When I read a function called fail I expect it to fail (i.e., exit 1) which is not the case here. It's confusing.

renamed to logFailure

jwendell · 2023-11-17T15:37:15Z

tests/integration/common-operator-integ-suite.sh

@@ -172,13 +177,33 @@ main_test() {
    ${COMMAND} get ns "${CONTROL_PLANE_NS}" >/dev/null 2>&1 || ${COMMAND} create namespace "${CONTROL_PLANE_NS}"
    sed -e "s/version:.*/version: ${ver}/g" "${ISTIO_MANIFEST}" | ${COMMAND} apply -f - -n "${CONTROL_PLANE_NS}"

+    echo "Wait for Istio to be Reconciled"


Do we need this? Or just wait for Ready is enough?

Although Ready should be enough, it makes sense to check Reconciled. Take this as an additional test assertion.

jwendell · 2023-11-17T15:41:59Z

tests/integration/common-operator-integ-suite.sh

+    echo "Wait for Istio to be Ready"
+    ${COMMAND} wait istio/istio-sample -n "${CONTROL_PLANE_NS}" --for condition=Ready=True --timeout=${TIMEOUT}
+
+    echo "Give the operator some time to settle down"


nit: s/some time/30s/

jwendell · 2023-11-17T15:42:54Z

tests/integration/common-operator-integ-suite.sh

+    echo "Give the operator some time to settle down"
+    sleep 30
+
+    echo "Check that the operator has stopped reconciling the resource"


nit: append sth like "Waiting 30s"

just so that people running this test manually knows what to expect

The tests waits in a few other places as well, and it doesn't say how long... but I've made the change as suggested

…of operator and istiod Istiod updates the `failurePolicy` in the istio-validator-<rev>-<ns> ValidatingWebhookConfiguration when the webhook endpoint becomes ready. The operator should ignore this change in order to prevent an infinite reconciliation loop caused by the operator and istiod reverting each other's changes to the `failurePolicy` field.

…ache Before updating the status, the operator should check whether the status has changed. If it doesn't do this, it may cause the lastTransitionTimestamp in the status conditions to flip between two values. This happens in the following scenario: 1. operator deploys istiod Deployment and other resources 2. istiod Deployment becomes ready; this triggers a reconciliation of the Istio resource 3. the operator changes the state of the Ready condition to True and updates the lastTransitionTimestamp; this change is posted to the API server, but as per normal controller behavior, the local cache isn't yet updated 4. if something else now triggers another reconciliation before the controller's local cache is synced (i.e. if the reconciliation is triggered before the status update makes it into the local cache), the operator will now again set the Ready condition to True and update the lastTransitionTimestamp; again, it posts this change to the API server (NOTE: because a patch operation is used to update the status, this change is not treated as a conflict) 5. during this previous reconciliation, the operator is notified of the status changed it made before; this causes another reconciliation of the locally-cached Istio resource 6. the object's status doesn't change, because the Ready condition is already true, however the lastTransitionTimestamp in the locally cached copy is different from the one in the API server; the operator updates the status again, overwriting the correct timestamp with the old one. 7. this process is then repeated continuously; the operator keeps switching the lastTransitionTimestamp between two values, since it keeps processing two different states of the Istio resource. By performing the status update only if the status in the locally cached copy has changed during reconciliation, we break this infinite loop.

luksa added the tide/merge-method-rebase Denotes a PR that should be rebased by tide when it merges. label Nov 17, 2023

openshift-ci bot added the size/M label Nov 17, 2023

jwendell reviewed Nov 17, 2023

View reviewed changes

luksa added 2 commits November 17, 2023 16:53

luksa force-pushed the OSSM-5375 branch from 079567a to a148fa0 Compare November 17, 2023 15:54

jwendell approved these changes Nov 17, 2023

View reviewed changes

luksa added the okay to merge label Nov 17, 2023

openshift-merge-bot bot merged commit 23927f4 into maistra:maistra-3.0 Nov 17, 2023
8 checks passed

luksa deleted the OSSM-5375 branch November 27, 2023 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix two different causes of infinite reconciliation loop #1473

Fix two different causes of infinite reconciliation loop #1473

luksa commented Nov 17, 2023

jwendell Nov 17, 2023

luksa Nov 17, 2023

jwendell Nov 17, 2023

luksa Nov 17, 2023

jwendell Nov 17, 2023

luksa Nov 17, 2023

jwendell Nov 17, 2023

luksa Nov 17, 2023

jwendell Nov 17, 2023

luksa Nov 17, 2023

Fix two different causes of infinite reconciliation loop #1473

Fix two different causes of infinite reconciliation loop #1473

Conversation

luksa commented Nov 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment