Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix two different causes of infinite reconciliation loop #1473

Merged
merged 2 commits into from
Nov 17, 2023

Conversation

luksa
Copy link
Contributor

@luksa luksa commented Nov 17, 2023

No description provided.

@luksa luksa added the tide/merge-method-rebase Denotes a PR that should be rebased by tide when it merges. label Nov 17, 2023
@openshift-ci openshift-ci bot added the size/M label Nov 17, 2023
@@ -285,6 +285,10 @@ func (r *IstioReconciler) updateStatus(ctx context.Context, log logr.Logger, ist
}
status.AppliedValues = appliedValues

if reflect.DeepEqual(istio.Status, *status) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth adding a simple comment reminding readers why this is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty clear from the code itself. Why update the status if it hasn't changed? Additionally, the commit message fully explains why this is needed. I wouldn't want to add that long explanation here.

@@ -158,6 +158,11 @@ check_ready() {
${COMMAND} wait deployment "${DEPLOYMENT_NAME}" -n "${NS}" --for condition=Available=True --timeout=${TIMEOUT}
}

fail() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: When I read a function called fail I expect it to fail (i.e., exit 1) which is not the case here. It's confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to logFailure

@@ -172,13 +177,33 @@ main_test() {
${COMMAND} get ns "${CONTROL_PLANE_NS}" >/dev/null 2>&1 || ${COMMAND} create namespace "${CONTROL_PLANE_NS}"
sed -e "s/version:.*/version: ${ver}/g" "${ISTIO_MANIFEST}" | ${COMMAND} apply -f - -n "${CONTROL_PLANE_NS}"

echo "Wait for Istio to be Reconciled"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Or just wait for Ready is enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although Ready should be enough, it makes sense to check Reconciled. Take this as an additional test assertion.

echo "Wait for Istio to be Ready"
${COMMAND} wait istio/istio-sample -n "${CONTROL_PLANE_NS}" --for condition=Ready=True --timeout=${TIMEOUT}

echo "Give the operator some time to settle down"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/some time/30s/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

echo "Give the operator some time to settle down"
sleep 30

echo "Check that the operator has stopped reconciling the resource"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: append sth like "Waiting 30s"

just so that people running this test manually knows what to expect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests waits in a few other places as well, and it doesn't say how long... but I've made the change as suggested

…of operator and istiod

Istiod updates the `failurePolicy` in the istio-validator-<rev>-<ns> ValidatingWebhookConfiguration when the webhook endpoint becomes ready. The operator should ignore this change in order to prevent an infinite reconciliation loop caused by the operator and istiod reverting each other's changes to the `failurePolicy` field.
…ache

Before updating the status, the operator should check whether the status has changed. If it doesn't do this, it may cause the lastTransitionTimestamp in the status conditions to flip between two values.

This happens in the following scenario:
1. operator deploys istiod Deployment and other resources
2. istiod Deployment becomes ready; this triggers a reconciliation of the Istio resource
3. the operator changes the state of the Ready condition to True and updates the lastTransitionTimestamp; this change is posted to the API server, but as per normal controller behavior, the local cache isn't yet updated
4. if something else now triggers another reconciliation before the controller's local cache is synced (i.e. if the reconciliation is triggered before the status update makes it into the local cache), the operator will now again set the Ready condition to True and update the lastTransitionTimestamp; again, it posts this change to the API server (NOTE: because a patch operation is used to update the status, this change is not treated as a conflict)
5. during this previous reconciliation, the operator is notified of the status changed it made before; this causes another reconciliation of the locally-cached Istio resource
6. the object's status doesn't change, because the Ready condition is already true, however the lastTransitionTimestamp in the locally cached copy is different from the one in the API server; the operator updates the status again, overwriting the correct timestamp with the old one.
7. this process is then repeated continuously; the operator keeps switching the lastTransitionTimestamp between two values, since it keeps processing two different states of the Istio resource.

 By performing the status update only if the status in the locally cached copy has changed during reconciliation, we break this infinite loop.
@openshift-merge-bot openshift-merge-bot bot merged commit 23927f4 into maistra:maistra-3.0 Nov 17, 2023
8 checks passed
@luksa luksa deleted the OSSM-5375 branch November 27, 2023 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
okay to merge size/M tide/merge-method-rebase Denotes a PR that should be rebased by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants