Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not return errors from reconciliation when waiting for update job #224

Closed
Szymongib opened this issue Mar 22, 2021 · 5 comments · Fixed by #288
Closed

Do not return errors from reconciliation when waiting for update job #224

Szymongib opened this issue Mar 22, 2021 · 5 comments · Fixed by #288

Comments

@Szymongib
Copy link
Contributor

Summary

Before changes are rolled out to Mattermost deployment, the update job is deployed to verify that the image is correct. The Operator waits for the job to complete before starting deployment update.

From reconciler perspective, when the job is still running the error is returned to requeue reconciliation request. If error occur enough number of times, in rare cases this may significantly delay rolling out the deployment due to error reconciliation back off.

We should modify logic that waits for the update job to finish to return some indication that it is not yet finished rather than returning an error and use constant time for requeue delay. For now it should be enough to return bool alongside error and propagate it from checkUpdateJob function to reconciler.

@imskr
Copy link
Contributor

imskr commented May 28, 2022

Can I work on this?

@Szymongib
Copy link
Contributor Author

Sure @imskr, I will assign you.

@imskr
Copy link
Contributor

imskr commented May 29, 2022

Thanks @Szymongib After reading the description I have some thoughts:

  1. We can return bool alongside error in updateMattermostDeployment and we can maybe return true here after checkUpdateJob

  2. Inside Reconcile() I have to use the true bool from above to somehow stop from reconciling.

Am I right?

@gabrieljackson
Copy link
Collaborator

Hey @imskr, yeah that seems like a fairly reasonable implementation.

For (2), we would want to take the returned bool and have the reconciler queue up another reconcile loop after a certain specific delay. Something like 5-10 seconds seems about right to me. The idea here is we don't want to have the operator think there was too many errors and fall into delayed reconciliation checks which it will do at a certain threshold.

@Szymongib
Copy link
Contributor Author

Inside Reconcile() I have to use the true bool from above to somehow stop from reconciling.

Yes, the general idea sound right, we might however consider returning some struct representing the status from checkMattermost instead of simple boolean for better clarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants