New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1767004: defer provided api update in operator groups #1114
Bug 1767004: defer provided api update in operator groups #1114
Conversation
@jpeeler: This pull request references Bugzilla bug 1767004, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
d58682e
to
8004312
Compare
This is the case of a bug where the CSV status wasn't updated due to an invalid CSV. The cause for this was twofold: 1) The CSV and OperatorGroup reconcile loops were undoing each others changes continously, giving no chance for the CSV to get synced beyond provided api conflict detection. 2) Due to the way the provided APIs are produced differently for operator groups (uses GVKSTringToProvidedAPISet) and OLM (uses NewOperatorFromV1Alpha1CSV) when an invalid CSV is in use, the resolver attempts to continue to instruct that an operator group annotation update is needed beyond the point of that being true. The changes are to ensure that OLM does not get stuck attempting to update the operator group annotations when they are no longer needed combined with ensuring that the operator group sync loop does not try to prematurely dismiss a provided api before the CSV has a chance to check the requirements and produce status. In order to accomplish "premature detection", providedAPIsFromCSVs was modified to not only return the APIs but also the CSV that provides a given API.
8004312
to
ad49491
Compare
/retest |
/retest |
/retest |
2 similar comments
/retest |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpeeler Thank you for getting on this so quickly! I had a few comments, but other than those, this PR really has me thinking about how/if we can simplify the provided API reconciliation. It would be nice if only OperatorGroup
syncing handled provided API annotation updates, WDYT?
// Prune providedAPIs annotation if the cluster has fewer providedAPIs (handles CSV deletion) | ||
if intersection := groupProvidedAPIs.Intersection(providedAPIsFromCSVs); len(intersection) < len(groupProvidedAPIs) { | ||
//if intersection := groupProvidedAPIs.Intersection(providedAPIsFromCSVs); len(intersection) < len(groupProvidedAPIs) { | ||
if len(intersection) < len(groupProvidedAPIs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only happen when the OperatorGroup
has provided APIs that no longer exist in the namespace. Could you explain how this condition is met otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was changed mainly because proviedAPIsFromCSVs no longer is an APISet, so calculating the intersection isn't possible. But with an invalid api version specified in a CSV, it is possible that the operator group has apis that were never on the cluster (which is part of the problem here). I can remove the commented out line I left in there if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But with an invalid api version specified in a CSV, it is possible that the operator group has apis that were never on the cluster (which is part of the problem here)
What does "never on the cluster" mean? IIRC, the point here is to remove APIs that no CSV in the OperatorGroup's namespace claims to provide. I would rather have false positives and cleanup more often
I can remove the commented out line I left in there if you want.
I think we want to avoid commented code piling up, but we don't need to cause another round of CI over it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also see: #1114 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also see: #1114 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember that this bug is dealing with an invalid api due to botched syntax. Due to the botched syntax as shown in the bug, the CRD version could never have possibly existed.
The problem with cleaning up more often is there are two sync loops that end up fighting each other with a constant stream of updates, which prevents validation from ever occurring and therefore not getting reported back to the user.
@njhale In general, it seems that having any loops updating resources other than the one the loop was started for leads to bad contention. But operator groups and CSVs are coupled in a way that makes the problem particularly bad. |
@@ -230,20 +237,40 @@ func (a *Operator) providedAPIsFromCSVs(group *v1.OperatorGroup, logger *logrus. | |||
logger.WithError(err).Warn("could not create OperatorSurface from csv") | |||
continue | |||
} | |||
providedAPIsFromCSVs = providedAPIsFromCSVs.Union(operatorSurface.ProvidedAPIs().StripPlural()) | |||
for providedAPI := range operatorSurface.ProvidedAPIs().StripPlural() { | |||
providedAPIsFromCSVs[providedAPI] = csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why lug the CSV around when you can filter out the ones you don't want here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I want to filter out any CSVs. The CSVs are returned so that in pruneProvidedAPIs while iterating over the provided apis, the CSV phase for a given api can be checked to ensure it has progressed to the point of validation (past phase pending).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think they are roughly isomorphic, but this isn't a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have one last comment I wanted to get your thoughts on before we merge this: #1114 (comment)
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jpeeler, njhale The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@jpeeler: All pull requests linked via external trackers have merged. Bugzilla bug 1767004 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is the case of a bug where the CSV status wasn't updated due to an
invalid CSV. The cause for this was twofold:
The CSV and OperatorGroup reconcile loops were undoing each others
changes continously, giving no chance for the CSV to get synced beyond
provided api conflict detection.
Due to the way the provided APIs are produced differently for
operator groups (uses GVKSTringToProvidedAPISet) and OLM (uses
NewOperatorFromV1Alpha1CSV) when an invalid CSV is in use, the resolver
attempts to continue to instruct that an operator group annotation
update is needed beyond the point of that being true.
The changes are to ensure that OLM does not get stuck attempting to
update the operator group annotations when they are no longer needed
combined with ensuring that the operator group sync loop does not try to
prematurely dismiss a provided api before the CSV has a chance to check
the requirements and produce status.
In order to accomplish "premature detection", providedAPIsFromCSVs was
modified to not only return the APIs but also the CSV that provides a
given API.
Description of the change:
Motivation for the change:
Reviewer Checklist
/docs