Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1767004: defer provided api update in operator groups #1114

Conversation

jpeeler
Copy link
Contributor

@jpeeler jpeeler commented Nov 7, 2019

This is the case of a bug where the CSV status wasn't updated due to an
invalid CSV. The cause for this was twofold:

  1. The CSV and OperatorGroup reconcile loops were undoing each others
    changes continously, giving no chance for the CSV to get synced beyond
    provided api conflict detection.

  2. Due to the way the provided APIs are produced differently for
    operator groups (uses GVKSTringToProvidedAPISet) and OLM (uses
    NewOperatorFromV1Alpha1CSV) when an invalid CSV is in use, the resolver
    attempts to continue to instruct that an operator group annotation
    update is needed beyond the point of that being true.

The changes are to ensure that OLM does not get stuck attempting to
update the operator group annotations when they are no longer needed
combined with ensuring that the operator group sync loop does not try to
prematurely dismiss a provided api before the CSV has a chance to check
the requirements and produce status.

In order to accomplish "premature detection", providedAPIsFromCSVs was
modified to not only return the APIs but also the CSV that provides a
given API.

Description of the change:

Motivation for the change:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /docs
  • Commit messages sensible and descriptive

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Nov 7, 2019
@openshift-ci-robot
Copy link
Collaborator

@jpeeler: This pull request references Bugzilla bug 1767004, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Bug 1767004: defer provided api update in operator groups

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 7, 2019
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 7, 2019
This is the case of a bug where the CSV status wasn't updated due to an
invalid CSV. The cause for this was twofold:

1) The CSV and OperatorGroup reconcile loops were undoing each others
changes continously, giving no chance for the CSV to get synced beyond
provided api conflict detection.

2) Due to the way the provided APIs are produced differently for
operator groups (uses GVKSTringToProvidedAPISet) and OLM (uses
NewOperatorFromV1Alpha1CSV) when an invalid CSV is in use, the resolver
attempts to continue to instruct that an operator group annotation
update is needed beyond the point of that being true.

The changes are to ensure that OLM does not get stuck attempting to
update the operator group annotations when they are no longer needed
combined with ensuring that the operator group sync loop does not try to
prematurely dismiss a provided api before the CSV has a chance to check
the requirements and produce status.

In order to accomplish "premature detection", providedAPIsFromCSVs was
modified to not only return the APIs but also the CSV that provides a
given API.
@jpeeler
Copy link
Contributor Author

jpeeler commented Nov 7, 2019

/retest

@jpeeler
Copy link
Contributor Author

jpeeler commented Nov 7, 2019

/retest
I think this PR is good to go, but CI seems to be broken right now.

@jpeeler
Copy link
Contributor Author

jpeeler commented Nov 8, 2019

/retest

2 similar comments
@dinhxuanvu
Copy link
Member

/retest

@dinhxuanvu
Copy link
Member

/retest

Copy link
Member

@njhale njhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpeeler Thank you for getting on this so quickly! I had a few comments, but other than those, this PR really has me thinking about how/if we can simplify the provided API reconciliation. It would be nice if only OperatorGroup syncing handled provided API annotation updates, WDYT?

pkg/controller/operators/olm/operator_test.go Show resolved Hide resolved
test/e2e/csv_e2e_test.go Show resolved Hide resolved
// Prune providedAPIs annotation if the cluster has fewer providedAPIs (handles CSV deletion)
if intersection := groupProvidedAPIs.Intersection(providedAPIsFromCSVs); len(intersection) < len(groupProvidedAPIs) {
//if intersection := groupProvidedAPIs.Intersection(providedAPIsFromCSVs); len(intersection) < len(groupProvidedAPIs) {
if len(intersection) < len(groupProvidedAPIs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only happen when the OperatorGroup has provided APIs that no longer exist in the namespace. Could you explain how this condition is met otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was changed mainly because proviedAPIsFromCSVs no longer is an APISet, so calculating the intersection isn't possible. But with an invalid api version specified in a CSV, it is possible that the operator group has apis that were never on the cluster (which is part of the problem here). I can remove the commented out line I left in there if you want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But with an invalid api version specified in a CSV, it is possible that the operator group has apis that were never on the cluster (which is part of the problem here)

What does "never on the cluster" mean? IIRC, the point here is to remove APIs that no CSV in the OperatorGroup's namespace claims to provide. I would rather have false positives and cleanup more often

I can remove the commented out line I left in there if you want.

I think we want to avoid commented code piling up, but we don't need to cause another round of CI over it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also see: #1114 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also see: #1114 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember that this bug is dealing with an invalid api due to botched syntax. Due to the botched syntax as shown in the bug, the CRD version could never have possibly existed.

The problem with cleaning up more often is there are two sync loops that end up fighting each other with a constant stream of updates, which prevents validation from ever occurring and therefore not getting reported back to the user.

@jpeeler
Copy link
Contributor Author

jpeeler commented Nov 13, 2019

@njhale In general, it seems that having any loops updating resources other than the one the loop was started for leads to bad contention. But operator groups and CSVs are coupled in a way that makes the problem particularly bad.

@@ -230,20 +237,40 @@ func (a *Operator) providedAPIsFromCSVs(group *v1.OperatorGroup, logger *logrus.
logger.WithError(err).Warn("could not create OperatorSurface from csv")
continue
}
providedAPIsFromCSVs = providedAPIsFromCSVs.Union(operatorSurface.ProvidedAPIs().StripPlural())
for providedAPI := range operatorSurface.ProvidedAPIs().StripPlural() {
providedAPIsFromCSVs[providedAPI] = csv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why lug the CSV around when you can filter out the ones you don't want here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I want to filter out any CSVs. The CSVs are returned so that in pruneProvidedAPIs while iterating over the provided apis, the CSV phase for a given api can be checked to ensure it has progressed to the point of validation (past phase pending).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think they are roughly isomorphic, but this isn't a big deal.

Copy link
Member

@njhale njhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have one last comment I wanted to get your thoughts on before we merge this: #1114 (comment)

@njhale
Copy link
Member

njhale commented Nov 15, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 15, 2019
@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jpeeler, njhale

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 76754c9 into operator-framework:master Nov 15, 2019
@openshift-ci-robot
Copy link
Collaborator

@jpeeler: All pull requests linked via external trackers have merged. Bugzilla bug 1767004 has been moved to the MODIFIED state.

In response to this:

Bug 1767004: defer provided api update in operator groups

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants