Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't recreate operator if the installplan exist in 4.4 #1570

Closed
horis233 opened this issue Jun 3, 2020 · 9 comments
Closed

Can't recreate operator if the installplan exist in 4.4 #1570

horis233 opened this issue Jun 3, 2020 · 9 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/support Indicates an issue that is a support question.

Comments

@horis233
Copy link
Contributor

horis233 commented Jun 3, 2020

Bug Report

Removing and adding an operator can get blocked by the previous installplan in OCP4.4

What did you do?
A clear and concise description of the steps you took (or insert a code snippet).

  1. Deploy several operators into the namesapce.

  2. Deploy an ibm-licensing-operator into the same namespace.
    a new installplan is created in the namespace.

Screen Shot 2020-06-03 at 6 48 27 PM

We can see it has one CSV and take all the subscriptions as its owner.
  1. remove ibm-licensing-operator.
    We can see the ibm-licensing-operator is removed, and the ibm-licensing-operator subscription is removed as well from the created installplans, but the installplan is still there.

Screen Shot 2020-06-03 at 6 49 23 PM

  1. Then add ibm-licensing-operator into the operandrequest again.
    The licensing operator will be blocked at upgradePending.

Screen Shot 2020-06-03 at 6 51 19 PM

What did you expect to see?
A clear and concise description of what you expected to happen (or insert a code snippet).

The operator can be added back into the cluster as it happens in OCP4.3

What did you see instead? Under which circumstances?
A clear and concise description of what you expected to happen (or insert a code snippet).

The licensing operator will be blocked at upgradePending forever

No much error message returned from the catalog controller.

I only found the following error message when deleting the operator.

time="2020-06-03T22:49:08Z" level=info msg=syncing event=delete reconciling="*v1alpha1.Subscription" selflink=/apis/operators.coreos.com/v1alpha1/namespaces/ibm-common-services/subscriptions/ibm-licensing-operator
E0603 22:49:08.993210       1 reconciler.go:257] unexpected subscription state in installplan reconciler *subscription.subscriptionDeletedState
time="2020-06-03T22:49:09Z" level=info msg=syncing id=IzUFY ip=install-9s9hj namespace=ibm-common-services phase=Complete

Environment

  • operator-lifecycle-manager version:
    - name: operator
      version: 4.4.5
    - name: operator-lifecycle-manager
      version: 0.14.2
  • Kubernetes version information:
➜  ~ kubectl version
Client Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.6-beta.0", GitCommit:"e7f962ba86f4ce7033828210ca3556393c377bcc", GitTreeState:"clean", BuildDate:"2020-01-15T08:26:26Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.1", GitCommit:"b23e21a", GitTreeState:"clean", BuildDate:"2020-05-18T09:20:55Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind:

OCP 4.4.5

Possible Solution

Delete the remaining installplan and redeploy the operator.

Additional context
Add any other context about the problem here.

I didn't meet this error in the OCP4.3

I guess that is because the logic to check the existing install plan is updated in OLM 4.4.

In 4.3
https://github.com/operator-framework/operator-lifecycle-manager/blob/release-4.3/pkg/controller/operators/catalog/operator.go#L960-L1008

OLM 4.4 It will use the generation to check the existing operator
https://github.com/operator-framework/operator-lifecycle-manager/blob/release-4.4/pkg/controller/operators/catalog/operator.go#L1020-L1040

@gyliu513 @chenzhiwei @DanielXLee @taylormgeorge91

@horis233
Copy link
Contributor Author

Is there any update on this issue? I can reproduce this issue from both the OCP console and CLI.

There are the steps to reproduce this issue from the OCP console:

  1. Taking ActiveMQ Artemis and Advanced Cluster Management for Kubernetes as examples, I deploy these two operators from console to the default namespace

Screen Shot 2020-06-19 at 4 24 16 PM

We can see three operators are deployed in the default namespace (Advanced Cluster Management for Kubernetes requires etcd).

Screen Shot 2020-06-19 at 4 24 16 PM

For the installplan, we notice two installplans are generated. One is for `ActiveMQ Artemis` and the other one is for Advanced Cluster Management for Kubernetes` and `etcd`.
  1. Delete operator Advanced Cluster Management for Kubernetes from the default namespace

Screen Shot 2020-06-19 at 4 27 12 PM

We can see the `Advanced Cluster Management for Kubernetes` is deleted

Screen Shot 2020-06-19 at 4 27 07 PM

The installplan for `Advanced Cluster Management for Kubernetes` is still here, because `ActiveMQ Artemis` operator still owns the installplan.
  1. Re-create Advanced Cluster Management for Kubernetes
    The Advanced Cluster Management for Kubernetes will hang in the UpgradePending status

Screen Shot 2020-06-19 at 4 31 04 PM

From my understanding, this issue is caused by the PR 5938bf6 @ecordell

Doesn't OLM recommend to deploy multiple operators into a namespace? Or is there any suggestions to avoid this issue?

@taylormgeorge91
Copy link

I've seen this as well when deleting an operator and trying to re-create it (using CSVs). The new CSV will get associated with the previous installPlan if it still exists, and then the operator will be stuck in pending because it does not create a new installPlan but instead is linked with one that was deemed successful already (even though its resources have since been cleaned up by the CSV delete).

@exdx
Copy link
Member

exdx commented Jun 23, 2020

The current design is InstallPlan objects are meant to persist after the CSV is deleted - they are a reflection of what was installed on the cluster and should be removed separately. Think of them as an audit log of what was installed.

However if an existing InstallPlan is causing problems installing a new CSV then that sounds like a legitimate bug. We are currently looking into some other issues around InstallPlan deletion and may take-on this issue in the near future.

@exdx exdx added kind/bug Categorizes issue or PR as related to a bug. triage/support Indicates an issue that is a support question. labels Jun 23, 2020
@njhale
Copy link
Member

njhale commented Aug 3, 2020

Updates:

This issue should be fixed on master and select release branches.

@njhale
Copy link
Member

njhale commented Aug 18, 2020

I'm going to close this out. Please re-open if you are still experiencing the issue on master.

@njhale njhale closed this as completed Aug 18, 2020
@horis233
Copy link
Contributor Author

@njhale I can still reproduce this issue on OCP 4.5.6. Is your fix merged into this release branch yet?

@horis233
Copy link
Contributor Author

horis233 commented Sep 1, 2020

@njhale Thanks, I verified the fix in the with 4.6.0-0.nightly-2020-09-01-042030

@nictownsend
Copy link

Has this been backported to 4.4.x? I'm seeing it on 4.4.20

@taylormgeorge91
Copy link

taylormgeorge91 commented Sep 22, 2020

Yes. Backport to 4.4 is delivered in 4.4.21
4.5 fix is delivered in 4.5.8

Reference:
4.4: https://bugzilla.redhat.com/show_bug.cgi?id=1869717
4.5: https://bugzilla.redhat.com/show_bug.cgi?id=1864121

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/support Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

5 participants