Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug in olm install #6490

Merged
merged 1 commit into from Jul 17, 2023

Conversation

nunnatsa
Copy link
Contributor

@nunnatsa nunnatsa commented Jul 6, 2023

When installing OLM, retry creating resources, if the CRD is not ready yet.

Fixes: #6489

Description of the change

On olm install command, if the creation of a resource fails with "no-match" error, retry creating it, assuming the CRD is not ready yet, but is going to be ready very soon.

Motivation for the change

Bug fix. See #6489

##Checklist

If the pull request includes user-facing changes, extra documentation is required:

@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 06:46 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 06:46 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 06:46 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 06:46 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 06:46 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 06:46 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 06:46 — with GitHub Actions Inactive
@nunnatsa
Copy link
Contributor Author

nunnatsa commented Jul 6, 2023

Result of running the olm install command with v1.30.0:

$ operator-sdk olm install --verbose
DEBU[0000] Debug logging is set                         
INFO[0000] Fetching CRDs for version "latest"           
INFO[0000] Fetching resources for resolved version "latest" 
INFO[0001] Creating CRDs and resources                  
INFO[0001]   Creating CustomResourceDefinition "catalogsources.operators.coreos.com" 
INFO[0001]   Creating CustomResourceDefinition "clusterserviceversions.operators.coreos.com" 
INFO[0001]   Creating CustomResourceDefinition "installplans.operators.coreos.com" 
INFO[0001]   Creating CustomResourceDefinition "olmconfigs.operators.coreos.com" 
INFO[0001]   Creating CustomResourceDefinition "operatorconditions.operators.coreos.com" 
INFO[0001]   Creating CustomResourceDefinition "operatorgroups.operators.coreos.com" 
INFO[0001]   Creating CustomResourceDefinition "operators.operators.coreos.com" 
INFO[0001]   Creating CustomResourceDefinition "subscriptions.operators.coreos.com" 
INFO[0001]   Creating Namespace "olm"                   
INFO[0001]   Creating Namespace "operators"             
INFO[0001]   Creating ServiceAccount "olm/olm-operator-serviceaccount" 
INFO[0001]   Creating ClusterRole "system:controller:operator-lifecycle-manager" 
INFO[0001]   Creating ClusterRoleBinding "olm-operator-binding-olm" 
INFO[0001]   Creating OLMConfig "cluster"               
FATA[0001] Failed to install OLM version "latest": failed to create CRDs and resources: no matches for kind "OLMConfig" in version "operators.coreos.com/v1" 

Result after this fix; notice the OLMConfig resource:

$ build/operator-sdk olm install                   
INFO[0000] Fetching CRDs for version "latest"           
INFO[0000] Fetching resources for resolved version "latest" 
INFO[0001] Creating CRDs and resources                  
INFO[0001]   Creating CustomResourceDefinition "catalogsources.operators.coreos.com" 
INFO[0001]     CustomResourceDefinition "catalogsources.operators.coreos.com" created 
INFO[0001]   Creating CustomResourceDefinition "clusterserviceversions.operators.coreos.com" 
INFO[0001]     CustomResourceDefinition "clusterserviceversions.operators.coreos.com" created 
INFO[0001]   Creating CustomResourceDefinition "installplans.operators.coreos.com" 
INFO[0001]     CustomResourceDefinition "installplans.operators.coreos.com" created 
INFO[0001]   Creating CustomResourceDefinition "olmconfigs.operators.coreos.com" 
INFO[0001]     CustomResourceDefinition "olmconfigs.operators.coreos.com" created 
INFO[0001]   Creating CustomResourceDefinition "operatorconditions.operators.coreos.com" 
INFO[0001]     CustomResourceDefinition "operatorconditions.operators.coreos.com" created 
INFO[0001]   Creating CustomResourceDefinition "operatorgroups.operators.coreos.com" 
INFO[0001]     CustomResourceDefinition "operatorgroups.operators.coreos.com" created 
INFO[0001]   Creating CustomResourceDefinition "operators.operators.coreos.com" 
INFO[0001]     CustomResourceDefinition "operators.operators.coreos.com" created 
INFO[0001]   Creating CustomResourceDefinition "subscriptions.operators.coreos.com" 
INFO[0001]     CustomResourceDefinition "subscriptions.operators.coreos.com" created 
INFO[0001]   Creating Namespace "olm"                   
INFO[0001]     Namespace "olm" created                  
INFO[0001]   Creating Namespace "operators"             
INFO[0001]     Namespace "operators" created            
INFO[0001]   Creating ServiceAccount "olm/olm-operator-serviceaccount" 
INFO[0001]     ServiceAccount "olm/olm-operator-serviceaccount" created 
INFO[0001]   Creating ClusterRole "system:controller:operator-lifecycle-manager" 
INFO[0001]     ClusterRole "system:controller:operator-lifecycle-manager" created 
INFO[0001]   Creating ClusterRoleBinding "olm-operator-binding-olm" 
INFO[0001]     ClusterRoleBinding "olm-operator-binding-olm" created 
INFO[0001]   Creating OLMConfig "cluster"               
INFO[0001]     the CRD for OLMConfig "cluster" is not ready yet. Retrying in one second 
INFO[0004]     OLMConfig "cluster" created              
INFO[0004]   Creating Deployment "olm/olm-operator"     
INFO[0004]     Deployment "olm/olm-operator" created    
INFO[0004]   Creating Deployment "olm/catalog-operator" 
INFO[0004]     Deployment "olm/catalog-operator" created 
INFO[0004]   Creating ClusterRole "aggregate-olm-edit"  
INFO[0004]     ClusterRole "aggregate-olm-edit" created 
INFO[0004]   Creating ClusterRole "aggregate-olm-view"  
INFO[0004]     ClusterRole "aggregate-olm-view" created 
INFO[0004]   Creating OperatorGroup "operators/global-operators" 
INFO[0004]     OperatorGroup "operators/global-operators" created 
INFO[0004]   Creating OperatorGroup "olm/olm-operators" 
INFO[0004]     OperatorGroup "olm/olm-operators" created 
INFO[0004]   Creating ClusterServiceVersion "olm/packageserver" 
INFO[0004]     ClusterServiceVersion "olm/packageserver" created 
INFO[0004]   Creating CatalogSource "olm/operatorhubio-catalog" 
INFO[0004]     CatalogSource "olm/operatorhubio-catalog" created 
INFO[0004] Waiting for deployment/olm-operator rollout to complete 
INFO[0004]   Waiting for Deployment "olm/olm-operator" to rollout: 0 of 1 updated replicas are available 
INFO[0005]   Deployment "olm/olm-operator" successfully rolled out 
INFO[0005] Waiting for deployment/catalog-operator rollout to complete 
INFO[0005]   Deployment "olm/catalog-operator" successfully rolled out 
INFO[0005] Waiting for deployment/packageserver rollout to complete 
INFO[0005]   Waiting for Deployment "olm/packageserver" to rollout: 0 of 2 updated replicas are available 
INFO[0012]   Deployment "olm/packageserver" successfully rolled out 
INFO[0012] Successfully installed OLM version "latest"  

NAME                                            NAMESPACE    KIND                        STATUS
catalogsources.operators.coreos.com                          CustomResourceDefinition    Installed
clusterserviceversions.operators.coreos.com                  CustomResourceDefinition    Installed
installplans.operators.coreos.com                            CustomResourceDefinition    Installed
olmconfigs.operators.coreos.com                              CustomResourceDefinition    Installed
operatorconditions.operators.coreos.com                      CustomResourceDefinition    Installed
operatorgroups.operators.coreos.com                          CustomResourceDefinition    Installed
operators.operators.coreos.com                               CustomResourceDefinition    Installed
subscriptions.operators.coreos.com                           CustomResourceDefinition    Installed
olm                                                          Namespace                   Installed
operators                                                    Namespace                   Installed
olm-operator-serviceaccount                     olm          ServiceAccount              Installed
system:controller:operator-lifecycle-manager                 ClusterRole                 Installed
olm-operator-binding-olm                                     ClusterRoleBinding          Installed
cluster                                                      OLMConfig                   Installed
olm-operator                                    olm          Deployment                  Installed
catalog-operator                                olm          Deployment                  Installed
aggregate-olm-edit                                           ClusterRole                 Installed
aggregate-olm-view                                           ClusterRole                 Installed
global-operators                                operators    OperatorGroup               Installed
olm-operators                                   olm          OperatorGroup               Installed
packageserver                                   olm          ClusterServiceVersion       Installed
operatorhubio-catalog                           olm          CatalogSource               Installed

if err != nil {
if !apierrors.IsAlreadyExists(err) {

for i := 0; i < 5; i++ { // try to create 5 times before giving up
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like if we get the NoMatchError 5 times, we'll swallow the error and return nil. We probably would want to surface the error in this case instead of swallowing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 good point. But in that case, what do you say about making it a lop of ten, just for case?

@oceanc80
Copy link
Collaborator

oceanc80 commented Jul 6, 2023

We probably want a thorough set of unit tests to both reproduce the original error and exercise the edge cases of the new retry loop.

@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 16:54 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 16:54 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 16:54 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 16:54 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 16:54 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 16:54 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 6, 2023 16:54 — with GitHub Actions Inactive
log.Infof(" the CRD for %s %q is not ready yet. Retrying in one second", kind, resourceName)
time.Sleep(time.Second)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a bit more idiomatic (at least from our kubernetes adjacency) to use the client-go retry package.

See: https://pkg.go.dev/k8s.io/client-go/util/retry#OnError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to using the retry package

Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @nunnatsa !

I have a general question - does retrying actually resolve the error?

I am wondering if part of the problem is that there is no logic in the installation method in

log.Print("Creating CRDs and resources")
if err := c.DoCreate(ctx, objs...); err != nil {
return nil, fmt.Errorf("failed to create CRDs and resources: %v", err)
}
that differentiates between the actual CustomResourceDefinition resources and the instances of the CRDs so the ordering in which these resources are being created could be out of whack. This makes me think that retrying won't suffice in the event the CRD hasn't even been created yet due to the ordering of resource creation. I am thinking it might make more sense to have a distinct step for creating the CRDs that verifies all the CRDs have been created and then attempts to create the resources. (although it does appear to be the case that the list of resources is sorted such that CRDs come first here:
resources := append(crdResources, olmResources...)
)

Also leaving a couple comments:

# release notes and/or the migration guide
entries:
- description: >
Fix a bug where `olm install` command is failed fo "no-match" error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It might be nice to show an example of the error output

log.Infof(" the CRD for %s %q is not ready yet. Retrying in one second", kind, resourceName)
time.Sleep(time.Second)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to using the retry package

internal/olm/client/client.go Show resolved Hide resolved
@nunnatsa
Copy link
Contributor Author

nunnatsa commented Jul 9, 2023

Thank you all. You raised some very good points, and I was trying to address them all.

The timeout flag is not available for this function, but it's used to create the context. The retry.OnError func is cool, but it does not work well with timeout context, at least in my tests here. I eventually used the wait.ExponentialBackoffWithContext function that is very similar, but does check the context. It requires some more logic code, but not too much.

As for the ordering: this is a great point and I checked it before I even started the implementation. For now, all the CRDs are deployed before the resources. I fully agree that for the long run, it will be better to split the CRD creation and the resource creation. But this refactoring is out of my capacity right now - sorry.

Anyway, since the order is right, then this solution solves the problem. In my environment I always see (only) one failure message, and then the creation succeeds.

@nunnatsa nunnatsa temporarily deployed to deploy July 9, 2023 06:52 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 9, 2023 06:52 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 9, 2023 06:52 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 9, 2023 06:52 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 9, 2023 06:52 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 9, 2023 06:52 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 9, 2023 06:52 — with GitHub Actions Inactive
Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. A couple nits:

return false, err
})

if err != nil && err == context.DeadlineExceeded {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
if err != nil && err == context.DeadlineExceeded {
if err != nil && errors.Is(err, context.DeadlineExceeded) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

Comment on lines 10 to 29
$ operator-sdk olm install --verbose
DEBU[0000] Debug logging is set
INFO[0000] Fetching CRDs for version "latest"
INFO[0000] Fetching resources for resolved version "latest"
INFO[0001] Creating CRDs and resources
INFO[0001] Creating CustomResourceDefinition "catalogsources.operators.coreos.com"
INFO[0001] Creating CustomResourceDefinition "clusterserviceversions.operators.coreos.com"
INFO[0001] Creating CustomResourceDefinition "installplans.operators.coreos.com"
INFO[0001] Creating CustomResourceDefinition "olmconfigs.operators.coreos.com"
INFO[0001] Creating CustomResourceDefinition "operatorconditions.operators.coreos.com"
INFO[0001] Creating CustomResourceDefinition "operatorgroups.operators.coreos.com"
INFO[0001] Creating CustomResourceDefinition "operators.operators.coreos.com"
INFO[0001] Creating CustomResourceDefinition "subscriptions.operators.coreos.com"
INFO[0001] Creating Namespace "olm"
INFO[0001] Creating Namespace "operators"
INFO[0001] Creating ServiceAccount "olm/olm-operator-serviceaccount"
INFO[0001] Creating ClusterRole "system:controller:operator-lifecycle-manager"
INFO[0001] Creating ClusterRoleBinding "olm-operator-binding-olm"
INFO[0001] Creating OLMConfig "cluster"
FATA[0001] Failed to install OLM version "latest": failed to create CRDs and resources: no matches for kind "OLMConfig" in version "operators.coreos.com/v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would recommend only showing the full error message to keep the changelog entry as short as possible but still informative

@everettraven
Copy link
Contributor

@nunnatsa it looks like the sanity check is failing due to an out of date fork. I think you need to fetch the latest changes on master and rebase this PR on top of that.

@nunnatsa
Copy link
Contributor Author

@nunnatsa it looks like the sanity check is failing due to an out of date fork. I think you need to fetch the latest changes on master and rebase this PR on top of that.

I think the sanity check is broken. I rebased my pr, again, just before thr lastest push.

When installing OLM, retry creating resources, if the CRD is not ready
yet.

Signed-off-by: Nahshon Unna-Tsameret <nunnatsa@redhat.com>
@nunnatsa
Copy link
Contributor Author

@everettraven - done

@nunnatsa nunnatsa temporarily deployed to deploy July 11, 2023 06:05 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 11, 2023 06:05 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 11, 2023 06:05 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 11, 2023 06:05 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 11, 2023 06:05 — with GitHub Actions Inactive
@nunnatsa nunnatsa temporarily deployed to deploy July 11, 2023 06:05 — with GitHub Actions Inactive
Copy link
Member

@rashmigottipati rashmigottipati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 17, 2023
@everettraven everettraven merged commit 7911713 into operator-framework:master Jul 17, 2023
28 checks passed
@nunnatsa nunnatsa deleted the fix-olm-install branch July 18, 2023 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

olm install fails in some clusters
6 participants