Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scorecard does not wait for finalisers to complete before removing operator #2094

Closed
oseoin opened this issue Oct 23, 2019 · 17 comments · Fixed by #2597
Closed

Scorecard does not wait for finalisers to complete before removing operator #2094

oseoin opened this issue Oct 23, 2019 · 17 comments · Fixed by #2597
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. language/ansible Issue is related to an Ansible operator project scorecard Issue relates to the scorecard subcomponent triage/support Indicates an issue that is a support question.

Comments

@oseoin
Copy link

oseoin commented Oct 23, 2019

Bug Report

What did you do?
Created a simple Ansible Operator with SDK v0.11.0 to reproduce an issue we are seeing on our Ansible operators where finalisers do not complete within a few seconds.

Using the following tasks:

---
- name: "Set state of {{ meta.name }}-test-cfg"
  k8s:
    state: "{{ state }}"
    definition:
      apiVersion: "v1"
      kind: ConfigMap
      metadata:
        name: "{{ meta.name }}-test-cfg"
        namespace: "{{ meta.namespace }}"
      data:
        key: value

- name: test wait
  shell: sleep 10

Set a finaliser set in watches:

- version: v1
  group: example.com
  kind: TestCr
  playbook: /opt/ansible/playbook.yml
  finalizer:
    name: example.com
    vars:
      state: absent

Used the following scorecard configuration:

scorecard:
  plugins:
    - basic:
        olm-deployed: false
        cr-manifest:
          - "deploy/crds/icm.ibm.com_v1_testcr_cr.yaml"
        init-timeout: 300

What did you expect to see?
With the sleep task disabled I see the following as expected:

operator-sdk scorecard --verbose   
DEBU[0000] Debug logging is set                         
INFO[0000] Using config file: /Users/oseoin/playground/kubernetes/operators/operator-test/.osdk-scorecard.yaml 
Basic Tests:
        Writing into CRs has an effect: 1/1
        Spec Block Exists: 1/1
        Status Block Exists: 1/1

What did you see instead? Under which circumstances?
Instead I get an error that cleanup does not complete:

operator-sdk scorecard --verbose       
DEBU[0000] Debug logging is set                         
INFO[0000] Using config file: /Users/oseoin/playground/kubernetes/operators/operator-test/.osdk-scorecard.yaml 
WARN[0044] time="2019-10-23T15:19:35+01:00" level=info msg="a cleanup function failed with error: cleanup function failed: timed out waiting for the condition\n"
time="2019-10-23T15:19:45+01:00" level=info msg="a cleanup function failed with error: cleanup function failed: timed out waiting for the condition\n"
time="2019-10-23T15:19:45+01:00" level=error msg="Failed to cleanup resources: (a cleanup function failed; see stdout for more details)" 
Basic Tests:
        Spec Block Exists: 1/1
        Status Block Exists: 1/1
        Writing into CRs has an effect: 1/1

Watching the operator deployment I can see that it gets removed before the finaliser completes.

Environment

  • operator-sdk version:

operator-sdk version: "v0.11.0"
commit: "39c65c36159a9c249e5f3c178205cc6e86c16f8d"

  • go version:
    go version: "go1.13.1 darwin/amd64"

  • Kubernetes version information:

Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.4", GitCommit:"67d2fcf276fcd9cf743ad4be9a9ef5828adc082f", GitTreeState:"clean", BuildDate:"2019-09-18T14:41:55Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind:
    Docker desktop 2.1.4.0 on macOs 10.15

  • Are you writing your operator in ansible, helm, or go?
    Ansible

Possible Solution
Scorecard should wait before removing the operator deployment if finalisers are set.

@camilamacedo86 camilamacedo86 added language/ansible Issue is related to an Ansible operator project triage/duplicate Indicates an issue is a duplicate of other open issue. triage/support Indicates an issue that is a support question. scorecard Issue relates to the scorecard subcomponent and removed triage/duplicate Indicates an issue is a duplicate of other open issue. labels Oct 23, 2019
@camilamacedo86 camilamacedo86 self-assigned this Oct 23, 2019
@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Oct 24, 2019

Hi @oseoin,

Thank you for raise it. Could you please let us the following information?

  1. Have you tried to increase the init-timeout: 120? See doc.
  2. Is this project public? Could you share it?
  3. What is the command that you have been used to call scorecard? Could you provide the .osdk-scorecard.yaml used?
  4. Have any information in the stdout? (See see stdout for more details)

@camilamacedo86 camilamacedo86 removed their assignment Oct 24, 2019
@mew2057
Copy link

mew2057 commented Oct 25, 2019

I'm hitting a similar problem as well (stdout doesn't show anything):

Scorecard command:

# If I don't spec cr-manifest and csv-path it doesn't work.
GO111MODULE=on operator-sdk scorecard --config .osdk-scorecard.yaml --cr-manifest  deploy/crds/ibm_v1alpha1_csiscaleoperator_cr.yaml  --csv-path deploy/olm-catalog/ibm-spectrum-scale-csi-operator/0.0.1/ibm-spectrum-scale-csi-operator.v0.0.1.clusterserviceversion.yaml

.osdk-scorecard.yaml

scorecard:
  init-timeout: 120
  output: json
  plugins:
    - basic:
        cr-manifest:
            - "deploy/crds/ibm_v1alpha1_csiscaleoperator_cr.yaml"
    - olm:
        cr-manifest:
            - "deploy/crds/ibm_v1alpha1_csiscaleoperator_cr.yaml"
        csv-path: "deploy/olm-catalog/ibm-spectrum-scale-csi-operator/0.0.1/ibm-spectrum-scale-csi-operator.v0.0.1.clusterserviceversion.yaml"

EDIT I was playing around using my helper scripts and it looks like if I redeploy on top of the scorecard and then delete again everything behaves and the delete will work. Otherwise the scorecard puts the Custom Resource into a state where it can't be deleted.

@mew2057
Copy link

mew2057 commented Oct 25, 2019

I just did two configs of my watches.yaml and it looks like there's a pretty concrete connection in the finalizer to the scorecard failing. The only delta between the two runs was the finalizer change.

I ran both versions through travis and got the following results:

Scorecard 100%
"Good" config:

---
- version: v1alpha1
  group: scale.ibm.com
  kind: CSIScaleOperator
  role: /opt/ansible/roles/csi-scale
  ManageStatus: True

Scorecard 50%
"Bad" Config

---
- version: v1alpha1
  group: scale.ibm.com
  kind: CSIScaleOperator
  role: /opt/ansible/roles/csi-scale
  ManageStatus: True
  finalizer:
    name: finalizer.scale.ibm.com
    vars:
      state: absent

I've also done a run with the cluster-info before and after the scorecard run:

https://travis-ci.org/mew2057/ibm-spectrum-scale-csi-operator/jobs/602927683

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Oct 25, 2019

Hi @mew2057,

Really thank you for the great info provide. I just would like to isolate the scenario to know if it is a scorecard or a finalizer issue with ansible operator. Then, could you check it out of the scorecard? I mean, could you install all (make install), then check that are all running and then, try to delete the objects (make uninstall) to see if it will get stuck?

See here a Makefile with the commands make install and make uninstall which can help you to do this test.

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Oct 25, 2019

Hi @mew2057,

Also, checking your logs I notice that you are using python 2.

DEPRECATION: Python 2.7 will reach the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 won't be maintained after that date. A future version of pip will drop support for Python 2.7.

So, try to use Python3. Following how to add it in your .travis file.

# Python 3.6 is the default Python when language: python.
# But when not language: python, Python 3 can not be used by default.
# There is pip (/usr/bin/pip), but not pip3.
# Because of it the global env is required.
env:
  global:
    - PATH=/opt/python/3.6.7/bin:$PATH

# Install python3 and utilities dependencies
addons:
  apt:
    packages:
      - "python3"
      - "python3-pip" 

I am guessing that it can be the root cause since because of the deprecation we upgrade it to use Python3. See the changelog.

Upgrade Python version from 2.7 to 3.6, Ansible version from 2.8.0 to ~=2.8 and ansible-runner from 1.2 to 1.3.4 in the Ansible based images. (#1947)

Please, let us know if by using Python3 solve your scenario.

mew2057 added a commit to mew2057/ibm-spectrum-scale-csi-operator that referenced this issue Oct 28, 2019
@mew2057
Copy link

mew2057 commented Oct 28, 2019

Makefile

For the Makefile it looks like the delete is a bit slow (~74 seconds), but that's probably the finalizer running when the object gets removed. That should fit my init-timeout: 120 in my .osdk-scorecard.yaml configuration.

make install

$ time make install
....... Creating namespace .......
kubectl create namespace csiscaleoperator
namespace/csiscaleoperator created
....... Applying CRDS and Operator .......
kubectl apply -f deploy/crds/ibm_v1alpha1_csiscaleoperator_crd.yaml -n csiscaleoperator
customresourcedefinition.apiextensions.k8s.io/csiscaleoperators.scale.ibm.com created
....... Applying Rules and Service Account .......
kubectl apply -f deploy/role.yaml -n csiscaleoperator
role.rbac.authorization.k8s.io/ibm-spectrum-scale-csi-operator created
clusterrole.rbac.authorization.k8s.io/ibm-spectrum-scale-csi-operator created
clusterrole.rbac.authorization.k8s.io/csi-nodeplugin created
clusterrole.rbac.authorization.k8s.io/external-attacher-runner created
clusterrole.rbac.authorization.k8s.io/external-provisioner-runner created
kubectl apply -f deploy/role_binding.yaml  -n csiscaleoperator
rolebinding.rbac.authorization.k8s.io/ibm-spectrum-scale-csi-operator created
clusterrolebinding.rbac.authorization.k8s.io/ibm-spectrum-scale-csi-operator created
clusterrolebinding.rbac.authorization.k8s.io/csi-nodeplugin created
clusterrolebinding.rbac.authorization.k8s.io/external-provisioner-runner created
clusterrolebinding.rbac.authorization.k8s.io/external-attacher-runner created
kubectl apply -f deploy/service_account.yaml  -n csiscaleoperator
serviceaccount/ibm-spectrum-scale-csi-operator created
serviceaccount/external-attacher-runner created
serviceaccount/csi-nodeplugin created
serviceaccount/external-provisioner-runner created
....... Applying Operator .......
kubectl apply -f deploy/operator.yaml -n csiscaleoperator
deployment.apps/ibm-spectrum-scale-csi-operator created
....... Creating the Cluster .......
kubectl apply -f deploy/crds/ibm_v1alpha1_csiscaleoperator_cr.yaml -n csiscaleoperator
csiscaleoperator.scale.ibm.com/csi-scale-operator created

real	0m2.436s
user	0m1.427s
sys	0m0.227s

make uninstall

time make uninstall
....... Uninstalling .......
....... Deleting CRDs.......
kubectl delete -f deploy/crds/ibm_v1alpha1_csiscaleoperator_crd.yaml -n csiscaleoperator
warning: deleting cluster-scoped resources, not scoped to the provided namespace
customresourcedefinition.apiextensions.k8s.io "csiscaleoperators.scale.ibm.com" deleted
....... Deleting Rules and Service Account .......
kubectl delete -f deploy/role.yaml -n csiscaleoperator
role.rbac.authorization.k8s.io "ibm-spectrum-scale-csi-operator" deleted
warning: deleting cluster-scoped resources, not scoped to the provided namespace
clusterrole.rbac.authorization.k8s.io "ibm-spectrum-scale-csi-operator" deleted
clusterrole.rbac.authorization.k8s.io "csi-nodeplugin" deleted
clusterrole.rbac.authorization.k8s.io "external-attacher-runner" deleted
clusterrole.rbac.authorization.k8s.io "external-provisioner-runner" deleted
kubectl delete -f deploy/role_binding.yaml -n csiscaleoperator
rolebinding.rbac.authorization.k8s.io "ibm-spectrum-scale-csi-operator" deleted
warning: deleting cluster-scoped resources, not scoped to the provided namespace
clusterrolebinding.rbac.authorization.k8s.io "ibm-spectrum-scale-csi-operator" deleted
clusterrolebinding.rbac.authorization.k8s.io "csi-nodeplugin" deleted
clusterrolebinding.rbac.authorization.k8s.io "external-provisioner-runner" deleted
clusterrolebinding.rbac.authorization.k8s.io "external-attacher-runner" deleted
kubectl delete -f deploy/service_account.yaml -n csiscaleoperator
serviceaccount "ibm-spectrum-scale-csi-operator" deleted
serviceaccount "external-attacher-runner" deleted
serviceaccount "csi-nodeplugin" deleted
serviceaccount "external-provisioner-runner" deleted
....... Deleting Operator .......
kubectl delete -f deploy/operator.yaml -n csiscaleoperator
deployment.apps "ibm-spectrum-scale-csi-operator" deleted
....... Deleting namespace csiscaleoperator.......
kubectl delete namespace csiscaleoperator
namespace "csiscaleoperator" deleted

real	1m14.311s
user	0m0.732s
sys	0m0.099s

Makefile (modified from example):

[root ibm-spectrum-scale-csi-operator]# cat Makefile
.DEFAULT_GOAL:=help
SHELL:=/bin/bash
NAMESPACE=csiscaleoperator

# Modified from operator framework sample.
##@ Application

install: ## Install all resources (CR/CRD's, RBCA and Operator)
	@echo ....... Creating namespace .......
	- kubectl create namespace ${NAMESPACE}
	@echo ....... Applying CRDS and Operator .......
	- kubectl apply -f deploy/crds/ibm_v1alpha1_csiscaleoperator_crd.yaml -n ${NAMESPACE}
	@echo ....... Applying Rules and Service Account .......
	- kubectl apply -f deploy/role.yaml -n ${NAMESPACE}
	- kubectl apply -f deploy/role_binding.yaml  -n ${NAMESPACE}
	- kubectl apply -f deploy/service_account.yaml  -n ${NAMESPACE}
	@echo ....... Applying Operator .......
	- kubectl apply -f deploy/operator.yaml -n ${NAMESPACE}
	@echo ....... Creating the Cluster .......
	- kubectl apply -f deploy/crds/ibm_v1alpha1_csiscaleoperator_cr.yaml -n ${NAMESPACE}

uninstall: ## Uninstall all that all performed in the $ make install
	@echo ....... Uninstalling .......
	@echo ....... Deleting CRDs.......
	- kubectl delete -f deploy/crds/ibm_v1alpha1_csiscaleoperator_crd.yaml -n ${NAMESPACE}
	@echo ....... Deleting Rules and Service Account .......
	- kubectl delete -f deploy/role.yaml -n ${NAMESPACE}
	- kubectl delete -f deploy/role_binding.yaml -n ${NAMESPACE}
	- kubectl delete -f deploy/service_account.yaml -n ${NAMESPACE}
	@echo ....... Deleting Operator .......
	- kubectl delete -f deploy/operator.yaml -n ${NAMESPACE}
	@echo ....... Deleting namespace ${NAMESPACE}.......
	- kubectl delete namespace ${NAMESPACE}

.PHONY: help
help:  ## Display this help
	@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n  make \033[36m<target>\033[0m\n"} /^[a-zA-Z0-9_-]+:.*?##/ { printf "  \033[36m%-15s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST)

Environment Change

I updated my config to match the recommendation and it still looks like I'm hitting the timeout on the resource cleanup:

https://travis-ci.org/mew2057/ibm-spectrum-scale-csi-operator/jobs/603884581#L358

Thoughts

What the timeout is tripping on the delete because the timeout is not being properly loaded by the scorecard? If I don't run the scorecard with the csv-path in the variable list it errors out, maybe something similar is happening with the timeout?

$ operator-sdk scorecard --init-timeout 120 --config .osdk-scorecard.yaml --cr-manifest  deploy/crds/ibm_v1alpha1_csiscaleoperator_cr.yaml  --csv-path deploy/olm-catalog/ibm-spectrum-scale-csi-operator/0.0.1/ibm-spectrum-scale-csi-operator.v0.0.1.clusterserviceversion.yaml
INFO[0000] Using config file: .osdk-scorecard.yaml
WARN[0000] Plugin directory not found; skipping external plugins: stat scorecard: no such file or directory
WARN[0035] time="2019-10-28T06:02:14-07:00" level=info msg="a cleanup function failed with error: cleanup function failed: timed out waiting for the condition\n"
time="2019-10-28T06:02:29-07:00" level=info msg="a cleanup function failed with error: cleanup function failed: timed out waiting for the condition\n"
time="2019-10-28T06:02:29-07:00" level=error msg="Failed to cleanup resources: (a cleanup function failed; see stdout for more details)"
ERRO[0035] Plugin `olm-tests` failed with error (failed to create global resources: object is being deleted: customresourcedefinitions.apiextensions.k8s.io "csiscaleoperators.scale.ibm.com" already exists)
Basic Tests:
	Spec Block Exists: 1/1
	Status Block Exists: 1/1
	Writing into CRs has an effect: 1/1
Failed Plugin: olm-tests:

Total Score: 50%

Apparently not.

Addendum

I ran the make uninstall after that test and it's taking even longer than before:

make uninstall
....... Uninstalling .......
....... Deleting CRDs.......
kubectl delete -f deploy/crds/ibm_v1alpha1_csiscaleoperator_crd.yaml -n csiscaleoperator
warning: deleting cluster-scoped resources, not scoped to the provided namespace
customresourcedefinition.apiextensions.k8s.io "csiscaleoperators.scale.ibm.com" deleted

I've been there for about five minutes. I'm going to look at the scorecard source and try to recreate these conditions by hand so I can trace where the hang is occuring, because the Custom Resource is not showing any information.

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Oct 28, 2019

Hi @oseoin,

Could you share your role.yaml file? Also, could you try to use the same permissions that we have in the samples and lt us know if it solved your issue?. Note that if you edit the CRD when it gets stuck and removes the finalizer it will be deleted. So, it is possible that it can be related to the permissions to allow the operator to remove the finalizer.

@mew2057
Copy link

mew2057 commented Oct 28, 2019

Is this related to #1513? I've tried a lot of different configs (bumped to 0.11.0, verified the role was present and even dug through the scorecard logic) and I can't identify any other sources.

Additionally, clearing the CR's finalizer will clear the CR and CRD 100% of the time after the scorecard fails.

@oseoin
Copy link
Author

oseoin commented Oct 29, 2019

Hi @camilamacedo86, apologies for the delay in getting back to you. I tested using init-timeout: 300. Our projects are not public but I have attached the test operator that I created to recreate the issue, which I have verified uses the same permissions as in the memcached sample. Unfortunately there is no other output in stdout, and the JSON output does not provide any additional information. I think that this has to be a scorecard issue rather than a permissions one as the finaliser works fine when run manually - unless scorecard uses permissions differently?
operator-test.zip.
Thanks!

@jmccormick2001
Copy link
Contributor

jmccormick2001 commented Oct 29, 2019

I played around this morning and was able to recreate your error I think...

jeffmc@doppio114 ~/projects/memcached-operator> operator-sdk scorecard -o text
INFO[0000] Using config file: /home/jeffmc/projects/memcached-operator/.osdk-scorecard.yaml
WARN[0027] time="2019-10-29T13:14:52-05:00" level=info msg="a cleanup function failed with error: cleanup function failed: timed out waiting for the condition\n"
time="2019-10-29T13:15:02-05:00" level=info msg="a cleanup function failed with error: cleanup function failed: timed out waiting for the condition\n"
time="2019-10-29T13:15:02-05:00" level=error msg="Failed to cleanup resources: (a cleanup function failed; see stdout for more details)"
ERRO[0027] Plugin OLM Integration failed with error (failed to create global resources: object is being deleted: customresourcedefinitions.apiextensions.k8s.io "memcacheds.cache.example.com" already exists)
basic:
Spec Block Exists : pass
Status Block Exists : pass
Writing into CRs has an effect : pass

I did this by artificially putting a Sleep of 75 seconds in the finalizer function.

this causes an issue where you can't remove the CR without doing this:
kubectl patch crd/memcacheds.cache.example.com -p '{"metadata":{"finalizers":[]}}' --type=merge

adjusting the scorecard config to use init-timeout of a value greater than 75 seconds still
results in the following errors:

INFO[0000] Using config file: /home/jeffmc/projects/memcached-operator/.osdk-scorecard.yaml
ERRO[0014] Plugin Basic Tests failed with error (failed to create cr resource: object is being deleted: memcacheds.cache.example.com "example-memcached" already exists)
ERRO[0024] Plugin OLM Integration failed with error (failed to create global resources: object is being deleted: customresourcedefinitions.apiextensions.k8s.io "memcacheds.cache.example.com" already exists)

@jmccormick2001
Copy link
Contributor

based on my debugging so far, it appears the scorecard deletes resources in an order like this:
Deployment.apps
RoleBinding.rbac.authorization.k8s.io
Role.rbac.authorization.k8s.io
ServiceAccount
CustomResourceDefinition.apiextensions.k8s.io

what I'm suspecting is that scorecard assumes that when it removes the CRD that it will cause any associated CRs to be removed, this is not happening when a finalizer is specified, you instead see the CRD deletion hang and the CR not being removed.

It seems to me the CR should be removed prior to the CRD being removed which works around this hang.

@joelanford
Copy link
Member

The Deployment.apps is the operator itself, right? If so, the correct order would be:

  1. CustomResourceDefinition.apiextensions.k8s.io
  2. Deployment.apps
  3. RoleBinding.rbac.authorization.k8s.io
  4. Role.rbac.authorization.k8s.io
  5. ServiceAccount

Deleting the CRD causes all CRs of that type to be deleted, but if the CR has a finalizer, then the operator still needs to be running. So we would need to make sure that our code waits until the CRDs from step 1 are actually deleted before proceeding to step 2.

@jmccormick2001
Copy link
Contributor

jmccormick2001 commented Oct 29, 2019

@joelanford I confirmed that order does indeed work, so the fix here would be to change scorecard to delete in that order it sounds like.

@jmccormick2001 jmccormick2001 self-assigned this Oct 30, 2019
@jmccormick2001 jmccormick2001 added the kind/bug Categorizes issue or PR as related to a bug. label Nov 1, 2019
mew2057 added a commit to mew2057/ibm-spectrum-scale-csi-operator that referenced this issue Nov 13, 2019
@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2020
@mew2057
Copy link

mew2057 commented Feb 17, 2020

/remove-lifecycle stale

This is still very much a problem.

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 17, 2020
@jmccormick2001
Copy link
Contributor

fyi, I'm picking this one back up, should have a fix for this in a day or so. I'll make a change to scorecard's resource cleanup logic to remove the CRD resource first when doing its cleanup....I'll also change the deletion timeout from a hard-coded value to use the --init-timeout configuration setting, this will allow for finalizers to complete before the operator deployment is deleted.

@jmccormick2001
Copy link
Contributor

ok, the order of deletion was actually ok, but the initTimeout value was not being used in the deletion wait, I've added that, a PR is coming up soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. language/ansible Issue is related to an Ansible operator project scorecard Issue relates to the scorecard subcomponent triage/support Indicates an issue that is a support question.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants